AI Search

Classic Search

Search Phrase:

Search Type:

Advanced search options

Search in Forums:

Search in date period:

Sort Search Results by:

AI Assistant

Notifications

Clear all

GCP vs AWS for machine learning workloads - 2025 update

✦ Summarize Topic

AWS Cloud

Last Post by William Smith 2 months ago

20 Posts

17 Users

0 Reactions

434 Views

RSS

[#83]

06/11/2025 9:34 pm

Topic starter

Translate ▼

Nicholas Morgan

(@nicholas.morgan692)

New Member

0 Posts
0 0 0

We're running gcp vs aws for machine learning workloads - 2025 update in production and wanted to share our experience.

Scale:
- 480 services deployed
- 89 TB data processed/month
- 25M requests/day
- 8 regions worldwide

Architecture:
- Compute: EC2 Auto Scaling
- Data: DocumentDB
- Queue: MSK (Kafka)

Monthly cost: ~$75k

Lessons learned:
1. Reserved instances save 40% on compute
2. S3 lifecycle policies are essential
3. Tagging strategy is critical

AMA about our setup!

Matthew Ramos

07/11/2025 7:13 am

Translate ▼

Pro tip: if you're implementing this, make sure to configure resource quotas correctly. We spent 2 weeks debugging random failures only to discover the default timeout was too low. Changed from 30s to 2min and all issues disappeared.

Tom Chack

07/11/2025 5:10 pm

Translate ▼

Security team blocked this due to compliance requirements.

Joseph Peterson

09/11/2025 7:01 am

Translate ▼

Consider the long-term maintenance burden before adopting.

Show 16 more comments

Add a comment

09/11/2025 7:01 am

Translate ▼

Joseph Peterson

(@joseph.peterson474)

New Member

1 Posts
0 0 0

Consider the long-term maintenance burden before adopting.

Add a comment

01/12/2025 7:09 am

Translate ▼

James Allen

(@james.allen159)

New Member

0 Posts
0 0 0

The migration path we took:
Week 1-2: Research & POC
Week 3-4: Staging deployment
Week 5-6: Prod rollout (10% -> 50% -> 100%)
Week 7-8: Optimization
Total cost: ~200 eng hours
Would do it again in a heartbeat.

Add a comment

25/12/2025 7:47 am

Translate ▼

Dennis King

(@dennis.king704)

New Member

0 Posts
0 0 0

Cautionary tale: we rushed this implementation without proper testing and it caused a 4-hour outage. The issue was memory leak in the worker. Lesson learned: always test in staging first, especially when dealing with load balancers.

Add a comment

10 Forums
93 Topics
1,770 Posts
0 Online
100 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed