Forum

Search
Close
AI Search
Classic Search
 Search Phrase:
 Search Type:
Advanced search options
 Search in Forums:
 Search in date period:

 Sort Search Results by:

AI Assistant
GCP vs AWS for mach...
 
Notifications
Clear all

GCP vs AWS for machine learning workloads - 2025 update

20 Posts
17 Users
0 Reactions
434 Views
0
[#83]
Topic starter
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian

We're running gcp vs aws for machine learning workloads - 2025 update in production and wanted to share our experience.

Scale:
- 480 services deployed
- 89 TB data processed/month
- 25M requests/day
- 8 regions worldwide

Architecture:
- Compute: EC2 Auto Scaling
- Data: DocumentDB
- Queue: MSK (Kafka)

Monthly cost: ~$75k

Lessons learned:
1. Reserved instances save 40% on compute
2. S3 lifecycle policies are essential
3. Tagging strategy is critical

AMA about our setup!


07/11/2025 7:13 am
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian

Pro tip: if you're implementing this, make sure to configure resource quotas correctly. We spent 2 weeks debugging random failures only to discover the default timeout was too low. Changed from 30s to 2min and all issues disappeared.


Tom Chack
07/11/2025 5:10 pm
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian

Security team blocked this due to compliance requirements.


09/11/2025 7:01 am
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian

Consider the long-term maintenance burden before adopting.


0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian

Consider the long-term maintenance burden before adopting.


0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian

The migration path we took:
Week 1-2: Research & POC
Week 3-4: Staging deployment
Week 5-6: Prod rollout (10% -> 50% -> 100%)
Week 7-8: Optimization
Total cost: ~200 eng hours
Would do it again in a heartbeat.


0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian

Cautionary tale: we rushed this implementation without proper testing and it caused a 4-hour outage. The issue was memory leak in the worker. Lesson learned: always test in staging first, especially when dealing with load balancers.


Share:
Scroll to Top