AI preferences coming soon...
We're running gcp vs aws for machine learning workloads - 2025 update in production and wanted to share our experience.
Scale:
- 480 services deployed
- 89 TB data processed/month
- 25M requests/day
- 8 regions worldwide
Architecture:
- Compute: EC2 Auto Scaling
- Data: DocumentDB
- Queue: MSK (Kafka)
Monthly cost: ~$75k
Lessons learned:
1. Reserved instances save 40% on compute
2. S3 lifecycle policies are essential
3. Tagging strategy is critical
AMA about our setup!
Pro tip: if you're implementing this, make sure to configure resource quotas correctly. We spent 2 weeks debugging random failures only to discover the default timeout was too low. Changed from 30s to 2min and all issues disappeared.
Security team blocked this due to compliance requirements.
Consider the long-term maintenance burden before adopting.
Consider the long-term maintenance burden before adopting.
The migration path we took:
Week 1-2: Research & POC
Week 3-4: Staging deployment
Week 5-6: Prod rollout (10% -> 50% -> 100%)
Week 7-8: Optimization
Total cost: ~200 eng hours
Would do it again in a heartbeat.
Cautionary tale: we rushed this implementation without proper testing and it caused a 4-hour outage. The issue was memory leak in the worker. Lesson learned: always test in staging first, especially when dealing with load balancers.