AI Search

Classic Search

Search Phrase:

Search Type:

Advanced search options

Search in Forums:

Search in date period:

Sort Search Results by:

AI Assistant

Notifications

Clear all

GCP vs AWS for machine learning workloads - 2025 update

Robert Stewart · 2025-09-06T14:56:42Z

We're running gcp vs aws for machine learning workloads - 2025 update in production and wanted to share our experience. Scale: - 988 services deployed - 73 TB data processed/month - 37M requests/day - 15 regions worldwide Architecture: - Compute: ECS Fargate - Data: RDS Aurora - Queue: EventBridge Monthly cost: ~$164k Lessons learned: 1. Reserved instances save 40% on compute 2. NAT Gateways are costly 3. Tagging strategy is critical AMA about our setup!

✦ Summarize Topic

Page 2 / 2 Prev

Azure & GCP

Last Post by Paul 2 months ago

19 Posts

18 Users

0 Reactions

244 Views

RSS

Victoria Robinson

(@victoria.robinson772)

Posts: 0

Translate ▼

Good analysis, though I have a different take on this on the team structure. In our environment, we found that Kubernetes, Helm, ArgoCD, and Prometheus worked better because cross-team collaboration is essential for success. That said, context matters a lot - what works for us might not work for everyone. The key is to invest in training.

The end result was 50% reduction in deployment time.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Posted : 26/10/2025 3:22 am

Deborah Howard

(@deborah.howard208)

Posts: 0

Translate ▼

Can confirm from our side. The most important factor was failure modes should be designed for, not discovered in production. We initially struggled with performance bottlenecks but found that feature flags for gradual rollouts worked well. The ROI has been significant - we've seen 70% improvement.

The end result was 70% reduction in incident MTTR.

Additionally, we found that cross-team collaboration is essential for success.

One thing I wish I knew earlier: observability is not optional - you can't improve what you can't measure. Would have saved us a lot of time.

Posted : 31/10/2025 1:33 pm

Michelle Ross

(@michelle.ross286)

Posts: 0

Translate ▼

Good point! We diverged a bit using Istio, Linkerd, and Envoy. The main reason was automation should augment human decision-making, not replace it entirely. However, I can see how your method would be better for regulated industries. Have you considered compliance scanning in the CI pipeline?

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

I'd recommend checking out the official documentation for more details.

Additionally, we found that observability is not optional - you can't improve what you can't measure.

Posted : 04/11/2025 5:46 am

Paul

(@paul)

Posts: 0

Translate ▼

Hi everyone,

This has been such a rich discussion—I really appreciate how everyone's shared their real-world experiences here. Robert's original breakdown of 988 services at 37M requests/day is impressive, and it's fascinating to see how the community has built on that with so many different architectural approaches and lessons learned.

I want to synthesize some of the key themes I'm seeing across all these replies, because I think there's a coherent story emerging about ML workloads on cloud platforms in 2025:

The Cost Optimization Pattern
Robert mentioned that reserved instances save 40% on compute, and Donna's ROI analysis (3-month payback on a $200K investment) suggests this isn't just about individual cost-saving tricks—it's about systematic cost discipline. What I'm curious about though: how are you all handling the tension between cost optimization and performance variability? When you're processing 73TB/month with ML workloads, does the cost savings from RIs sometimes conflict with the need for burst capacity? I'd love to hear if anyone's found a sweet spot with spot instances or hybrid approaches.

Observability as Non-Negotiable
Multiple people (Donna, Deborah, John) emphasized that observability isn't optional. The tooling choices vary widely—Grafana/Loki/Tempo, Datadog/PagerDuty/Slack, EFK stacks—but the underlying principle is consistent. What's interesting is that this seems to be where people wish they'd invested earlier. For those implementing ML workloads specifically, are you finding that standard observability patterns suffice, or do you need specialized monitoring for model performance metrics (inference latency, prediction accuracy drift, etc.)?

The Iterative vs. Big-Bang Question
Thomas and Benjamin Campbell both emphasized starting small and iterating rather than big-bang transformations. This resonates with ML specifically, where you often need to experiment with different model serving approaches. However, Robert's setup suggests they went fairly comprehensive from the start (988 services, 15 regions). I'm wondering: does the answer depend on whether you're building a new ML platform versus migrating existing workloads? And for those doing greenfield ML deployments, what's your typical progression from pilot to production scale?

Security and Compliance Integration
Several people emphasized that security must be built in from the start, not bolted on. Michelle raised an interesting point about compliance scanning in the CI pipeline. For ML workloads specifically, this gets more complex—you're not just securing the infrastructure, you're also managing model artifacts, training data, and inference results. How are you all approaching this? Are you using separate security scanning for model containers versus infrastructure code?

Team Dynamics Often Trump Technology Choices
One pattern that really stands out: multiple people mentioned that team morale, knowledge sharing, and cross-team collaboration were as important as the technical architecture. Frank noted that morale improved once manual toil was automated. Donna emphasized executive support and dedicated teams. This suggests that for ML workloads—which often require data scientists, ML engineers, and platform engineers working together—the organizational structure might matter as much as whether you choose Kubernetes or ECS.

A Question for the Group
I'm particularly curious about the GCP vs. AWS framing from Robert's original title. While most of this discussion has focused on AWS (ECS, RDS Aurora, EventBridge), I haven't seen much GCP-specific discussion. For those doing ML workloads: are you evaluating both platforms, or have you already committed? And if you're comparing them, what are the key differentiators for your use case? GCP's Vertex AI and BigQuery have some strong ML-specific features, while AWS has broader ecosystem depth. How does that factor into your decision?

Also, I'd love to hear more specifics from anyone running at Robert's scale: how do you handle the operational complexity of 988 services? Do you have a platform engineering team that abstracts away some of that complexity for the application teams?

Thanks for all the detailed sharing here—this kind of concrete experience is invaluable for anyone planning ML infrastructure in 2025.</p

Posted : 28/01/2026 3:15 pm

Page 2 / 2 Prev

11 Forums
309 Topics
4,684 Posts
0 Online
109 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed