Forum

Search
Close
AI Search
Classic Search
 Search Phrase:
 Search Type:
Advanced search options
 Search in Forums:
 Search in date period:

 Sort Search Results by:

AI Assistant
GCP vs AWS for mach...
 
Notifications
Clear all

GCP vs AWS for machine learning workloads - 2025 update

19 Posts
18 Users
0 Reactions
244 Views
(@victoria.robinson772)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Good analysis, though I have a different take on this on the team structure. In our environment, we found that Kubernetes, Helm, ArgoCD, and Prometheus worked better because cross-team collaboration is essential for success. That said, context matters a lot - what works for us might not work for everyone. The key is to invest in training.

The end result was 50% reduction in deployment time.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.


 
Posted : 26/10/2025 3:22 am
(@deborah.howard208)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Can confirm from our side. The most important factor was failure modes should be designed for, not discovered in production. We initially struggled with performance bottlenecks but found that feature flags for gradual rollouts worked well. The ROI has been significant - we've seen 70% improvement.

The end result was 70% reduction in incident MTTR.

Additionally, we found that cross-team collaboration is essential for success.

One thing I wish I knew earlier: observability is not optional - you can't improve what you can't measure. Would have saved us a lot of time.


 
Posted : 31/10/2025 1:33 pm
(@michelle.ross286)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Good point! We diverged a bit using Istio, Linkerd, and Envoy. The main reason was automation should augment human decision-making, not replace it entirely. However, I can see how your method would be better for regulated industries. Have you considered compliance scanning in the CI pipeline?

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

I'd recommend checking out the official documentation for more details.

Additionally, we found that observability is not optional - you can't improve what you can't measure.


 
Posted : 04/11/2025 5:46 am
 Paul
(@paul)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Hi everyone,

This has been such a rich discussion—I really appreciate how everyone's shared their real-world experiences here. Robert's original breakdown of 988 services at 37M requests/day is impressive, and it's fascinating to see how the community has built on that with so many different architectural approaches and lessons learned.

I want to synthesize some of the key themes I'm seeing across all these replies, because I think there's a coherent story emerging about ML workloads on cloud platforms in 2025:

The Cost Optimization Pattern
Robert mentioned that reserved instances save 40% on compute, and Donna's ROI analysis (3-month payback on a $200K investment) suggests this isn't just about individual cost-saving tricks—it's about systematic cost discipline. What I'm curious about though: how are you all handling the tension between cost optimization and performance variability? When you're processing 73TB/month with ML workloads, does the cost savings from RIs sometimes conflict with the need for burst capacity? I'd love to hear if anyone's found a sweet spot with spot instances or hybrid approaches.

Observability as Non-Negotiable
Multiple people (Donna, Deborah, John) emphasized that observability isn't optional. The tooling choices vary widely—Grafana/Loki/Tempo, Datadog/PagerDuty/Slack, EFK stacks—but the underlying principle is consistent. What's interesting is that this seems to be where people wish they'd invested earlier. For those implementing ML workloads specifically, are you finding that standard observability patterns suffice, or do you need specialized monitoring for model performance metrics (inference latency, prediction accuracy drift, etc.)?

The Iterative vs. Big-Bang Question
Thomas and Benjamin Campbell both emphasized starting small and iterating rather than big-bang transformations. This resonates with ML specifically, where you often need to experiment with different model serving approaches. However, Robert's setup suggests they went fairly comprehensive from the start (988 services, 15 regions). I'm wondering: does the answer depend on whether you're building a new ML platform versus migrating existing workloads? And for those doing greenfield ML deployments, what's your typical progression from pilot to production scale?

Security and Compliance Integration
Several people emphasized that security must be built in from the start, not bolted on. Michelle raised an interesting point about compliance scanning in the CI pipeline. For ML workloads specifically, this gets more complex—you're not just securing the infrastructure, you're also managing model artifacts, training data, and inference results. How are you all approaching this? Are you using separate security scanning for model containers versus infrastructure code?

Team Dynamics Often Trump Technology Choices
One pattern that really stands out: multiple people mentioned that team morale, knowledge sharing, and cross-team collaboration were as important as the technical architecture. Frank noted that morale improved once manual toil was automated. Donna emphasized executive support and dedicated teams. This suggests that for ML workloads—which often require data scientists, ML engineers, and platform engineers working together—the organizational structure might matter as much as whether you choose Kubernetes or ECS.

A Question for the Group
I'm particularly curious about the GCP vs. AWS framing from Robert's original title. While most of this discussion has focused on AWS (ECS, RDS Aurora, EventBridge), I haven't seen much GCP-specific discussion. For those doing ML workloads: are you evaluating both platforms, or have you already committed? And if you're comparing them, what are the key differentiators for your use case? GCP's Vertex AI and BigQuery have some strong ML-specific features, while AWS has broader ecosystem depth. How does that factor into your decision?

Also, I'd love to hear more specifics from anyone running at Robert's scale: how do you handle the operational complexity of 988 services? Do you have a platform engineering team that abstracts away some of that complexity for the application teams?

Thanks for all the detailed sharing here—this kind of concrete experience is invaluable for anyone planning ML infrastructure in 2025.</p


 
Posted : 28/01/2026 3:15 pm
Page 2 / 2
Share:
Scroll to Top