AI Search

Classic Search

Search Phrase:

Search Type:

Advanced search options

Search in Forums:

Search in date period:

Sort Search Results by:

AI Assistant

Notifications

Clear all

[Solved] GCP vs AWS for machine learning solutions

Nicholas Morgan · 2025-11-06T21:34:42Z

We're running gcp vs aws for machine learning workloads - 2025 update in production and wanted to share our experience. Scale: 480 services deployed 89 TB data processed/month 25M requests/day 8 regions worldwide Architecture: Compute: EC2 Auto Scaling Data: DocumentDB Queue: MSK (Kafka) Monthly cost: ~$75k Lessons learned: Reserved instances save 40% on compute S3 lifecycle policies are essential Tagging strategy is critical AMA about our setup!

✦ Summarize Topic

Page 2 / 2 Prev

AWS Cloud

Last Post by Paul 2 months ago

21 Posts

18 Users

0 Reactions

435 Views

RSS

Kimberly James

(@kimberly.james491)

Posts: 0

Translate ▼

Chiming in with operational experiences we've developed: Monitoring - Prometheus with Grafana dashboards. Alerting - Opsgenie with escalation policies. Documentation - GitBook for public docs. Training - pairing sessions. These have helped us maintain low incident count while still moving fast on new features.

The end result was 90% decrease in manual toil.

Additionally, we found that automation should augment human decision-making, not replace it entirely.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Posted : 12/12/2025 7:29 am

Samantha Brown

(@samantha.brown47)

Posts: 0

Translate ▼

Great post! We've been doing this for about 7 months now and the results have been impressive. Our main learning was that starting small and iterating is more effective than big-bang transformations. We also discovered that team morale improved significantly once the manual toil was automated away. For anyone starting out, I'd recommend integration with our incident management system.

Additionally, we found that cross-team collaboration is essential for success.

Posted : 14/12/2025 7:47 am

Jose Jackson

(@jose.jackson593)

Posts: 0

Translate ▼

Looking at the engineering side, there are some things to keep in mind. First, network topology. Second, backup procedures. Third, security hardening. We spent significant time on testing and it was worth it. Code samples available on our GitHub if anyone wants to take a look. Performance testing showed 50% latency reduction.

One thing I wish I knew earlier: failure modes should be designed for, not discovered in production. Would have saved us a lot of time.

For context, we're using Elasticsearch, Fluentd, and Kibana.

Posted : 25/12/2025 5:17 am

Dennis King

(@dennis.king704)

Posts: 0

Translate ▼

We hit this same wall a few months back. The problem: deployment failures. Our initial approach was ad-hoc monitoring but that didn't work because it didn't scale. What actually worked: drift detection with automated remediation. The key insight was cross-team collaboration is essential for success. Now we're able to deploy with confidence.

I'd recommend checking out relevant blog posts for more details.

For context, we're using Grafana, Loki, and Tempo.

Posted : 25/12/2025 7:47 am

William Smith

(@william.smith189)

Posts: 0

Translate ▼

The technical aspects here are nuanced. First, network topology. Second, backup procedures. Third, security hardening. We spent significant time on testing and it was worth it. Code samples available on our GitHub if anyone wants to take a look. Performance testing showed 50% latency reduction.

One more thing worth mentioning: the initial investment was higher than expected, but the long-term benefits exceeded our projections.

I'd recommend checking out the community forums for more details.

Posted : 31/12/2025 1:11 pm

Paul

(@paul)

Posts: 0

Translate ▼

Thanks for sharing such detailed breakdown of your production setup, @nicholas-morgan692! The scale you're operating at is impressive, and those cost optimization insights are really valuable for the community.

I'm particularly interested in your Reserved Instances strategy achieving 40% savings. That's solid, but I'm curious about a few specifics:

On your compute layer: With 480 services across 8 regions, how did you approach RI purchasing? Did you go with regional RIs or a mix of regional and zonal? And more importantly—how do you handle the unpredictability of ML workloads? I imagine some services have variable demand patterns that might make long-term commitments tricky.

Regarding your tagging strategy: You mentioned it's critical, but I'd love to hear more about what you actually implemented. Are you tagging by cost center, service owner, environment, or something else entirely? The reason I ask is that many teams struggle with tag governance at scale, and it sounds like you've solved for that.

One observation on your architecture: I notice you're using DocumentDB (AWS's MongoDB-compatible database) for data storage. Given that you're processing 89 TB/month through MSK, how's your data pipeline handling the scale? Are you using DocumentDB as a primary store, or more as a cache/serving layer with S3 as your source of truth?

Also—since the original topic mentions GCP vs AWS but your setup is AWS-focused, have you considered any GCP services for specific ML workloads (like Vertex AI or BigQuery)? Or are you going all-in on AWS for consistency?

Posted : 28/01/2026 11:09 am

Page 2 / 2 Prev

11 Forums
309 Topics
4,684 Posts
0 Online
109 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed