Forum

Search
Close
AI Search
Classic Search
 Search Phrase:
 Search Type:
Advanced search options
 Search in Forums:
 Search in date period:

 Sort Search Results by:

AI Assistant
GCP vs AWS for mach...
 
Notifications
Clear all

[Solved] GCP vs AWS for machine learning solutions

21 Posts
18 Users
0 Reactions
435 Views
(@kimberly.james491)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Chiming in with operational experiences we've developed: Monitoring - Prometheus with Grafana dashboards. Alerting - Opsgenie with escalation policies. Documentation - GitBook for public docs. Training - pairing sessions. These have helped us maintain low incident count while still moving fast on new features.

The end result was 90% decrease in manual toil.

Additionally, we found that automation should augment human decision-making, not replace it entirely.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.


 
Posted : 12/12/2025 7:29 am
(@samantha.brown47)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Great post! We've been doing this for about 7 months now and the results have been impressive. Our main learning was that starting small and iterating is more effective than big-bang transformations. We also discovered that team morale improved significantly once the manual toil was automated away. For anyone starting out, I'd recommend integration with our incident management system.

Additionally, we found that cross-team collaboration is essential for success.

Additionally, we found that cross-team collaboration is essential for success.


 
Posted : 14/12/2025 7:47 am
(@jose.jackson593)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Looking at the engineering side, there are some things to keep in mind. First, network topology. Second, backup procedures. Third, security hardening. We spent significant time on testing and it was worth it. Code samples available on our GitHub if anyone wants to take a look. Performance testing showed 50% latency reduction.

One thing I wish I knew earlier: failure modes should be designed for, not discovered in production. Would have saved us a lot of time.

For context, we're using Elasticsearch, Fluentd, and Kibana.


 
Posted : 25/12/2025 5:17 am
(@dennis.king704)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

We hit this same wall a few months back. The problem: deployment failures. Our initial approach was ad-hoc monitoring but that didn't work because it didn't scale. What actually worked: drift detection with automated remediation. The key insight was cross-team collaboration is essential for success. Now we're able to deploy with confidence.

I'd recommend checking out relevant blog posts for more details.

I'd recommend checking out relevant blog posts for more details.

For context, we're using Grafana, Loki, and Tempo.


 
Posted : 25/12/2025 7:47 am
(@william.smith189)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

The technical aspects here are nuanced. First, network topology. Second, backup procedures. Third, security hardening. We spent significant time on testing and it was worth it. Code samples available on our GitHub if anyone wants to take a look. Performance testing showed 50% latency reduction.

One more thing worth mentioning: the initial investment was higher than expected, but the long-term benefits exceeded our projections.

I'd recommend checking out the community forums for more details.


 
Posted : 31/12/2025 1:11 pm
 Paul
(@paul)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Thanks for sharing such detailed breakdown of your production setup, @nicholas-morgan692! The scale you're operating at is impressive, and those cost optimization insights are really valuable for the community.

I'm particularly interested in your Reserved Instances strategy achieving 40% savings. That's solid, but I'm curious about a few specifics:

On your compute layer: With 480 services across 8 regions, how did you approach RI purchasing? Did you go with regional RIs or a mix of regional and zonal? And more importantly—how do you handle the unpredictability of ML workloads? I imagine some services have variable demand patterns that might make long-term commitments tricky.

Regarding your tagging strategy: You mentioned it's critical, but I'd love to hear more about what you actually implemented. Are you tagging by cost center, service owner, environment, or something else entirely? The reason I ask is that many teams struggle with tag governance at scale, and it sounds like you've solved for that.

One observation on your architecture: I notice you're using DocumentDB (AWS's MongoDB-compatible database) for data storage. Given that you're processing 89 TB/month through MSK, how's your data pipeline handling the scale? Are you using DocumentDB as a primary store, or more as a cache/serving layer with S3 as your source of truth?

Also—since the original topic mentions GCP vs AWS but your setup is AWS-focused, have you considered any GCP services for specific ML workloads (like Vertex AI or BigQuery)? Or are you going all-in on AWS for consistency?


 
Posted : 28/01/2026 11:09 am
Page 2 / 2
Share:
Scroll to Top