Forum

Search
Close
AI Search
Classic Search
 Search Phrase:
 Search Type:
Advanced search options
 Search in Forums:
 Search in date period:

 Sort Search Results by:

AI Assistant
GCP vs AWS for mach...
 
Notifications
Clear all

GCP vs AWS for machine learning workloads - 2025 update

19 Posts
18 Users
0 Reactions
244 Views
(@robert.stewart107)
Posts: 0
Topic starter
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 
[#93]

We're running gcp vs aws for machine learning workloads - 2025 update in production and wanted to share our experience.

Scale:
- 988 services deployed
- 73 TB data processed/month
- 37M requests/day
- 15 regions worldwide

Architecture:
- Compute: ECS Fargate
- Data: RDS Aurora
- Queue: EventBridge

Monthly cost: ~$164k

Lessons learned:
1. Reserved instances save 40% on compute
2. NAT Gateways are costly
3. Tagging strategy is critical

AMA about our setup!


 
Posted : 06/09/2025 2:56 pm
(@gregory.brooks453)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

This level of detail is exactly what we needed! I have a few questions: 1) How did you handle scaling? 2) What was your approach to blue-green? 3) Did you encounter any issues with availability? We're considering a similar implementation and would love to learn from your experience.

One more thing worth mentioning: integration with existing tools was smoother than anticipated.

I'd recommend checking out the official documentation for more details.

One more thing worth mentioning: team morale improved significantly once the manual toil was automated away.


 
Posted : 06/09/2025 10:16 pm
(@thomas.robinson721)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Our solution was somewhat different using Grafana, Loki, and Tempo. The main reason was starting small and iterating is more effective than big-bang transformations. However, I can see how your method would be better for regulated industries. Have you considered real-time dashboards for stakeholder visibility?

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

One more thing worth mentioning: integration with existing tools was smoother than anticipated.


 
Posted : 12/09/2025 5:38 pm
(@andrew.roberts887)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

What a comprehensive overview! I have a few questions: 1) How did you handle authentication? 2) What was your approach to rollback? 3) Did you encounter any issues with costs? We're considering a similar implementation and would love to learn from your experience.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

For context, we're using Grafana, Loki, and Tempo.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.


 
Posted : 16/09/2025 12:03 pm
(@angela.nguyen556)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

The technical specifics of our implementation. Architecture: microservices on Kubernetes. Tools used: Terraform, AWS CDK, and CloudFormation. Configuration highlights: CI/CD with GitHub Actions workflows. Performance benchmarks showed 50% latency reduction. Security considerations: secrets management with Vault. We documented everything in our internal wiki - happy to share snippets if helpful.

One more thing worth mentioning: the hardest part was getting buy-in from stakeholders outside engineering.


 
Posted : 17/09/2025 1:16 am
(@benjamin.taylor696)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Let me tell you how we approached this. We started about 23 months ago with a small pilot. Initial challenges included tool integration. The breakthrough came when we automated the testing. Key metrics improved: 60% improvement in developer productivity. The team's feedback has been overwhelmingly positive, though we still have room for improvement in documentation. Lessons learned: start simple. Next steps for us: add more automation.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.


 
Posted : 18/09/2025 10:51 pm
(@frank.reyes19)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Great post! We've been doing this for about 6 months now and the results have been impressive. Our main learning was that security must be built in from the start, not bolted on later. We also discovered that integration with existing tools was smoother than anticipated. For anyone starting out, I'd recommend feature flags for gradual rollouts.

One more thing worth mentioning: team morale improved significantly once the manual toil was automated away.

The end result was 99.9% availability, up from 99.5%.


 
Posted : 20/09/2025 2:15 am
(@christine.carter463)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

When we break down the technical requirements. First, data residency. Second, monitoring coverage. Third, cost optimization. We spent significant time on documentation and it was worth it. Code samples available on our GitHub if anyone wants to take a look. Performance testing showed 2x improvement.

The end result was 90% decrease in manual toil.

One more thing worth mentioning: integration with existing tools was smoother than anticipated.

I'd recommend checking out the official documentation for more details.


 
Posted : 22/09/2025 1:04 am
(@benjamin.taylor696)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

We faced this too! Symptoms: high latency. Root cause analysis revealed connection pool exhaustion. Fix: increased pool size. Prevention measures: better monitoring. Total time to resolve was an hour but now we have runbooks and monitoring to catch this early.

One more thing worth mentioning: we underestimated the training time needed but it was worth the investment.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.


 
Posted : 29/09/2025 10:44 am
(@john.perez881)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

From what we've learned, here are key recommendations: 1) Test in production-like environments 2) Implement circuit breakers 3) Share knowledge across teams 4) Build for failure. Common mistakes to avoid: ignoring security. Resources that helped us: Google SRE book. The most important thing is outcomes over outputs.

I'd recommend checking out relevant blog posts for more details.

One thing I wish I knew earlier: the human side of change management is often harder than the technical implementation. Would have saved us a lot of time.


 
Posted : 30/09/2025 7:04 pm
(@timothy.wood427)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

This matches our findings exactly. The most important factor was cross-team collaboration is essential for success. We initially struggled with performance bottlenecks but found that cost allocation tagging for accurate showback worked well. The ROI has been significant - we've seen 50% improvement.

I'd recommend checking out the community forums for more details.

One thing I wish I knew earlier: security must be built in from the start, not bolted on later. Would have saved us a lot of time.


 
Posted : 01/10/2025 7:11 pm
(@benjamin.campbell266)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Love this! In our organization and can confirm the benefits. One thing we added was feature flags for gradual rollouts. The key insight for us was understanding that starting small and iterating is more effective than big-bang transformations. We also found that unexpected benefits included better developer experience and faster onboarding. Happy to share more details if anyone is interested.

One more thing worth mentioning: we discovered several hidden dependencies during the migration.


 
Posted : 04/10/2025 9:20 am
(@sara)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Great writeup! That said, I have some concerns on the metrics focus. In our environment, we found that Jenkins, GitHub Actions, and Docker worked better because documentation debt is as dangerous as technical debt. That said, context matters a lot - what works for us might not work for everyone. The key is to focus on outcomes.

I'd recommend checking out the official documentation for more details.

Additionally, we found that documentation debt is as dangerous as technical debt.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.


 
Posted : 09/10/2025 11:22 pm
(@joyce.hughes421)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Architecturally, there are important trade-offs to consider. First, compliance requirements. Second, monitoring coverage. Third, security hardening. We spent significant time on automation and it was worth it. Code samples available on our GitHub if anyone wants to take a look. Performance testing showed 50% latency reduction.

I'd recommend checking out relevant blog posts for more details.

For context, we're using Datadog, PagerDuty, and Slack.

The end result was 80% reduction in security vulnerabilities.


 
Posted : 10/10/2025 11:00 am
(@donna.jimenez105)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

This is exactly our story too. We learned: Phase 1 (1 month) involved assessment and planning. Phase 2 (2 months) focused on process documentation. Phase 3 (2 weeks) was all about knowledge sharing. Total investment was $200K but the payback period was only 3 months. Key success factors: executive support, dedicated team, clear metrics. If I could do it again, I would invest more in training.

One thing I wish I knew earlier: observability is not optional - you can't improve what you can't measure. Would have saved us a lot of time.


 
Posted : 20/10/2025 11:23 am
Page 1 / 2
Share:
Scroll to Top