AI preferences coming soon...
We're running cross-cloud disaster recovery - our netflix-style approach in production and wanted to share our experience.
Scale:
- 438 services deployed
- 24 TB data processed/month
- 44M requests/day
- 5 regions worldwide
Architecture:
- Compute: Lambda + Step Functions
- Data: Redshift
- Queue: EventBridge
Monthly cost: ~$156k
Lessons learned:
1. Spot instances are production-ready
2. NAT Gateways are costly
3. FinOps team paid for itself
AMA about our setup!
We evaluated Kubernetes last quarter and decided against it due to learning curve. Instead, we went with Grafana which better fit our use case. The main factors were cost (30% cheaper), ease of use (2-day vs 2-week training), and community support.
Exactly! This is what we implemented last month.
For those asking about cost: in our case (AWS, us-east-1, ~500 req/sec), we're paying about $5000/month. That's 70% vs our old setup with Kubernetes. ROI was positive after just 2 months when you factor in engineering time saved.
This is a game changer for teams doing GitOps! We integrated it with our existing Jenkins + Docker and the results were immediate. Developer productivity up 40%, deployment frequency up 3x, and MTTR down 60%. Best investment we made this year.
Pro tip: if you're implementing this, make sure to configure scaling parameters correctly. We spent 2 weeks debugging random failures only to discover the default timeout was too low. Changed from 30s to 2min and all issues disappeared.
The migration path we took:
Week 1-2: Research & POC
Week 3-4: Staging deployment
Week 5-6: Prod rollout (10% -> 50% -> 100%)
Week 7-8: Optimization
Total cost: ~200 eng hours
Would do it again in a heartbeat.