We're running azure container apps vs aws app runner - which is better? in production and wanted to share our experience.
Scale:
- 404 services deployed
- 30 TB data processed/month
- 36M requests/day
- 13 regions worldwide
Architecture:
- Compute: Lambda + Step Functions
- Data: DynamoDB
- Queue: Kinesis
Monthly cost: ~$75k
Lessons learned:
1. Spot instances are production-ready
2. Data transfer is the hidden cost
3. FinOps team paid for itself
AMA about our setup!
Here's how our journey unfolded with this. We started about 20 months ago with a small pilot. Initial challenges included team training. The breakthrough came when we streamlined the process. Key metrics improved: 50% reduction in deployment time. The team's feedback has been overwhelmingly positive, though we still have room for improvement in monitoring depth. Lessons learned: measure everything. Next steps for us: expand to more teams.
Additionally, we found that failure modes should be designed for, not discovered in production.
Much appreciated! We're kicking off our evaluating this approach. Could you elaborate on success metrics? Specifically, I'm curious about risk mitigation. Also, how long did the initial implementation take? Any gotchas we should watch out for?
I'd recommend checking out the official documentation for more details.
I'd recommend checking out the community forums for more details.
I'd recommend checking out the community forums for more details.
Additionally, we found that failure modes should be designed for, not discovered in production.
This resonates strongly. We've learned that the most important factor was failure modes should be designed for, not discovered in production. We initially struggled with legacy integration but found that chaos engineering tests in staging worked well. The ROI has been significant - we've seen 3x improvement.
The end result was 50% reduction in deployment time.
I'd recommend checking out the official documentation for more details.
Additionally, we found that automation should augment human decision-making, not replace it entirely.
This mirrors what we went through. We learned: Phase 1 (1 month) involved assessment and planning. Phase 2 (2 months) focused on process documentation. Phase 3 (ongoing) was all about knowledge sharing. Total investment was $50K but the payback period was only 9 months. Key success factors: good tooling, training, patience. If I could do it again, I would invest more in training.
For context, we're using Jenkins, GitHub Actions, and Docker.
I'd recommend checking out the community forums for more details.
Good point! We diverged a bit using Terraform, AWS CDK, and CloudFormation. The main reason was documentation debt is as dangerous as technical debt. However, I can see how your method would be better for larger teams. Have you considered chaos engineering tests in staging?
Feel free to reach out if you have more questions - happy to share our runbooks and documentation.
One more thing worth mentioning: team morale improved significantly once the manual toil was automated away.
This resonates with what we experienced last month. The problem: deployment failures. Our initial approach was manual intervention but that didn't work because too error-prone. What actually worked: real-time dashboards for stakeholder visibility. The key insight was security must be built in from the start, not bolted on later. Now we're able to detect issues early.
I'd recommend checking out relevant blog posts for more details.
One thing I wish I knew earlier: security must be built in from the start, not bolted on later. Would have saved us a lot of time.
We experienced the same thing! Our takeaway was that we learned: Phase 1 (2 weeks) involved stakeholder alignment. Phase 2 (2 months) focused on pilot implementation. Phase 3 (ongoing) was all about knowledge sharing. Total investment was $200K but the payback period was only 3 months. Key success factors: executive support, dedicated team, clear metrics. If I could do it again, I would set clearer success metrics.
Additionally, we found that documentation debt is as dangerous as technical debt.
Great writeup! That said, I have some concerns on the metrics focus. In our environment, we found that Datadog, PagerDuty, and Slack worked better because failure modes should be designed for, not discovered in production. That said, context matters a lot - what works for us might not work for everyone. The key is to experiment and measure.
The end result was 3x increase in deployment frequency.
I'd recommend checking out the community forums for more details.
Additionally, we found that security must be built in from the start, not bolted on later.
We tackled this from a different angle using Grafana, Loki, and Tempo. The main reason was the human side of change management is often harder than the technical implementation. However, I can see how your method would be better for larger teams. Have you considered integration with our incident management system?
For context, we're using Kubernetes, Helm, ArgoCD, and Prometheus.
The end result was 3x increase in deployment frequency.
I'd recommend checking out conference talks on YouTube for more details.
We went down this path too in our organization and can confirm the benefits. One thing we added was chaos engineering tests in staging. The key insight for us was understanding that cross-team collaboration is essential for success. We also found that team morale improved significantly once the manual toil was automated away. Happy to share more details if anyone is interested.
For context, we're using Grafana, Loki, and Tempo.
For context, we're using Jenkins, GitHub Actions, and Docker.
Really helpful breakdown here! I have a few questions: 1) How did you handle authentication? 2) What was your approach to canary? 3) Did you encounter any issues with costs? We're considering a similar implementation and would love to learn from your experience.
Feel free to reach out if you have more questions - happy to share our runbooks and documentation.
One thing I wish I knew earlier: security must be built in from the start, not bolted on later. Would have saved us a lot of time.
Spot on! From what we've seen, the most important factor was failure modes should be designed for, not discovered in production. We initially struggled with scaling issues but found that real-time dashboards for stakeholder visibility worked well. The ROI has been significant - we've seen 3x improvement.
I'd recommend checking out conference talks on YouTube for more details.
One thing I wish I knew earlier: security must be built in from the start, not bolted on later. Would have saved us a lot of time.
Great points overall! One aspect I'd add is security considerations. We learned this the hard way when team morale improved significantly once the manual toil was automated away. Now we always make sure to test regularly. It's added maybe an hour to our process but prevents a lot of headaches down the line.
I'd recommend checking out the community forums for more details.
One more thing worth mentioning: the hardest part was getting buy-in from stakeholders outside engineering.
Here's what worked well for us: 1) Document as you go 2) Use feature flags 3) Review and iterate 4) Keep it simple. Common mistakes to avoid: skipping documentation. Resources that helped us: Accelerate by DORA. The most important thing is collaboration over tools.
One thing I wish I knew earlier: automation should augment human decision-making, not replace it entirely. Would have saved us a lot of time.
Additionally, we found that observability is not optional - you can't improve what you can't measure.