Appreciate you laying this out so clearly! I have a few questions: 1) How did you handle testing? 2) What was your approach to backup? 3) Did you encounter any issues with consistency? We're considering a similar implementation and would love to learn from your experience.
For context, we're using Grafana, Loki, and Tempo.
The end result was 3x increase in deployment frequency.
I'd recommend checking out the community forums for more details.
For context, we're using Grafana, Loki, and Tempo.
Great post! We've been doing this for about 14 months now and the results have been impressive. Our main learning was that automation should augment human decision-making, not replace it entirely. We also discovered that we discovered several hidden dependencies during the migration. For anyone starting out, I'd recommend cost allocation tagging for accurate showback.
For context, we're using Kubernetes, Helm, ArgoCD, and Prometheus.
One thing I wish I knew earlier: observability is not optional - you can't improve what you can't measure. Would have saved us a lot of time.
Lessons we learned along the way: 1) Test in production-like environments 2) Monitor proactively 3) Practice incident response 4) Measure what matters. Common mistakes to avoid: ignoring security. Resources that helped us: Accelerate by DORA. The most important thing is learning over blame.
For context, we're using Terraform, AWS CDK, and CloudFormation.
I'd recommend checking out the official documentation for more details.
One more thing worth mentioning: team morale improved significantly once the manual toil was automated away.
Our recommended approach: 1) Test in production-like environments 2) Use feature flags 3) Practice incident response 4) Build for failure. Common mistakes to avoid: over-engineering early. Resources that helped us: Phoenix Project. The most important thing is learning over blame.
One thing I wish I knew earlier: failure modes should be designed for, not discovered in production. Would have saved us a lot of time.
One thing I wish I knew earlier: automation should augment human decision-making, not replace it entirely. Would have saved us a lot of time.
Additionally, we found that security must be built in from the start, not bolted on later.
Additionally, we found that failure modes should be designed for, not discovered in production.
Additionally, we found that failure modes should be designed for, not discovered in production.
The end result was 80% reduction in security vulnerabilities.
One more thing worth mentioning: the hardest part was getting buy-in from stakeholders outside engineering.
Great approach! In our organization and can confirm the benefits. One thing we added was real-time dashboards for stakeholder visibility. The key insight for us was understanding that observability is not optional - you can't improve what you can't measure. We also found that unexpected benefits included better developer experience and faster onboarding. Happy to share more details if anyone is interested.
For context, we're using Vault, AWS KMS, and SOPS.
I'd recommend checking out conference talks on YouTube for more details.
I respect this view, but want to offer another perspective on the tooling choice. In our environment, we found that Istio, Linkerd, and Envoy worked better because observability is not optional - you can't improve what you can't measure. That said, context matters a lot - what works for us might not work for everyone. The key is to focus on outcomes.
The end result was 90% decrease in manual toil.
I'd recommend checking out relevant blog posts for more details.
Feel free to reach out if you have more questions - happy to share our runbooks and documentation.
One more thing worth mentioning: integration with existing tools was smoother than anticipated.
For context, we're using Datadog, PagerDuty, and Slack.
One more thing worth mentioning: the initial investment was higher than expected, but the long-term benefits exceeded our projections.
Additionally, we found that cross-team collaboration is essential for success.
One more thing worth mentioning: integration with existing tools was smoother than anticipated.