We went down this path too in our organization and can confirm the benefits. One thing we added was real-time dashboards for stakeholder visibility. The key insight for us was understanding that cross-team collaboration is essential for success. We also found that we had to iterate several times before finding the right balance. Happy to share more details if anyone is interested.
For context, we're using Grafana, Loki, and Tempo.
Feel free to reach out if you have more questions - happy to share our runbooks and documentation.
Great post! We've been doing this for about 21 months now and the results have been impressive. Our main learning was that failure modes should be designed for, not discovered in production. We also discovered that the hardest part was getting buy-in from stakeholders outside engineering. For anyone starting out, I'd recommend chaos engineering tests in staging.
For context, we're using Vault, AWS KMS, and SOPS.
The end result was 80% reduction in security vulnerabilities.
One thing I wish I knew earlier: automation should augment human decision-making, not replace it entirely. Would have saved us a lot of time.
Great writeup! That said, I have some concerns on the timeline. In our environment, we found that Terraform, AWS CDK, and CloudFormation worked better because failure modes should be designed for, not discovered in production. That said, context matters a lot - what works for us might not work for everyone. The key is to start small and iterate.
Additionally, we found that security must be built in from the start, not bolted on later.
For context, we're using Terraform, AWS CDK, and CloudFormation.