This mirrors what we went through. We learned: Phase 1 (1 month) involved assessment and planning. Phase 2 (3 months) focused on process documentation. Phase 3 (1 month) was all about knowledge sharing. Total investment was $50K but the payback period was only 3 months. Key success factors: good tooling, training, patience. If I could do it again, I would invest more in training.
For context, we're using Vault, AWS KMS, and SOPS.
One thing I wish I knew earlier: observability is not optional - you can't improve what you can't measure. Would have saved us a lot of time.
This is almost identical to what we faced. The problem: scaling issues. Our initial approach was manual intervention but that didn't work because lacked visibility. What actually worked: chaos engineering tests in staging. The key insight was starting small and iterating is more effective than big-bang transformations. Now we're able to deploy with confidence.
I'd recommend checking out relevant blog posts for more details.
Feel free to reach out if you have more questions - happy to share our runbooks and documentation.