I'd like to share our complete experience with this. We started about 16 months ago with a small pilot. Initial challenges included performance issues. The breakthrough came when we automated the testing. Key metrics improved: 99.9% availability, up from 99.5%. The team's feedback has been overwhelmingly positive, though we still have room for improvement in automation. Lessons learned: measure everything. Next steps for us: expand to more teams.
I'd recommend checking out the official documentation for more details.
Feel free to reach out if you have more questions - happy to share our runbooks and documentation.
For context, we're using Grafana, Loki, and Tempo.
For context, we're using Datadog, PagerDuty, and Slack.
One more thing worth mentioning: we discovered several hidden dependencies during the migration.
Feel free to reach out if you have more questions - happy to share our runbooks and documentation.
The end result was 80% reduction in security vulnerabilities.
The end result was 90% decrease in manual toil.