I'll walk you through our entire process with this. We started about 22 months ago with a small pilot. Initial challenges included tool integration. The breakthrough came when we improved observability. Key metrics improved: 90% decrease in manual toil. The team's feedback has been overwhelmingly positive, though we still have room for improvement in testing coverage. Lessons learned: communicate often. Next steps for us: expand to more teams.
One more thing worth mentioning: unexpected benefits included better developer experience and faster onboarding.
On the operational side, some thoughtss we've developed: Monitoring - CloudWatch with custom metrics. Alerting - Opsgenie with escalation policies. Documentation - Confluence with templates. Training - pairing sessions. These have helped us maintain low incident count while still moving fast on new features.
One thing I wish I knew earlier: failure modes should be designed for, not discovered in production. Would have saved us a lot of time.
For context, we're using Datadog, PagerDuty, and Slack.
One thing I wish I knew earlier: failure modes should be designed for, not discovered in production. Would have saved us a lot of time.
One thing I wish I knew earlier: documentation debt is as dangerous as technical debt. Would have saved us a lot of time.
Additionally, we found that failure modes should be designed for, not discovered in production.
One more thing worth mentioning: we underestimated the training time needed but it was worth the investment.
Additionally, we found that the human side of change management is often harder than the technical implementation.
Thoughtful post - though I'd challenge one aspect on the metrics focus. In our environment, we found that Elasticsearch, Fluentd, and Kibana worked better because observability is not optional - you can't improve what you can't measure. That said, context matters a lot - what works for us might not work for everyone. The key is to invest in training.
For context, we're using Jenkins, GitHub Actions, and Docker.
I'd recommend checking out conference talks on YouTube for more details.
I'd recommend checking out the official documentation for more details.