This resonates with my experience, though I'd emphasize maintenance burden. We learned this the hard way when the hardest part was getting buy-in from stakeholders outside engineering. Now we always make sure to include in design reviews. It's added maybe an hour to our process but prevents a lot of headaches down the line.
For context, we're using Grafana, Loki, and Tempo.
Feel free to reach out if you have more questions - happy to share our runbooks and documentation.
One more thing worth mentioning: team morale improved significantly once the manual toil was automated away.
Great post! We've been doing this for about 19 months now and the results have been impressive. Our main learning was that documentation debt is as dangerous as technical debt. We also discovered that we underestimated the training time needed but it was worth the investment. For anyone starting out, I'd recommend integration with our incident management system.
For context, we're using Elasticsearch, Fluentd, and Kibana.
I'd recommend checking out the official documentation for more details.
For context, we're using Jenkins, GitHub Actions, and Docker.