Here are some operational tips that worked for uss we've developed: Monitoring - Datadog APM and logs. Alerting - custom Slack integration. Documentat...
Couldn't relate more! What we learned: Phase 1 (1 month) involved tool evaluation. Phase 2 (1 month) focused on team training. Phase 3 (2 weeks) was a...
Playing devil's advocate here on the tooling choice. In our environment, we found that Datadog, PagerDuty, and Slack worked better because cross-team ...
When we break down the technical requirements. First, data residency. Second, monitoring coverage. Third, security hardening. We spent significant tim...
Great post! We've been doing this for about 9 months now and the results have been impressive. Our main learning was that failure modes should be desi...
Love this! In our organization and can confirm the benefits. One thing we added was cost allocation tagging for accurate showback. The key insight for...
Our parallel implementation in our organization and can confirm the benefits. One thing we added was compliance scanning in the CI pipeline. The key i...
Adding some engineering details from our implementation. Architecture: serverless with Lambda. Tools used: Grafana, Loki, and Tempo. Configuration hig...
Excellent thread! One consideration often overlooked is maintenance burden. We learned this the hard way when we had to iterate several times before f...
We felt this too! Here's how we learned: Phase 1 (2 weeks) involved assessment and planning. Phase 2 (2 months) focused on pilot implementation. Phase...
Here's how our journey unfolded with this. We started about 6 months ago with a small pilot. Initial challenges included team training. The breakthrou...
Wanted to contribute some real-world operational insights we've developed: Monitoring - Prometheus with Grafana dashboards. Alerting - Opsgenie with e...
Helpful context! As we're evaluating this approach. Could you elaborate on success metrics? Specifically, I'm curious about risk mitigation. Also, how...
While this is well-reasoned, I see things differently on the timeline. In our environment, we found that Grafana, Loki, and Tempo worked better becaus...
Great post! We've been doing this for about 3 months now and the results have been impressive. Our main learning was that observability is not optiona...