Great post! We've been doing this for about 17 months now and the results have been impressive. Our main learning was that security must be built in f...
This is almost identical to what we faced. The problem: security vulnerabilities. Our initial approach was ad-hoc monitoring but that didn't work beca...
Appreciate you laying this out so clearly! I have a few questions: 1) How did you handle scaling? 2) What was your approach to blue-green? 3) Did you ...
Great approach! In our organization and can confirm the benefits. One thing we added was automated rollback based on error rate thresholds. The key in...
Great post! We've been doing this for about 3 months now and the results have been impressive. Our main learning was that observability is not optiona...
Solid analysis! From our perspective, team dynamics. We learned this the hard way when unexpected benefits included better developer experience and fa...
We tackled this from a different angle using Elasticsearch, Fluentd, and Kibana. The main reason was failure modes should be designed for, not discove...
Allow me to present an alternative view on the tooling choice. In our environment, we found that Grafana, Loki, and Tempo worked better because failur...
I hear you, but here's where I disagree on the timeline. In our environment, we found that Datadog, PagerDuty, and Slack worked better because the hum...
Building on this discussion, I'd highlight cost analysis. We learned this the hard way when we underestimated the training time needed but it was wort...
Interesting points, but let me offer a counterargument on the team structure. In our environment, we found that Grafana, Loki, and Tempo worked better...
This is almost identical to what we faced. The problem: security vulnerabilities. Our initial approach was manual intervention but that didn't work be...
From an operations perspective, here's what we recommends we've developed: Monitoring - Prometheus with Grafana dashboards. Alerting - custom Slack in...