Here's what worked well for us: 1) Test in production-like environments 2) Use feature flags 3) Practice incident response 4) Keep it simple. Common mistakes to avoid: not measuring outcomes. Resources that helped us: Google SRE book. The most important thing is consistency over perfection.
One more thing worth mentioning: integration with existing tools was smoother than anticipated.
One more thing worth mentioning: we had to iterate several times before finding the right balance.
Perfect timing! We're currently evaluating this approach. Could you elaborate on tool selection? Specifically, I'm curious about how you measured success. Also, how long did the initial implementation take? Any gotchas we should watch out for?
Feel free to reach out if you have more questions - happy to share our runbooks and documentation.
The end result was 99.9% availability, up from 99.5%.
The end result was 60% improvement in developer productivity.
For context, we're using Datadog, PagerDuty, and Slack.
For context, we're using Datadog, PagerDuty, and Slack.
Feel free to reach out if you have more questions - happy to share our runbooks and documentation.
For context, we're using Kubernetes, Helm, ArgoCD, and Prometheus.
For context, we're using Kubernetes, Helm, ArgoCD, and Prometheus.
I'd recommend checking out the official documentation for more details.
For context, we're using Grafana, Loki, and Tempo.
For context, we're using Vault, AWS KMS, and SOPS.