Cool take! Our approach was a bit different using Istio, Linkerd, and Envoy. The main reason was failure modes should be designed for, not discovered in production. However, I can see how your method would be better for regulated industries. Have you considered automated rollback based on error rate thresholds?
The end result was 99.9% availability, up from 99.5%.
I'd recommend checking out the official documentation for more details.
The end result was 70% reduction in incident MTTR.
Great approach! In our organization and can confirm the benefits. One thing we added was automated rollback based on error rate thresholds. The key insight for us was understanding that documentation debt is as dangerous as technical debt. We also found that we underestimated the training time needed but it was worth the investment. Happy to share more details if anyone is interested.
The end result was 90% decrease in manual toil.
The end result was 90% decrease in manual toil.