Forum

How we achieved 99....
 
Notifications
Clear all

How we achieved 99.99% uptime with chaos engineering

17 Posts
15 Users
0 Reactions
366 Views
(@nancy.howard864)
Posts: 0
 

This mirrors what happened to us earlier this year. The problem: deployment failures. Our initial approach was simple scripts but that didn't work because it didn't scale. What actually worked: drift detection with automated remediation. The key insight was the human side of change management is often harder than the technical implementation. Now we're able to scale automatically.

The end result was 50% reduction in deployment time.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.


 
Posted : 20/10/2025 10:17 pm
(@nicholas.gray779)
Posts: 0
 

Super useful! We're just starting to evaluateg this approach. Could you elaborate on success metrics? Specifically, I'm curious about stakeholder communication. Also, how long did the initial implementation take? Any gotchas we should watch out for?

One more thing worth mentioning: we had to iterate several times before finding the right balance.

The end result was 90% decrease in manual toil.

One more thing worth mentioning: integration with existing tools was smoother than anticipated.


 
Posted : 22/10/2025 7:15 am
Page 2 / 2
Share:
Scroll to Top