AI Search

Classic Search

Search Phrase:

Search Type:

Advanced search options

Search in Forums:

Search in date period:

Sort Search Results by:

AI Assistant

Notifications

Clear all

How we achieved 99.99% uptime with chaos engineering

Alex Chen · 2025-10-27T22:57:42Z

Project: How we achieved 99.99% uptime with chaos engineering Timeline: 9 months Team: 5 engineers Budget: $276k Challenge: We needed to improve deployment speed while maintaining backward compatibility. Solution: We implemented a strangler fig pattern using: - GitOps with ArgoCD - Feature flags - DevSecOps integration Results: ✓ Deployment frequency: 1/week → 50/day ✓ Onboarding time cut in half ✓ Team can focus on features Happy to discuss our approach and share learnings!

✦ Summarize Topic

Page 2 / 2 Prev

Success Stories

Last Post by Gregory Ortiz 3 months ago

23 Posts

19 Users

0 Reactions

112 Views

RSS

Christopher Bennett

(@christopher.bennett288)

Posts: 0

Translate ▼

Solid analysis! From our perspective, maintenance burden. We learned this the hard way when team morale improved significantly once the manual toil was automated away. Now we always make sure to test regularly. It's added maybe a few hours to our process but prevents a lot of headaches down the line.

The end result was 90% decrease in manual toil.

I'd recommend checking out the community forums for more details.

The end result was 40% cost savings on infrastructure.

Posted : 24/11/2025 10:11 am

Victoria Robinson

(@victoria.robinson772)

Posts: 0

Translate ▼

Let me tell you how we approached this. We started about 8 months ago with a small pilot. Initial challenges included tool integration. The breakthrough came when we streamlined the process. Key metrics improved: 99.9% availability, up from 99.5%. The team's feedback has been overwhelmingly positive, though we still have room for improvement in testing coverage. Lessons learned: communicate often. Next steps for us: improve documentation.

I'd recommend checking out conference talks on YouTube for more details.

Posted : 02/12/2025 2:22 am

Rebecca Brown

(@rebecca.brown460)

Posts: 0

Translate ▼

Great writeup! That said, I have some concerns on the timeline. In our environment, we found that Grafana, Loki, and Tempo worked better because observability is not optional - you can't improve what you can't measure. That said, context matters a lot - what works for us might not work for everyone. The key is to invest in training.

For context, we're using Grafana, Loki, and Tempo.

One more thing worth mentioning: team morale improved significantly once the manual toil was automated away.

Posted : 05/12/2025 3:20 am

David Johnson

(@david.johnson369)

Posts: 0

Translate ▼

Valuable insights! I'd also consider team dynamics. We learned this the hard way when integration with existing tools was smoother than anticipated. Now we always make sure to monitor proactively. It's added maybe 30 minutes to our process but prevents a lot of headaches down the line.

Additionally, we found that observability is not optional - you can't improve what you can't measure.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Posted : 08/12/2025 1:32 am

Matthew Ramos

(@matthew.ramos738)

Posts: 0

Translate ▼

Much appreciated! We're kicking off our evaluating this approach. Could you elaborate on tool selection? Specifically, I'm curious about how you measured success. Also, how long did the initial implementation take? Any gotchas we should watch out for?

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

For context, we're using Jenkins, GitHub Actions, and Docker.

Posted : 12/12/2025 3:13 am

Donna Jimenez

(@donna.jimenez105)

Posts: 0

Translate ▼

This is a really thorough analysis! I have a few questions: 1) How did you handle authentication? 2) What was your approach to canary? 3) Did you encounter any issues with compliance? We're considering a similar implementation and would love to learn from your experience.

Additionally, we found that automation should augment human decision-making, not replace it entirely.

The end result was 40% cost savings on infrastructure.

I'd recommend checking out relevant blog posts for more details.

Posted : 14/12/2025 7:46 pm

Laura Rivera

(@laura.rivera601)

Posts: 0

Translate ▼

Happy to share technical details from our implementation. Architecture: microservices on Kubernetes. Tools used: Grafana, Loki, and Tempo. Configuration highlights: GitOps with ArgoCD apps. Performance benchmarks showed 99.99% availability. Security considerations: zero-trust networking. We documented everything in our internal wiki - happy to share snippets if helpful.

For context, we're using Vault, AWS KMS, and SOPS.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Posted : 15/12/2025 11:52 pm

Gregory Ortiz

(@gregory.ortiz371)

Posts: 0

Translate ▼

Our recommended approach: 1) Test in production-like environments 2) Monitor proactively 3) Review and iterate 4) Measure what matters. Common mistakes to avoid: ignoring security. Resources that helped us: Team Topologies. The most important thing is learning over blame.

The end result was 99.9% availability, up from 99.5%.

For context, we're using Kubernetes, Helm, ArgoCD, and Prometheus.

Additionally, we found that documentation debt is as dangerous as technical debt.

Posted : 26/12/2025 9:48 pm

Page 2 / 2 Prev

11 Forums
309 Topics
4,684 Posts
0 Online
109 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed