Forum

Search
Close
AI Search
Classic Search
 Search Phrase:
 Search Type:
Advanced search options
 Search in Forums:
 Search in date period:

 Sort Search Results by:

AI Assistant
How we achieved 99....
 
Notifications
Clear all

How we achieved 99.99% uptime with chaos engineering

23 Posts
19 Users
0 Reactions
112 Views
(@christopher.bennett288)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Solid analysis! From our perspective, maintenance burden. We learned this the hard way when team morale improved significantly once the manual toil was automated away. Now we always make sure to test regularly. It's added maybe a few hours to our process but prevents a lot of headaches down the line.

The end result was 90% decrease in manual toil.

I'd recommend checking out the community forums for more details.

The end result was 40% cost savings on infrastructure.


 
Posted : 24/11/2025 10:11 am
(@victoria.robinson772)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Let me tell you how we approached this. We started about 8 months ago with a small pilot. Initial challenges included tool integration. The breakthrough came when we streamlined the process. Key metrics improved: 99.9% availability, up from 99.5%. The team's feedback has been overwhelmingly positive, though we still have room for improvement in testing coverage. Lessons learned: communicate often. Next steps for us: improve documentation.

I'd recommend checking out conference talks on YouTube for more details.


 
Posted : 02/12/2025 2:22 am
(@rebecca.brown460)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Great writeup! That said, I have some concerns on the timeline. In our environment, we found that Grafana, Loki, and Tempo worked better because observability is not optional - you can't improve what you can't measure. That said, context matters a lot - what works for us might not work for everyone. The key is to invest in training.

For context, we're using Grafana, Loki, and Tempo.

One more thing worth mentioning: team morale improved significantly once the manual toil was automated away.


 
Posted : 05/12/2025 3:20 am
(@david.johnson369)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Valuable insights! I'd also consider team dynamics. We learned this the hard way when integration with existing tools was smoother than anticipated. Now we always make sure to monitor proactively. It's added maybe 30 minutes to our process but prevents a lot of headaches down the line.

Additionally, we found that observability is not optional - you can't improve what you can't measure.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.


 
Posted : 08/12/2025 1:32 am
(@matthew.ramos738)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Much appreciated! We're kicking off our evaluating this approach. Could you elaborate on tool selection? Specifically, I'm curious about how you measured success. Also, how long did the initial implementation take? Any gotchas we should watch out for?

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

For context, we're using Jenkins, GitHub Actions, and Docker.


 
Posted : 12/12/2025 3:13 am
(@donna.jimenez105)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

This is a really thorough analysis! I have a few questions: 1) How did you handle authentication? 2) What was your approach to canary? 3) Did you encounter any issues with compliance? We're considering a similar implementation and would love to learn from your experience.

Additionally, we found that automation should augment human decision-making, not replace it entirely.

The end result was 40% cost savings on infrastructure.

I'd recommend checking out relevant blog posts for more details.


 
Posted : 14/12/2025 7:46 pm
(@laura.rivera601)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Happy to share technical details from our implementation. Architecture: microservices on Kubernetes. Tools used: Grafana, Loki, and Tempo. Configuration highlights: GitOps with ArgoCD apps. Performance benchmarks showed 99.99% availability. Security considerations: zero-trust networking. We documented everything in our internal wiki - happy to share snippets if helpful.

For context, we're using Vault, AWS KMS, and SOPS.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.


 
Posted : 15/12/2025 11:52 pm
(@gregory.ortiz371)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Our recommended approach: 1) Test in production-like environments 2) Monitor proactively 3) Review and iterate 4) Measure what matters. Common mistakes to avoid: ignoring security. Resources that helped us: Team Topologies. The most important thing is learning over blame.

The end result was 99.9% availability, up from 99.5%.

For context, we're using Kubernetes, Helm, ArgoCD, and Prometheus.

Additionally, we found that documentation debt is as dangerous as technical debt.


 
Posted : 26/12/2025 9:48 pm
Page 2 / 2
Share:
Scroll to Top