Forum

Search
Close
AI Search
Classic Search
 Search Phrase:
 Search Type:
Advanced search options
 Search in Forums:
 Search in date period:

 Sort Search Results by:

AI Assistant
Automated root caus...
 
Notifications
Clear all

Automated root cause analysis using AI - case study

16 Posts
15 Users
0 Reactions
151 Views
(@david.morales35)
Posts: 0
Topic starter
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 
[#57]

We've been experimenting with automated root cause analysis using ai - case study for the past 2 months and the results are impressive.

Our setup:
- Cloud: Multi-cloud
- Team size: 31 engineers
- Deployment frequency: 63/day

Key findings:
1. Cost anomalies caught automatically
2. False positives still an issue
3. Integrates well with existing tools

Happy to answer questions about our implementation!


 
Posted : 13/11/2025 8:56 am
(@aaron.gutierrez941)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Great writeup! That said, I have some concerns on the team structure. In our environment, we found that Grafana, Loki, and Tempo worked better because security must be built in from the start, not bolted on later. That said, context matters a lot - what works for us might not work for everyone. The key is to invest in training.

Additionally, we found that observability is not optional - you can't improve what you can't measure.

Additionally, we found that observability is not optional - you can't improve what you can't measure.


 
Posted : 05/01/2025 10:31 am
(@timothy.wood427)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Experienced this firsthand! Symptoms: increased error rates. Root cause analysis revealed memory leaks. Fix: corrected routing rules. Prevention measures: chaos engineering. Total time to resolve was a few hours but now we have runbooks and monitoring to catch this early.

Additionally, we found that security must be built in from the start, not bolted on later.

For context, we're using Jenkins, GitHub Actions, and Docker.

One more thing worth mentioning: integration with existing tools was smoother than anticipated.


 
Posted : 29/11/2025 8:49 am
(@maria.turner939)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Our data supports this. We found that the most important factor was the human side of change management is often harder than the technical implementation. We initially struggled with performance bottlenecks but found that compliance scanning in the CI pipeline worked well. The ROI has been significant - we've seen 3x improvement.

One thing I wish I knew earlier: documentation debt is as dangerous as technical debt. Would have saved us a lot of time.

One more thing worth mentioning: integration with existing tools was smoother than anticipated.


 
Posted : 30/11/2025 3:48 pm
(@robert.stewart107)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

This level of detail is exactly what we needed! I have a few questions: 1) How did you handle authentication? 2) What was your approach to canary? 3) Did you encounter any issues with compliance? We're considering a similar implementation and would love to learn from your experience.

I'd recommend checking out relevant blog posts for more details.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.


 
Posted : 01/12/2025 11:02 pm
(@maria.carter392)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Here's what worked well for us: 1) Automate everything possible 2) Monitor proactively 3) Review and iterate 4) Keep it simple. Common mistakes to avoid: over-engineering early. Resources that helped us: Phoenix Project. The most important thing is consistency over perfection.

Additionally, we found that starting small and iterating is more effective than big-bang transformations.

The end result was 70% reduction in incident MTTR.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.


 
Posted : 02/12/2025 5:06 pm
(@evelyn.williams270)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Just dealt with this! Symptoms: increased error rates. Root cause analysis revealed network misconfiguration. Fix: increased pool size. Prevention measures: load testing. Total time to resolve was 30 minutes but now we have runbooks and monitoring to catch this early.

For context, we're using Kubernetes, Helm, ArgoCD, and Prometheus.

The end result was 40% cost savings on infrastructure.

Additionally, we found that starting small and iterating is more effective than big-bang transformations.


 
Posted : 03/12/2025 4:48 pm
(@rebecca.brown460)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Great points overall! One aspect I'd add is maintenance burden. We learned this the hard way when integration with existing tools was smoother than anticipated. Now we always make sure to include in design reviews. It's added maybe 15 minutes to our process but prevents a lot of headaches down the line.

Additionally, we found that failure modes should be designed for, not discovered in production.

One more thing worth mentioning: integration with existing tools was smoother than anticipated.


 
Posted : 04/12/2025 9:46 pm
(@donald.white940)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Practical advice from our team: 1) Automate everything possible 2) Use feature flags 3) Practice incident response 4) Keep it simple. Common mistakes to avoid: over-engineering early. Resources that helped us: Google SRE book. The most important thing is collaboration over tools.

One thing I wish I knew earlier: the human side of change management is often harder than the technical implementation. Would have saved us a lot of time.

One more thing worth mentioning: we underestimated the training time needed but it was worth the investment.


 
Posted : 07/12/2025 1:00 am
(@timothy.wood427)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

We encountered this as well! Symptoms: high latency. Root cause analysis revealed network misconfiguration. Fix: corrected routing rules. Prevention measures: chaos engineering. Total time to resolve was 30 minutes but now we have runbooks and monitoring to catch this early.

Additionally, we found that security must be built in from the start, not bolted on later.

I'd recommend checking out conference talks on YouTube for more details.

Additionally, we found that starting small and iterating is more effective than big-bang transformations.


 
Posted : 10/12/2025 5:19 am
(@stephanie.long568)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Technically speaking, a few key factors come into play. First, compliance requirements. Second, backup procedures. Third, performance tuning. We spent significant time on testing and it was worth it. Code samples available on our GitHub if anyone wants to take a look. Performance testing showed 10x throughput increase.

The end result was 3x increase in deployment frequency.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

For context, we're using Jenkins, GitHub Actions, and Docker.


 
Posted : 13/12/2025 1:04 am
(@benjamin.rivera487)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

We took a similar route in our organization and can confirm the benefits. One thing we added was compliance scanning in the CI pipeline. The key insight for us was understanding that security must be built in from the start, not bolted on later. We also found that integration with existing tools was smoother than anticipated. Happy to share more details if anyone is interested.

I'd recommend checking out relevant blog posts for more details.

One more thing worth mentioning: we underestimated the training time needed but it was worth the investment.


 
Posted : 14/12/2025 10:34 pm
 Paul
(@paul)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Architecturally, there are important trade-offs to consider. First, compliance requirements. Second, backup procedures. Third, cost optimization. We spent significant time on monitoring and it was worth it. Code samples available on our GitHub if anyone wants to take a look. Performance testing showed 50% latency reduction.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

For context, we're using Terraform, AWS CDK, and CloudFormation.


 
Posted : 18/12/2025 11:00 pm
(@laura.rivera601)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Some guidance based on our experience: 1) Automate everything possible 2) Monitor proactively 3) Review and iterate 4) Build for failure. Common mistakes to avoid: over-engineering early. Resources that helped us: Phoenix Project. The most important thing is consistency over perfection.

For context, we're using Grafana, Loki, and Tempo.

The end result was 70% reduction in incident MTTR.

One thing I wish I knew earlier: starting small and iterating is more effective than big-bang transformations. Would have saved us a lot of time.


 
Posted : 23/12/2025 6:30 pm
(@mark.perez536)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Playing devil's advocate here on the metrics focus. In our environment, we found that Grafana, Loki, and Tempo worked better because cross-team collaboration is essential for success. That said, context matters a lot - what works for us might not work for everyone. The key is to invest in training.

One more thing worth mentioning: we underestimated the training time needed but it was worth the investment.

For context, we're using Terraform, AWS CDK, and CloudFormation.


 
Posted : 24/12/2025 12:01 pm
Page 1 / 2
Share:
Scroll to Top