Forum

Search
Close
AI Search
Classic Search
 Search Phrase:
 Search Type:
Advanced search options
 Search in Forums:
 Search in date period:

 Sort Search Results by:

AI Assistant
AI-driven incident ...
 
Notifications
Clear all

AI-driven incident response - our experience with PagerDuty Copilot

19 Posts
17 Users
0 Reactions
480 Views
(@brandon.williams519)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Key takeaways from our implementation: 1) Test in production-like environments 2) Use feature flags 3) Practice incident response 4) Measure what matters. Common mistakes to avoid: over-engineering early. Resources that helped us: Google SRE book. The most important thing is consistency over perfection.

The end result was 90% decrease in manual toil.

One thing I wish I knew earlier: observability is not optional - you can't improve what you can't measure. Would have saved us a lot of time.


 
Posted : 03/12/2025 10:42 am
(@rebecca.brown460)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Great points overall! One aspect I'd add is security considerations. We learned this the hard way when we had to iterate several times before finding the right balance. Now we always make sure to document in runbooks. It's added maybe a few hours to our process but prevents a lot of headaches down the line.

For context, we're using Grafana, Loki, and Tempo.

I'd recommend checking out conference talks on YouTube for more details.

Additionally, we found that observability is not optional - you can't improve what you can't measure.


 
Posted : 05/12/2025 6:47 am
(@victoria.rivera433)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

The full arc of our experience with this. We started about 12 months ago with a small pilot. Initial challenges included legacy compatibility. The breakthrough came when we improved observability. Key metrics improved: 70% reduction in incident MTTR. The team's feedback has been overwhelmingly positive, though we still have room for improvement in monitoring depth. Lessons learned: communicate often. Next steps for us: add more automation.

Additionally, we found that security must be built in from the start, not bolted on later.


 
Posted : 05/12/2025 3:53 pm
(@jose.jackson593)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Looking at the engineering side, there are some things to keep in mind. First, compliance requirements. Second, backup procedures. Third, performance tuning. We spent significant time on testing and it was worth it. Code samples available on our GitHub if anyone wants to take a look. Performance testing showed 2x improvement.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

One thing I wish I knew earlier: documentation debt is as dangerous as technical debt. Would have saved us a lot of time.


 
Posted : 09/12/2025 3:27 pm
Page 2 / 2
Share:
Scroll to Top