AI-driven incident response - our experience with PagerDuty Copilot
We've been experimenting with ai-driven incident response - our experience with pagerduty copilot for the past 2 months and the results are impressive.
Our setup:
- Cloud: GCP
- Team size: 47 engineers
- Deployment frequency: 27/day
Key findings:
1. Cost anomalies caught automatically
2. False positives still an issue
3. Impressive accuracy rate
Happy to answer questions about our implementation!
For those asking about cost: in our case (AWS, us-east-1, ~500 req/sec), we're paying about $2000/month. That's 30% vs our old setup with Docker. ROI was positive after just 2 months when you factor in engineering time saved.
Has anyone else encountered issues with Jenkins when running in GCP us-west-2? We're seeing intermittent failures during peak traffic. Our setup: containerized with New Relic. Starting to wonder if we should switch to GitLab CI.
Cautionary tale: we rushed this implementation without proper testing and it caused a 4-hour outage. The issue was DNS resolution delay. Lesson learned: always test in staging first, especially when dealing with authentication services.
Spot on. This is the direction the industry is moving.
We evaluated ArgoCD last quarter and decided against it due to learning curve. Instead, we went with Grafana which better fit our use case. The main factors were cost (30% cheaper), ease of use (2-day vs 2-week training), and community support.
What about security? Did you run into any compliance issues? Our team is particularly concerned about production stability.
Here's our production setup:
- Tool A for X
- Tool B for Y
- Custom scripts for Z
Happy to share more details if interested.
The migration path we took:
Week 1-2: Research & POC
Week 3-4: Staging deployment
Week 5-6: Prod rollout (10% -> 50% -> 100%)
Week 7-8: Optimization
Total cost: ~200 eng hours
Would do it again in a heartbeat.
How did you handle the migration? Any gotchas to watch for? Our team is particularly concerned about production stability.
Great for small teams, but doesn't scale well past 50 people.
In our production environment with 200+ microservices, we found that Ansible significantly outperformed Prometheus. The key was proper configuration of memory limits. Deployment time dropped from 45min to 8min. Highly recommended for teams running Kubernetes at scale.
Did you consider alternatives? Why did you choose this one? Trying to build a business case for management.
Resource consumption is a concern. What's your experience? Our team is particularly concerned about production stability.
For those asking about cost: in our case (AWS, us-east-1, ~500 req/sec), we're paying about $2000/month. That's 70% vs our old setup with Prometheus. ROI was positive after just 2 months when you factor in engineering time saved.
Security team blocked this due to compliance requirements.
We tried this but hit issues with X. How did you solve it? Our team is particularly concerned about production stability.
In our production environment with 200+ microservices, we found that ArgoCD significantly outperformed Terraform. The key was proper configuration of timeout settings. Deployment time dropped from 45min to 8min. Highly recommended for teams running Kubernetes at scale.
Just implemented this last week. Already seeing improvements!
- 10 Forums
- 93 Topics
- 1,770 Posts
- 0 Online
- 100 Members