AI Search

Classic Search

Search Phrase:

Search Type:

Advanced search options

Search in Forums:

Search in date period:

Sort Search Results by:

AI Assistant

Notifications

Clear all

On-call rotation best practices to prevent burnout

✦ Summarize Topic

Page 1 / 2 Next

Weekly Roundup

Last Post by Deborah Howard 8 months ago

22 Posts

16 Users

0 Reactions

325 Views

RSS

Evelyn Williams

(@evelyn.williams270)

Posts: 0

Topic starter

Translate ▼

[#144]

On-call burnout was affecting our team morale. Improvements we made: proper rotation schedules with adequate rest time, runbooks for common issues, escalation policies, follow-the-sun for global teams, and blameless postmortems. We also set clear expectations: on-call means available, not necessarily working. Incident rate dropped as we improved system reliability. What are your on-call best practices?

Posted : 27/07/2025 11:21 am

Christine Roberts

(@christine.roberts720)

Posts: 0

Translate ▼

The technical aspects here are nuanced. First, network topology. Second, monitoring coverage. Third, cost optimization. We spent significant time on monitoring and it was worth it. Code samples available on our GitHub if anyone wants to take a look. Performance testing showed 50% latency reduction.

One thing I wish I knew earlier: the human side of change management is often harder than the technical implementation. Would have saved us a lot of time.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

The end result was 90% decrease in manual toil.

One more thing worth mentioning: we underestimated the training time needed but it was worth the investment.

One thing I wish I knew earlier: automation should augment human decision-making, not replace it entirely. Would have saved us a lot of time.

I'd recommend checking out conference talks on YouTube for more details.

Posted : 28/07/2025 11:40 am

Michelle Ross

(@michelle.ross286)

Posts: 0

Translate ▼

This is exactly the kind of detail that helps! I have a few questions: 1) How did you handle security? 2) What was your approach to rollback? 3) Did you encounter any issues with availability? We're considering a similar implementation and would love to learn from your experience.

One more thing worth mentioning: the hardest part was getting buy-in from stakeholders outside engineering.

One more thing worth mentioning: integration with existing tools was smoother than anticipated.

Additionally, we found that security must be built in from the start, not bolted on later.

Posted : 29/07/2025 11:02 pm

James Allen

(@james.allen159)

Posts: 0

Translate ▼

Valuable insights! I'd also consider team dynamics. We learned this the hard way when unexpected benefits included better developer experience and faster onboarding. Now we always make sure to document in runbooks. It's added maybe 15 minutes to our process but prevents a lot of headaches down the line.

The end result was 90% decrease in manual toil.

Additionally, we found that observability is not optional - you can't improve what you can't measure.

One more thing worth mentioning: the hardest part was getting buy-in from stakeholders outside engineering.

Posted : 30/07/2025 7:27 pm

Donna Jimenez

(@donna.jimenez105)

Posts: 0

Translate ▼

Great points overall! One aspect I'd add is cost analysis. We learned this the hard way when we underestimated the training time needed but it was worth the investment. Now we always make sure to test regularly. It's added maybe 15 minutes to our process but prevents a lot of headaches down the line.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Additionally, we found that the human side of change management is often harder than the technical implementation.

Posted : 01/08/2025 1:07 am

Elizabeth Perez

(@elizabeth.perez157)

Posts: 0

Translate ▼

Looking at the engineering side, there are some things to keep in mind. First, compliance requirements. Second, failover strategy. Third, performance tuning. We spent significant time on testing and it was worth it. Code samples available on our GitHub if anyone wants to take a look. Performance testing showed 10x throughput increase.

I'd recommend checking out conference talks on YouTube for more details.

One thing I wish I knew earlier: failure modes should be designed for, not discovered in production. Would have saved us a lot of time.

Posted : 01/08/2025 3:20 pm

Mary Castillo

(@mary.castillo14)

Posts: 0

Translate ▼

What a comprehensive overview! I have a few questions: 1) How did you handle security? 2) What was your approach to backup? 3) Did you encounter any issues with latency? We're considering a similar implementation and would love to learn from your experience.

The end result was 60% improvement in developer productivity.

I'd recommend checking out relevant blog posts for more details.

For context, we're using Jenkins, GitHub Actions, and Docker.

Additionally, we found that documentation debt is as dangerous as technical debt.

For context, we're using Vault, AWS KMS, and SOPS.

One more thing worth mentioning: we underestimated the training time needed but it was worth the investment.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

The end result was 40% cost savings on infrastructure.

The end result was 99.9% availability, up from 99.5%.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Posted : 02/08/2025 12:49 pm

Deborah Cook

(@deborah.cook920)

Posts: 0

Translate ▼

From an implementation perspective, here are the key points. First, compliance requirements. Second, monitoring coverage. Third, security hardening. We spent significant time on automation and it was worth it. Code samples available on our GitHub if anyone wants to take a look. Performance testing showed 2x improvement.

I'd recommend checking out the community forums for more details.

For context, we're using Grafana, Loki, and Tempo.

I'd recommend checking out the official documentation for more details.

For context, we're using Elasticsearch, Fluentd, and Kibana.

Additionally, we found that failure modes should be designed for, not discovered in production.

The end result was 40% cost savings on infrastructure.

One more thing worth mentioning: the initial investment was higher than expected, but the long-term benefits exceeded our projections.

For context, we're using Elasticsearch, Fluentd, and Kibana.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Posted : 03/08/2025 12:36 am

Michelle Ross

(@michelle.ross286)

Posts: 0

Translate ▼

This mirrors what we went through. We learned: Phase 1 (1 month) involved stakeholder alignment. Phase 2 (2 months) focused on pilot implementation. Phase 3 (2 weeks) was all about full rollout. Total investment was $100K but the payback period was only 6 months. Key success factors: good tooling, training, patience. If I could do it again, I would start with better documentation.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Posted : 04/08/2025 9:42 pm

William Harris

(@william.harris811)

Posts: 0

Translate ▼

Had this exact problem! Symptoms: increased error rates. Root cause analysis revealed memory leaks. Fix: corrected routing rules. Prevention measures: load testing. Total time to resolve was an hour but now we have runbooks and monitoring to catch this early.

One thing I wish I knew earlier: the human side of change management is often harder than the technical implementation. Would have saved us a lot of time.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Posted : 05/08/2025 9:49 am

Deborah Howard

(@deborah.howard208)

Posts: 0

Translate ▼

We went through something very similar. The problem: security vulnerabilities. Our initial approach was ad-hoc monitoring but that didn't work because lacked visibility. What actually worked: integration with our incident management system. The key insight was automation should augment human decision-making, not replace it entirely. Now we're able to scale automatically.

One thing I wish I knew earlier: observability is not optional - you can't improve what you can't measure. Would have saved us a lot of time.

Posted : 06/08/2025 8:53 am

Alexander Smith

(@alexander.smith802)

Posts: 0

Translate ▼

From a technical standpoint, our implementation. Architecture: microservices on Kubernetes. Tools used: Elasticsearch, Fluentd, and Kibana. Configuration highlights: IaC with Terraform modules. Performance benchmarks showed 99.99% availability. Security considerations: container scanning in CI. We documented everything in our internal wiki - happy to share snippets if helpful.

Additionally, we found that failure modes should be designed for, not discovered in production.

One thing I wish I knew earlier: automation should augment human decision-making, not replace it entirely. Would have saved us a lot of time.

Posted : 06/08/2025 2:20 pm

Alexander Smith

(@alexander.smith802)

Posts: 0

Translate ▼

Adding my two cents here - focusing on cost analysis. We learned this the hard way when we underestimated the training time needed but it was worth the investment. Now we always make sure to monitor proactively. It's added maybe an hour to our process but prevents a lot of headaches down the line.

For context, we're using Vault, AWS KMS, and SOPS.

Additionally, we found that the human side of change management is often harder than the technical implementation.

Posted : 07/08/2025 12:40 am

John Perez

(@john.perez881)

Posts: 0

Translate ▼

Architecturally, there are important trade-offs to consider. First, network topology. Second, backup procedures. Third, security hardening. We spent significant time on testing and it was worth it. Code samples available on our GitHub if anyone wants to take a look. Performance testing showed 50% latency reduction.

Additionally, we found that security must be built in from the start, not bolted on later.

For context, we're using Vault, AWS KMS, and SOPS.

Additionally, we found that security must be built in from the start, not bolted on later.

Posted : 08/08/2025 7:11 pm

Kathleen Watson

(@kathleen.watson88)

Posts: 0

Translate ▼

This really hits home! We learned: Phase 1 (2 weeks) involved assessment and planning. Phase 2 (3 months) focused on process documentation. Phase 3 (2 weeks) was all about optimization. Total investment was $200K but the payback period was only 9 months. Key success factors: good tooling, training, patience. If I could do it again, I would invest more in training.

Additionally, we found that observability is not optional - you can't improve what you can't measure.

Posted : 09/08/2025 7:38 pm

Page 1 / 2 Next

11 Forums
309 Topics
4,684 Posts
0 Online
109 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed