AI Search

Classic Search

Search Phrase:

Search Type:

Advanced search options

Search in Forums:

Search in date period:

Sort Search Results by:

AI Assistant

Notifications

Clear all

Deep dive: On-call rotation best practices to prevent burnout

✦ Summarize Topic

Lessons Learned

Last Post by Christopher Mitchell 6 months ago

9 Posts

9 Users

0 Reactions

299 Views

RSS

Jose Jackson

(@jose.jackson593)

Posts: 0

Topic starter

Translate ▼

[#219]

This mirrors what we went through. We learned: Phase 1 (1 month) involved assessment and planning. Phase 2 (1 month) focused on pilot implementation. Phase 3 (ongoing) was all about knowledge sharing. Total investment was $50K but the payback period was only 6 months. Key success factors: automation, documentation, feedback loops. If I could do it again, I would invest more in training.

Additionally, we found that automation should augment human decision-making, not replace it entirely.

One thing I wish I knew earlier: documentation debt is as dangerous as technical debt. Would have saved us a lot of time.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

The end result was 3x increase in deployment frequency.

One thing I wish I knew earlier: starting small and iterating is more effective than big-bang transformations. Would have saved us a lot of time.

Posted : 27/09/2025 5:21 am

Sara Pike

(@sara)

Posts: 0

Translate ▼

Excellent thread! One consideration often overlooked is maintenance burden. We learned this the hard way when the initial investment was higher than expected, but the long-term benefits exceeded our projections. Now we always make sure to document in runbooks. It's added maybe 30 minutes to our process but prevents a lot of headaches down the line.

One more thing worth mentioning: the hardest part was getting buy-in from stakeholders outside engineering.

One thing I wish I knew earlier: documentation debt is as dangerous as technical debt. Would have saved us a lot of time.

Posted : 28/09/2025 6:41 am

Benjamin Taylor

(@benjamin.taylor696)

Posts: 0

Translate ▼

The depth of this analysis is impressive! I have a few questions: 1) How did you handle testing? 2) What was your approach to migration? 3) Did you encounter any issues with compliance? We're considering a similar implementation and would love to learn from your experience.

Additionally, we found that automation should augment human decision-making, not replace it entirely.

The end result was 99.9% availability, up from 99.5%.

I'd recommend checking out conference talks on YouTube for more details.

Posted : 28/09/2025 10:07 pm

Emily Gutierrez

(@emily.gutierrez57)

Posts: 0

Translate ▼

Practical advice from our team: 1) Test in production-like environments 2) Monitor proactively 3) Share knowledge across teams 4) Measure what matters. Common mistakes to avoid: ignoring security. Resources that helped us: Phoenix Project. The most important thing is collaboration over tools.

The end result was 80% reduction in security vulnerabilities.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

For context, we're using Istio, Linkerd, and Envoy.

I'd recommend checking out the official documentation for more details.

One thing I wish I knew earlier: observability is not optional - you can't improve what you can't measure. Would have saved us a lot of time.

One more thing worth mentioning: team morale improved significantly once the manual toil was automated away.

Additionally, we found that starting small and iterating is more effective than big-bang transformations.

One more thing worth mentioning: team morale improved significantly once the manual toil was automated away.

Posted : 30/09/2025 12:45 am

Evelyn Lewis

(@evelyn.lewis664)

Posts: 0

Translate ▼

From beginning to end, here's what we did with this. We started about 8 months ago with a small pilot. Initial challenges included tool integration. The breakthrough came when we streamlined the process. Key metrics improved: 90% decrease in manual toil. The team's feedback has been overwhelmingly positive, though we still have room for improvement in documentation. Lessons learned: start simple. Next steps for us: optimize costs.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Posted : 30/09/2025 9:35 am

Tyler Foster

(@tyler.foster787)

Posts: 0

Translate ▼

I'd like to share our complete experience with this. We started about 5 months ago with a small pilot. Initial challenges included legacy compatibility. The breakthrough came when we streamlined the process. Key metrics improved: 60% improvement in developer productivity. The team's feedback has been overwhelmingly positive, though we still have room for improvement in documentation. Lessons learned: automate everything. Next steps for us: expand to more teams.

One more thing worth mentioning: unexpected benefits included better developer experience and faster onboarding.

I'd recommend checking out the community forums for more details.

One thing I wish I knew earlier: security must be built in from the start, not bolted on later. Would have saved us a lot of time.

One more thing worth mentioning: team morale improved significantly once the manual toil was automated away.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

For context, we're using Vault, AWS KMS, and SOPS.

Posted : 01/10/2025 12:02 am

Dennis King

(@dennis.king704)

Posts: 0

Translate ▼

Here are some operational tips that worked for uss we've developed: Monitoring - CloudWatch with custom metrics. Alerting - Opsgenie with escalation policies. Documentation - Notion for team wikis. Training - pairing sessions. These have helped us maintain low incident count while still moving fast on new features.

Additionally, we found that failure modes should be designed for, not discovered in production.

The end result was 90% decrease in manual toil.

One thing I wish I knew earlier: starting small and iterating is more effective than big-bang transformations. Would have saved us a lot of time.

I'd recommend checking out relevant blog posts for more details.

One thing I wish I knew earlier: automation should augment human decision-making, not replace it entirely. Would have saved us a lot of time.

For context, we're using Istio, Linkerd, and Envoy.

Additionally, we found that starting small and iterating is more effective than big-bang transformations.

Additionally, we found that failure modes should be designed for, not discovered in production.

Posted : 02/10/2025 9:31 pm

Scott Allen

(@scott.allen968)

Posts: 0

Translate ▼

Our experience from start to finish with this. We started about 17 months ago with a small pilot. Initial challenges included legacy compatibility. The breakthrough came when we streamlined the process. Key metrics improved: 70% reduction in incident MTTR. The team's feedback has been overwhelmingly positive, though we still have room for improvement in testing coverage. Lessons learned: start simple. Next steps for us: optimize costs.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Posted : 04/10/2025 6:16 am

Christopher Mitchell

(@christopher.mitchell35)

Posts: 0

Translate ▼

We felt this too! Here's how we learned: Phase 1 (2 weeks) involved assessment and planning. Phase 2 (2 months) focused on pilot implementation. Phase 3 (ongoing) was all about full rollout. Total investment was $100K but the payback period was only 9 months. Key success factors: good tooling, training, patience. If I could do it again, I would set clearer success metrics.

One more thing worth mentioning: we discovered several hidden dependencies during the migration.

Posted : 05/10/2025 2:40 am

11 Forums
309 Topics
4,684 Posts
0 Online
109 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed