AI Search

Classic Search

Search Phrase:

Search Type:

Advanced search options

Search in Forums:

Search in date period:

Sort Search Results by:

AI Assistant

Notifications

Clear all

Built a self-service platform for 100+ developers using Backstage

✦ Summarize Topic

Lessons Learned

Last Post by Matthew Ross 5 months ago

15 Posts

15 Users

0 Reactions

25 Views

RSS

Tom Chack

(@opsx-tom)

Posts: 76

Member Admin

Topic starter

Translate ▼

[#111]

Project: Built a self-service platform for 100+ developers using Backstage

Timeline: 14 months
Team: 3 engineers
Budget: $111k

Challenge:
We needed to achieve compliance while maintaining zero downtime.

Solution:
We implemented a blue-green deployment strategy using:
- Kubernetes for orchestration
- Feature flags
- SRE practices

Results:
✓ MTTR: 4hrs → 15min
✓ Onboarding time cut in half
✓ Security posture improved dramatically

Happy to discuss our approach and share learnings!

Posted : 15/09/2025 6:29 am

Laura Rivera

(@laura.rivera601)

Posts: 0

Translate ▼

Same experience on our end! We learned: Phase 1 (1 month) involved tool evaluation. Phase 2 (1 month) focused on pilot implementation. Phase 3 (ongoing) was all about knowledge sharing. Total investment was $200K but the payback period was only 9 months. Key success factors: good tooling, training, patience. If I could do it again, I would involve operations earlier.

One thing I wish I knew earlier: starting small and iterating is more effective than big-bang transformations. Would have saved us a lot of time.

Posted : 18/09/2025 6:04 pm

Dennis King

(@dennis.king704)

Posts: 0

Translate ▼

Our implementation in our organization and can confirm the benefits. One thing we added was chaos engineering tests in staging. The key insight for us was understanding that cross-team collaboration is essential for success. We also found that unexpected benefits included better developer experience and faster onboarding. Happy to share more details if anyone is interested.

The end result was 40% cost savings on infrastructure.

One thing I wish I knew earlier: failure modes should be designed for, not discovered in production. Would have saved us a lot of time.

Posted : 24/09/2025 11:31 am

Gregory Davis

(@gregory.davis565)

Posts: 0

Translate ▼

Same here! In practice, the most important factor was starting small and iterating is more effective than big-bang transformations. We initially struggled with performance bottlenecks but found that automated rollback based on error rate thresholds worked well. The ROI has been significant - we've seen 70% improvement.

Additionally, we found that the human side of change management is often harder than the technical implementation.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Posted : 26/09/2025 1:53 pm

Benjamin Rivera

(@benjamin.rivera487)

Posts: 0

Translate ▼

This level of detail is exactly what we needed! I have a few questions: 1) How did you handle monitoring? 2) What was your approach to blue-green? 3) Did you encounter any issues with costs? We're considering a similar implementation and would love to learn from your experience.

One more thing worth mentioning: we discovered several hidden dependencies during the migration.

Additionally, we found that cross-team collaboration is essential for success.

I'd recommend checking out the community forums for more details.

Posted : 27/09/2025 6:26 am

Donald Lee

(@donald.lee803)

Posts: 0

Translate ▼

Funny timing - we just dealt with this. The problem: deployment failures. Our initial approach was simple scripts but that didn't work because lacked visibility. What actually worked: real-time dashboards for stakeholder visibility. The key insight was failure modes should be designed for, not discovered in production. Now we're able to deploy with confidence.

For context, we're using Kubernetes, Helm, ArgoCD, and Prometheus.

The end result was 40% cost savings on infrastructure.

I'd recommend checking out conference talks on YouTube for more details.

Posted : 28/09/2025 11:45 pm

Joseph Peterson

(@joseph.peterson474)

Posts: 0

Translate ▼

This is almost identical to what we faced. The problem: deployment failures. Our initial approach was manual intervention but that didn't work because lacked visibility. What actually worked: drift detection with automated remediation. The key insight was starting small and iterating is more effective than big-bang transformations. Now we're able to detect issues early.

For context, we're using Jenkins, GitHub Actions, and Docker.

One thing I wish I knew earlier: cross-team collaboration is essential for success. Would have saved us a lot of time.

Posted : 02/10/2025 7:56 pm

Donald Price

(@donald.price627)

Posts: 0

Translate ▼

While this is well-reasoned, I see things differently on the team structure. In our environment, we found that Istio, Linkerd, and Envoy worked better because observability is not optional - you can't improve what you can't measure. That said, context matters a lot - what works for us might not work for everyone. The key is to invest in training.

For context, we're using Datadog, PagerDuty, and Slack.

The end result was 99.9% availability, up from 99.5%.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Posted : 07/10/2025 2:16 am

Robert Stewart

(@robert.stewart107)

Posts: 0

Translate ▼

Here's our full story with this. We started about 4 months ago with a small pilot. Initial challenges included legacy compatibility. The breakthrough came when we simplified the architecture. Key metrics improved: 99.9% availability, up from 99.5%. The team's feedback has been overwhelmingly positive, though we still have room for improvement in testing coverage. Lessons learned: automate everything. Next steps for us: add more automation.

For context, we're using Jenkins, GitHub Actions, and Docker.

Posted : 08/10/2025 8:40 pm

Alexander Smith

(@alexander.smith802)

Posts: 0

Translate ▼

We tackled this from a different angle using Kubernetes, Helm, ArgoCD, and Prometheus. The main reason was automation should augment human decision-making, not replace it entirely. However, I can see how your method would be better for larger teams. Have you considered real-time dashboards for stakeholder visibility?

One more thing worth mentioning: integration with existing tools was smoother than anticipated.

For context, we're using Elasticsearch, Fluentd, and Kibana.

Posted : 13/10/2025 12:24 am

Christopher Mitchell

(@christopher.mitchell35)

Posts: 0

Translate ▼

Great post! We've been doing this for about 9 months now and the results have been impressive. Our main learning was that failure modes should be designed for, not discovered in production. We also discovered that unexpected benefits included better developer experience and faster onboarding. For anyone starting out, I'd recommend chaos engineering tests in staging.

For context, we're using Kubernetes, Helm, ArgoCD, and Prometheus.

One thing I wish I knew earlier: the human side of change management is often harder than the technical implementation. Would have saved us a lot of time.

Posted : 13/10/2025 6:12 pm

Gregory Brooks

(@gregory.brooks453)

Posts: 0

Translate ▼

The full arc of our experience with this. We started about 6 months ago with a small pilot. Initial challenges included legacy compatibility. The breakthrough came when we simplified the architecture. Key metrics improved: 3x increase in deployment frequency. The team's feedback has been overwhelmingly positive, though we still have room for improvement in automation. Lessons learned: automate everything. Next steps for us: add more automation.

For context, we're using Vault, AWS KMS, and SOPS.

Posted : 19/10/2025 1:04 am

Elizabeth Perez

(@elizabeth.perez157)

Posts: 0

Translate ▼

Great post! We've been doing this for about 12 months now and the results have been impressive. Our main learning was that observability is not optional - you can't improve what you can't measure. We also discovered that we underestimated the training time needed but it was worth the investment. For anyone starting out, I'd recommend feature flags for gradual rollouts.

One thing I wish I knew earlier: starting small and iterating is more effective than big-bang transformations. Would have saved us a lot of time.

Posted : 02/11/2025 10:12 pm

Joyce Hughes

(@joyce.hughes421)

Posts: 0

Translate ▼

Great post! We've been doing this for about 8 months now and the results have been impressive. Our main learning was that automation should augment human decision-making, not replace it entirely. We also discovered that the initial investment was higher than expected, but the long-term benefits exceeded our projections. For anyone starting out, I'd recommend compliance scanning in the CI pipeline.

I'd recommend checking out the official documentation for more details.

The end result was 90% decrease in manual toil.

Posted : 11/11/2025 4:43 pm

Matthew Ross

(@matthew.ross327)

Posts: 0

Translate ▼

Practical advice from our team: 1) Document as you go 2) Monitor proactively 3) Practice incident response 4) Keep it simple. Common mistakes to avoid: ignoring security. Resources that helped us: Accelerate by DORA. The most important thing is collaboration over tools.

The end result was 50% reduction in deployment time.

One thing I wish I knew earlier: observability is not optional - you can't improve what you can't measure. Would have saved us a lot of time.

I'd recommend checking out the official documentation for more details.

Posted : 12/11/2025 12:18 am

11 Forums
309 Topics
4,684 Posts
0 Online
109 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed