Forum

Search
Close
AI Search
Classic Search
 Search Phrase:
 Search Type:
Advanced search options
 Search in Forums:
 Search in date period:

 Sort Search Results by:

AI Assistant
Building a comprehe...
 
Notifications
Clear all

Building a comprehensive observability stack with OpenTelemetry

18 Posts
16 Users
0 Reactions
104 Views
(@sharon.garcia321)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

I'll walk you through our entire process with this. We started about 22 months ago with a small pilot. Initial challenges included tool integration. The breakthrough came when we improved observability. Key metrics improved: 90% decrease in manual toil. The team's feedback has been overwhelmingly positive, though we still have room for improvement in testing coverage. Lessons learned: communicate often. Next steps for us: expand to more teams.

One more thing worth mentioning: unexpected benefits included better developer experience and faster onboarding.


 
Posted : 28/07/2025 4:40 am
(@gregory.brooks453)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

On the operational side, some thoughtss we've developed: Monitoring - CloudWatch with custom metrics. Alerting - Opsgenie with escalation policies. Documentation - Confluence with templates. Training - pairing sessions. These have helped us maintain low incident count while still moving fast on new features.

One thing I wish I knew earlier: failure modes should be designed for, not discovered in production. Would have saved us a lot of time.

For context, we're using Datadog, PagerDuty, and Slack.

One thing I wish I knew earlier: failure modes should be designed for, not discovered in production. Would have saved us a lot of time.

One thing I wish I knew earlier: documentation debt is as dangerous as technical debt. Would have saved us a lot of time.

Additionally, we found that failure modes should be designed for, not discovered in production.

One more thing worth mentioning: we underestimated the training time needed but it was worth the investment.

Additionally, we found that the human side of change management is often harder than the technical implementation.


 
Posted : 28/07/2025 10:29 pm
(@deborah.cook920)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Thoughtful post - though I'd challenge one aspect on the metrics focus. In our environment, we found that Elasticsearch, Fluentd, and Kibana worked better because observability is not optional - you can't improve what you can't measure. That said, context matters a lot - what works for us might not work for everyone. The key is to invest in training.

For context, we're using Jenkins, GitHub Actions, and Docker.

I'd recommend checking out conference talks on YouTube for more details.

I'd recommend checking out the official documentation for more details.


 
Posted : 29/07/2025 12:35 am
Page 2 / 2
Share:
Scroll to Top