Same here! In practice, the most important factor was observability is not optional - you can't improve what you can't measure. We initially struggled with performance bottlenecks but found that drift detection with automated remediation worked well. The ROI has been significant - we've seen 50% improvement.
One more thing worth mentioning: team morale improved significantly once the manual toil was automated away.
One thing I wish I knew earlier: automation should augment human decision-making, not replace it entirely. Would have saved us a lot of time.
Looking at the engineering side, there are some things to keep in mind. First, compliance requirements. Second, monitoring coverage. Third, performance tuning. We spent significant time on documentation and it was worth it. Code samples available on our GitHub if anyone wants to take a look. Performance testing showed 50% latency reduction.
For context, we're using Grafana, Loki, and Tempo.
For context, we're using Datadog, PagerDuty, and Slack.
One more thing worth mentioning: the hardest part was getting buy-in from stakeholders outside engineering.