Some guidance based on our experience: 1) Automate everything possible 2) Use feature flags 3) Share knowledge across teams 4) Keep it simple. Common mistakes to avoid: ignoring security. Resources that helped us: Google SRE book. The most important thing is consistency over perfection.
I'd recommend checking out relevant blog posts for more details.
Additionally, we found that automation should augment human decision-making, not replace it entirely.
Feel free to reach out if you have more questions - happy to share our runbooks and documentation.
Want to share our path through this. We started about 12 months ago with a small pilot. Initial challenges included legacy compatibility. The breakthrough came when we automated the testing. Key metrics improved: 99.9% availability, up from 99.5%. The team's feedback has been overwhelmingly positive, though we still have room for improvement in automation. Lessons learned: communicate often. Next steps for us: improve documentation.
One more thing worth mentioning: unexpected benefits included better developer experience and faster onboarding.