What we'd suggest based on our work: 1) Automate everything possible 2) Monitor proactively 3) Practice incident response 4) Build for failure. Common mistakes to avoid: skipping documentation. Resources that helped us: Team Topologies. The most important thing is consistency over perfection.
Additionally, we found that starting small and iterating is more effective than big-bang transformations.
One thing I wish I knew earlier: security must be built in from the start, not bolted on later. Would have saved us a lot of time.
Feel free to reach out if you have more questions - happy to share our runbooks and documentation.
For context, we're using Grafana, Loki, and Tempo.
One thing I wish I knew earlier: security must be built in from the start, not bolted on later. Would have saved us a lot of time.
Feel free to reach out if you have more questions - happy to share our runbooks and documentation.
For context, we're using Terraform, AWS CDK, and CloudFormation.