We're running aws organizations best practices for 50+ accounts in production and wanted to share our experience.
Scale:
- 654 services deployed
- 82 TB data processed/month
- 48M requests/day
- 15 regions worldwide
Architecture:
- Compute: Lambda + Step Functions
- Data: DocumentDB
- Queue: MSK (Kafka)
Monthly cost: ~$190k
Lessons learned:
1. Serverless not always cheaper
2. Data transfer is the hidden cost
3. Autoscaling needs careful tuning
AMA about our setup!
Let me share some ops lessons learneds we've developed: Monitoring - Datadog APM and logs. Alerting - Opsgenie with escalation policies. Documentation - GitBook for public docs. Training - pairing sessions. These have helped us maintain fast deployments while still moving fast on new features.
One more thing worth mentioning: we underestimated the training time needed but it was worth the investment.
The end result was 80% reduction in security vulnerabilities.
For context, we're using Elasticsearch, Fluentd, and Kibana.
What a comprehensive overview! I have a few questions: 1) How did you handle scaling? 2) What was your approach to backup? 3) Did you encounter any issues with compliance? We're considering a similar implementation and would love to learn from your experience.
The end result was 99.9% availability, up from 99.5%.
For context, we're using Vault, AWS KMS, and SOPS.
The end result was 80% reduction in security vulnerabilities.
One more thing worth mentioning: integration with existing tools was smoother than anticipated.
Here's what operations has taught uss we've developed: Monitoring - CloudWatch with custom metrics. Alerting - Opsgenie with escalation policies. Documentation - Notion for team wikis. Training - pairing sessions. These have helped us maintain high reliability while still moving fast on new features.
Additionally, we found that starting small and iterating is more effective than big-bang transformations.
Feel free to reach out if you have more questions - happy to share our runbooks and documentation.
Been there with this one! Symptoms: increased error rates. Root cause analysis revealed network misconfiguration. Fix: corrected routing rules. Prevention measures: better monitoring. Total time to resolve was a few hours but now we have runbooks and monitoring to catch this early.
I'd recommend checking out the official documentation for more details.
I'd recommend checking out relevant blog posts for more details.
For context, we're using Grafana, Loki, and Tempo.
Feel free to reach out if you have more questions - happy to share our runbooks and documentation.
Great info! We're exploring and evaluating this approach. Could you elaborate on team structure? Specifically, I'm curious about how you measured success. Also, how long did the initial implementation take? Any gotchas we should watch out for?
I'd recommend checking out relevant blog posts for more details.
For context, we're using Elasticsearch, Fluentd, and Kibana.
The end result was 40% cost savings on infrastructure.
For context, we're using Kubernetes, Helm, ArgoCD, and Prometheus.
There are several engineering considerations worth noting. First, compliance requirements. Second, backup procedures. Third, performance tuning. We spent significant time on documentation and it was worth it. Code samples available on our GitHub if anyone wants to take a look. Performance testing showed 2x improvement.
I'd recommend checking out relevant blog posts for more details.
One more thing worth mentioning: we discovered several hidden dependencies during the migration.
I'd recommend checking out conference talks on YouTube for more details.
This resonates with my experience, though I'd emphasize cost analysis. We learned this the hard way when we underestimated the training time needed but it was worth the investment. Now we always make sure to document in runbooks. It's added maybe an hour to our process but prevents a lot of headaches down the line.
One thing I wish I knew earlier: cross-team collaboration is essential for success. Would have saved us a lot of time.
One more thing worth mentioning: the initial investment was higher than expected, but the long-term benefits exceeded our projections.
Allow me to present an alternative view on the metrics focus. In our environment, we found that Jenkins, GitHub Actions, and Docker worked better because automation should augment human decision-making, not replace it entirely. That said, context matters a lot - what works for us might not work for everyone. The key is to focus on outcomes.
Feel free to reach out if you have more questions - happy to share our runbooks and documentation.
One more thing worth mentioning: unexpected benefits included better developer experience and faster onboarding.
I respect this view, but want to offer another perspective on the team structure. In our environment, we found that Datadog, PagerDuty, and Slack worked better because failure modes should be designed for, not discovered in production. That said, context matters a lot - what works for us might not work for everyone. The key is to focus on outcomes.
Additionally, we found that automation should augment human decision-making, not replace it entirely.
For context, we're using Vault, AWS KMS, and SOPS.
The technical specifics of our implementation. Architecture: hybrid cloud setup. Tools used: Terraform, AWS CDK, and CloudFormation. Configuration highlights: CI/CD with GitHub Actions workflows. Performance benchmarks showed 3x throughput improvement. Security considerations: zero-trust networking. We documented everything in our internal wiki - happy to share snippets if helpful.
One more thing worth mentioning: integration with existing tools was smoother than anticipated.
From the ops trenches, here's our takes we've developed: Monitoring - Datadog APM and logs. Alerting - custom Slack integration. Documentation - Notion for team wikis. Training - monthly lunch and learns. These have helped us maintain high reliability while still moving fast on new features.
I'd recommend checking out relevant blog posts for more details.
Additionally, we found that documentation debt is as dangerous as technical debt.
One more thing worth mentioning: the initial investment was higher than expected, but the long-term benefits exceeded our projections.
Super useful! We're just starting to evaluateg this approach. Could you elaborate on success metrics? Specifically, I'm curious about stakeholder communication. Also, how long did the initial implementation take? Any gotchas we should watch out for?
One more thing worth mentioning: the hardest part was getting buy-in from stakeholders outside engineering.
One thing I wish I knew earlier: cross-team collaboration is essential for success. Would have saved us a lot of time.
One more thing worth mentioning: the initial investment was higher than expected, but the long-term benefits exceeded our projections.
Had this exact problem! Symptoms: increased error rates. Root cause analysis revealed network misconfiguration. Fix: corrected routing rules. Prevention measures: better monitoring. Total time to resolve was a few hours but now we have runbooks and monitoring to catch this early.
The end result was 40% cost savings on infrastructure.
Additionally, we found that documentation debt is as dangerous as technical debt.
Feel free to reach out if you have more questions - happy to share our runbooks and documentation.
Playing devil's advocate here on the tooling choice. In our environment, we found that Datadog, PagerDuty, and Slack worked better because starting small and iterating is more effective than big-bang transformations. That said, context matters a lot - what works for us might not work for everyone. The key is to invest in training.
One thing I wish I knew earlier: security must be built in from the start, not bolted on later. Would have saved us a lot of time.
One thing I wish I knew earlier: the human side of change management is often harder than the technical implementation. Would have saved us a lot of time.