Notifications

Clear all

Follow-up: Data lake architecture on AWS: S3, Glue, and Athena

Jennifer Young · 2025-05-13T04:21:13Z

Great approach! In our organization and can confirm the benefits. One thing we added was chaos engineering tests in staging. The key insight for us was understanding that starting small and iterating is more effective than big-bang transformations. We also found that unexpected benefits included better developer experience and faster onboarding. Happy to share more details if anyone is interested. Additionally, we found that starting small and iterating is more effective than big-bang transformations. For context, we're using Istio, Linkerd, and Envoy. For context, we're using Istio, Linkerd, and Envoy. For context, we're using Grafana, Loki, and Tempo. For context, we're using Kubernetes, Helm, ArgoCD, and Prometheus. For context, we're using Elasticsearch, Fluentd, and Kibana.

Page 2 / 2 Prev

AIOps Discussion

Last Post by Brian Cook 1 year ago

19 Posts

17 Users

0 Reactions

592 Views

RSS

Donald Lee

(@donald.lee803)

Posts: 0

Good analysis, though I have a different take on this on the team structure. In our environment, we found that Elasticsearch, Fluentd, and Kibana worked better because starting small and iterating is more effective than big-bang transformations. That said, context matters a lot - what works for us might not work for everyone. The key is to focus on outcomes.

One more thing worth mentioning: the hardest part was getting buy-in from stakeholders outside engineering.

I'd recommend checking out relevant blog posts for more details.

Posted : 02/06/2025 12:46 am

Mark Perez

(@mark.perez536)

Posts: 0

Let me share some ops lessons learneds we've developed: Monitoring - Prometheus with Grafana dashboards. Alerting - Opsgenie with escalation policies. Documentation - Notion for team wikis. Training - certification programs. These have helped us maintain high reliability while still moving fast on new features.

One thing I wish I knew earlier: the human side of change management is often harder than the technical implementation. Would have saved us a lot of time.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Posted : 02/06/2025 3:39 am

Elizabeth Perez

(@elizabeth.perez157)

Posts: 0

Been there with this one! Symptoms: frequent timeouts. Root cause analysis revealed memory leaks. Fix: fixed the leak. Prevention measures: chaos engineering. Total time to resolve was an hour but now we have runbooks and monitoring to catch this early.

The end result was 50% reduction in deployment time.

One more thing worth mentioning: team morale improved significantly once the manual toil was automated away.

For context, we're using Datadog, PagerDuty, and Slack.

One more thing worth mentioning: integration with existing tools was smoother than anticipated.

Posted : 03/06/2025 12:48 pm

Brian Cook

(@brian.cook36)

Posts: 0

Great job documenting all of this! I have a few questions: 1) How did you handle scaling? 2) What was your approach to rollback? 3) Did you encounter any issues with latency? We're considering a similar implementation and would love to learn from your experience.

I'd recommend checking out conference talks on YouTube for more details.

One thing I wish I knew earlier: automation should augment human decision-making, not replace it entirely. Would have saved us a lot of time.

For context, we're using Terraform, AWS CDK, and CloudFormation.

Posted : 04/06/2025 4:53 pm

Page 2 / 2 Prev

Forum Jump:

Previous Topic

Next Topic

Currently viewing this topic 1 guest.

11 Forums
309 Topics
4,684 Posts
5 Online
109 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed