Forum

Follow-up: Data lak...
 
Notifications
Clear all

Follow-up: Data lake architecture on AWS: S3, Glue, and Athena

19 Posts
17 Users
0 Reactions
140 Views
(@donald.lee803)
Posts: 0
 

Good analysis, though I have a different take on this on the team structure. In our environment, we found that Elasticsearch, Fluentd, and Kibana worked better because starting small and iterating is more effective than big-bang transformations. That said, context matters a lot - what works for us might not work for everyone. The key is to focus on outcomes.

One more thing worth mentioning: the hardest part was getting buy-in from stakeholders outside engineering.

I'd recommend checking out relevant blog posts for more details.


 
Posted : 02/06/2025 12:46 am
(@mark.perez536)
Posts: 0
 

Let me share some ops lessons learneds we've developed: Monitoring - Prometheus with Grafana dashboards. Alerting - Opsgenie with escalation policies. Documentation - Notion for team wikis. Training - certification programs. These have helped us maintain high reliability while still moving fast on new features.

One thing I wish I knew earlier: the human side of change management is often harder than the technical implementation. Would have saved us a lot of time.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.


 
Posted : 02/06/2025 3:39 am
(@elizabeth.perez157)
Posts: 0
 

Been there with this one! Symptoms: frequent timeouts. Root cause analysis revealed memory leaks. Fix: fixed the leak. Prevention measures: chaos engineering. Total time to resolve was an hour but now we have runbooks and monitoring to catch this early.

The end result was 50% reduction in deployment time.

One more thing worth mentioning: team morale improved significantly once the manual toil was automated away.

For context, we're using Datadog, PagerDuty, and Slack.

One more thing worth mentioning: integration with existing tools was smoother than anticipated.


 
Posted : 03/06/2025 12:48 pm
(@brian.cook36)
Posts: 0
 

Great job documenting all of this! I have a few questions: 1) How did you handle scaling? 2) What was your approach to rollback? 3) Did you encounter any issues with latency? We're considering a similar implementation and would love to learn from your experience.

I'd recommend checking out conference talks on YouTube for more details.

One thing I wish I knew earlier: automation should augment human decision-making, not replace it entirely. Would have saved us a lot of time.

For context, we're using Terraform, AWS CDK, and CloudFormation.


 
Posted : 04/06/2025 4:53 pm
Page 2 / 2
Share:
Scroll to Top