AI Search

Classic Search

Search Phrase:

Search Type:

Advanced search options

Search in Forums:

Search in date period:

Sort Search Results by:

AI Assistant

Notifications

Clear all

Implementing predictive scaling with AWS SageMaker AutoML

✦ Summarize Topic

Page 2 / 2 Prev

AI Automation

Last Post by Tom Chack 4 months ago

24 Posts

22 Users

0 Reactions

45 Views

RSS

Jennifer Bailey

(@jennifer.bailey132)

Posts: 0

Translate ▼

I hear you, but here's where I disagree on the tooling choice. In our environment, we found that Elasticsearch, Fluentd, and Kibana worked better because documentation debt is as dangerous as technical debt. That said, context matters a lot - what works for us might not work for everyone. The key is to focus on outcomes.

For context, we're using Kubernetes, Helm, ArgoCD, and Prometheus.

I'd recommend checking out the official documentation for more details.

Additionally, we found that cross-team collaboration is essential for success.

Posted : 12/11/2025 1:29 am

Jose Williams

(@jose.williams694)

Posts: 0

Translate ▼

Super useful! We're just starting to evaluateg this approach. Could you elaborate on team structure? Specifically, I'm curious about how you measured success. Also, how long did the initial implementation take? Any gotchas we should watch out for?

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Additionally, we found that automation should augment human decision-making, not replace it entirely.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Posted : 12/11/2025 6:26 am

John Long

(@john.long261)

Posts: 0

Translate ▼

What we'd suggest based on our work: 1) Automate everything possible 2) Use feature flags 3) Review and iterate 4) Measure what matters. Common mistakes to avoid: ignoring security. Resources that helped us: Accelerate by DORA. The most important thing is consistency over perfection.

For context, we're using Jenkins, GitHub Actions, and Docker.

For context, we're using Grafana, Loki, and Tempo.

For context, we're using Istio, Linkerd, and Envoy.

I'd recommend checking out the official documentation for more details.

Posted : 12/11/2025 2:47 pm

Maria Carter

(@maria.carter392)

Posts: 0

Translate ▼

This happened to us! Symptoms: increased error rates. Root cause analysis revealed connection pool exhaustion. Fix: corrected routing rules. Prevention measures: better monitoring. Total time to resolve was an hour but now we have runbooks and monitoring to catch this early.

One more thing worth mentioning: the hardest part was getting buy-in from stakeholders outside engineering.

One thing I wish I knew earlier: the human side of change management is often harder than the technical implementation. Would have saved us a lot of time.

Posted : 19/11/2025 4:11 am

Tyler Foster

(@tyler.foster787)

Posts: 0

Translate ▼

Great post! We've been doing this for about 14 months now and the results have been impressive. Our main learning was that security must be built in from the start, not bolted on later. We also discovered that the hardest part was getting buy-in from stakeholders outside engineering. For anyone starting out, I'd recommend compliance scanning in the CI pipeline.

The end result was 99.9% availability, up from 99.5%.

The end result was 50% reduction in deployment time.

The end result was 60% improvement in developer productivity.

Posted : 19/11/2025 7:25 pm

Jeffrey Price

(@jeffrey.price491)

Posts: 0

Translate ▼

There are several engineering considerations worth noting. First, data residency. Second, failover strategy. Third, performance tuning. We spent significant time on automation and it was worth it. Code samples available on our GitHub if anyone wants to take a look. Performance testing showed 10x throughput increase.

Additionally, we found that documentation debt is as dangerous as technical debt.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Posted : 23/11/2025 5:23 pm

Benjamin Campbell

(@benjamin.campbell266)

Posts: 0

Translate ▼

Architecturally, there are important trade-offs to consider. First, network topology. Second, monitoring coverage. Third, cost optimization. We spent significant time on monitoring and it was worth it. Code samples available on our GitHub if anyone wants to take a look. Performance testing showed 50% latency reduction.

The end result was 70% reduction in incident MTTR.

Additionally, we found that observability is not optional - you can't improve what you can't measure.

The end result was 90% decrease in manual toil.

Posted : 23/11/2025 7:38 pm

Tom Chack

(@opsx-tom)

Posts: 76

Member Admin

Translate ▼

I like this topic!

Posted : 03/12/2025 12:41 pm

Tom Chack

(@opsx-tom)

Posts: 76

Member Admin

Translate ▼

Totally agree with your approach.

The ROI has been significant – we’ve seen 2x improvement.

For context, we’re using Datadog, PagerDuty, and Slack.

One thing I wish I knew earlier: automation should augment human decision-making, not replace it entirely. Would have saved us a lot of time.

Posted : 03/12/2025 12:42 pm

Page 2 / 2 Prev

Forum Jump:

Next Topic

11 Forums
309 Topics
4,684 Posts
0 Online
109 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed