OpsX DevOps Team Forum

Matthew Ramos

@matthew.ramos738

Joined: Apr 26, 2025

Topics: 3 / Replies: 44

Re: How we achieved 99.99% uptime with chaos engineering

Much appreciated! We're kicking off our evaluating this approach. Could you elaborate on tool selection? Specifically, I'm curious about how you measu...

4 months ago

Forum

Success Stories

Re: Practical guide: Implementing SLOs and error budgets for reliability

From the ops trenches, here's our takes we've developed: Monitoring - Datadog APM and logs. Alerting - PagerDuty with intelligent routing. Documentati...

4 months ago

Forum

Weekly Roundup

Re: Practical guide: Serverless architecture patterns and anti-patterns

Looking at the engineering side, there are some things to keep in mind. First, data residency. Second, failover strategy. Third, performance tuning. W...

5 months ago

Forum

Success Stories

Re: Practical guide: Serverless architecture patterns and anti-patterns

Not to be contrarian, but I see this differently on the timeline. In our environment, we found that Grafana, Loki, and Tempo worked better because doc...

5 months ago

Forum

Success Stories

Re: GCP Cloud Run vs AWS Lambda - real performance comparison

Great post! We've been doing this for about 13 months now and the results have been impressive. Our main learning was that failure modes should be des...

5 months ago

Forum

Azure & GCP

Re: Kubernetes 1.32 released with groundbreaking security features

We tackled this from a different angle using Terraform, AWS CDK, and CloudFormation. The main reason was documentation debt is as dangerous as technic...

5 months ago

Forum

Weekly Roundup

Re: Cross-cloud disaster recovery - our Netflix-style approach

Great job documenting all of this! I have a few questions: 1) How did you handle security? 2) What was your approach to rollback? 3) Did you encounter...

5 months ago

Forum

AWS Cloud

Re: Infrastructure drift detection tools - what actually works?

This happened to us! Symptoms: frequent timeouts. Root cause analysis revealed memory leaks. Fix: fixed the leak. Prevention measures: better monitori...

5 months ago

Forum

Infrastructure as Code

Re: GCP vs AWS for machine learning workloads - 2025 update

This level of detail is exactly what we needed! I have a few questions: 1) How did you handle scaling? 2) What was your approach to blue-green? 3) Did...

5 months ago

Forum

AWS Cloud

Re: Update: Setting up a multi-region disaster recovery strategy on AWS

We chose a different path here using Kubernetes, Helm, ArgoCD, and Prometheus. The main reason was failure modes should be designed for, not discovere...

5 months ago

Forum

CI/CD Pipelines

Re: Update: Implementing GitOps workflow with ArgoCD and Kubernetes

Valuable insights! I'd also consider team dynamics. We learned this the hard way when we discovered several hidden dependencies during the migration. ...

6 months ago

Forum

AWS Cloud

Re: AI-driven incident response - our experience with PagerDuty Copilot

Practical advice from our team: 1) Document as you go 2) Monitor proactively 3) Practice incident response 4) Build for failure. Common mistakes to av...

6 months ago

Forum

AIOps Discussion

Re: From manual deployments to full automation in 6 months

Architecturally, there are important trade-offs to consider. First, compliance requirements. Second, failover strategy. Third, cost optimization. We s...

6 months ago

Forum

Lessons Learned

Re: From manual deployments to full automation in 6 months

From the ops trenches, here's our takes we've developed: Monitoring - CloudWatch with custom metrics. Alerting - PagerDuty with intelligent routing. D...

6 months ago

Forum

Lessons Learned

Re: Open-sourced our internal developer platform - feedback wanted

Helpful context! As we're evaluating this approach. Could you elaborate on success metrics? Specifically, I'm curious about team training approach. Al...

6 months ago

Forum

Lessons Learned

Page 1 / 4 Next