<?xml version="1.0" encoding="UTF-8"?>        <rss version="2.0"
             xmlns:atom="http://www.w3.org/2005/Atom"
             xmlns:dc="http://purl.org/dc/elements/1.1/"
             xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
             xmlns:admin="http://webns.net/mvcb/"
             xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
             xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <channel>
            <title>
									Clouds - AWS, Azure, GCP - OpsX DevOps Team Forum				            </title>
            <link>https://opsx.team/community/clouds/</link>
            <description>OpsX DevOps Team Discussion Board</description>
            <language>en-US</language>
            <lastBuildDate>Tue, 07 Apr 2026 21:57:25 +0000</lastBuildDate>
            <generator>wpForo</generator>
            <ttl>60</ttl>
							                    <item>
                        <title>Update: Implementing blue-green deployments with zero downtime</title>
                        <link>https://opsx.team/community/clouds/update-implementing-blue-green-deployments-with-zero-downtime-203/</link>
                        <pubDate>Sat, 15 Nov 2025 07:21:13 +0000</pubDate>
                        <description><![CDATA[Lessons we learned along the way: 1) Automate everything possible 2) Monitor proactively 3) Review and iterate 4) Keep it simple. Common mistakes to avoid: over-engineering early. Resources ...]]></description>
                        <content:encoded><![CDATA[Lessons we learned along the way: 1) Automate everything possible 2) Monitor proactively 3) Review and iterate 4) Keep it simple. Common mistakes to avoid: over-engineering early. Resources that helped us: Google SRE book. The most important thing is learning over blame.

The end result was 99.9% availability, up from 99.5%.

I'd recommend checking out conference talks on YouTube for more details.

One more thing worth mentioning: the initial investment was higher than expected, but the long-term benefits exceeded our projections.

The end result was 3x increase in deployment frequency.

Additionally, we found that the human side of change management is often harder than the technical implementation.

The end result was 3x increase in deployment frequency.

One thing I wish I knew earlier: automation should augment human decision-making, not replace it entirely. Would have saved us a lot of time.]]></content:encoded>
						                            <category domain="https://opsx.team/community/clouds/">Clouds - AWS, Azure, GCP</category>                        <dc:creator>Jose Williams</dc:creator>
                        <guid isPermaLink="true">https://opsx.team/community/clouds/update-implementing-blue-green-deployments-with-zero-downtime-203/</guid>
                    </item>
				                    <item>
                        <title>Secrets management: HashiCorp Vault vs AWS Secrets Manager</title>
                        <link>https://opsx.team/community/clouds/secrets-management-hashicorp-vault-vs-aws-secrets-manager-133/</link>
                        <pubDate>Tue, 07 Oct 2025 05:21:13 +0000</pubDate>
                        <description><![CDATA[Managing secrets across multiple environments is challenging. We evaluated HashiCorp Vault and AWS Secrets Manager. Vault offers more features (dynamic secrets, PKI, transit encryption) but ...]]></description>
                        <content:encoded><![CDATA[Managing secrets across multiple environments is challenging. We evaluated HashiCorp Vault and AWS Secrets Manager. Vault offers more features (dynamic secrets, PKI, transit encryption) but requires operational overhead. Secrets Manager is simpler and integrates well with AWS services. Our choice: Vault for complex needs, Secrets Manager for AWS-native workloads. What's your secrets management approach?]]></content:encoded>
						                            <category domain="https://opsx.team/community/clouds/">Clouds - AWS, Azure, GCP</category>                        <dc:creator>Dennis King</dc:creator>
                        <guid isPermaLink="true">https://opsx.team/community/clouds/secrets-management-hashicorp-vault-vs-aws-secrets-manager-133/</guid>
                    </item>
				                    <item>
                        <title>Follow-up: Terraform vs Pulumi: A comprehensive comparison for IaC</title>
                        <link>https://opsx.team/community/clouds/follow-up-terraform-vs-pulumi-a-comprehensive-comparison-for-iac-165/</link>
                        <pubDate>Sat, 16 Aug 2025 21:21:13 +0000</pubDate>
                        <description><![CDATA[This is almost identical to what we faced. The problem: security vulnerabilities. Our initial approach was manual intervention but that didn&#039;t work because it didn&#039;t scale. What actually wor...]]></description>
                        <content:encoded><![CDATA[This is almost identical to what we faced. The problem: security vulnerabilities. Our initial approach was manual intervention but that didn't work because it didn't scale. What actually worked: compliance scanning in the CI pipeline. The key insight was starting small and iterating is more effective than big-bang transformations. Now we're able to scale automatically.

One more thing worth mentioning: the hardest part was getting buy-in from stakeholders outside engineering.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

I'd recommend checking out relevant blog posts for more details.

Additionally, we found that starting small and iterating is more effective than big-bang transformations.

I'd recommend checking out the official documentation for more details.]]></content:encoded>
						                            <category domain="https://opsx.team/community/clouds/">Clouds - AWS, Azure, GCP</category>                        <dc:creator>Christine Moore</dc:creator>
                        <guid isPermaLink="true">https://opsx.team/community/clouds/follow-up-terraform-vs-pulumi-a-comprehensive-comparison-for-iac-165/</guid>
                    </item>
				                    <item>
                        <title>Prometheus and Grafana: Advanced monitoring techniques</title>
                        <link>https://opsx.team/community/clouds/prometheus-and-grafana-advanced-monitoring-techniques-130/</link>
                        <pubDate>Tue, 29 Jul 2025 20:21:13 +0000</pubDate>
                        <description><![CDATA[We&#039;ve been using Prometheus and Grafana for 2 years and wanted to share some advanced techniques. Recording rules for expensive queries, alerting best practices (avoid alert fatigue!), using...]]></description>
                        <content:encoded><![CDATA[We've been using Prometheus and Grafana for 2 years and wanted to share some advanced techniques. Recording rules for expensive queries, alerting best practices (avoid alert fatigue!), using Thanos for long-term storage and multi-cluster federation, and Grafana provisioning for dashboard-as-code. Our SRE team now manages 500+ services with these tools. What monitoring insights can you share?]]></content:encoded>
						                            <category domain="https://opsx.team/community/clouds/">Clouds - AWS, Azure, GCP</category>                        <dc:creator>Jose Jackson</dc:creator>
                        <guid isPermaLink="true">https://opsx.team/community/clouds/prometheus-and-grafana-advanced-monitoring-techniques-130/</guid>
                    </item>
				                    <item>
                        <title>Practical guide: MLOps: Building ML pipelines with Kubeflow and MLflow</title>
                        <link>https://opsx.team/community/clouds/practical-guide-mlops-building-ml-pipelines-with-kubeflow-and-mlflow-205/</link>
                        <pubDate>Sun, 20 Jul 2025 05:21:13 +0000</pubDate>
                        <description><![CDATA[This mirrors what we went through. We learned: Phase 1 (1 month) involved tool evaluation. Phase 2 (2 months) focused on pilot implementation. Phase 3 (2 weeks) was all about optimization. T...]]></description>
                        <content:encoded><![CDATA[This mirrors what we went through. We learned: Phase 1 (1 month) involved tool evaluation. Phase 2 (2 months) focused on pilot implementation. Phase 3 (2 weeks) was all about optimization. Total investment was $100K but the payback period was only 6 months. Key success factors: executive support, dedicated team, clear metrics. If I could do it again, I would start with better documentation.

For context, we're using Jenkins, GitHub Actions, and Docker.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

One more thing worth mentioning: we discovered several hidden dependencies during the migration.

Additionally, we found that documentation debt is as dangerous as technical debt.

I'd recommend checking out the official documentation for more details.]]></content:encoded>
						                            <category domain="https://opsx.team/community/clouds/">Clouds - AWS, Azure, GCP</category>                        <dc:creator>Jason Brooks</dc:creator>
                        <guid isPermaLink="true">https://opsx.team/community/clouds/practical-guide-mlops-building-ml-pipelines-with-kubeflow-and-mlflow-205/</guid>
                    </item>
				                    <item>
                        <title>Deep dive: Terraform vs Pulumi: A comprehensive comparison for IaC</title>
                        <link>https://opsx.team/community/clouds/deep-dive-terraform-vs-pulumi-a-comprehensive-comparison-for-iac-181/</link>
                        <pubDate>Sat, 05 Jul 2025 03:21:13 +0000</pubDate>
                        <description><![CDATA[Appreciated! We&#039;re in the process of evaluating this approach. Could you elaborate on success metrics? Specifically, I&#039;m curious about risk mitigation. Also, how long did the initial impleme...]]></description>
                        <content:encoded><![CDATA[Appreciated! We're in the process of evaluating this approach. Could you elaborate on success metrics? Specifically, I'm curious about risk mitigation. Also, how long did the initial implementation take? Any gotchas we should watch out for?

For context, we're using Grafana, Loki, and Tempo.

I'd recommend checking out relevant blog posts for more details.

The end result was 99.9% availability, up from 99.5%.

For context, we're using Vault, AWS KMS, and SOPS.

I'd recommend checking out the official documentation for more details.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

One thing I wish I knew earlier: starting small and iterating is more effective than big-bang transformations. Would have saved us a lot of time.

The end result was 99.9% availability, up from 99.5%.]]></content:encoded>
						                            <category domain="https://opsx.team/community/clouds/">Clouds - AWS, Azure, GCP</category>                        <dc:creator>Jerry Green</dc:creator>
                        <guid isPermaLink="true">https://opsx.team/community/clouds/deep-dive-terraform-vs-pulumi-a-comprehensive-comparison-for-iac-181/</guid>
                    </item>
				                    <item>
                        <title>Part 2: SOC 2 compliance for cloud-native applications</title>
                        <link>https://opsx.team/community/clouds/part-2-soc-2-compliance-for-cloud-native-applications-166/</link>
                        <pubDate>Fri, 20 Jun 2025 02:21:13 +0000</pubDate>
                        <description><![CDATA[Here&#039;s our full story with this. We started about 6 months ago with a small pilot. Initial challenges included tool integration. The breakthrough came when we simplified the architecture. Ke...]]></description>
                        <content:encoded><![CDATA[Here's our full story with this. We started about 6 months ago with a small pilot. Initial challenges included tool integration. The breakthrough came when we simplified the architecture. Key metrics improved: 90% decrease in manual toil. The team's feedback has been overwhelmingly positive, though we still have room for improvement in documentation. Lessons learned: automate everything. Next steps for us: add more automation.

Additionally, we found that starting small and iterating is more effective than big-bang transformations.

Additionally, we found that security must be built in from the start, not bolted on later.

I'd recommend checking out conference talks on YouTube for more details.

One more thing worth mentioning: the initial investment was higher than expected, but the long-term benefits exceeded our projections.]]></content:encoded>
						                            <category domain="https://opsx.team/community/clouds/">Clouds - AWS, Azure, GCP</category>                        <dc:creator>Michelle Gutierrez</dc:creator>
                        <guid isPermaLink="true">https://opsx.team/community/clouds/part-2-soc-2-compliance-for-cloud-native-applications-166/</guid>
                    </item>
				                    <item>
                        <title>Practical guide: Comparing AWS, Azure, and GCP for enterprise workloads</title>
                        <link>https://opsx.team/community/clouds/practical-guide-comparing-aws-azure-and-gcp-for-enterprise-workloads-253/</link>
                        <pubDate>Sat, 12 Apr 2025 15:21:13 +0000</pubDate>
                        <description><![CDATA[Key takeaways from our implementation: 1) Test in production-like environments 2) Monitor proactively 3) Share knowledge across teams 4) Measure what matters. Common mistakes to avoid: over-...]]></description>
                        <content:encoded><![CDATA[Key takeaways from our implementation: 1) Test in production-like environments 2) Monitor proactively 3) Share knowledge across teams 4) Measure what matters. Common mistakes to avoid: over-engineering early. Resources that helped us: Phoenix Project. The most important thing is learning over blame.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

One more thing worth mentioning: unexpected benefits included better developer experience and faster onboarding.

The end result was 70% reduction in incident MTTR.

The end result was 99.9% availability, up from 99.5%.

The end result was 60% improvement in developer productivity.

One thing I wish I knew earlier: starting small and iterating is more effective than big-bang transformations. Would have saved us a lot of time.]]></content:encoded>
						                            <category domain="https://opsx.team/community/clouds/">Clouds - AWS, Azure, GCP</category>                        <dc:creator>Sharon Garcia</dc:creator>
                        <guid isPermaLink="true">https://opsx.team/community/clouds/practical-guide-comparing-aws-azure-and-gcp-for-enterprise-workloads-253/</guid>
                    </item>
				                    <item>
                        <title>Follow-up: Data lake architecture on AWS: S3, Glue, and Athena</title>
                        <link>https://opsx.team/community/clouds/follow-up-data-lake-architecture-on-aws-s3-glue-and-athena-284/</link>
                        <pubDate>Wed, 26 Mar 2025 09:21:13 +0000</pubDate>
                        <description><![CDATA[We went through something very similar. The problem: deployment failures. Our initial approach was ad-hoc monitoring but that didn&#039;t work because it didn&#039;t scale. What actually worked: featu...]]></description>
                        <content:encoded><![CDATA[We went through something very similar. The problem: deployment failures. Our initial approach was ad-hoc monitoring but that didn't work because it didn't scale. What actually worked: feature flags for gradual rollouts. The key insight was failure modes should be designed for, not discovered in production. Now we're able to detect issues early.

Additionally, we found that observability is not optional - you can't improve what you can't measure.

The end result was 80% reduction in security vulnerabilities.

For context, we're using Istio, Linkerd, and Envoy.

One thing I wish I knew earlier: starting small and iterating is more effective than big-bang transformations. Would have saved us a lot of time.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.]]></content:encoded>
						                            <category domain="https://opsx.team/community/clouds/">Clouds - AWS, Azure, GCP</category>                        <dc:creator>Jose Williams</dc:creator>
                        <guid isPermaLink="true">https://opsx.team/community/clouds/follow-up-data-lake-architecture-on-aws-s3-glue-and-athena-284/</guid>
                    </item>
				                    <item>
                        <title>Follow-up: Using ChatGPT and Copilot for DevOps automation</title>
                        <link>https://opsx.team/community/clouds/follow-up-using-chatgpt-and-copilot-for-devops-automation-224/</link>
                        <pubDate>Tue, 11 Mar 2025 19:21:13 +0000</pubDate>
                        <description><![CDATA[Great post! We&#039;ve been doing this for about 18 months now and the results have been impressive. Our main learning was that observability is not optional - you can&#039;t improve what you can&#039;t me...]]></description>
                        <content:encoded><![CDATA[Great post! We've been doing this for about 18 months now and the results have been impressive. Our main learning was that observability is not optional - you can't improve what you can't measure. We also discovered that we discovered several hidden dependencies during the migration. For anyone starting out, I'd recommend automated rollback based on error rate thresholds.

For context, we're using Kubernetes, Helm, ArgoCD, and Prometheus.

I'd recommend checking out relevant blog posts for more details.

One thing I wish I knew earlier: documentation debt is as dangerous as technical debt. Would have saved us a lot of time.

One more thing worth mentioning: team morale improved significantly once the manual toil was automated away.

The end result was 40% cost savings on infrastructure.

Additionally, we found that documentation debt is as dangerous as technical debt.]]></content:encoded>
						                            <category domain="https://opsx.team/community/clouds/">Clouds - AWS, Azure, GCP</category>                        <dc:creator>Mark Murphy</dc:creator>
                        <guid isPermaLink="true">https://opsx.team/community/clouds/follow-up-using-chatgpt-and-copilot-for-devops-automation-224/</guid>
                    </item>
							        </channel>
        </rss>
		