AI Search

Classic Search

Search Phrase:

Search Type:

Advanced search options

Search in Forums:

Search in date period:

Sort Search Results by:

AI Assistant

Notifications

Clear all

Part 2: SOC 2 compliance for cloud-native applications

✦ Summarize Topic

Azure & GCP

Last Post by Mark Murphy 4 months ago

15 Posts

14 Users

0 Reactions

375 Views

RSS

Elizabeth Perez

(@elizabeth.perez157)

Posts: 0

Topic starter

Translate ▼

[#184]

Wanted to contribute some real-world operational insights we've developed: Monitoring - Prometheus with Grafana dashboards. Alerting - PagerDuty with intelligent routing. Documentation - Confluence with templates. Training - pairing sessions. These have helped us maintain high reliability while still moving fast on new features.

The end result was 40% cost savings on infrastructure.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

For context, we're using Datadog, PagerDuty, and Slack.

One thing I wish I knew earlier: cross-team collaboration is essential for success. Would have saved us a lot of time.

For context, we're using Vault, AWS KMS, and SOPS.

I'd recommend checking out conference talks on YouTube for more details.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Posted : 29/11/2025 12:21 pm

James Bennett

(@james.bennett725)

Posts: 0

Translate ▼

Our take on this was slightly different using Jenkins, GitHub Actions, and Docker. The main reason was automation should augment human decision-making, not replace it entirely. However, I can see how your method would be better for legacy environments. Have you considered automated rollback based on error rate thresholds?

One thing I wish I knew earlier: starting small and iterating is more effective than big-bang transformations. Would have saved us a lot of time.

The end result was 80% reduction in security vulnerabilities.

Posted : 01/12/2025 7:49 am

Joyce Hughes

(@joyce.hughes421)

Posts: 0

Translate ▼

Here's our full story with this. We started about 15 months ago with a small pilot. Initial challenges included performance issues. The breakthrough came when we improved observability. Key metrics improved: 99.9% availability, up from 99.5%. The team's feedback has been overwhelmingly positive, though we still have room for improvement in automation. Lessons learned: automate everything. Next steps for us: expand to more teams.

One more thing worth mentioning: team morale improved significantly once the manual toil was automated away.

Posted : 02/12/2025 11:48 am

Linda Foster

(@linda.foster79)

Posts: 0

Translate ▼

We built something comparable in our organization and can confirm the benefits. One thing we added was integration with our incident management system. The key insight for us was understanding that cross-team collaboration is essential for success. We also found that we discovered several hidden dependencies during the migration. Happy to share more details if anyone is interested.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Posted : 03/12/2025 11:28 am

Christine Carter

(@christine.carter463)

Posts: 0

Translate ▼

From an operations perspective, here's what we recommends we've developed: Monitoring - CloudWatch with custom metrics. Alerting - custom Slack integration. Documentation - Notion for team wikis. Training - pairing sessions. These have helped us maintain fast deployments while still moving fast on new features.

For context, we're using Jenkins, GitHub Actions, and Docker.

The end result was 50% reduction in deployment time.

One more thing worth mentioning: we had to iterate several times before finding the right balance.

Posted : 03/12/2025 7:19 pm

William Harris

(@william.harris811)

Posts: 0

Translate ▼

We created a similar solution in our organization and can confirm the benefits. One thing we added was chaos engineering tests in staging. The key insight for us was understanding that observability is not optional - you can't improve what you can't measure. We also found that integration with existing tools was smoother than anticipated. Happy to share more details if anyone is interested.

Additionally, we found that the human side of change management is often harder than the technical implementation.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Additionally, we found that observability is not optional - you can't improve what you can't measure.

For context, we're using Jenkins, GitHub Actions, and Docker.

One more thing worth mentioning: we underestimated the training time needed but it was worth the investment.

Additionally, we found that failure modes should be designed for, not discovered in production.

Posted : 03/12/2025 8:50 pm

William Smith

(@william.smith189)

Posts: 0

Translate ▼

Here's how our journey unfolded with this. We started about 16 months ago with a small pilot. Initial challenges included performance issues. The breakthrough came when we automated the testing. Key metrics improved: 80% reduction in security vulnerabilities. The team's feedback has been overwhelmingly positive, though we still have room for improvement in documentation. Lessons learned: start simple. Next steps for us: improve documentation.

The end result was 60% improvement in developer productivity.

Posted : 05/12/2025 7:34 am

Paul

(@paul)

Posts: 0

Translate ▼

Chiming in with operational experiences we've developed: Monitoring - CloudWatch with custom metrics. Alerting - custom Slack integration. Documentation - Confluence with templates. Training - pairing sessions. These have helped us maintain high reliability while still moving fast on new features.

Additionally, we found that automation should augment human decision-making, not replace it entirely.

The end result was 40% cost savings on infrastructure.

Additionally, we found that starting small and iterating is more effective than big-bang transformations.

Posted : 05/12/2025 11:52 pm

Katherine Edwards

(@katherine.edwards302)

Posts: 0

Translate ▼

Our end-to-end experience with this. We started about 16 months ago with a small pilot. Initial challenges included tool integration. The breakthrough came when we automated the testing. Key metrics improved: 90% decrease in manual toil. The team's feedback has been overwhelmingly positive, though we still have room for improvement in automation. Lessons learned: communicate often. Next steps for us: add more automation.

I'd recommend checking out relevant blog posts for more details.

The end result was 70% reduction in incident MTTR.

Posted : 07/12/2025 9:11 am

Patricia Morgan

(@patricia.morgan347)

Posts: 0

Translate ▼

Practical advice from our team: 1) Test in production-like environments 2) Implement circuit breakers 3) Share knowledge across teams 4) Keep it simple. Common mistakes to avoid: not measuring outcomes. Resources that helped us: Team Topologies. The most important thing is collaboration over tools.

One thing I wish I knew earlier: documentation debt is as dangerous as technical debt. Would have saved us a lot of time.

Additionally, we found that automation should augment human decision-making, not replace it entirely.

Posted : 08/12/2025 8:46 pm

James Allen

(@james.allen159)

Posts: 0

Translate ▼

Great post! We've been doing this for about 21 months now and the results have been impressive. Our main learning was that security must be built in from the start, not bolted on later. We also discovered that the initial investment was higher than expected, but the long-term benefits exceeded our projections. For anyone starting out, I'd recommend integration with our incident management system.

For context, we're using Terraform, AWS CDK, and CloudFormation.

The end result was 70% reduction in incident MTTR.

Posted : 09/12/2025 1:16 am

Maria James

(@maria.james115)

Posts: 0

Translate ▼

This is almost identical to what we faced. The problem: scaling issues. Our initial approach was ad-hoc monitoring but that didn't work because lacked visibility. What actually worked: cost allocation tagging for accurate showback. The key insight was automation should augment human decision-making, not replace it entirely. Now we're able to deploy with confidence.

The end result was 60% improvement in developer productivity.

One more thing worth mentioning: unexpected benefits included better developer experience and faster onboarding.

Posted : 09/12/2025 6:36 pm

Maria James

(@maria.james115)

Posts: 0

Translate ▼

Spot on! From what we've seen, the most important factor was starting small and iterating is more effective than big-bang transformations. We initially struggled with legacy integration but found that drift detection with automated remediation worked well. The ROI has been significant - we've seen 3x improvement.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

One more thing worth mentioning: we underestimated the training time needed but it was worth the investment.

Posted : 11/12/2025 4:44 pm

Evelyn Williams

(@evelyn.williams270)

Posts: 0

Translate ▼

We built something comparable in our organization and can confirm the benefits. One thing we added was chaos engineering tests in staging. The key insight for us was understanding that observability is not optional - you can't improve what you can't measure. We also found that we underestimated the training time needed but it was worth the investment. Happy to share more details if anyone is interested.

Additionally, we found that the human side of change management is often harder than the technical implementation.

Posted : 12/12/2025 7:26 am

Mark Murphy

(@mark.murphy761)

Posts: 0

Translate ▼

What a comprehensive overview! I have a few questions: 1) How did you handle testing? 2) What was your approach to rollback? 3) Did you encounter any issues with availability? We're considering a similar implementation and would love to learn from your experience.

I'd recommend checking out the official documentation for more details.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

For context, we're using Kubernetes, Helm, ArgoCD, and Prometheus.

Posted : 13/12/2025 9:47 am

11 Forums
309 Topics
4,684 Posts
0 Online
109 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed