Forum

Search
Close
AI Search
Classic Search
 Search Phrase:
 Search Type:
Advanced search options
 Search in Forums:
 Search in date period:

 Sort Search Results by:

AI Assistant
Deep dive: Setting ...
 
Notifications
Clear all

Deep dive: Setting up a multi-region disaster recovery strategy on AWS

22 Posts
18 Users
0 Reactions
37 Views
(@mark.perez536)
Posts: 0
Topic starter
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 
[#167]

On the technical front, several aspects deserve attention. First, data residency. Second, monitoring coverage. Third, performance tuning. We spent significant time on testing and it was worth it. Code samples available on our GitHub if anyone wants to take a look. Performance testing showed 2x improvement.

One more thing worth mentioning: unexpected benefits included better developer experience and faster onboarding.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

The end result was 70% reduction in incident MTTR.

Additionally, we found that security must be built in from the start, not bolted on later.

The end result was 99.9% availability, up from 99.5%.

One more thing worth mentioning: we had to iterate several times before finding the right balance.


 
Posted : 21/04/2025 10:21 pm
(@kimberly.james491)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Same experience on our end! We learned: Phase 1 (2 weeks) involved stakeholder alignment. Phase 2 (1 month) focused on pilot implementation. Phase 3 (2 weeks) was all about knowledge sharing. Total investment was $200K but the payback period was only 3 months. Key success factors: automation, documentation, feedback loops. If I could do it again, I would set clearer success metrics.

One more thing worth mentioning: we discovered several hidden dependencies during the migration.

One more thing worth mentioning: we had to iterate several times before finding the right balance.

One more thing worth mentioning: we had to iterate several times before finding the right balance.

For context, we're using Istio, Linkerd, and Envoy.

One thing I wish I knew earlier: cross-team collaboration is essential for success. Would have saved us a lot of time.

One more thing worth mentioning: we discovered several hidden dependencies during the migration.

The end result was 80% reduction in security vulnerabilities.


 
Posted : 23/04/2025 4:30 pm
(@laura.rivera601)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

A few operational considerations to adds we've developed: Monitoring - Prometheus with Grafana dashboards. Alerting - custom Slack integration. Documentation - GitBook for public docs. Training - certification programs. These have helped us maintain fast deployments while still moving fast on new features.

I'd recommend checking out conference talks on YouTube for more details.

One more thing worth mentioning: the initial investment was higher than expected, but the long-term benefits exceeded our projections.


 
Posted : 25/04/2025 1:44 am
(@linda.foster79)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Great approach! In our organization and can confirm the benefits. One thing we added was feature flags for gradual rollouts. The key insight for us was understanding that automation should augment human decision-making, not replace it entirely. We also found that integration with existing tools was smoother than anticipated. Happy to share more details if anyone is interested.

One more thing worth mentioning: we discovered several hidden dependencies during the migration.


 
Posted : 25/04/2025 11:51 am
(@donald.price627)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Just dealt with this! Symptoms: frequent timeouts. Root cause analysis revealed memory leaks. Fix: corrected routing rules. Prevention measures: better monitoring. Total time to resolve was 30 minutes but now we have runbooks and monitoring to catch this early.

One more thing worth mentioning: the initial investment was higher than expected, but the long-term benefits exceeded our projections.

I'd recommend checking out conference talks on YouTube for more details.

For context, we're using Grafana, Loki, and Tempo.


 
Posted : 25/04/2025 1:33 pm
(@nicholas.morgan692)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

From the ops trenches, here's our takes we've developed: Monitoring - CloudWatch with custom metrics. Alerting - Opsgenie with escalation policies. Documentation - Confluence with templates. Training - certification programs. These have helped us maintain high reliability while still moving fast on new features.

I'd recommend checking out the community forums for more details.

For context, we're using Kubernetes, Helm, ArgoCD, and Prometheus.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.


 
Posted : 27/04/2025 8:38 am
(@joan.hill519)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Timely post! We're actively evaluating this approach. Could you elaborate on success metrics? Specifically, I'm curious about stakeholder communication. Also, how long did the initial implementation take? Any gotchas we should watch out for?

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

The end result was 90% decrease in manual toil.

One more thing worth mentioning: we underestimated the training time needed but it was worth the investment.


 
Posted : 28/04/2025 1:20 pm
(@jeffrey.alvarez11)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Valuable insights! I'd also consider maintenance burden. We learned this the hard way when team morale improved significantly once the manual toil was automated away. Now we always make sure to document in runbooks. It's added maybe a few hours to our process but prevents a lot of headaches down the line.

I'd recommend checking out the community forums for more details.

I'd recommend checking out the community forums for more details.

I'd recommend checking out conference talks on YouTube for more details.


 
Posted : 29/04/2025 3:02 am
(@michelle.ross286)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Great post! We've been doing this for about 3 months now and the results have been impressive. Our main learning was that the human side of change management is often harder than the technical implementation. We also discovered that we discovered several hidden dependencies during the migration. For anyone starting out, I'd recommend real-time dashboards for stakeholder visibility.

For context, we're using Jenkins, GitHub Actions, and Docker.

Additionally, we found that failure modes should be designed for, not discovered in production.


 
Posted : 30/04/2025 2:04 pm
(@katherine.nelson24)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Here's what worked well for us: 1) Automate everything possible 2) Implement circuit breakers 3) Review and iterate 4) Keep it simple. Common mistakes to avoid: not measuring outcomes. Resources that helped us: Team Topologies. The most important thing is learning over blame.

Additionally, we found that the human side of change management is often harder than the technical implementation.

For context, we're using Datadog, PagerDuty, and Slack.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.


 
Posted : 01/05/2025 1:46 am
(@jeffrey.alvarez11)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

What a comprehensive overview! I have a few questions: 1) How did you handle testing? 2) What was your approach to canary? 3) Did you encounter any issues with availability? We're considering a similar implementation and would love to learn from your experience.

One more thing worth mentioning: we discovered several hidden dependencies during the migration.

For context, we're using Istio, Linkerd, and Envoy.

The end result was 3x increase in deployment frequency.

The end result was 90% decrease in manual toil.


 
Posted : 02/05/2025 5:00 pm
(@thomas.robinson721)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

We created a similar solution in our organization and can confirm the benefits. One thing we added was compliance scanning in the CI pipeline. The key insight for us was understanding that the human side of change management is often harder than the technical implementation. We also found that unexpected benefits included better developer experience and faster onboarding. Happy to share more details if anyone is interested.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

Additionally, we found that observability is not optional - you can't improve what you can't measure.

I'd recommend checking out relevant blog posts for more details.

I'd recommend checking out the community forums for more details.

One thing I wish I knew earlier: the human side of change management is often harder than the technical implementation. Would have saved us a lot of time.

For context, we're using Terraform, AWS CDK, and CloudFormation.


 
Posted : 03/05/2025 10:29 pm
(@stephanie.howard98)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Our recommended approach: 1) Automate everything possible 2) Implement circuit breakers 3) Review and iterate 4) Build for failure. Common mistakes to avoid: ignoring security. Resources that helped us: Phoenix Project. The most important thing is consistency over perfection.

One thing I wish I knew earlier: automation should augment human decision-making, not replace it entirely. Would have saved us a lot of time.

One more thing worth mentioning: unexpected benefits included better developer experience and faster onboarding.

One more thing worth mentioning: the initial investment was higher than expected, but the long-term benefits exceeded our projections.

Feel free to reach out if you have more questions - happy to share our runbooks and documentation.

One more thing worth mentioning: the hardest part was getting buy-in from stakeholders outside engineering.

One thing I wish I knew earlier: security must be built in from the start, not bolted on later. Would have saved us a lot of time.


 
Posted : 05/05/2025 11:27 am
(@jerry.green681)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Chiming in with operational experiences we've developed: Monitoring - CloudWatch with custom metrics. Alerting - custom Slack integration. Documentation - Notion for team wikis. Training - pairing sessions. These have helped us maintain low incident count while still moving fast on new features.

One thing I wish I knew earlier: starting small and iterating is more effective than big-bang transformations. Would have saved us a lot of time.

For context, we're using Jenkins, GitHub Actions, and Docker.


 
Posted : 06/05/2025 11:47 am
(@samantha.brown47)
Posts: 0
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

This helps! Our team is evaluating this approach. Could you elaborate on the migration process? Specifically, I'm curious about team training approach. Also, how long did the initial implementation take? Any gotchas we should watch out for?

I'd recommend checking out conference talks on YouTube for more details.

Additionally, we found that starting small and iterating is more effective than big-bang transformations.

One thing I wish I knew earlier: cross-team collaboration is essential for success. Would have saved us a lot of time.


 
Posted : 08/05/2025 2:23 am
Page 1 / 2
Share:
Scroll to Top