LinkedIn Twitter
Director, OpenShift Strategy at Red Hat. Founder of Rishidot Research, a research community focused on services world. His focus is on Platform Services, Infrastructure and the role of Open Source in the services era. Krish has been writing @ CloudAve from its inception and had also been part of GigaOm Pro Analyst Group. The opinions expressed here are his own and are neither representative of his employer, Red Hat, nor CloudAve, nor its sponsors.

13 responses to “Some Lessons From AWS Outage”

  1. Roman Stanek

    I disagree. We followed the best practices from AWS and used multiple availability zones with apparently no single point of failure as hot spares (what you would call multiple data-centers) but we had no visibility into internal design of AWS and so we did not expect that all availability zones would go down at the same time in one region.

    Amazon has a concept of regions but there are no tools to manage app in multiple regions, the connectivity goes over public web and so on. So once we decide to use regions as the DR approach (and we probably have to) we may as well look at other cloud providers because we will have to maintain multiple very separate instances anyway.

    I believe that the key to this is more transparency and communications from AWS. More here: http://bit.ly/aws-outage

  2. Mark S. Rasmussen

    Well, the thing is, according to the AWS docs, the AZs should be completely separate and should not have any SPOFs in common. Given that, a multi-AZ deployment should theoretically be just as good as a multi-region deployment. Since we know nothing about the internal architecture of the AWS setup, we cannot trust that a multi-region failure might not occur, given that a multi-AZ failure did occur, even though it shouldn’t.

    I totally agree that having a DR strategy that involves multi-region deployment is a sound plan, given the expenses are justifiable, but it does not change the fact that the current multi-AZ failure should not have occurred in the first place.

    1. till

      I very much agree with this comment.

      Busting people’s chops over this outtage is pretty unreasonable. Given that most of them follow not just AWS’ advice but pretty much anyone’s advice about multi-AZ closely. Of course there is always room for improvement, more drastic measures to take and more complex setups to run.

      AZs are supposed to be different physical locations and all. I can’t recall that kind of simultaneous failure across four datacenters anywhere else.

  3. Roman Stanek

    Mark:

    In theory multi-AZ deployment should be better than multi-region deployment. AWS is not ready for multi-region deployment (some of the services are only available in one region such as VPC) and tools like Elastic Balancing support multi-AZ deployment only (see the text below) and additional complexity of dealing with multi-regions may generate more problems and downtimes. And as you pointed out how do we know there is no SPOF between regions?

    Elastic Load Balancing

    You can build fault tolerant applications by placing your Amazon EC2 instances in multiple Availability Zones. To achieve even more fault tolerance with less manual intervention, you can use Elastic Load Balancing. You get improved fault tolerance by placing your compute instances behind an Elastic Load Balancer, as it can automatically balance traffic across multiple instances and multiple Availability Zones and ensure that only healthy Amazon EC2 instances receive traffic.

  4. Monday’s Musings: Lessons Learned From Amazon’s Cloud Outage | Constellation Research

    [...] a real disaster recovery strategy. The Amazon outage exposed that many start ups failed to have a disaster recovery strategy.  A number of solution providers now provide cloud disaster recovery.  More importantly, these [...]

  5. Monday’s Musings: Lessons Learned From Amazon’s Cloud Outage

    [...] a real disaster recovery strategy. The Amazon outage exposed that many start ups failed to have a disaster recovery strategy.  A number of solution providers now provide cloud disaster recovery.  More importantly, these [...]

  6. Twilio’s Cloud Architecture Principles | ITPark

    [...] a Disaster Recovery plan is not optional even in the Cloud, and Architecture is and will remain essential for building [...]

  7. Trevor O Connell

    I think one of the biggest challenges facing any start-up is how to develop a sustainable DR strategy and plan which takes consideration of the current financial standing of the start-up.

    I also think many start-ups with good intentions discuss, explore and document DR strategies most appropriate to there business, but how many actually test or execute the DR strategy prior to a major outage.

    And it’s not good enough to just test the infrastructure, application and data

    You got to literally pull the plug.

    While many cloud vendors will claim that they can offer full redundancy and reliability its my opinion that all start-ups where it is financially viable should select two cloud providers i.e. primary and stand-by (DR).

  8. Spike Robinson

    For me, the additional up front costs – duplication of design, test, administration, maintenance effort – strongly argue against using a second provider, though of course that is the ideal from a redundancy point of view.

    What would be most useful at this point would be for Amazon to disclose the multi region architecture, so we (customers) can make an independent assessment of whether multi region offers a sufficiently decoupled risk of failure. It also wouldn’t hurt for them to support multi region capability for Load Balancing, preferably working hand in hand with multi region Auto Scaling.

    If I was Amazon, that’s what I would be doing for an immediate customer confidence fix. They can tell us they have fixed the SPoFs in the AZ EBS model (the “backplane”), and they can even disclose the (fixed) architecture, but there is still going to be a trust problem there for the AZs, and, by implication, even for multi region DR.