Yesterday’s AWS outage has been buzzing around the tech blogosphere even after 24+ hours. As usual naysayers of cloud are up in the arms trying not to miss the golden opportunity to create FUD and competitors to Amazon are tapping into their misery to push their services. Well, people are tuned to accept this as legitimate strategy in a free market system. Without ranting any further on this or spending time blaming how Amazon botched this up big time, I want to talk about some of the lessons we can learn from this outage.
Before we talk about the lessons learned from this AWS debacle, I want to emphasize one difference between the cloud world and the traditional IT world. In the FUD and noise surrounding the outage, many miss this important advantage in the cloud based world. In traditional IT, there are significant costs associated with any DR plan because you have to provision the additional servers (datacenters) needed for any recovery well in advance. This not only adds significantly to the capital expense, it also adds deeply into the operating expenses. Even if your IT is with a managed provider, you spend lot of money reserving capacity for any possible DR needs. The advantage with the cloud based environment is that if you manage to keep your data backup current in another location, the processing power can be switched on by just swiping your credit card and without any need to either provision ahead of time or wait for a long time after the disaster. This is a very important advantage in the cloud based world and, when disaster strikes, you can recover with minimum monetary pinch (provided the DR plan is solid).
Yesterday’s EC2 outage exposed how many of the startups are running without a proper DR strategy. It is a shame that some of the well funded startups didn’t bother to plan for such eventualities. I guess this outage will teach a good lesson for the startups (and, also, their investors) and prepare them before the next disaster. There are many lessons we can learn from yesterday’s outage but I want to highlight some key ones in this post. After all, CloudAve is one of the well respected blogs on cloud computing and we cannot shy away from talking about a topic which reached even the consumer media.
The following are the key lessons we should learn from the episode:
- Even though I don’t like the idea of coding for failure, just do it. When we shop at Walmart, we clearly understand that there is a compromise in the quality while getting goods at low prices. If we want to take advantage of commodity servers based public clouds, there is no option but to code for failure
- Now imagine myself to be jumping up and down the stage like Steve Ballmer shouting “DR, DR, DR, DR, ……….”. Well, a proper DR strategy is key to any cloud plans. As I pointed out in the paragraph above, cloud computing offers some cost advantages while planning for disaster recovery. In spite of that advantage, we have seen many businesses getting hit in the AWS outage. There are many reasons why this happened. The picture painted by cloud evangelists (including myself in the past) gave an impression that cloud is fail proof. The higher emphasis on devs over ops gave some kind of complacency to people. They started believing religiously that cloud removes ops from the picture entirely and everything works automagically. All these evangelism driven dogma led people to not worry about DR at all. I am glad that this failure wake people up from any complacency
- SLAs are important but what matters is how you have negotiated the compensation. This is one of the reasons I promote federated clouds over consolidation. When you have a handful of infrastructure players, they will not care about compensating for any loss during the outage unless the customers are Fortune 500 companies. We need providers who differentiate their offerings on the basis of how they compensate. In order for this to happen we need large scale competition and not consolidation. Only federated clouds can help in ensuring a marketplace where customers are not screwed because of cloud downtimes
- Keep geographical redundancy and proximity to another cloud provider as key mantra while planning your DR strategy
Whether we like it or not, the customers are equally responsible for outages along with the cloud providers. Cloud is not a magic pill that solves the erection, sorry, scaling problems without any other worries. As in the case of pills that help in the erection issues, there are some side effects associated with the cloud that helps with the rapid infrastructure scaling. It is important that customers understand the compromises they have to make while taking advantage of the benefits offered by cloud computing. Yesterday’s AWS outage is a good opportunity to take a step back and be realistic about the approach to cloud.
Related articles
- Amazon’s Web Services outage: End of cloud innocence? (zdnet.com)
- Many AWS Sites Recover, Some Face Longer Wait (datacenterknowledge.com)
- Amazon EC2 Outage Hobbles Websites (informationweek.com)
- What Amazon’s Outage Means For Cloud Storage (huffingtonpost.com)
- Inside Amazon’s Cloud Disaster (AMZN) (businessinsider.com)
- Prolonged Amazon outage takes down sites across Internet (theglobeandmail.com)
- Amazon cloud outage derails Reddit, Quora (news.cnet.com)
- Amazon cloud outage derails Reddit, Quora (news.cnet.com)
I disagree. We followed the best practices from AWS and used multiple availability zones with apparently no single point of failure as hot spares (what you would call multiple data-centers) but we had no visibility into internal design of AWS and so we did not expect that all availability zones would go down at the same time in one region.
Amazon has a concept of regions but there are no tools to manage app in multiple regions, the connectivity goes over public web and so on. So once we decide to use regions as the DR approach (and we probably have to) we may as well look at other cloud providers because we will have to maintain multiple very separate instances anyway.
I believe that the key to this is more transparency and communications from AWS. More here: http://bit.ly/aws-outage
I agree with you on the need for transparency. But in my post, I am not talking about Amazon alone and when I talk of redundancy, I meant different regions and not availability zones.
There is nothing in your post that suggests you were referring to multiple regions. Furthermore, even if it’s what you actually meant (it sounds more like spin), it’s no better a DR plan for this case than spanning multiple AZs. The failure is in Amazon’s EBS system architecture, which did not properly isolate AZs. It could have just as well failed to isolate regions. The only true way to protect against this kind of failure is to span multiple providers, not regions.
I guess you didn’t read my post. I never said Amazon is not wrong. As I clearly stated in my post, I wanted to take the conversation beyond blaming Amazon and focus on what are some of the best practices.
Well, the thing is, according to the AWS docs, the AZs should be completely separate and should not have any SPOFs in common. Given that, a multi-AZ deployment should theoretically be just as good as a multi-region deployment. Since we know nothing about the internal architecture of the AWS setup, we cannot trust that a multi-region failure might not occur, given that a multi-AZ failure did occur, even though it shouldn’t.
I totally agree that having a DR strategy that involves multi-region deployment is a sound plan, given the expenses are justifiable, but it does not change the fact that the current multi-AZ failure should not have occurred in the first place.
Mark, I agree with you that multi-AZ failure should not have occurred in the first place. There is no second opinions to that assertion. The intent of my post is to tell the users to look beyond the claims and prepare for any eventuality.
I also agree with Roman’s point about transparency not just about the outage but also about their architecture.
I very much agree with this comment.
Busting people’s chops over this outtage is pretty unreasonable. Given that most of them follow not just AWS’ advice but pretty much anyone’s advice about multi-AZ closely. Of course there is always room for improvement, more drastic measures to take and more complex setups to run.
AZs are supposed to be different physical locations and all. I can’t recall that kind of simultaneous failure across four datacenters anywhere else.
Mark:
In theory multi-AZ deployment should be better than multi-region deployment. AWS is not ready for multi-region deployment (some of the services are only available in one region such as VPC) and tools like Elastic Balancing support multi-AZ deployment only (see the text below) and additional complexity of dealing with multi-regions may generate more problems and downtimes. And as you pointed out how do we know there is no SPOF between regions?
Elastic Load Balancing
You can build fault tolerant applications by placing your Amazon EC2 instances in multiple Availability Zones. To achieve even more fault tolerance with less manual intervention, you can use Elastic Load Balancing. You get improved fault tolerance by placing your compute instances behind an Elastic Load Balancer, as it can automatically balance traffic across multiple instances and multiple Availability Zones and ensure that only healthy Amazon EC2 instances receive traffic.
I think one of the biggest challenges facing any start-up is how to develop a sustainable DR strategy and plan which takes consideration of the current financial standing of the start-up.
I also think many start-ups with good intentions discuss, explore and document DR strategies most appropriate to there business, but how many actually test or execute the DR strategy prior to a major outage.
And it’s not good enough to just test the infrastructure, application and data
You got to literally pull the plug.
While many cloud vendors will claim that they can offer full redundancy and reliability its my opinion that all start-ups where it is financially viable should select two cloud providers i.e. primary and stand-by (DR).
For me, the additional up front costs – duplication of design, test, administration, maintenance effort – strongly argue against using a second provider, though of course that is the ideal from a redundancy point of view.
What would be most useful at this point would be for Amazon to disclose the multi region architecture, so we (customers) can make an independent assessment of whether multi region offers a sufficiently decoupled risk of failure. It also wouldn’t hurt for them to support multi region capability for Load Balancing, preferably working hand in hand with multi region Auto Scaling.
If I was Amazon, that’s what I would be doing for an immediate customer confidence fix. They can tell us they have fixed the SPoFs in the AZ EBS model (the “backplane”), and they can even disclose the (fixed) architecture, but there is still going to be a trust problem there for the AZs, and, by implication, even for multi region DR.