The AWS outage from last week brought the idea of “design for failure” into focus in many of the discussions around the cloud world. Looking back at the outage, it is pretty clear that only those apps that were designed for failure withstood the outage and the rest, especially the ones without even a DR strategy, went down. Some stayed down for more than 24 hours putting emphasis on the “design for failure” concept. In this blog post, let me try to highlight some key points which we should take into consideration while designing the apps for the cloud.
Can you explain the concept of designing for failure?
The public cloud infrastructure, in general, and Amazon cloud, in particular, are built in such a way that the control of application availability is in the hands of the developers. Unlike the traditional apps which are entirely dependent on the availability of the underlying infrastructure, cloud applications can be designed to withstand even big infrastructure outages. It is both the strength of the cloud model and its “weakness” (before you attack me on this, please note that I have used it within quotes). It’s strength comes from the fact that developers can finally break away from their dependence on infrastructure availability and they can even achieve 100% uptime for their applications. The very nature of cloud, its utility pricing and on-demand provisioning, makes this process affordable and seamless. It’s “weakness” stems from the fact that applications not designed for failure will face outage when the underlying cloud or even hardware running the virtual machines goes down. Since most of the public cloud providers run their cloud on top of commodity hardware, such outages could come at an inopportune time.
George Reese, CTO of enStratus, offers some basic steps to design the applications for failure in this blog post.
- Each application component must be deployed across redundant cloud components, ideally with minimal or no common points of failure
- Each application component must make no assumptions about the underlying infrastructure—it must be able to adapt to changes in the infrastructure without downtime
- Each application component should be partition tolerant—in other words, it should be able to survive network latency (or loss of communication) among the nodes that support that component
- Automation tools must be in place to orchestrate application responses to failures or other changes in the infrastructure
Some key points to consider
Even though many blog posts have highlighted key points regarding the design for failure approach to building applications, I want to consolidate some key points here in this post based on the queries I fielded from my clients and other enterprise application developers in the past few days.
- Though I have said it already in a post last week, I want to emphasize again. If you are developing for the cloud, design your application architecture for failure. Don’t look for alternatives
- Split your applications into different components and, as George Reese mentioned in his post, make sure every component of your application has redundancy with no common points of failure
- If you are using AWS, distributing across multiple (3 or 4) Availability Zones (AZs) must work but some developers have complained that it didn’t help them. Ideally, spreading it across multiple AWS regions or, even better, multiple cloud providers will ensure the availability of your application. But this approach has issues related to costs going out of control and network latency. It is not possible to tap into multiple regions/cloud providers for all applications because of issues like the ones listed above
- If your datastore is NoSQL based, you can easily design for failure using different AZs/Regions/Providers. If you use relational databases for data consistency needs, you are out of luck for the most part
- If your application processes large amounts of data (big data use case), the “design for failure” approach is going to be even more difficult because of the “inertia” associated with moving such large volumes of data
- Even if you are preparing for such an eventuality, a simple oversight could land your app into trouble. According to this comment on George Reese’s post, a startup used the best practices to prepare themselves for an outage like this one but they still went down because their webapp was hosted on top of Heroku. Unfortunately for them, Heroku was not designed to withstand the AWS outage. I am sure, Salesforce will move Heroku to their datacenters faster after this incident
- Finally, if you are not worried about developing applications for cloud but, instead, wanting to use legacy apps, I suggest you go and find a good cloud management platform that can help you deal such outages with minimum impact
I really want to hear from application developers on some of the best practices they employ to avoid such outages. I am sharing two blog posts talking about how Twilio and Smugmug stayed up during the recent AWS outage. If you want to share your own experience staying up by designing for the failure, CloudAve will be happy to showcase your experience. Please contact me.
Talking about death of Cloud Computing in the aftermath of AWS outage is just naive. However, this incident is a good speed bump on our way to take a step back from all the hype and be more realistic. It gives us an opportunity to stop considering cloud as a belief system and use it in a more strategic way. Ever since the outage happened, we are seeing many blog posts from the actual practitioners sharing their best practices. In short, this incident helped cloud community get better educated and empowered to deal with any future outages while also taking advantages of the benefits of cloud computing.
- Traditional Vs Design for Failure (storagezilla.typepad.com)
- Seven lessons to learn from Amazon’s outage (zdnet.com)
- Amazon EC2 outage: summary and lessons learned (rightscale.com)
- Lessons From a Cloud Failure: It’s Not Amazon, It’s You (wired.com)
- How One Site Survived Amazon’s Outage (blogs.wsj.com)