It’s amazing, though not in the least bit surprising, that the recent AWS outage has generated such widespread attention, with a plethora of blog posts from customers to industry experts taking up pixel space across all corners of the globe.
I think it’s fairly obvious to all that since the event, everything that needs to be said has already been said by someone or other, so I’m not going to wax poetic on “who is or isn’t to blame” as it really is somewhat academic to those who are not a customer of AWS today. Or is it ? To extract a quote from George Reese vis-a-vis his assessment:
….or you failed to design for Amazon’s cloud computing model.
This is a very interesting statement, specifically because I feel it (unintentionally) brings yet more uncertainty front and center in today’s enterprise mindset as it relates to the process of public cloud adoption. I have said on countless occasions that I believe that many enterprises are essentially black-balled from moving to public cloud because of legacy application architectures that “don’t scale past the edge”, hence making movement and subsequent “like for like” functionality of the associated components that comprise the application impossible without overlay networking technologies, but add in this latest, often misunderstood concept of design for failure and perhaps this another potential nail in mass enterprise adoption at this moment in time ? Could we say that the public cloud has even become discredited ? (ok, so that was a cheap shot)….
In the early stages of the recent AWS event, I posted this tweet :
I say “rightly or wrongly” because although there is no doubt we will continue to hear from those who wish fan the flames of the age-old discussion on the reliability of public cloud services, if an organization has a fall-back proposition to try and build their own private cloud with a basic assumption that it will somehow be more reliable than AWS in delivering application workloads that could be deployed either publicly or privately but have not been exclusively assessed and designed for failure…..well, how do you think that’s going to end ?
I’m convinced that today, for the traditional enterprise folks, it really doesn’t take much more than an outage of this nature, combined with some horror stories of how certain customers were catastrophically affected, and paradoxically worrying cases of what it took for certain customers not to be affected, to push the exploration of private cloud further up the to-do lists of many enterprise CIOs. This would be an entirely natural reaction. Somehow it always feels safer if you’re in control. In a strange way, it is the same as the irrational fear people have of flying – most times it’s not the actual fear of being propelled through the air at 530mph in a metal tube, it’s the fact that someone else is in control of your destiny – forget the fact that he or she is a highly trained, redundant resource in the flight deck !
Do enterprises give significant considerations to design for failure today ? Do they even know what that means ? Perhaps they even do it without realizing ? My guess is that there are many confused people out there wondering if design for failure means re-architecting applications to be “a la Netflix” (or any of the other customers who came through the event unscathed). That would be a fair conclusion to arrive at if you’re reading between the lines and it wouldn’t be too hard to see how that very premise could lead to putting public cloud on the “too hard” pile, citing “can’t afford to make the architectural changes required” as the reason.
Yet, I have a slightly different view. In my experience of building some very large enterprise environments over the years, and of course more recently, our incredibly successful private cloud, I have very rarely come across “traditional” enterprise applications in a behind-the-firewall context that are absolutely and inherently capable of withstanding multiple component failures at any given layer in any given physical location and continue to operate via geo-redundancy and without downtime. In most cases, it is the sheer complexities of doing that with the technologies in the estate that are huge barriers – data replication, workflow, business logic, latency, all high-precision point-in-time stuff that frankly, I have never seen a true implementation of. This is complicated, difficult stuff and although unfairly (IMHO) magnified in the public cloud context, no private cloud will fix this without addressing the same set of issues, both architecturally and operationally – but, designing for failure need not necessarily mean your application has to withstand the chaos monkey without missing a beat. It does, however, serve to underline my previous notion that “apps that support your business” are very different from “apps that are your business”.
Put all that together, and here’s my thought:
Like a million others, I’ve spent a thousand hours putting Humpty back together again when data centers have broken out of sight. Many enterprises like ours, and those I know personally, have rock solid disaster recovery and business continuity plans, many with business sponsorship. In most cases, these are reflective of the levels of risk and acceptable RTO for the business that the applications support. Those considerations can and should transcend any cloud strategy – there is no magic formula, simply good planning and an understanding of the capability of your organization to deal with the unexpected, irrespective of where, when and on who’s platform that outage occurs. To me, it’s more important to be complete in how you prepare for failure, even if you don’t consider it a true design.
Trying to provide the ultimate go-to solution for public cloud is an occupational hazard for leaders like Amazon – when you’re way out in front, there is an implied level of expectation upon you. When things break, as the absolutely will, it’s imperative that they are addressed fully and with honesty – and as has happened before, Amazon will certainly have to provide a good explanation in their post-mortem. I am sure they will. As AWS CTO Werner Vogels points out in this video from 2008 in response to a previous outage – “they would have seen a lot more downtime on their own infrastructures”. Still hard to argue against that, in fairness.
When the dust settles, I think we can all come out of this a little smarter.
You know I can’t resist an aviation analogy, so let me close with this. (source : wikipedia)
The de Havilland DH 106 Comet was the world’s first commercial jet airliner to reach production. It first flew in 1949 and was a landmark in aeronautical design. It featured an extremely aerodynamically clean design with its four de Havilland Ghost turbojet engines buried into the wings, a low-noise pressurized cabin, and large windows; for the era, it was an exceptionally comfortable design for passengers and showed signs of being a major success in the first year upon launching.
However, a few years after introduction into commercial service, the Comet suffered from catastrophic metal fatigue, which in combination with the pressurisation, caused two well-publicised accidents where the aircraft tore apart in mid-flight. The Comet had to be extensively tested to discover the cause; the first incident had been incorrectly identified as having been caused by an onboard fire. Several contributory factors, such as window installation methodology, were also identified as exacerbating the problem. The Comet was extensively redesigned to eliminate this design flaw. Rival manufacturers meanwhile developed their own aircraft and heeded the lessons learnt from the Comet.
So, rather than see this whole episode as an opportunity to be negative, I opt to see it as a positive and an experience we all can learn from, just like de Havilland did over 60 years ago.They led an industry to where aviation has become such a safe, affordable, mass-transporation commodity. In enterprise, no two needs or two scenarios are the same, so it’s critical that you know the limitations by separating fact from fiction and allowing you to make informed decisions specific to your use cases. Be thorough, because when it all breaks, there is no patch for stupid.
