Before any wit comes up with any other suggestions for the meaning of the title of this post, the “swallow” in the above refers to the bird, Stelgidopteryx serripennis, and not the physiological movement, phagia.
For one swallow does not make the summer, nor does one day; and so too one day, or a short time, does not make a man blessed or happy
With apologies to Aristotle who, if his long robes were to allow him, would be spinning in his grave at my use of his words but it seemed like an (almost) appropriate quote to discuss the crescendo of the naysayers claiming the recent outages suffered by hosting/cloud vendors indicate how flawed a move to the cloud really is. One (or for that matter half a dozen) outages doth not a concept negate.
The naysayers logic goes that since offsite seems to be so unreliable, any service that can be maintained in-house should be. Just to recount, the outages of note included;
- June 29, Rackspace Hosting experiences a power outage at its Dallas data center. Downtime of approx 45 minutes, knocking many popular customer web sites offline.
- July 2, Equinix data centers in Sydney, Australia and Paris each experienced power failures. The Sydney event led to disruptions for VoIP service in parts of Australia, while the Paris outage caused downtime for the popular video site DailyMotion and the French portal for hosting firm ClaraNet.
- July 2, Google App Engine, the company’s cloud computing platform, had lengthy performance problems on Thursday, experiencing high latency and data loss.
- July 2, a fire at Fisher Plaza in Seattle late Thursday night left many of the building’s data centers without power. Payment gateway Authorize.net offline for more than 12 hours, leaving its merchant customers unable to process credit card sales. Other sites experiencing lengthy downtime included AdHost, GeoCaching and Microsoft’s Bing Travel.
- July 5, a fire at 151 Front Street, the major carrier hotel in Toronto, knocked out power on several floors of the facility used by Peer 1 networks.
Phil Wainewright plays the Emperor’s New Clothes line by pointing out that the outages that occurred were not in fact cloud service data centers. In Fact many of those affected by the recent outages would have been enterprise customers with their own off-site private hosting – as Jesse Robbins former techzilla at Amazon.com said;
The people that I hold responsible for outages are not the data centers but the people who built systems that rely on a single data centers or that depend on disaster recovery plans that they write once and never put into practice — and end up getting caught with their pants down
Putting this fact aside for a minute however let’s look at the on-premise vs off-premise hosting argument for a minute. Unfortunately the playing field we’re dealing with here is less than level. Anyone who has worked within large enterprises (or small business for that matter) is aware of regular outages, both planned and otherwise. Being internal however, these outages rarely hit the news and as such we have exceptionally skewed metrics around uptime.
It’s one of the big challenges for SaaS and Cloud Computing vendors, that old chestnut that “mission critical software needs to be locally deployed in order to ensure reliability and accessibility at all times” – I though I’d talk to a few vendors about their view on all of this.
I spoke with Rod Drury, CEO of Xero who put an interesting twist on the issue – especially for those providing services to small businesses. As Rod said;
Ironically I think it showed the resilience of the model. No data lost, seamless experience for everyone, and as long as you communicate the customers seem to understand… when the worst happened a large team of super smart people around the world immediately kicked into gear working on ‘Joe the Plumbers’ accounting system to make sure it was up and running and no data was lost. Within 45 minutes all back and up and running. So it showed the huge benefit of SaaS.
Which is a great answer for those providing services to small businesses where there is no option of providing internal infrastructure that comes close to that provided by a professional cloud provider. It’s also a legitimate answer for a situation where the vendor is open and communicative.
It doesn’t however provide the answer for an enterprise customer looking to deploy Google Apps and rightly concerned about both uptime and communication in the event of a problem. Scott McMullan, Google Apps Partner Lead, Google Enterprise, who no doubt fights these battles every day, pointed me to an official Google blog post that included some (admittedly slightly out of date) metrics;
According to the research firm Radicati Group, companies with on-premises email solutions averaged from 30 to 60 minutes of unscheduled downtime and an additional 36 to 90 minutes of planned downtime per month
At the end of the day, looking at the outages above, some guidelines can be drawn about how to avoid these incidents and how to react to them if and when they occur;
- Plan ad nauseum – multiple levels of redundancy are critical. Assess the worse case scenario and then plan for something even worse. Test your ability to switch to alternative locations quickly and cleanly.
- Communicate ad nauseum – social media is cheap and easy – use it. Develop a service notification dashboard and publicize it- both within the offering and at an alternative location (company blog hosted elsewhere for example). Post status updates to Twitter in the event of an outage.
- Test disaster recovery protocols. A data centre outage is one thing, the ability to restore backups, to quickly run consistency checks and to get users up and running with no data loss is critical.
- Hosting/Cloud vendors react fast and react well – witness Rackspace’s quick response with credits after their outage – sure their SLAs required this but I’d like to think it would have happened anyway