One of the biggest worries organizations have about cloud computing is the unexpected outages and the impact of associated disruptions. In fact, some of the traditional vendors use this very issue to push FUD among their customers so that they can lock them in for the foreseeable future. Similarly, if anyone evangelizing cloud tells you to ignore stories about cloud outages, you better send that person for psychiatric evaluation. Having said that, I want to emphasize that cloud outages are real and they are here to stay. In fact, there were two outages in the last few days that are worth mentioning in this context. Google Docs had an outage on Wednesday and it lasted for one hour cutting off majority of their customers from accessing documents. Yesterday, Microsoft Hotmail and SkyDrive had few hours of downtime.
Outages happen. It happened in the traditional services world, it is happening in the current cloud services era and it will also happen in the future services. But what is different from the outages in traditional services is the transparency on the part of cloud providers. They clearly understand the critical nature of their services and are upfront about it. Google wrote a blog post explaining Google Docs outage
So what happened? The outage was caused by a change designed to improve real time collaboration within the document list. Unfortunately this change exposed a memory management bug which was only evident under heavy usage.
Every time a Google Doc is modified, a machine looks up the servers that need to be updated. Due to the memory management bug, the lookup machines didn’t recycle their memory properly after each lookup, causing them to eventually run out of memory and restart. While they restarted, their load was picked up by the remaining lookup machines – making them run out of memory even faster. This meant that eventually the servers couldn’t properly process a large fraction of the requests to access document lists, documents, drawings, and scripts which led to the outage you saw on Wednesday.
Windows Live team was keeping their users updated on their blog. When AWS outage happened, even though there were criticism about their silence during the outage, they came out with a very detailed postmortem of what happened and addressed the steps they are taking to avoid such disruptions in the future. This is different from the traditional services approach where it was always difficult to get them talk frankly on the reasons for the outage. Cloud era is defined by the high level of transparency on the part of service providers and it is going to be one of the pillars of eventual success of this operational model. It is not just about being transparent to their customers but also to the public. This, in turn, helps providers to get their future customers to trust them more easily.
Of course, transparency doesn’t compensate for the loss suffered by the organization. Cloud computing is not a shortcut for organizations to avoid having DR and business continuity practices in place. Rather, moving to cloud even requires the apps to be designed for failure. Organizations are expected to be even more vigilant with respect to their DR plans. The only way to avoid any business loss is by having a well designed (and tested) DR policy. Instead these transparency measures greatly addresses another concern often quoted by enterprise IT managers, ie. losing visibility over disruptions. An IT manager knows pretty well that disruptions are part and parcel of the IT lifecycle. They also want to know the root cause of the disruption and the steps taken to avoid future disruptions by the same cause. By being completely transparent about the outage and by doing a thorough postmortem, the cloud service providers are giving the confidence to these managers that they are in control of the disruptions even if it happens in the cloud outside of their perimeter. I think any arguments about losing visibility on disruptions is meaningless when cloud providers are proactively transparent about it.
If you are a cloud service provider, make sure to revisit the business school and learn about the fact that secrecy is no more a competitive advantage. Being transparent is very critical for not just your company’s business success but also for enabling trust on the side of buyers. Even as the traditional companies engage in the FUD about cloud outages and the associated loss in business, cloud service providers can blunt them by being excessively transparent. Transparency is one of the key mantras of the cloud and it is eventually going to completely change the tide in favor of cloud services. Do you agree with me?
Related articles
- Doing It Right: Google Docs Apologizes for Yesterday’s Outage (readwriteweb.com)
- Google Explains Its Google Docs Outage (techcrunch.com)
- Transparency & Trust in Cloud Security (securopia.wordpress.com)
- Google apologizes for this week’s Docs outage (news.cnet.com)
- Microsoft fixing Hotmail after major outage (canada.com)
- Microsoft’s Cloud Bursts for over Two Hours: What Happened? (techland.time.com)

You just can’t avoid outages. Cloud services still need to be maintained and upgraded.
But what’s great is that you never really need to worry about calling somebody to fix a problem. Those guys providing just do it without any prompting.
I agree that providers are getting more transparent. It not only a function of a new “cloud” world, but online accountability with social networking is a huge impact. 10 years ago if your outsourcer was doing a bad job, no one knew. This new cloud service providers, its easier to switch as the infrastructure is commoditized and the ability to blog about providers is so much easier. All in all, a good thing though, because competition and accountability make for better services and more innovation.
Look at Amazon EC2 SLA. http://aws.amazon.com/ec2-sla/
This is the new transparency. Of course, there are SLA’s and there are S*L*A*’s. As with any SP, if you are willing to pay, they will add 9′s to your annual availability. If it is outside their control like Force Majure, last mile or third-party connections you are in the same boat as any consumer of services. Stuff happens. It is getting better … fast. Knowing what the SLA is before subscribing and living with what you can afford is SOP for Cloud Commerce.
[...] Krishnan Subramanian, a analyst and researcher wrote an article on Cloudave.com titled Loss Of Control And Transparency In The Cloud Era. Krishnan discusses the reality that moving to the cloud does not reduce your risk out outages to [...]