One of the biggest worries organizations have about cloud computing is the unexpected outages and the impact of associated disruptions. In fact, some of the traditional vendors use this very issue to push FUD among their customers so that they can lock them in for the foreseeable future. Similarly, if anyone evangelizing cloud tells you to ignore stories about cloud outages, you better send that person for psychiatric evaluation. Having said that, I want to emphasize that cloud outages are real and they are here to stay. In fact, there were two outages in the last few days that are worth mentioning in this context. Google Docs had an outage on Wednesday and it lasted for one hour cutting off majority of their customers from accessing documents. Yesterday, Microsoft Hotmail and SkyDrive had few hours of downtime.
Outages happen. It happened in the traditional services world, it is happening in the current cloud services era and it will also happen in the future services. But what is different from the outages in traditional services is the transparency on the part of cloud providers. They clearly understand the critical nature of their services and are upfront about it. Google wrote a blog post explaining Google Docs outage
So what happened? The outage was caused by a change designed to improve real time collaboration within the document list. Unfortunately this change exposed a memory management bug which was only evident under heavy usage.
Every time a Google Doc is modified, a machine looks up the servers that need to be updated. Due to the memory management bug, the lookup machines didn’t recycle their memory properly after each lookup, causing them to eventually run out of memory and restart. While they restarted, their load was picked up by the remaining lookup machines – making them run out of memory even faster. This meant that eventually the servers couldn’t properly process a large fraction of the requests to access document lists, documents, drawings, and scripts which led to the outage you saw on Wednesday.
Windows Live team was keeping their users updated on their blog. When AWS outage happened, even though there were criticism about their silence during the outage, they came out with a very detailed postmortem of what happened and addressed the steps they are taking to avoid such disruptions in the future. This is different from the traditional services approach where it was always difficult to get them talk frankly on the reasons for the outage. Cloud era is defined by the high level of transparency on the part of service providers and it is going to be one of the pillars of eventual success of this operational model. It is not just about being transparent to their customers but also to the public. This, in turn, helps providers to get their future customers to trust them more easily.
Of course, transparency doesn’t compensate for the loss suffered by the organization. Cloud computing is not a shortcut for organizations to avoid having DR and business continuity practices in place. Rather, moving to cloud even requires the apps to be designed for failure. Organizations are expected to be even more vigilant with respect to their DR plans. The only way to avoid any business loss is by having a well designed (and tested) DR policy. Instead these transparency measures greatly addresses another concern often quoted by enterprise IT managers, ie. losing visibility over disruptions. An IT manager knows pretty well that disruptions are part and parcel of the IT lifecycle. They also want to know the root cause of the disruption and the steps taken to avoid future disruptions by the same cause. By being completely transparent about the outage and by doing a thorough postmortem, the cloud service providers are giving the confidence to these managers that they are in control of the disruptions even if it happens in the cloud outside of their perimeter. I think any arguments about losing visibility on disruptions is meaningless when cloud providers are proactively transparent about it.
If you are a cloud service provider, make sure to revisit the business school and learn about the fact that secrecy is no more a competitive advantage. Being transparent is very critical for not just your company’s business success but also for enabling trust on the side of buyers. Even as the traditional companies engage in the FUD about cloud outages and the associated loss in business, cloud service providers can blunt them by being excessively transparent. Transparency is one of the key mantras of the cloud and it is eventually going to completely change the tide in favor of cloud services. Do you agree with me?
- Doing It Right: Google Docs Apologizes for Yesterday’s Outage (readwriteweb.com)
- Google Explains Its Google Docs Outage (techcrunch.com)
- Transparency & Trust in Cloud Security (securopia.wordpress.com)
- Google apologizes for this week’s Docs outage (news.cnet.com)
- Microsoft fixing Hotmail after major outage (canada.com)
- Microsoft’s Cloud Bursts for over Two Hours: What Happened? (techland.time.com)