Skype went down this morning and, as expected, Techmeme and Twitter are already going crazy with the news about downtime. There were speculations about problems with their centralized infrastructure used to authenticate users (which was a cause of trouble in one of the past outages) and some were even talking about “The Cloud Fail”. Whether it is President or Cloud, there are people who are eager to see them fail because of their political or business interests. Skype has already responded about the downtime and, in spite of their reasoning, the talk about “the cloud fail” continues to spread. I thought I will do a quick post to highlight the FUD as a public service message :-).
According to Skype blog, the reasons for failure are not their centralized infrastructure but parts of their distributed infrastructure.
Skype isn’t a network like a conventional phone or IM network – instead, it relies on millions of individual connections between computers and phones to keep things up and running. Some of these computers are what we call ‘supernodes’ – they act a bit like phone directories for Skype. If you want to talk to someone, and your Skype app can’t find them immediately (for example, because they’re connecting from a different location or from a different device) your computer or phone will first try to find a supernode to figure out how to reach them.
Under normal circumstances, there are a large number of supernodes available. Unfortunately, today, many of them were taken offline by a problem affecting some versions of Skype. As Skype relies on being able to maintain contact with supernodes, it may appear offline for some of you.
Clearly, the problem here is not indicative of any trouble that could arise with cloud computing. It is more of a distributed system failure due to software problems than any failure in a centralized infrastructure. Failures do happen in cloud computing in the same way it happens in traditional IT. No one is claiming that cloud, by its very nature, will offer 100% uptime. In fact, commodity cloud providers like Amazon advise you to expect failures to happen and architect your application for such failures. When someone gets out to explore cloud computing services, they don’t just go out because it offers enormous cost savings and elasticity but they put this advantage in the context of associated risks and compare it with the risk-benefit equation of traditional infrastructure. Any FUD promoted using the Skype downtime against the risks in public clouds is irresponsible and there is also an inherent assumption that people who listen to such FUD are complete idiots who only explore public clouds based on its advantages and without considering the risks. Enoughhhhh.
I am not sure whether I agree or disagree with you. So let me share my thoughts and let you decide whether we are agreeing or not. Skype supenode overlay CAN be viewed as a cloud. After all each supernode is an instance in a cloud. The problem Skype faced was that most of the instances were using the same version of the app and they ended upgrading them all at the same time, thereby losing their drone army. A few years back they faced a similar failure because most of the supernodes used the same OS (Windows) and an OS patch made the app inoperable.
Alternatively, if they had deployed their own instances, they could have selected diff OSs, they could have operationally selected the supernodes for updow a cloud could fail because we have not paid attention to operational details.ates in batches etc. In this respect Skype failure points out h
I agree with your comparison that Supernodes are like EC2 instances. An EC2 instance may fail due to the user messing up with the OS or software installed but it is not a failure of Amazon cloud. Thats the distinction I wanted to make.
Come ON!!!
This is not accepted for the big Skype!
it shouldn’t happen with them!
True but that was not my point.