Fall of last year I wrote a controversial whitepaper detailing my concerns about how distributed storage was being marketed. The blog introduction and the whitepaper were both entitled Converged Storage, Wishful Thinking, and Reality. There was a certain amount of expected blowback from folks at RedHat and Ceph as well as more thoughtful replies from Dreamhost and Solidfire. Some of the feedback was pure screed, some was worthy of considered thought, and much of it missed the mark. I realized however that it was *my* failure to communicate clearly that caused a considerable amount of the feedback to be erroneous.
So I wrote an update to that whitepaper that I think helps clarify my thinking in this area. Fortuitously, the recent Google outage really helps to emphasize what I am talking about. If you haven’t been paying attention, you can find out more about it here and from Google themselves. In a nutshell, Google, who runs a 3-4 nine operation for most services (meaning 99.9-99.99% uptime), had a 35-minute outage due to a software bug. It’s noteworthy that an event like this is so heavily scrutinized given that the vast majority of businesses struggle to achieve three nines of uptime.
More importantly, this outage highlights what I was trying to say in my original paper, which is that large homogeneous software systems are inherently dangerous to uptime. Systems fail because of hardware failure, operator error, or software bugs. Traditional high availability (HA) pairs solve for the hardware failure problem, but typically ignore operator error or software bugs. Newer distributed software systems *still* predominantly solve for hardware failure, ignoring the more likely failure scenarios of operator error and software bugs. This is a major problem.
There are a number of ways to solve these problems, including but not limited to: running more than one kind of software system (i.e. moving to a more heterogeneous set of systems), using sharding techniques, and applying disaster recovery techniques. All of these approaches are essentially doing the same thing, which is limiting the failure domain size. Not having fault isolation means you will see cascading failures as AWS has seen several times when it violated some of these basic principles. Operator errors combined with software bugs caused massive EBS failures, which spilled across Availability Zones because the EBS control plane spanned the AZes.
This is my problem with how distributed storage systems are being marketed. I love distributed storage. I think it has a place in the datacenter. Cloudscaling is currently evaluating our options for integrating open source distributed storage software in our product. The problem is that it’s place is not to run everywhere in the datacenter and the marketeers at various storage startups who believe so and market so-called “unified” storage solutions are really setting everyone up for failure. There is no spoon.
For more details, here is my updated storage whitepaper, now entitled The Case for Tiered Storage in Private Clouds, that hopefully furthers the conversation on the appropriate best practices and patterns to use for deploying storage in private clouds today.
In the interest of transparency, I will say that this new revision was kindly reviewed by the team at SolidFire, who have a similar viewpoint to mine. Much thanks to Dave Wright, Dave Cahill, and John Griffith (PTL for OpenStack Cinder) for their feedback.