One of the biggest failures of OpenStack to date is expectation setting. New potential OpenStack users and customers come into OpenStack and expect to find:
-
A uniform, monolithic cloud operating system (like Linux)
-
Set of well-integrated and interoperable components
-
Interoperability with their own vendors of choice in hardware, software, and public cloud
Unfortunately, none of this exists and probably none of it should have ever been expected since OpenStack won’t ever become a unified cloud operating system.
The problem can be summed up by a request I still see regularly from customers:
I want ‘vanilla OpenStack’
Vanilla OpenStack does not exist, never has existed, and never will exist.
Examining The Request
First of all, it’s a reasonable request. The potential new OpenStack customer is indirectly asking for those things that led them to OpenStack in the first place. The bi-annual user survey has already told us what people care about:
The top reasons for OpenStack boil down to:
-
Help me reduce costs
-
Help me reduce or eliminate vendor lock-in
Hence the desire for “vanilla” OpenStack.
But what is “vanilla”? It could be a number of things:
-
Give me the “official” OpenStack release with no modifications
-
Give me OpenStack with all default settings
-
Give me an OpenStack that has no proprietary components
Which immediately leads into the next problem: what is “OpenStack”? Which could also be a number of things:
-
Until recently, officially “OpenStack” meant Nova + Swift *only*
-
The de facto “baseline” set of commonly deployed OpenStack services
-
Nova, Keystone, Glance, Cinder, Horizon, and maybe Neutron
-
There is no name or official stance on this arbitrary grouping
-
-
Use of DefCore + RefStack to test for OpenStack
This last item is still in motion and is the intended replacement model for #1. All of this is covered elsewhere so I won’t go into details on what it means here.
Understanding The OpenStack Trademark and Its History
It’s not really a secret, but it’s deeply misunderstood: until the last few weeks, the Bylaw very specifically said that “OpenStack” is Nova plus Swift. That’s it. No other projects were included in the definition. Technically, many folks who shipped “OpenStack” without Swift were not actually legally allowed to use the trademark and brand. This was widely unenforced because the Board and Foundation new the definitions were broken. Hence DefCore.
Also, the earliest deployments out there of OpenStack were Swift-only. Cloudscaling launched the first ever production deployment of OpenStack outside of Rackspace in January of 2011, barely 6 months after launch. At that time, Nova was not ready for prime time. So the earliest OpenStack deployments also technically violated the trademark bylaws, although since the Foundation and bylaws had yet to be written this didn’t really matter.
My point here is that from the very beginning of OpenStack’s formation drawing a line around “OpenStack” has been difficult.
Explosive Growth of Projects is Making Defining OpenStack Much Harder
There are now 20 projects in OpenStack. Removing libraries and non-shipping projects like Rally, there are still ~15 projects in “OpenStack” integrated status. And there are many more on the way. Don’t be surprised if by the end of 2016 there are as many as 30 shipping OpenStack projects in integrated status.
Many of these new projects are above the imaginary waterline many have created in their minds for OpenStack. Meaning that for many OpenStack is an IaaS-only effort. However, we can now see efforts like Zaqar, Sahara, and others are blurring the line and moving us up into PaaS land.
So when a customer is asking for “OpenStack”, just what are the asking for? The answer is that we don’t know and rarely do they. The lack of definition creation on the part of the Board, the Foundation, and the TC has made explaining this very challenging.
Vanilla-land: A Fairy-Tale in the Making
You can’t run OpenStack at scale and in production, the only measure that matters, in a “vanilla” manner. Here I am considering “vanilla” to include all three definitions above: default settings, no proprietary code, and no modifications. The closest you can get to this is DevStack, which is not scalable nor acceptable for production.
Why?
It would really take far too long to go through all of the gory details, but I need to give you some solid examples so you can understand. Let’s do this.
General Configuration
First, there are well over 500 configuration options for OpenStack, maybe more. Many options immediately take you into proprietary land. Want to use your existing hypervisor, ESX? ESX is proprietary code, creating vendor lock-in and increasing costs. Want to use your existing storage or networking vendors? Same deal.
Don’t want to reuse technology you already have? Then be prepared for a shock about what you’ll get by default from core projects.
Networking
Whether you use nova-networking or Neutron, the “default” mode is what nova-networking calls “single_host” mode. Single host mode is very simple. You attach VLANs to a single virtual machine (or a single bare metal host) which acts as the gateway and firewall for all customers and all applications. Limited scalability and performance since an x86 server will never have the performance of a switch with proprietary ASICs and firmware. Worst of all, the only real failover model here is use a high availability active/passive model. Most people use Linux-HA, which means that on failover, you’re looking at 45-60 seconds when absolutely NO network traffic is running through your cloud. Can you imagine a system-wide networking failure of 60 seconds each time you failover the single_host server to do maintenance?
You can’t run like this in production, which means you *will* be using a Neutron plugin that provides control over a proprietary non-OpenStack networking solution, whether that’s bare metal switching, SDN, or something else.
Storage
Like networking, the default block storage on demand capability in Cinder is not what people expect. By default, Cinder simply assumes that each hypervisor has it’s only locally attached storage (either DAS or some kind of shared storage using Fiber Channel, etc). Calls to the Cinder API result in the hypervisor creating a new block device (disk) on the local storage. That means:
-
The block storage is probably not network-attached
-
You can’t back the block storage up to another system
-
The block device can’t be moved between VMs like AWS EBS
-
Hypervisor failure likely means loss of not only VM, but also all storage attached to it
This is *not* what customers mean when they think about block storage. What they usually mean is that they want a pooled set of SAN devices providing block storage on demand, where the snapshots are at least shared across the SAN devices or backed up else, and disk drives can be moved from VM to VM. Whether this is a storage array, distributed software only storage, or whatever, doesn’t matter.
That means in order to meet the actual requirements of the customer they have to load a Cinder plugin that manages a proprietary non-OpenStack block storage solution. That could be EMC’s own ScaleIO, an open source distributed block store like Ceph, industry-standard SAN arrays like VMAX/VNX, or really anything else.
And if you look at the laundry list of storage actually used in production you’ll see that over half of all deployments take this option.
Application Management
Want your developers to create great new scalable cloud-native applications? Great, let’s do it, but it won’t be with Horizon. Horizon is a very basic dashboard and even with Heat, there are major gaps if you want to help your developers succeed. You’ll need Scalr or Rightscale as cloud application management frameworks (especially if you are going multi-cloud or hybrid cloud with the major public clouds) or you’ll need a PaaS like CloudFoundry that does the heavy lifting for you.
You Can Reduce Vendor Lock-in, But … You Can’t Eliminate It
Are you trying to eliminate vendor lock-in? Power to you. That’s the right move. Just don’t expect to succeed. You can reduce but not eliminate vendor lock-in. It’s better to demand that your vendors provide open source solutions, which don’t necessarily eliminate lock-in, but does reduce it.
Why isn’t it possible? Well, network switches, for example, are deeply proprietary. Even if you went with something like Cumlus Linux on ODM switches from Taiwan, you will *still* run proprietary firmware and use a proprietary closed-source ASIC from someone like Marvell or Broadcom. Not even Google gets around this.
Firmware and BIOS on standard x86 servers is all deeply proprietary, licenses strictly, and this won’t change any time soon. Not even the Open Compute Project (OCP) can get entirely around this.
The Notion of Vanilla OpenStack is Dangerous
This idea that there is a generic “no lock-in” OpenStack is one of the most dangerous ideas in OpenStack land and it needs to be quashed ruthlessly. Yes, you should absolutely push to have as much open source in your OpenStack deployment as possible, but since 100% isn’t possible, what should be evaluating is what combination of open source and proprietary get you to the place where you can solve the business problem you are trying to conquer.
Excessive navel-gazing and trying to completely eliminate proprietary components is doomed to failure, even if you have the world’s most badass infrastructure operations and development team. If Google can’t do it, then you can’t either. Period.
The Process for Evaluating Production OpenStack
Here’s the right process for evaluating OpenStack:
-
Select the person in your organization to spearhead this work
-
Make him/her read this blog posting
-
The leader should immediately download and play with DevStack
-
The leader should create a team to build a very simple POC (5 servers or less)
-
Understand how the plugins and optional components work
-
Commission a larger pilot (at least 20-40 servers) with a trusted partner or set of trusted partners who have various options for “more generic” and “more proprietary” OpenStack
-
Kick the crap out of this pilot; make sure you come with an exhaustive testing game plan
-
VM launch times
-
Block storage and networking performance
-
etc…
-
-
Gather business requirements from the internal developers who will use the system
-
Figure out the gap between “more generic” and “more proprietary” and your business requirements
-
Dial into the right level of “lock-in” that you are comfortable with from a strategic point of view that meets the business requirements
-
If at all possible (it probably won’t be, but try anyway), get OpenStack from a vendor who can be a “single throat to choke”
Summarizing
I am trying to put a pragmatic face on what is a very challenging problem: how do you get to the next generation of datacenter? We all believe OpenStack is the cornerstone of such a strategy. Unfortunately, OpenStack itself is not a single monolithic turn key system. It’s a set of interrelated but not always dependent projects. A set of projects that is increasing rapidly and your own business probably needs a subset of all the projects, at least initially.
That means being realistic about what can be accomplished and what is a pipe dream. Follow these guidelines and you’ll get there. But whatever you do, don’t ask for “vanilla OpenStack”. It doesn’t exist and never will.
(Cross-posted @ Cloudscaling)