Last week, a lightning strike rendered part of Amazon EC2 belonging to a single zone cutoff from the real world. I don’t want to go into whether it is an outage or not debate but towards a different kind of debate. Ever since Cloud Computing started gaining traction, we have a debate in the industry about whether the instance based setup is better or a fabric based one. I thought I will revisit this debate again in the light of the recent Amazon EC2 “it’s not an outage” incident. Let me do a brief recap of the terminologies and, then, see how the debate shapes up in the aftermath of the “Amazon lightning incident”.
Amazon jumped into Cloud Computing bandwagon with the release of raw computing power inside virtual machine containers (EC2). They are like individual servers but, rather, based on virtualization than any actual physical hardware. It is an outgrowth of VPS from the previous dedicated server era but with the added advantage of elasticity. This is a simplistic description but it gives an idea of what an Instance is in the Cloud Computing terminology. This kind of approach has its own merits and demerits. While it offers the ability to port apps without having to rewrite the code among other advantages, scaling the apps on instances are not as smooth as in the case of fabric. Also, incidents like the recent lightning strikes make us wonder if the instance based approach is well suited for the needs of startups, in particular.
The other approach is called fabric based approach. In this case, all the physical and virtual infrastructure are abstracted out and attributes like transparent scaling, fault tolerance/self healing, etc. are added, thereby, offering developers an uniform fabric to work with. Some of the advantages of fabric includes linear scaling and under the hood fault tolerance. Unlike the instances, we need not worry about individual hardware or virtual machines going down. The most cited disadvantage is that there is a need for rewriting the code and a possibility of vendor lock-in. There are lots of confusion about the definition and characteristics of the Cloud Fabric but I will not dig into it right now. The above, somewhat simplistic, definition of fabric is good enough for our discussion. Some of the Cloud fabrics are closely tied to particular software develoment platform and/or vendors while others are more general in nature. For example, Google App Engine and Microsoft Azure fits into the former category. With Google App Engine, developers can only develop apps using Python and, now, Java. Both Google App Engine and Microsoft Azure can run from only the vendors’ datacenters. One cannot take them and run on Amazon EC2 Cloud or their own private datacenters. However, there are also general purpose Cloud fabrics like Appistry’s CloudIQ Platform, Gigaspaces’ eXtreme Application Platform, Eucalyptus, etc. that are not tied to a particular vendors’ datacenter and they can be installed on any public cloud and/or private datacenters. Moreover, they are, generally, vendor agnostic when it comes to supporting software development platforms. Some of them are released under proprietary licenses and others under Open Source licenses.
Whether it is a startup or an enterprise, everyone strives to have a zero downtime. When a startup bets their Cloud deployment on the Instance based infrastructure like standalone Amazon instances or GoGrid Servers or Rackspace Cloud Servers, issues like the recent lightning strikes could render their application(s) unavailable. Well, there are ways to fire up instances in other availabilty zones from the backups but there will still be some downtime. If there are no options for multiple availability zones, the startups will have to wait and watch till the Cloud provider fixes the issues. Instead, if the startups use a Cloud fabric with multiple vendors, such disaster strikes will have no impact because the fabric can manage the whole healing process in a completely transparent manner. The same thing can also be achieved in a properly architected instance based setup but the Cloud fabric makes it much more seamless to manage such incidents. By selecting a general purpose fabric supporting many software development platforms, it is possible to avoid platform and vendor lock-in. Well, the platform lock-in will always be there in any IT deployment. We cannot avoid it. The moment a particular software platform is selected for developing the apps, we are locking ourselves into that platform. Also, by selecting an Open Source fabric, we can even avoid some pitfalls associated with the proprietary ones.
I am not dismissing the Instance based approach at all. It has its own advantages but I feel that by selecting a fabric based approach, we can eliminate some of the risks that are unavoidable in the Instance approach. Well, this is definitely not the end of the debate but I thought that it is time to revisit this debate in the light of recent events. I have put forward my arguments in favor of a fabric based approach in this post and I would love to hear from the opposite camp. Feel free to pick me apart on this discussion and, also, offer your own technical arguments on this topic. As we always encourage here in Cloud Ave, you are welcome to post the rebuttal as a guest post.






[..] Let us do a brief recap of
how Cloud is architected at present and, then, do a complete rethink of
this model to keep such downtimes at its bare minimum. Before
describing the nature of Cloud Computing as it exists today, let us dig
back into the history of computing. Till a few decades back, the
computing was done on huge centralized mainframe machines and super
computers and are accessed by users using dumb text based terminals.
All the software, peripherals, etc. were part of this huge centralized
powerful machines and were centrally managed by dedicated teams. This
centralized client-server model of computing was in vogue for quite
some time before the PC revolution ushered in a new era of distributed
client-server model. This new client-server model saw the federation of
management and offered greater flexibility than the centralized
client-server model. The past few years saw the emergence of Cloud
Computing which is a much sophisticated evolution from the centralized
client-server system but built using large numbers of cheaper x86
systems. Even though the computing resources in the Cloud model appear
to be centralized like the centralized client-server model of mainframe
years, there are some significant differences. In the traditional
mainframe client-server model, the work was split between the server
and the client whereas in the Cloud model, the work is done completely
on the “server” side (I have used the double quotes here to
differentiate from a single powerful server). On the traditional model,
the server was a single powerful machine like a mainframe or a
supercomputer whereas in the cloud model, the “server” is actually a
server farm with hundreds or thousands of cheap low end x86 machines
that acts as a centralized computing resource. Even though the Cloud
model is a much sophisticated evolution from the previous client-server
models, we are still dealing with a “centralized resource” from a
single vendor. Some of the big vendors use geographically distributed
datacenters and state of art virtualization technologies or “fabric”
technology to offer high reliability in terms of uptime. However, it is
not the case with all the vendors. Many of them use a single datacenter
and a Cloud like architecture to offer their infrastructure services.
This leads to a single point of failure, like what happened in the case
of Rackspace recently. Even with geo-distributed datacenters, there arepartial outageslike the recent lightning strike on one of the Amazon’s datacenters. [..]