Over the last few decades, the technologies we used and the approaches we took to make our systems reliable have undergone a steady evolution. In some cases the technology has just gotten more reliable through quality control at the hardware level (consider an Intel Blade today compared to my 1986 Zenith 8088 that I wrote my first automated trading programs on. Hard to believe 8MHz, 2×5.25″ floppy’s and 512K of RAM was once the best machine money could buy, short of a mainframe. AHH.. the nostalgia……NOT)
For most of the time pre mid 90’s we relied on hardware to make our systems reliable. We had mainframes for most things business critical and towards the latter part of that time, the Unix machines were starting to be taken seriously by business as well as the scientific community. Regardless of whether you used Tandem Non-Stop technology, IBM Series 3X0’s or Stratus, you relied on the hardware to be fault tolerant and to just stay up. And for the most part they did, but at great cost and with relatively poor price/performance compared to the other platforms that were becoming available.
Coupled with this resilient hardware we would have typically 2 data centers (and sometime 3) with essentially identical hardware for disaster recovery. Two of these centers were usually less than 30 miles apart and the data was synchronized between them again using hardware, with technology such as EMC’s SAN replication technology. In fact a lot of systems still do this today where performance and latency in the systems response time is not critical. Although post 9/11 the SEC mandated financial firms to have their DR site 300 miles apart which means this SAN replication approach cannot be used for most new systems as it’s distance limited. Most other countries followed the SEC’s lead (Do you know how hard it is to find site’s 300 km apart in Switzerland and still be within Switzerland, because Swiss data (depending on the data type) can’t be stored or transmitted outside of Switzerland, which is something for SaaS vendors to keep in mind. Well you can’t so we cheat. Usually one in Zurich and one in Lugano which is as good as you can do.)
By the mid 90’s though we were starting to use more UNIX machines. SUN Sparc Systems, IBM R6000’s and HP-UX machines were coming on strong. Their hardware was better than a typical Intel desktop at the time but it still didn’t have the 9’s of uptime that a mainframe had. Now for stateless applications such as those that were emerging on the web, we could throw an IP sprayer or Load balancer, such as the BIG-IP product line by F5, in front of a hot-hot pair and be pretty good to go. This is still the best way to achieve HA for most stateless applications today, but I digress. So in order to assure reliability, and for this era we defined that mostly as no loss of data more so than sheer system uptime, we had to do more with software to provide that reliability.
This software augmentation centered around two primary software technologies.
- Messaging Middleware such as IBM’s MQ, Tibco EMS and Rendevous
- Databases such as DB2, Oracle, Sybase and Informix
Well I won’t spend to much time on how we used these technologies back 10 years ago, because to be honest it really hasn’t changed much up to today. With the messaging software, we moved from a world in which all inter-process communication happened over a raw socket, to instead using messaging middleware, which removed the burden for message reliability from the programmer. No longer did we have to implement transactional semantics in every application by hand. We could instead rely on the middleware to make sure the messages got from point A to point B.
Today we use IBM MQ to handle every message for virtually every trade of US treasuries, Eurobonds, Stocks, etc in the world. We can rely on it to deliver messages of any size from one application to another, even if one of the machines goes down and doesn’t come back online for weeks, MQ ensures it gets delivered. (Hopefully, being down for weeks doesn’t actually ever happen in production, but the guys at IBM’s Hursley labs due test these things.) Now I will say, we don’t use TIBCO, or MQ when low latency and very high throughput are required. There is a new breed of messaging technologies out recently which are preferred and I will touch on some of them in coming articles.
With the databases, we moved all of the transactional abilities we knew and loved off the mainframes and onto the distributed platforms. In addition, the database companies implemented ways to run the databases in a cluster. This meant that if the database server failed, I would in theory, with a slight pause, fail over to the backup, with no intervention on the part of my application. Now in practice this took a few missteps to get right but today is old hat and everyone relies on the big commercial databases to be able to do this. Some of the open source ones are not so strong here as their paid for counterparts, but in time we will probably see this happen as well.
So this brings us pretty much up to today’s state of the world (or atleast a few years ago for a typical enterprise application) in a very Cliff’s Notes sort of summary. In the next article we will start a hypothetical design exercise as a way to ground the discussions going forward. This hypothetical will form the basis of the next few articles to come after it.
(Guest post by Paul Michaud, Global Executive IT Architect for the Financial Markets sector at IBM. Paul blogs @ Technology Musings)