Over the past few days I have been having some issues with my Twitter account. Beyond the well known pauses in the service, outages, etc there are some less known but more annoying problems with twitter search. It turns out that many accounts don’t show up in search at all. Therefore, if you are one of those lucky accounts, no one other than direct followers can see your tweets and no one can find you or any of your Tweets. This makes the accounts pretty useless. It also turns out its been a know issue with no fix for over a year other than to create a new account and tweet with that. Well it turns out that my account was one such account which needless to say was very annoying and cost me 2 days of my time trying to figure out a viable work around. As a result, Twitter earned the place of honor in today’s blog.
Now in the defense of Evan Williams, Biz Stone and the rest of the gang at Twitter, they find themselves in the enviable position of having a hugely successful product on their hands which has no doubt outpaced their wildest growth projections over the past few years and thus put stress on their design and everything else. I on the other hand have the advantage of 20/20 hindsight and thus in this blog we can design Twitter on Steroids from scratch using technology that was not even available when Twitter was conceived. I know the Team at twitter is busting their butts to keep up with their phenomenal growth and my hats of to them for their success.
So for those who have not read my Bio, I have been designing and building ultra high performance systems for the World’s Largest Banks and Stock Exchanges for about 25 years. Just this June a couple of colleagues of mine and I designed and ran a stock exchange prototype system capable of 4.5 million transaction per second with round trip response time as low as 15 microseconds (yes that’s microseconds for multiple network hops, I/O, parsing, matching and the whole shebang, everything the NYSE does to tell you that you just bought 100 shares of IBM). We also showed this system can scale linearly for throughput by adding hardware, was fully fault tolerant and could do dynamic load balancing if traffic at the exchange spiked. In this design, I will be leveraging the lessons learned over that 25 years and the technologies used for the system above.
So lets dive in.
So what does our Twitter on Steroids need to do. Here is my overly simplistic list of requirements (I am only going to deal with the big ones):
The system shall allow users to create accounts.
The system must provide a means for users to submit Tweets
The System must persist those Tweets
Users shall be able to follow other users Tweets
The system shall provide a mechanism to search Tweets
The system shall be highly responsive
The system shall maintain response times even under load
The system shall be highly scalable
The system shall be highly available with 99.999% or better uptime (its doable)
Where Do We Start?
First some design principles:
- We will use a componentized SOA design
- The Twitter Web Site will use the same Service API that is exposed publicly
- The System will use a Hot/Hot High Availability Model based on component replication for reliability
- All Service Components will be implemented in a manner that ensure deterministic behaviour (Easiest way to do that, but not the only way, is to make it single threaded which is what we do for most exchange systems. Thread context switches are expensive at speed and multithreading can result in coherency issues which Twitter seems to be suffering from based on the comments on their support site)
- To the maximum extent possible all I/O, remote Service calls, etc will be asynchronous
- All internal communication will be message based using multicast for efficiency
About the Technology
I don’t normally like to reference specific technologies in my blog but in this case I am going to as there are a couple which provide unique capabilities to implement this system design, and which people are probably not familiar with. Apologies in advance for the product plug. They are as follows:
Websphere MQ Low Latency Messaging (LLM):
LLM is a unique high performance messaging product that has some purpose built capabilities specifically designed for ultra high throughput, low latency, transactional systems. For one it’s the fastest messaging available on the market, capable of throughput in excess of 9 million Tweets per second per connection, and latency application to application across a switch as low as 3 microseconds with Infiniband Networking and about 12 microseconds with 10Gbe.
More important than its speed for this type of application though is its unique high availability mechanisms. LLM provides a unique mechanism that allows me to deliver messages to a primary and secondary Service Component at speed, while maintaining total order across all receivers. In addition it provides unique mechanisms to perform failure detection, failover, state synchronization and component replication all at speed. In exchange systems, LLM has detected and failed over from a primary system to a backup in as little as 7 milliseconds, with no loss of messages or duplication and no system level down time even though a component failed.
This is an appliance that was originally designed for Web Service and Web Edge Security. This model is specifically enabled to work with LLM above. It will allow us to expose REST or SOAP based services and convert them to message based for internal consumption. The XM70 can also do content based routing, parsing and transformation for us on the fly at wire speed taking load of the back end Service Components.
This is a low cost storage appliance that has great throughput and reliability. I have been able to sustain write speeds with this in excess of 5.5 Gb per second per intel box writing to it.
The rest we can use pretty commodity stuff. The disk above can also be easily swapped for your preferred flavour, this one just has great price performance.
What Does Twitter on Steroids Look Like?
My version of Twitter on Steroids would look like this (except I didn’t have room on the drawing to add the Account Management Service Componets or the Follower Service Components, so just imagine they are in the diagram and follow the same pattern ):
Twitter on Steroids
So let’s walk through this diagram.
- Firstly we are using Big IP to load balance across the Web Servers and also across the Datapower appliances. This is pretty standard Web design no surprises. The BIG IP could also do this to a remote backup site as well if configured correctly, where we could twin this setup for failov
er or load balancing. Or we could put
the Instance 2’s in the second site. It just depends on the SLA’s you are trying to meet. The logical design and coding would not change regardless.
- The Web Servers are making calls through the Datapower to the back end Services Components just like the external API calls. This ensures consistent behaviour and reduces the need to test and maintain two API’s
- Datapower is converting all REST and SOAP payload into messages on top of LLM
- This is important. Datapower is multicasting all messages out of the appliance using LLM’s high availability mechanisms. It is also putting those messages on different topics based on the content of the message. I am suggesting partitioning the incoming Tweets based on the first few letters of the Tweeter’s ID. The first 2 letters will do to start giving us 676 topics to work with for load balancing. We can add more topics for finer partitioning later if need be.
- LLM is delivering the messages throughout the systems and also providing all the reliability. It handles NAK’s and ACK’s automatically, retransmissions, etc to asssure messages get where they need to be without any additional work by the application.
- Tweets are first picked up by the Tweet Capture Service Components. Each partition subscribes to and handles a subset of the topics in order to provide load balancing. It is possible to add an external system which monitors load per topic and dynamically changes the subscriptions to adjust load. Also by partitioning, we can use multiple databases in parallel thus eliminating the databases as a bottleneck, throughput wise.
- I/O, in the Tweet Capture Service Components, is Asynchronous providing very fast response times. We can batch write the tweets for higher throughput and because we do compoent replication using LLM, if the primary Instance 1 fails, Instance 2 just takes over where it left off with no loss of messages or duplication.
- The Tweet Capture rebroadcasts (multicast) all messages to the Tweet Indexing Service Component. These are also twinned for High Availability and Partitioned for Scalability. The indexing component does as the name says and indexes into the tweets and stores a record in a database. I would recommend an in memory database be used with a traditional database behind it, with bi-directional synchronization of current data between the two. SolidDB/DB2 is one pair or possibly TimesTen/Oracle is another (but the latter pair is slower). I/O would be batched and asynchronous again for speed.
- When a search request comes in, it would be routed by Datapower to the Search Service Components, which would then query the Indexing Service and receive back the matching records for each key word in the search. A fast parallel algorithm would then be used to handle any “or” or “and” statements in the search
- These results would be returned to the caller via the datapower box as a response to the original service call.
So how fast would this be and how big could it scale?
Well this is just a guess based on my experience and without ever having looked at any specific search algorithms that might be used by Twitter. Lets assume we write everything in C behind the Datapower for speed and stability and that we use 1Gbe for networking which is the slowest at about 27 microseconds per hop. All latencies are round trip to and from the Datapower Box.
- I think for Tweet Capture, we could achieve round trip latency per tweet of about 50-60 microseconds with throughput per partition somewhere in the 100,000-200,000 thousand Tweets per second range if using a fast database and some solid state disk for the database log files, etc. Even higher if a custom binary file system is used (15,000,000+ Tweets per second which have done with stock orders with similar sized messages)
- Similar performance is possible for the Tweet Indexing per partition to that of the Capture
- For Tweet Search it a bit tougher to gauge, but I woudl guess it would be about 100-150 microseconds per search depending on the algorithm used. Throughput should also be well into the 100’s of thousands per second if in mkemory databases are used.
- Response times could be reduced by as much as 25 microseconds per network hop by using Infiniband networking instead of 1Gbe
- From a scaling poerspective, this should be able to scale linearly by adding hardware almost without limit (only limited by the avilable network bandwidth)
Now clearly, this is a simplified case and I am sure there are lots of design details we are missing but I think you get the idea. A bigger, badder Twitter (or any other app for that matter) is definately possible and by using the SOA pattern, Async I/O, component replication, etc we can do this to almost anything. So if anyone from Twitter (or any one else for that matter) wants to talk specifics or other examples feel free to leave a comment or reach me on Twitter (@techmusings) any time.
Sorry for picking on Twitter they just seemed like a good example given my struggles. We all wish we had the “problems” that come with such a huge success.
(Guest post by Paul Michaud, Global Executive IT Architect for the Financial Markets sector at IBM. Paul blogs @ Technology Musings)