Recently a blogger wrote an article comparing the mailing list interaction in the communities around major open source infrastructure projects. It is a personal project by a blogger using various data sources available in the internet. But the post kickstarted discussion among the punditry talking about whether OpenStack or CloudStack is the top ranking infrastructure project with each party pointing to metrics convenient to them. Even though the bloggers intentions were completely different, cloud chatterati (or clouderati or pundits or whatever term you want to use here) is going wild analyzing these metrics. Not to be left alone, I thought I will jump in and offer my analysis of this analysis
What are the usual metrics used to study open source projects and why it matters
Though different folks use different metrics to highlight the point they want to highlight, the most popular among them are:
- Software downloads
- Mailing list activity
- Production deployment claims
- Conference/PR case studies
- Github activity
- Google Trends
- Job board metrics
There are other metrics too but these are the most widely used metrics in any discussion/debate. In this post, I will offer my thoughts on some of these metrics and see if the discussion leads eventually leads to a more comprehensive metric that can be used to measure the health of an open source project.
To begin with, I want to dismiss two of the metrics right away as they are completely meaningless. They are software download numbers and mailing list activity (the metric behind current brouhaha). Even though software download gives some idea about the interest in the project, it doesn’t give any granular information like how many of them are for use in production systems, how many of these downloads have lead to installations, how many of these installations are being used, etc..Without this insight, the number of software downloads means nothing. Similarly, the activity on the mailing list is another meaningless metric. Even though mailing lists are like oxygen for open source projects, the activities in these lists can range from some serious discussions to unnecessary flame wars. In fact, you can see many mailing lists where flame wars are a daily affair. Measuring the activity doesn’t represent anything about the project’s health. If the activity on the mailing list is an accurate indication of any project’s health, we can then conclude that OpenStack was healthiest during the board elections fiasco. #justsayin.
There are some suggestions that metrics regarding production deployments can accurately describe the health of the open source project. Yes, it does describe it accurately from the user point of view but it is still not a good metric. First, it is difficult to get the metric uniformly for all the open source projects. Many of the end users may not be willing to publicly talk about how they are using an open source project. Second, even if we assume that we get that metric accurately, it only offers information on the usage of the project and not the developer activity. In order to understand the actual health of the open source project, we need data both on the usage front and developer contribution. Not only this metric is difficult to obtain, it only gives partial information about a project. The same argument applies for conference/PR case studies too.
Github activity, including pull requests, commits, number of lines of code, etc., is a good indicator of developer activity but it doesn’t say anything about the actual usage of the software. Though it is possible to make an indirect observation about actual usage based on the developer activity (why would developers spend their time if there is no traction for the software), it still doesn’t give accurate information on the use of the software. Though github activity is easier to obtain, it still gives partial information on the health of the project. Google Trends offers some interesting insight on the interest on a particular project but it is still a partial information. Job board metrics (like number of job openings on sites like indeed.com) is a very good metric that can offer insight on the actual usage of the software but it is still an indirect measure.
If someone is interested in getting a grip on the health of an open source project, it is important that they take into account many relevant metrics so that they can build an accurate story covering all the bases. Talking about the health of the project based on a single metric is meaningless. It is definitely a waste of time to talk about the health of a project based on metrics like number of software downloads and mailing list activities. #justsayin.