I had coffee this morning with Anthony Goldbloom, Australian CEO of Kaggle. The company describes itself as “a platform for data prediction competitions,” and seeks to solve big problems by hosting competitions that match data owners with a problem to (professional and amateur) data scientists with the time, creativity and skills to crack it.
Whilst some of the site’s competitions carry sizeable prize funds (the Heritage Health Prize will pay $3 Million to the lucky winner), others either award more modest sums or offer little more than kudos and the warm feeling of having made the world a better place. Goldbloom suggests that an interesting data set or a compelling problem will often appeal more to competition entrants than the amount of money on offer. He goes on to stress, though, that large prize funds can be important in demonstrating just how serious data owners are in seeking viable solutions.
Goldbloom argues that ‘so many problems today are data problems,’ with effective data collection, sampling and analysis enabling companies to unlock a wealth of insight. Further, ‘there is a mismatch between those with the data and those best skilled to analyse it,’ with few companies able to recruit data scientists with the requisite mindset and skills. In academia, researchers and students are constantly seeking large real-world data sets on which to learn – and hone – their skills.
With Kaggle, data owners tap a deep pool of talent, and typically benefit as their data is subjected to a wide range of analytical techniques. Goldbloom claims that Kaggle’s competitions have all been well subscribed (tens or hundreds of entrants, typically), and that winning entries always beat the ‘default threshold’ set by the best efforts of traditional in-house approaches. Interestingly, winning entries tend not to come from statisticians, computer scientists and the like, but from physicists, electrical engineers and others for whom a tangible result outweighs the purity of an algorithm.
Established with cash from the Kaggle team itself, the company generates revenue by offering consultancy services to those competition hosters ‘that need the help.’ It is possible to create, administer, run, and award prizes in a competition without paying Kaggle a penny, but Goldbloom believes that there are enough organisations in need of a little hand-holding to continue covering Kaggle’s running costs. To support further growth, the company is exploring additional avenues such as a new careers site through which successful competition entrants might find employment.
With more big competitions to come (Goldbloom hopes to attract over 100,000 entries for the $3 Million Heritage Health Prize when it opens in the next few months, beating the 50,000 drawn to Netflix for their (non-Kaggle) $1 Million Prize), and a relocation to the Bay Area, there is plenty to keep the team busy. Goldbloom is also keen to ensure that the really important problems aren’t swamped by those that happen to carry big prize funds. A new kaggle.org is on the cards, specifically intended to showcase scientific problems in healthcare, astronomy, and other fields without the budget for huge payouts.
There are also real opportunities to increase the feedback between data scientists and the domain experts with the knowledge to explain and interpret results. These real-world challenges are not simply academic exercises. The results have implications, and the implementation of findings may have consequences. The Heritage Health Prize, for example, may well bring real benefits by saving money and by highlighting the need for potentially life-saving preventative treatments. Whilst this is to be welcomed, the challenge is to ensure that less scrupulous health providers are unable to carry things a stage further and refuse care to those who need it most. No competition will solve that; only informed debate and enlightened public policy can.
- Netflix Prize-Style Competition Predicts Hospitalizations (fastcompany.com)
- [TNW Australia] Think it all happens in Silicon Valley? You’re wrong! – Kaggle (thenextweb.com)
- Competitive Data Science: An Update (revolutionanalytics.com)
(Cross-posted @ Paul Miller – The Cloud of Data)