I’m in Santa Clara this week, attending O’Reilly‘s inaugural Strata Conference. Today, I’m spending the day in the event’s Executive Summit, where I hope to hear some of the ways in which ‘normal’ businesses are approaching the opportunity of making their data work harder.
The notes that follow are a rather raw summary of some of the things I’m hearing. Later in the week, I’ll try to come back and extract the main issues in a rather more polished form.
Trying to frame conversation; why are we all here? Why is now the time that we’re all beginning to focus on ‘Big Data’ ? Drawing analogy with the tar sands in Alberta; lots of oil there, but it’s been expensive to extract.
If ‘information is the oil of the 21st century [as Gartner has suggested], then Big Data are the tar sands;’ lots of data, sitting in our data centres, waiting for us to invest in extracting usable data. Expensive, painful, but ultimately valuable.
‘Attack of the exponentials;’ cost of storage, bandwidth and compute falling exponentially. Number of nodes on the network rising exponentially. Intersection creates ‘data singularity.’
Unlike oil, data abundant and renewable. Like oil, extraction of data creates value. Cheaper and easier to extract value from data than ever before.
Three forces reshaping data landscape; sensor networks, cloud computing, and machine learning.
Sensor networks; now prevalent, all-pervasive, ubiquitous, and typically connected. Generating vast amounts of data.
Machine learning; gives us the capabilities to process the flood of data, intelligently. Smart Planet, Smart Grid, Smart Business, etc. Driverless cars, spam filters, recommendation engines all drawing upon Machine Learning ideas. All iterating and improving with increasing rapidity. Unlike Cloud and sensors (which are becoming commodities), machine learning algorithms are – and may remain – a competitive advantage.
Four consequences of all this; battle for finite number of good data scientists, changes in the way that data is published (and valued), the end of privacy (?), the rise of data startups.
Battle for data scientists; difficult to hire people who can munge, interpret, and tell stories with data. And everyone last night at bigdatacamp was ‘hiring’.
Retailers, banks, online publishers, etc have tended to hand over the keys of data management to third parties. Seeing pendulum shift the other way, as companies recognise the value of their data and seek to control it – and realise the value. Tension with ‘open data,’ data to the cloud, etc?
Privacy – not about shifting access to data, but about more accurately defining the ways in which it may be used.
‘What sort of educational background does a data scientist need?’
knowing some stats helps. knowing some programming helps. But curiosity is key. Not sure there’s a degree out there. Pick up the skills if you have the right mindset. ‘You have to be a bit of a hacker,’ says Mike.
‘What are the three big problems that data science will solve?’
Making sense of the world around you; Freakonomics, for example. Taking data and making sense of how the world is working. Scaling up decision making, so that a data-powered story can be presented in a way that lets people make intelligent decisions.
Next, Barry Devlin from 9sight Consulting talks about The Data-Driven Business and Other Lessons from History.
‘the old guy who has been brought along to talk about history,’ and ’illegitimate grandfather of data warehousing’
Address Past, Present, and Future.
Past – the origins of Data Warehousing
Data Warehouse architecture work at IBM in Europe in the mid-80s.
‘Big Data’ (a couple of hundred MB, at the time) created need to structure the Enterprise Data Warehouse in a particular way. Led to silos. ‘Hard information’ only, at the time. Warehouse designed in a well-architected fashion. Ensures that data flows in a single direction. Possibly too regimented for the 21st century?
Information quality and reliability are key; Master Data Management, etc. This is unlikely to change.
Data volumes and variety have presented big challenges over the years. Expectations and business demands outstrip technological capabilities. Organisational and political issues hamper progress. This is unlikely to change.
Exploration and analysis drives innovation.
Present – Business and technical challenges
3 key trends in business are driving rapid change; closed loop business, massive information volumes, collaboration driving innovation.
Really important to stop talking about ‘unstructured information;’ that’s just noise. Information has structure. Instead, hard information is data; tables, structure, computer-oriented. Meaning and values have been separated. Metadata explicit, and formally modelled. Soft information is not well defined, it is by and for people, it mixes meanings and values. Metadata is implicit, tacit, or non-existent.
Moving from information we understand – and control – to information outside the enterprise that we don’t. Implications for quality, meaning, etc.
Future – a new architecture?
current architecture 25 years old – time for a change?
eBay “fascinated with numbers;” early 1999 screenshot of homepage, showing lots of stats.
What role does data play in the business? Using Analytics focussed on big buckets; velocity, efficiency, trust, etc.
Efficiency drive – lower insertion fees to list/sell new products by 99%. Decision based on analysis of data?
Trust – top-rated sellers constantly monitored to ensure algorithm is reflecting reality. 22% of sales from trusted sellers in 2009. Now 32%. Actually surprised it’s not higher…
eBay handled $2Bn of sales on mobile devices last year.
Analytics and continual data analysis drives all the apps, trust metrics, etc.
Some figures… 50 TB/day of new data, etc. Lots of other numbers on slide, but it was only up for seconds…
Analytics work in marketing, sales, product dev, and all areas of the business.
More than 85% of eBay’s analytial workload is new and unknown; design for the unknown. Enable exploration of the data, rather than just reporting of established metrics.
Machine Learning: Data trumps Algorithms. This is the promise of Big Data; existing algorithms get better as you throw more data at them. It’s cheaper to throw more data at an algorithm than to invest in developing new algorithms.
Enterprise Data Warehouse for transactional data; purchase history, etc. Behavioural data to track clickstreams, impulse purchases, etc… much larger than the amount of transactional data in traditional systems. No technology silver bullet for the behavioural data; optimise for concurrency, or TCO, or CPU usage, or flexibility, or storage, or governance? Those priorities change the tool you should use.
eBay built a 500-node Hadoop cluster in June 2010. Now they have a much bigger cluster.
Data Marts; ‘a reality for many of us,’ because businesses need to give control to the user. Don’t want infrastructure team/ data scientists as bottle neck. Totally opposite to attitude expressed by previous speaker. But Data Marts end up being very expensive and inefficient. eBay have built a virtual data mart; views onto a single pool of data. Far more efficient, in theory.
And then I had to slip away, and miss the final session before lunch… More later…