This is the second and the last part of this two-post series blog post on Big Data myths. If you haven’t read the first part, check it out here.
Myth # 2: Big Data is an old wine in new bottle
I hear people say, “Oh, that Big Data, we used to call it BI.” One of the main challenges with legacy BI has been that you pretty much have to know what you’re looking for based on a limited set of data sources that are available to you. The so called “intelligence” is people going around gathering, cleaning, staging, and analyzing data to create pre-canned “reports and dashboards” to answer a few very specific narrow questions. By the time the question is answered its value has been diluted. These restrictions manifested from the fact that the computational power was still scarce and the industry lacked sophisticated frameworks and algorithms to actually make sense out of data. Traditional BI introduced redundancies at many levels such as staging, cubes etc. This in turn reduced the the actual data size available to analyze. On top of that there were no self-service tools to do anything meaningful with this data. IT has always been a gatekeeper and they were always resource-constrained. A lot of you can relate to this. If you asked the IT to analyze traditional clickstream data you became a laughing stroke.
What is different about Big Data is not only that there’s no real need to throw away any kind of data, but the “enterprise data”, which always got a VIP treatment in the old BI world while everyone else waited, has lost that elite status. In the world of Big Data, you don’t know which data is valuable and which data is not until you actually look at it and do something about it. Every few years the industry reaches some sort of an inflection point. In this case, the inflection point is the combination of cheap computing — cloud as well as on-premise appliances — and emergence of several open computing data-centric software frameworks that can leverage this cheap computing.
Traditional BI is a symptom of all the hardware restrictions and legacy architecture unable to use relatively newer data frameworks such as Hadoop and plenty of others in the current landscape. Unfortunately, retrofitting existing technology stack may not be that easy if an organization truly wants to reap the benefits of Big Data. In many cases, buying some disruptive technology is nothing more than a line item in many CIOs’ wish-list. I would urge them to think differently. This is not BI 2.0. This is not a BI at all as you have known it.
Myth # 1: Data scientist is a glorified data analyst
The role of a data scientist has exponentially grown in its popularity. Recently, DJ Patil, a data scientist in-residence at Greylock, was featured on Generation Flux by Fast Company. He is the kind of a guy you want on your team. I know of a quite a few companies that are unable to hire good data scientists despite of their willingness to offer above-market compensation. This is also a controversial role where people argue that a data scientist is just a glorified data analyst. This is not true. Data scientist is the human side of Big Data and it’s real.
If you closely examine the skill set of people in the traditional BI ecosystem you’ll recognize that they fall into two main categories: database experts and reporting experts. Either people specialize in complicated ETL processes, database schemas, vendor-specific data warehousing tools, SQL etc. or people specialize in reporting tools, working with the “business” and delivering dashboards, reports etc. This is a broad generalization, but you get the point. There are two challenges with this set-up: a) the people are hired based on vendor-specific skills such as database, reporting tools etc. b) they have a shallow mandate of getting things done with the restrictions that typically lead to silos and lack of a bigger picture.
The role of a data scientist is not to replace any existing BI people but to complement them. You could expect the data scientists to have the following skills:
- Deep understanding of data and data sources to explore and discover the patterns at which data is being generated.
- Theoretical as well practical (tool) level understanding of advanced statistical algorithms and machine learning.
- Strategically connected with the business at all the levels to understand broader as well deeper business challenges and being able to translate them into designing experiments with data.
- Design and instrument the environment and applications to generate and gather new data and establish an enterprise-wide data strategy since one of the promises of Big Data is to leave no data behind and not to have any silos.
I have seen some enterprises that have a few people with some of these skills but they are scattered around the company and typically lack high level visibility and an executive buy-in.
Whether data scientists should be domain experts or not is still being debated. I would strongly argue that the primary skill to look for while hiring a data scientist should be how they deal with data with great curiosity and asking a lot of whys and not what kind of data they are dealing with. In my opinion if you ask a domain expert to be a data expert, preconceived biases and assumptions — knowledge curse — would hinder the discovery. Being naive and curious about a specific domain actually works better since they have no pre-conceived biases and they are open to look for insights in unusual places. Also, when they look at data in different domains it actually helps them to connect the dots and apply the insights gained in one domain to solve problems in a different domain.
No company would ever confess that their decisions are not based on hard facts derived from extensive data analysis and discovery. But, as I have often seen, most companies don’t even know that many of their decisions could prove to be completely wrong had they have access to right data and insights. It’s scary, but that’s the truth. You don’t know what you don’t know. BI never had one human face that we all could point to. Now, in the new world of Big Data, we can. And it’s called a data scientist.
Photo courtesy: Flickr
(Cross-posted @ cloud computing)