Wednesday, November 18, 2015

How many data scientists does it take to change a light bulb?

How many data scientists does it take to change a light bulb? Answer: 101. A hundred to build a scalable middleware infrastructure for managing multiple applications, fitting types, light bulb technologies and different wattages, and to build a schema or ontology to identify the relationships between these. One to change the light bulb.

This is actually a restatement of the old joke:
Q: How many engineers does it take to change a light bulb?
A: None. They are all too busy trying to design the perfect light bulb.
This is why the shift in thinking from "data warehouse" to "data lake" is important. A data warehouse is usually built in anticipation of all possible problems that a particular collection of data or data sets could be used for. Completing it is thus impossible - progress just continues until the warehouse gets bogged down in data integration issues, and then the process repeats with the launch of a "new" data warehouse.  Using the data lake terminology is really just an admission that the "solving all possible problems in advance" approach is unworkable.

However, there are problems here. Simply throwing a bunch of datasets into a high performance big data platform with a few generic analytics tools does not solve any problems in and of itself. This "anti-warehouse" approach has become known as the data lake fallacy, arguably perpetuated by some infrastructure and analytics tools vendors. The beauty of data lakes is that they defines and make accessible a set of resources that can potentially be intelligently selected, integrated and used to solve problems in a particular domain. But just as reservoirs don't become city water supplies without dams, pumping stations and lots of plumbing, data lakes don't become solutions without lots of semantic integration, curation, coding, and analysis. So I think the data lake really means that we make data as accessible, harvestable and reusable as possible, focus on solving the problems at hand - not imaginary future ones, and be prepared to do the hard plumbing work to make these solutions successful.

No comments:

Post a Comment