Wednesday, November 18, 2015

How many data scientists does it take to change a light bulb?

How many data scientists does it take to change a light bulb? Answer: 101. A hundred to build a scalable middleware infrastructure for managing multiple applications, fitting types, light bulb technologies and different wattages, and to build a schema or ontology to identify the relationships between these. One to change the light bulb.

This is actually a restatement of the old joke:
Q: How many engineers does it take to change a light bulb?
A: None. They are all too busy trying to design the perfect light bulb.
This is why the shift in thinking from "data warehouse" to "data lake" is important. A data warehouse is usually built in anticipation of all possible problems that a particular collection of data or data sets could be used for. Completing it is thus impossible - progress just continues until the warehouse gets bogged down in data integration issues, and then the process repeats with the launch of a "new" data warehouse.  Using the data lake terminology is really just an admission that the "solving all possible problems in advance" approach is unworkable.

However, there are problems here. Simply throwing a bunch of datasets into a high performance big data platform with a few generic analytics tools does not solve any problems in and of itself. This "anti-warehouse" approach has become known as the data lake fallacy, arguably perpetuated by some infrastructure and analytics tools vendors. The beauty of data lakes is that they defines and make accessible a set of resources that can potentially be intelligently selected, integrated and used to solve problems in a particular domain. But just as reservoirs don't become city water supplies without dams, pumping stations and lots of plumbing, data lakes don't become solutions without lots of semantic integration, curation, coding, and analysis. So I think the data lake really means that we make data as accessible, harvestable and reusable as possible, focus on solving the problems at hand - not imaginary future ones, and be prepared to do the hard plumbing work to make these solutions successful.

Friday, November 13, 2015

How can I become a data scientist?

I previously described the basic skill set of a data scientist. But how can you go about learning the skills needed to be a data scientist? There are many people looking to get into data science, from undergraduates in computer science, statistics, or other quantitative disciplines to industry professionals looking to reinvent themselves mid-career. One size certainly does not fit all, so here are a variety of ways to learn a basic skill set.

Teaching yourself

It is quite possible to teach yourself many of the "nuts and bolts" data science skills using nothing more than a computer and the many excellent books available on data science topics. You can start by looking at the book list on this site. Self-teaching requires you to be self-motivated and to be able to navigate to the right resources for you to learn. Self-teaching skills means that you lack credentialing or accreditation, but if you become proficient in a skill enough to put it on your C.V., that might be all that is needed. Further, if you self-teach a subject, then you can be more confident going into a class or a workshop on the subject at a later stage. 

Non-university online courses and certifications

There are now several low cost, low barrier entry options for learning data science skills online. You won't go far wrong if you start with Coursera's Data Science and Big Data specializations.  These can be taken in their entirety, or you can take individual courses of interest from the specializations. Udacity offers what they call Nanodegrees, including Data Analyst and Machine Learning Engineer specializations. 

Residential workshops and bootcamps

There are several residential options that offer training in a physical location, often on a weekend or in an intense multi-day program. These are generally either hosted by a company or non-profit, with an emphasis on practical skills. Prices vary immensely. Lists of bootcamps and workshops can be found at Skilledup, CourseReport, and ClassCentral

Formal academic programs

Numerous universities are now offering formal qualifications in data science and related subjects such as data analytics. For a list of current M.S. programs related to data science, take a look at the Masters in Data Science website. Several institutions also offer certificate programs that require fewer courses to be completed than a masters, usually four graduate courses, and many of these certificates have options for online instead of residential study. A list of certificate programs can also be found on the Masters in Data Science website.

Some institutions are starting to offer formal online M.S. degrees including Indiana University (for full disclosure I am part of this program), Berkeley, Illinois Institute of Technology, Northwestern, and Texas A&M

Thursday, November 12, 2015

The skills needed to be successful in data science

A question I am often asked is: what is data science, and what skills do I need to be a successful data scientist? For the first question, I have a simple answer:

Data Science is an umbrella term for a set of statistical and computational techniques that suddenly seem very important for the future of the world

It's important to recognize this: data science is not all new stuff. It includes technical capabilities like relational databases and machine learning that have been around for decades. What's new is how these techniques can come together to transform big data into game changers for industry, government, academia, and the ordinary citizen. 

So what are these techniques? What are the skills you need to learn to be a successful data scientist? I categorize them into five "shopping bags" of skills: 


Systems refers to the physical infrastructure necessary to manage big data, and the distributed computing systems necessary to process big data. The skill sets you need in this area include: familiarity with cloud computing services, such as Amazon Web Services (AWS); distributed file system management using Hadoop, and increasingly Apache Spark; and knowledge of High Performance Computing (HPC) techniques.

Big Data Management refers to the software and strategies of big data management. Many systems still use SQL and relational databases, and knowledge of these is a must for any data scientist; but increasingly new database technologies such as NoSQL (especially MongoDB), semantic databases and graph databases are being used. Important in this area is the emerging concept of the "data lake" as opposed to data warehousing.

Programming is the glue that brings everything together, be it writing code to manage or transform data, or user-side app development. The most popular languages are Java, C++, and Python.

Analytics is at the core of what most people consider data science. This is about being able to transform data into knowledge, insights and even wisdom. A good statistical training is an absolute must for this area. on top of this, visualization, data mining and machine learning are the most important techniques. Learning the R package is a good starting point too.

Human Data Interaction is about how to make analysis "move the needle" in positive and not negative ways for human beings. This bag is all about the strategy of making data science work for the world. It includes areas such as policy, strategy, ethics, security, and application of data science methods in different domains.

The job of the data scientist is to make recipes from the ingredients in these bags. As a data scientist, you don't have to be an expert in all these areas, but you should have a breadth of skills across these areas, and a depth in one or two of them