Becoming a Data Scientist

Sunday, May 29, 2016

The Complete Privacy & Security Desk Reference Volume I

For most of my life, privacy hasn't really been something I've thought about too much. I've happily given out my name, address, phone numbers, email addresses and other information. In 1993 I proudly made my first personal website, and in the last decade have reveled in the human digital connections enabled by social media. However, as data scientists we know that our deep learning algorithms and cloud platforms are enabling a new era, where machines can get unprecedented insights into our everyday lives my mining millions of data points about us. Much of this can be for good, but it can also work against us - for instance when your health insurance doubles in price because the insurance company's algorithms predict that your health is going to go downhill soon, maybe based on your grocery shopping habits, cellphone trail and hypochondriatic web searches of late; or when your credit card information gets leaked in the latest hack.

Since data scientists and data engineers are the people enabling these activities, I believe we as a community need to put as much effort into understanding and mitigating the human and social implications of data science, as we put into our coding and analytics. This has many dimensions, but one of those is understanding the choices we have as individuals about what we do and do not share with the rest of the world, and what access we give to sensitive information such as our credit card numbers.

The Complete Privacy & Security Desk Reference Volume I: Digital is by far the most comprehensive guide I have seen to understanding the privacy and security choices we make in the digital world, and to how to take some control back about what gets shared about us. The book covers a multitude of techniques from the basic that we should all do, such as setting the privacy settings of browsers and using VPNs - to highly advanced methods such as masking credit card numbers, setting up aliases and keeping your home address information completely private, that are probably only going to be realistic if you are a public figure or you are unfortunate enough to be threatened by someone. The chapters are helpfully organized into "basic", "intermediate", "advanced" and "expert". Several chapters lead you through a process to find out exactly what information about you is publicly accessible on the internet, and how to have some of it removed if you wish to.

The book goes into a lot of detail about each of the topics it covers - for instance which browser you should use (Firefox), and exactly what settings to choose to prevent third party cookies tracking you. I have spent the last couple of weeks experimenting with a variety of the methods of the book, including using VOIP phones, VPNs, searching myself on the internet, and closing a few security and privacy loopholes. What is for sure - and the book is clear about this - is that there is a tradeoff between security, privacy and convenience. If I have any criticism of this book, it would be that once you get started implementing its suggestions it is not clear where to stop, since everything is connected to everything else. Unless you want to live like a secret agent in a foreign country, you're going to have to draw the line somewhere. I am not sure how many of my experiments will persist for me, but going through the process I have learned a lot about what digital trail I am leaving, and what choices I have to do something about it.

Overall I would highly recommend the book, as it shows that you have much more control about your digital data than you probably realize, and it gives you tools to help you find the right place for you on the privacy-convenience continuum.

Wednesday, March 2, 2016

Who owns the future? A must-read book

I'm not normally one for posting book reviews - in fact if I am quite honest, I'm not normally one for reading books. I can just about get through a journal article or a magazine, but my attention span is just too short to stick with something the size of a book. However, occasionally a book grabs my attention on the first page, and goes on to have a real impact on my thinking. Jaron Lanier's Who Owns the Future? is such a book. I think it is a must-read for anyone working with technology or data in the 21st century (i.e. all of us). Unlike most books which have one idea that is repeated over and over again, this one has new ideas on every page.

Who Owns the Future is probably at heart an economics book. It is about how big data infrastructure, and specifically what he calls siren servers - hugely powerful cloud computing infrastructures like Amazon and Google. These siren servers become monopolies that everything else - people and things - revolve around. They can do this because data is now becoming more important than things - and perhaps even people. The answer to the title of the book then becomes apparent - those who own the future are those who have access to the most powerful computation to leverage the most from data.

How can this be? A good example is in healthcare. If you've been to the doctor recently, you'll have noticed that the nurses and usually the doctors spend more time talking to their laptop than to you. Doctors are arguably becoming data entry clerks - or at best a small part in a computation process that converts patients' symptoms into diagnosis codes and treatment plans. The real value comes from those who can sum over all of the doctors making all of their decisions in aggregate and optimize accordingly - for example, which treatment plans work for certain kinds of patients. Perhaps we can even replace the doctor with a machine learning model that learns from tens of thousands of real doctors. The doctors give up their value to the "server", and then the one who owns the server (a large provider network, a health insurance company, or maybe even ultimately Google) reaps the value.

There are many other examples of this we see around us. One recent example was when Amazon opened a bricks-and-mortar bookstore in Seattle. The bookstore can probably beat Barnes and Noble, because Amazon knows exactly what books people want to buy in that square mile of Seattle; they use customer reviews and ratings (given online for free by all of us!) to guide and add value to customers. Lanier goes on to a fascinating journey questioning whether this is desirable, the economic impacts, and impacts on the value of people, and how the issues it brings up have been addressed through philosophy. Lanier ultimately recommends a micropayments system - where the value of data is shared among all of us.

If you just read one book this year, I think this should be the one you read. There are many ways you can use it to impact your thinking. For instance, you could ask: "what would it look like if my company stopped being a [fill in the blank] company, and became a data company?"; you could use it to inform your ethics and the decisions you make in your data science career; you can use it to position your career for the world ten years from now. But make sure you read it sooner rather than later.

Wednesday, November 18, 2015

How many data scientists does it take to change a light bulb?

How many data scientists does it take to change a light bulb? Answer: 101. A hundred to build a scalable middleware infrastructure for managing multiple applications, fitting types, light bulb technologies and different wattages, and to build a schema or ontology to identify the relationships between these. One to change the light bulb.

This is actually a restatement of the old joke:

Q: How many engineers does it take to change a light bulb?
A: None. They are all too busy trying to design the perfect light bulb.

This is why the shift in thinking from "data warehouse" to "data lake" is important. A data warehouse is usually built in anticipation of all possible problems that a particular collection of data or data sets could be used for. Completing it is thus impossible - progress just continues until the warehouse gets bogged down in data integration issues, and then the process repeats with the launch of a "new" data warehouse. Using the data lake terminology is really just an admission that the "solving all possible problems in advance" approach is unworkable.

However, there are problems here. Simply throwing a bunch of datasets into a high performance big data platform with a few generic analytics tools does not solve any problems in and of itself. This "anti-warehouse" approach has become known as the data lake fallacy, arguably perpetuated by some infrastructure and analytics tools vendors. The beauty of data lakes is that they defines and make accessible a set of resources that can potentially be intelligently selected, integrated and used to solve problems in a particular domain. But just as reservoirs don't become city water supplies without dams, pumping stations and lots of plumbing, data lakes don't become solutions without lots of semantic integration, curation, coding, and analysis. So I think the data lake really means that we make data as accessible, harvestable and reusable as possible, focus on solving the problems at hand - not imaginary future ones, and be prepared to do the hard plumbing work to make these solutions successful.

Friday, November 13, 2015

How can I become a data scientist?

I previously described the basic skill set of a data scientist. But how can you go about learning the skills needed to be a data scientist? There are many people looking to get into data science, from undergraduates in computer science, statistics, or other quantitative disciplines to industry professionals looking to reinvent themselves mid-career. One size certainly does not fit all, so here are a variety of ways to learn a basic skill set.

Teaching yourself

It is quite possible to teach yourself many of the "nuts and bolts" data science skills using nothing more than a computer and the many excellent books available on data science topics. You can start by looking at the book list on this site. Self-teaching requires you to be self-motivated and to be able to navigate to the right resources for you to learn. Self-teaching skills means that you lack credentialing or accreditation, but if you become proficient in a skill enough to put it on your C.V., that might be all that is needed. Further, if you self-teach a subject, then you can be more confident going into a class or a workshop on the subject at a later stage.

Non-university online courses and certifications

There are now several low cost, low barrier entry options for learning data science skills online. You won't go far wrong if you start with Coursera's Data Science and Big Data specializations. These can be taken in their entirety, or you can take individual courses of interest from the specializations. Udacity offers what they call Nanodegrees, including Data Analyst and Machine Learning Engineer specializations.

Residential workshops and bootcamps

There are several residential options that offer training in a physical location, often on a weekend or in an intense multi-day program. These are generally either hosted by a company or non-profit, with an emphasis on practical skills. Prices vary immensely. Lists of bootcamps and workshops can be found at Skilledup, CourseReport, and ClassCentral.

Formal academic programs

Numerous universities are now offering formal qualifications in data science and related subjects such as data analytics. For a list of current M.S. programs related to data science, take a look at the Masters in Data Science website. Several institutions also offer certificate programs that require fewer courses to be completed than a masters, usually four graduate courses, and many of these certificates have options for online instead of residential study. A list of certificate programs can also be found on the Masters in Data Science website.

Some institutions are starting to offer formal online M.S. degrees including Indiana University (for full disclosure I am part of this program), Berkeley, Illinois Institute of Technology, Northwestern, and Texas A&M.

Thursday, November 12, 2015

The skills needed to be successful in data science

A question I am often asked is: what is data science, and what skills do I need to be a successful data scientist? For the first question, I have a simple answer:

Data Science is an umbrella term for a set of statistical and computational techniques that suddenly seem very important for the future of the world

It's important to recognize this: data science is not all new stuff. It includes technical capabilities like relational databases and machine learning that have been around for decades. What's new is how these techniques can come together to transform big data into game changers for industry, government, academia, and the ordinary citizen.

So what are these techniques? What are the skills you need to learn to be a successful data scientist? I categorize them into five "shopping bags" of skills:

Systems refers to the physical infrastructure necessary to manage big data, and the distributed computing systems necessary to process big data. The skill sets you need in this area include: familiarity with cloud computing services, such as Amazon Web Services (AWS); distributed file system management using Hadoop, and increasingly Apache Spark; and knowledge of High Performance Computing (HPC) techniques.

Big Data Management refers to the software and strategies of big data management. Many systems still use SQL and relational databases, and knowledge of these is a must for any data scientist; but increasingly new database technologies such as NoSQL (especially MongoDB), semantic databases and graph databases are being used. Important in this area is the emerging concept of the "data lake" as opposed to data warehousing.

Programming is the glue that brings everything together, be it writing code to manage or transform data, or user-side app development. The most popular languages are Java, C++, and Python.

Analytics is at the core of what most people consider data science. This is about being able to transform data into knowledge, insights and even wisdom. A good statistical training is an absolute must for this area. on top of this, visualization, data mining and machine learning are the most important techniques. Learning the R package is a good starting point too.

Human Data Interaction is about how to make analysis "move the needle" in positive and not negative ways for human beings. This bag is all about the strategy of making data science work for the world. It includes areas such as policy, strategy, ethics, security, and application of data science methods in different domains.

The job of the data scientist is to make recipes from the ingredients in these bags. As a data scientist, you don't have to be an expert in all these areas, but you should have a breadth of skills across these areas, and a depth in one or two of them