Wednesday, October 17, 2012

Data scientist(?)

One of the apparently ‘hot’ jobs these days is the data scientist. So hot in fact that the Harvard Business Review has named it the sexiest job of the 21st century. I came across Rachel Schutt’s Data Science class at Columbia (HT: Andrew Gelman) and she has a description of what a data scientists does (or should do):

What is a Data Scientist?
Let me start with academia because that’s quicker. Then industry.
In Academia: No one calls themselves a Data Scientist yet in universities. There are 60 students in my class from across disciplines. I thought when I proposed the course it would be statisticians, applied mathematicians and computer scientists who showed up. Actually it’s them plus sociologists, journalists, political scientists, biomedical informatics students, students from NYC government agencies and non-profits related to social welfare, someone from the architecture school, environmental engineering, pure mathematicians, business marketing students, and students who already work as data scientists. Am I missing someone? They’re all interested in figuring out ways to solve important problems, often of social value, with data.

For the term Data Science to catch on in academia at the level of the faculty, the research area needs to be more formally defined. I see a rich set of problems that could be many PhD theses. My current working definition is a Data Scientist in this setting is a Scientist (from social scientists to biologists) who work with large amounts of data, and must grapple with computational problems posed by the structure, size, messiness and nature of the data, while simultaneously solving a real world problem. Across academic disciplines, the computational and deep data problems are the same. So if researchers across departments join forces, they can solve multiple real-world problems from different domains.

In Industry:
It depends on the level of seniority and whether you’re talking about the internet industry in particular. The role of data scientist need not be exclusive to the tech world, but that’s where the term originated so for the purposes of the conversation, let me say what it means there:

A Chief Data Scientist should be setting the data strategy of the company which involves a variety of things: setting everything up from the engineering and infrastructure for collecting data and logging, to privacy concerns; deciding what data will be user-facing, how data is going to be used to make decisions, and how it’s going to be built back into the product. She should manage a team of engineers, scientists and analysts and she should communicate with leadership across the company including the CEO, CTO and product leadership. She’ll also be concerned with patenting innovative solutions, and setting research goals.

More generally, a data scientist is someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning, as well as being human. She spends a lot of time in the process of collecting, cleaning and munging data, because data is never clean. This process requires persistence, statistics and software engineering skills– skills that are also  necessary for understanding biases in the data, and for debugging logging. Once she gets the data into shape, a crucial part is exploratory data analysis which combines visualization and data sense. She’ll find patterns, build models and algorithms, some with the intention of understanding product usage and the overall health of the product, and others serve as prototypes that ultimately get baked back into the product. She may design experiments, and is a critical part of data-driven decision making. She’ll communicate with team members, engineers, and leadership in clear language and using data visualizations so that even if her colleagues are not immersed in the data themselves, they will understand the implications.

Looking at the syllabus it sure sounds a lot like data mining. I guess being a scientist beats being a miner.

No comments: