Data science needs drudges

Quality data science outputs depend on quality inputs. Data cleansing and preparing may not be exciting work, but it’s critical.

Contributor, InfoWorld |

Data science is a lot of drudgery, and that’s good — yogysic / Getty Images

Data scientist may be one of the sexiest jobs of our century, as Harvard Business Review opines, but it sure does involve a lot of unsexy, manual labor. According to Anaconda’s 2021 State of Data Science survey, survey respondents said they spend “39% of their time on data prep and data cleansing, which is more than the time spent on model training, model selection, and deploying models combined.”

Data scientist? More like data janitor.

Not that there’s anything wrong with that. In fact, there’s much that is right with it. For years we’ve oversold the glamorous side of data science (build models that cure cancer!) while overlooking the simple reality that much of data science is cleaning and preparing data, and this aspect of data science is fundamental to doing data science well. As consultant Aaron Zhu notes, “Any statistical analysis and machine learning models can be as good as the quality of the data you feed into them.”

Someone’s got to get their hands dirty

Positive or negative, time spent with data wrangling (data prep and cleaning) seems to be declining. Although data scientists today report they spend 39% of their time on data wrangling, last year the same Anaconda survey reported that number was 45%. Just a few years ago, the number might have been closer to 80%, by some estimates.

Such sky-high estimates were almost certainly incorrect, as Leigh Dodds of the Open Data Institute has argued. Worse, he insists, by demeaning the act of data wrangling we misunderstand the value of that wrangling. “I would argue that spending time working with data to transform, explore, and understand it better is absolutely what data scientists should be doing. This is the medium they are working in. Understand the material better and you’ll get better insights.”

In other words, while we might want to focus on data science outputs, we can’t do so effectively if we’ve overlooked the inputs. Garbage in, garbage out.

The people part of data science

For as long as we’ve been talking about data science and its ancestor “big data,” we’ve wrung our hands about machines obviating the need for people. This is true for data science as a category, but also for data wrangling as an input to that category.

It’s tempting to think that we can simply automate all of this data prep—how much thought can go into cleaning up data, after all? But the reality is that although some data work can be automated, it is ultimately a human task. Why? Data wrangling is a “critical part of the analytical process,” as suggested by Tim Stobierski, a contributing writer for Harvard Business School Online. It requires someone who can “understand what clean data looks like and how to shape raw data into usable forms.” For example, during the discovery phase of data wrangling, you need someone who can see gaps in the data as well as patterns.

Or, as noted in the Anaconda 2021 report, “While data preparation and data cleansing are time-consuming and potentially tedious, automation is not the solution. Instead, having a human in the mix ensures data quality, more accurate results, and provides context for the data.”

This has always been the case. In the early days of big data, we imagined a world in which we could just throw data at Apache Hadoop and out would pop “actionable insights.” However, life—and data science—don’t work that way. As I wrote back in 2014, ultimately data science is a matter of people. “Those who do data science well blend statistical, mathematical, and programming skills with domain knowledge.” That domain knowledge enables human creativity with data. The more familiar a person is with their business, the better they’re able to not only prepare that data for modeling, but also the more likely they’ll be to intuit insights from patterns and anomalies.

Domain knowledge also should help with the eventual output of data science models. According to the Anaconda report, only “36% of people said their organization’s decision-makers are very data literate and understand the stories told by visualizations and models. In comparison, 52% described their organization’s decision-makers as mostly data literate but needing some coaching on the stories told by visualizations and models.” Well, that may partly be a problem with the recipients of the models/visualizations, but it also arguably has to do with the data scientists preparing them. Greater familiarity with their domains should enable them to more clearly articulate how their machine learning models describe what the business can learn from its data.

Again, that domain knowledge doesn’t start to become useful when the data scientist is on the final sprint to the boardroom with the models. It starts early in the not-so-lowly task of data wrangling that is the foundation for all good data science. We should celebrate not deprecate it.

Next read this:

Matt Asay runs developer relations at MongoDB. The views expressed herein are Matt’s and do not reflect those of his employer.