Lightweight testing for maintainable data science

When I began working in analytics, one of the most miserable types of tasks I ended up doing was re-running an old Jupyter notebook. Often it failed part way through with some inscrutable error. Figuring out what was going on was challenging; how am I supposed to remember this particular notebook from five months ago? What's more, the underlying data sometimes stops getting updated, or a column name changes, or the date format in a particular field switches. You may have had similarly frustrating experiences. The good news is that simple techniques from the field of software engineering can dramatically improve this experience.

As you may have guessed from the title of this article, I'm a big fan of testing. It's easier than you realize, and it'll save you a ton of headaches. For our purposes today, let's consider a machine learning project that consists of three phases: first, exploratory data analysis and prototyping. Second, model training. And third, running in production. All three of these phases can benefit from testing.

One: EDA and prototyping

When exploring the data, we learn a significant amount of information. Here are some examples of questions we might answer:

Too often, we keep the answers to these questions in our head alone. This fact is part of what makes it difficult to go back to an old notebook; these answers have fallen out of our short- and long-term memory by the time we return to the notebook. Fortunately for us, computers have excellent memories! We could, of course, write down each of the answers to these questions directly in our Jupyter notebook, which will help us when we return to it. Still better, though, is expressing the answers to these questions as executable code -- as tests.

When doing initial analysis, I find it cumbersome to even think about running a testing framework inside my notebook. Fortunately, we can get by without one: Python includes the assert keyword, which will do just fine. For example, we might encode the knowledge that our DataFrame should have 8 columns thusly:

This is an improvement over a comment or markdown cell that simply states "DataFrame should have 8 columns" because the computer will actually check this for us each time the notebook is run. And if that condition is not met, we will see an error:

In this case, this may be an acceptable error. We can read the condition that was asserted and back into the conclusion that our DataFrame should have eight columns. But if we're feeling quite charitable toward our future self, we can add a message:

Writing an assert statement is a cheap insurance policy against unexpected changes. I highly recommend making assertions about the shape of your dataset, the sparsity of certain columns (assert df['a'].notnull().mean() > 0.9), the existence of particularly important columns (assert 'age' in df), and the range of numeric columns (assert (df['age'] < 0).sum() == 0). As a general rule, if you're making an assumption in your code, consider whether you can express that assumption as an assert statement.

Two: training script

A common pattern I've seen in machine learning work is to take a Jupyter notebook that contains code to train a model and turn it into a Python script (which is more easily run/monitored in certain environments). To do this, I recommend taking chunks of the notebook which do a discrete unit of work and turning them into standalone functions that the notebook then uses. Specifically, create a .py script in the same directory as the notebook (say, helpers.py), define a new function, and copy the code from the notebook into that function. Then, import the function (for example, from helpers import age_range_to_midpoint), delete the code you pasted into the script, and use the function instead.

As an example, suppose our data encodes age as a range ("0-25", "25-40", "40-100"), and we have decided that we want to represent this to our model with the midpoint of the range. Our helpers.py script might contain the following:

def age_range_to_midpoint(age_range):
    endpoints = [e.strip() for e in age_range.split('-')]
    return sum(int(e) for e in endpoints) / len(endpoints)

At this point, I believe it's worth it to use a testing framework. Python has one built in, but I love using pytest. As we create functions, we can add tests by defining a function (or functions) whose name(s) begin with "test_":

def test_age_range():
    assert age_range_to_midpoint('20-30') == 25
    assert age_range_to_midpoint('0-31') == 15.5

Just like the asserts we created during EDA encode information about our data, these tests encode information about how our code works. By the end of this process, we have a nice file of functions and a notebook which largely runs those functions in a certain order. Turning this notebook into a Python script is now simple, as the complex logic is already present in our helper file.

These tests enable us to make changes to our code more confidently. We can run these tests ourselves after changes we've made to make sure we haven't broken anything. Ideally, we can set up some sort of automated process that runs these tests as commits are made (both GitLab and GitHub offer tools that do this).

Further, these tests serve as executable documentation. While it is easy for comments to go stale, tests remain an accurate description of what a function does. If I introduce a change to the way a function works, I must also edit the tests (or else they will fail, and I will be sad). In this way, tests are a far more reliable and accurate kind of documentation than comments.

Three: production

While a thorough treatment of putting a model in production is outside the scope of this article, testing is certainly a part of it. In his book Building Machine Learning Powered Applications, Emmanuel Ameisen coins the term "check" to describe a test that runs in the production prediction pipeline (rather than in a CI/CD pipeline). The same kinds of common sense assert statements you wrote in your Jupyter notebook are also helpful sanity checks in a prediction pipeline.

You should write checks for both inputs and outputs of your model. Is someone passing in a negative value for the age of a human being? Is our model predicting that a car will have a fuel efficiency of over 9,000 miles per gallon? Both of these cases seem unexpected! Depending on the business requirements, we may take a variety of actions. For instance, if our model is predicting a huge value for miles per gallon, we might refuse to make a prediction:

y_pred = model.predict(X)
if y_pred < 0 or y_pred > 100:
    raise PredictionError('Problem predicting mpg for this car')

if y_pred < 0 or y_pred > 100:
    return 30

Sometimes, we may be able to swap in a simpler model if it's available and more robust. Or we can replace nonsensical feature values for nulls, or impute a value. There's a lot of options here, and you should be careful about choosing the right one for your use case. A well-written check prevents a certain class of bug from becoming an issue, thereby improving the robustness of the system overall.

Go forth and test

Keep in mind how you can introduce testing throughout your process. Whether it's a quick assert statement in a Jupyter notebook, a unit test in a Python script, or a check that runs in production, well-written tests are a gift to your future self and your team. Tests make code less error prone, easier to debug, and less vulnerable to decay.