In the fourth part of this worked example, we refactor the code of the exploration in the Jupyter Notebook into the Titanic package that we created in part A - Setup.
The code in the Jupyter Notebook that we saw in the previous part of this tutorial is written as a script. The idea of refactoring for exploration is to restructure code and text so that analyses are easier to understand and further exploration becomes faster to do. For this, it is useful to introduce the concept of code to text ratio.
Since the main purpose of an exploratory analysis is to prove a point rather than showing code, refactoring for exploration should be aimed at reducing the code to text ratio, within reason. In this way, a notebook would look more like a document that uses words (and plots) to reason and prove a point.
However, with this criterion, we may as well increase the number of words, just to decrease the code to text ratio. This leads to longer documents that are both text and code heavy and, in turn, harder to read. A better solution is to simplify both code and text, while keeping the code to word ratio reasonably low.
I had the idea of code to text ratio from the data-ink ratio introduced by Edward Tufte in The Visual Display of Quantitative Information as a quantity to maximise to convey information through graphics in a more effective way.
Related to code to text ratio is Donald Knuth's paradigm of literate programming that encourages people to write for people, rather than computers.
Note that if an analysis does not lead to any useful result or did not develop useful tools, it may as well be useless to refactor it.
Following the principle above, a possible workflow for refactoring notebooks is the following.
- Repeated scripts can be moved into functions that are called multiple times, reducing the amount of code in the notebook and, therefore, improving readability.
- Functions that become widely used in the analysis, can be moved into Python modules located in the same directory of the exploratory analysis. This further reduces the code in the notebook and allows the functions to be called from other notebooks or scripts in the same folder of the analysis.
- Functions that become particularly important, can be made more solid by writing unit tests, as explained in the section below.
The same workflow is easily adapted to IDEs and text editors.
Once an exploratory analysis has taken a certain direction, it is useful to refactor the parts of the code that are going into production, as, for example, the functions and methods that will form data pipelines.
Refactoring for production is a field covering many areas, such as readability, code complexity, code architecture and testing. To make the tutorial easy to follow, the code was already made readable in part C - Explore and the Titanic toy-example problem kept the complexity and architecture simple. So, in this section we will focus just on testing.
Since refactoring data science for production is closer to software development than refactoring for exploration, we can rely more on standard testing methodologies. Moreover, data science broadly involves data preprocessing and predictive modelling, for which testing is done differently.
In data processing, the general idea to write tests is to,
- Create an input dataset with peculiar cases
- Create the output dataset that we expect from processing the input dataset
- Compare the processed input dataset and expected output dataset
In predictive modelling, the situation is similar if we take the model predictions as output and, if random processes are involved, we fix the random seed. However, if we would like to improve the model along the way, we must allow for the output to change, meaning that, instead of testing the exact output, we test properties of the output. This kind of testing is analogous to the validation testing in software development, where tests check that systems meet some requirements. In our example, we may require that the logistic regression model performs at least as well as the majority vote classifier. Note that to run this validation test we need to store some data. For this example, we choose the entire dataset, as it is small. For large datasets, you can store a smaller sample to use just for validation. Moreover, because this validation data is used only for tests, we store it in a folder called validation_data/
in the tests/
folder.
If more extensive validation tools are required, there are some useful choices for Python:
- Hypothesis — A package to create unit tests which are simpler to write and more powerful when run, finding edge cases in your code you wouldn’t have thought to look for
- Engarde — A package for defensive data analysis
- TDDA — A package for test-driven data analysis
- Faker — A package to generate fake data
- Feature Forge — A package that provides some help with the boilerplate of defining features and helps you testing them
In this section, we refactor some of the notebook code into the titanic
package for production. We do this by creating the titanic/data.py
and titanic/models.py
modules, where we put respectively functions for data processing and predictive modelling. We use this modular approach motivated by the Single Responsibility Principle, which states that each bit of code should be focused on a single task with a limited scope. Some reasons behind this principle are that modular files are easier to maintain, discourage the use of global variables, and encourage the use of variables with narrow scopes and input and output parameters.
So far we have used the print
function to display messages. The print
function is good during the exploration phase. However, when productionising code, it is a good practice to use the logging
module. The Python logging module allows us to handle messages flexibly by making it easy to redirect them to the console or log files, or to change the display format of the messages. In this tutorial, we limit ourselves to messages displayed on the screen as this project is simple. When a project becomes more complex, it is useful to write log messages to files.
We will also use the NumPy docstring format, as it is more readable than the standard Python reStructuredText format.
When refactoring, keep in mind that code is read much more often than it is written.
In particular, these four actions help:
- Use explicit variable names
- Write docstrings
- Comment your code
- Commenting too much may mean that you should improve your code instead
Let us start by creating a branch for the refactoring explained above.
git checkout -b refactor_passenger_survival
On this branch, refactor the code following the following files:
- ➠ Refactored Jupyter Notebook (Search for the "REFACTORED" keyword.)
- ➠ Data manipulation module: data.py
- ➠ Predictive models module: models.py
For these functions, we also create unit tests in tests/
by using PyTest, as this library is more user-friendly than the standard unittest library.
mkdir tests/
pip install pytest==4.3.1 pytest-runner==4.4
Add the following content to setup.py
:
...
setup(
...
install_requires=[
...
'pytest>=4.3.1',
'pytest-runner>=4.4',
],
setup_requires=['pytest-runner'],
tests_require=['pytest'],
)
To tell Python to use PyTest for testing, create the configuration file setup.cfg
with the following content.
[aliases]
test=pytest
To see the tests for the functions in the data.py and models.py modules, click on the following links.
- ➠ Tests for the data manipulation module: test_data.py
- ➠ Tests for the predictive modelling module: test_models.py
To run the tests, you can use the following command.
python -m pytest
Finally, we commit the changes, merge the predict_passenger_survival
branch with the master
branch and push the content to the GitHub repository.
git add .
git commit -m "Refactor exploratory analysis of passenger survival predictions using ridge logistic regression"
git checkout master
git merge refactor_passenger_survival
git push
In this part of the tutorial we saw how to refactor an exploratory analysis. In the next part, we will discuss how to iterate exploration and refactoring to obtain a product.