In the fifth part of this worked example, we discuss the iteration of the explore-refactor cycle and create a data pipeline that can be run from the command line, corresponding to our end-product.
So far, we have seen a single exploration and a single refactoring steps. The idea of the workflow presented in this tutorial is to cycle through small exploration and refactoring steps, instead of doing a single big exploration and refactoring just at the end.
Cycling through small exploration and refactoring steps has several advantages over a single exploration with a final refactoring.
-
Cycling through refactoring and exploration keeps the code modular. In turn, the codebase becomes smaller, and easier to reuse and fix. In general, viewing code as a liability, less code means less liability.
-
Modular code allows to build tests that make the codebase robust against change. In addition, by writing tests, it is easier to exploit the weaknesses of a system.
-
By calling functions and methods from the refactored modular code, analyses contain less code. By containing less code, analyses focus more on proving a point through argumentations, which is what an analysis is supposed to do. In addition, basing an analysis on a modular codebase, prevents the dangerous tendency of copying and pasting code snippets from older analyses.
-
By taking advantage of the modular codebase, it is faster to carry out new analyses. For example, we may use just a few lines of code to call functions or methods from the codebase to clean our data or make customised plots. In addition, using refactored and tested code reduces the risk of reaching a wrong conclusion in an analysis.
-
Modular code is faster and easier to simplify and document. Simple and documented code, together with focussed analyses, lead to a work that is easier to read, understand and change, therefore, boosting collaboration.
-
By following the explore-refactor cycle, the code is kept in a state that is quick to productionise.
To complete our product, we implement the work into a pipeline and create a command line tool to run the pipeline from the terminal.
First, we create a new Git branch, called data_pipeline
.
git checkout -b data_pipeline
Then, we write the code of the pipeline into the module pipelines.py
inside the Titanic package.
➠ Go to the data pipeline module: pipelines.py
To implement the command line tool to run the pipeline, we use the Click library instead of the standard argparse, as Click is more user-friendly.
pip install click==7.0
pip freeze | grep -v titanic > requirements.txt
We also add the following lines to setup.py
.
...
setup(
...
install_requires=[
...
'click>=7.0'
],
...
entry_points='''
[console_scripts]
titanic_analysis=titanic.command_line:titanic_analysis
'''
)
The command line tool is implemented in the following file,
➠ Go to the command line module: command_line.py
and is run from the terminal using the following command — Note that the virtual environment titanic
has to be active to run this command.
titanic_analysis --filename exploration/data/titanic.csv
Finally, we commit the changes and push the content to the GitHub repository.
git add .
git commit -m "Add data pipeline and command line tool to launch it"
git checkout master
git merge data_pipeline
git push
Once a product is ready, most people just use the Python package without contributing to it. In this case, the package can be conveniently installed in a single line of code.
pip install -e 'git+https://github.com/<github_account>/titanic_datascience.git#egg=titanic'
For a private repository accessible only through an SSH authentication, substitute
git+https://github.com
withgit+ssh://[email protected]
.
This commands installs only the titanic
package, as specified in setup.py
, and omits other files, like the exploration/
folder or requirements.txt
.
In this part of the tutorial we saw how to iterate the explore-refactor cycle, how to create a data pipeline interface accessible through the command line, and how to distribute the product.
Congratulations for completing the tutorial! 🎉
To set up your next project, you can use the Cookiecutter template at the following link.