In the second part of this tutorial, we explain how multiple people can collaborate on the project.
Note that all the commands in this tutorial are run from the root folder of the project.
In order for other people to be able to reproduce the same environment that we are using, we create the file requirements.txt
containing the list of packages in our virtual enviroment.
pip freeze | grep -v titanic > requirements.txt
The grep -v titanic
command omits the local package titanic
to avoid errors when installing packages from the requirement file, as we will see in section Contributing.
We also add the package we installed in part A - Setup, pypandoc
, as minimal package requirements to setup.py
.
...
setup(
...
install_requires=[
'pypandoc>=1.4'
]
)
We will use these additions to reproduce the working environment in section Contributing. Before that, we need to share our project.
To allow other people access to the work, we share it on GitHub, using the Git command line tools.
Since we want to add only the necessary files to this repository, we create a list of files to omit by storing it in a .gitignore
file. The files to omit, are, for example, files that are created in the project folder when we installed the titanic
package. A list of such files for Python has already been compiled by other people, so that we can simply copy it in our project folder.
curl -o .gitignore https://raw.githubusercontent.com/github/gitignore/master/Python.gitignore
Note that operating system specific files should be omitted at global level using the command below, matching your operating system.
# Unix
curl -o $HOME/.gitignore_global https://raw.githubusercontent.com/github/gitignore/master/Global/Linux.gitignore
# Mac
curl -o $HOME/.gitignore_global https://raw.githubusercontent.com/github/gitignore/master/Global/macOS.gitignore
# Windows
curl -o $HOME/.gitignore_global https://raw.githubusercontent.com/github/gitignore/master/Global/Windows.gitignore
Next, we set up a new repository; call it titanic_datascience
, and push the content we created into it by following the official GitHub guide:
➠ Creating a new GitHub repository
It is important to learn Git well, otherwise it can be easy to mess up a repository. You may start with these resources and later take a proper course, as, for example, the free course How to Use Version Control in Git & GitHub.
After these additions, the project structure, found at the top of the page, becomes the following.
📁 exploration/
📁 data/
📄 titanic.csv
📁 titanic/
📄 __init__.py
📄 .gitignore
📄 README.md
📄 requirements.txt
📄 setup.py
If other people would like to contribute to the project, they need to clone the repository and reproduce the working environment.
git clone <git-repository-url> # Download the repository from GitHub
cd titanic_datascience
mkvirtualenv --python=python3 titanic_datascience # Create empty virtual environment
pip install -r requirements.txt # Install packages listed in requirements.txt
pip install -e . # Install the titanic package in development mode
There are two ways in which we can contribute to a data science project aimed at production: we can explore the data through a data science analysis or refactor analyses into production.
In part A - Setup, we separated the files related to exploration and production by introducing the exploration
folder to store analyses, and the titanic
package to store the production code. To extend the separation of concerns of exploration and productionisation into how people contribute, we introduce the Git branching workflow exemplified in the following diagram.
In this workflow, we dedicate the Product branch (blue) to production code, the Explore branch (green) to exploratory analyses and the Refactor branch (orange) to refactoring exploratory analyses into production.
Let us walk through the example diagram, starting from the left. Explore branch A
is created from the Product branch. Exploratory work is carried out and committed; commits are denoted by coloured circles. Once the exploratory work in branch A
is over, the branch is merged into the Product branch. Refactor branch a
is created, used to refactor the exploratory work carried out in branch A
, and merged into the Product branch. At the same time, Explore branch B
is created to carry out a new different analysis, which is later refactored in Refactor branch b
. While work is carried out in branch B
, another Explore branch, C
, is created, and so on.
For the Product branch we use the master
branch, from which we branch out Explore and Refactor branches. To differenciate Explore and Refactor branches, we use the "explore" keyword for Explore branches, in the format explore_<name>
.
A Refactor branch can effectively be viewed as a feature branch, in the Feature Branch workflow for software development, with new features coming from insights and code from exploratory analyses. In this view, we can think of adding branches for exploratory analyses, with dedicated folders, to the Feature Branch workflow. In the same way, it is possible to extend other Git workflows, such as the Gitflow and Forking workflows, by introducting branches dedicated to exploration. As a consequece, when using these workflows extended for exploration, software developers feel at home.
Now that we have a strategy to collaborate, we proceed to the next part of the tutorial, where we will do some exploratory data analysis.