The template of this repository follows production-data-science workflow,
which focuses on productionizing data scientist's work, make the analysis or research to be reusable, applicable to
production.
The workflow is separated into 2 phases:
exploration phase
is where data scientist explores the project, mainly work with jupyter notebook. All the work in this phase will be stored inexploration
folder.production phase
is where data scientists' works are refactored into packages so it can be reuse, imported. All the work in this phase will be stored inyour_package
folder.
git clone https://gitlab.com/Phuc_Su/production_data_science_template.git
git clone <your_project_repository>
cd <your_project_name>
git checkout -b product-initial-setup
# open Finder, copy all content of production_data_science_template into your project repository, except .git and .idea folder
conda create --name <environment_name> python=3.6
source activate <environment_name>
pip install git-lfs
# in case you want to add some large file extension other than .jpg, .pdf, .csv, .xlsx
git lfs track <add large file path>
# rename <your package> folder and modify setup.py, most importance is require_packages. See example below
# write something about your project in README.md
pip install -e .
pip freeze | grep -v <package_name> > requirements.txt
git add .
git commit -m "First commit"
git push -u origin HEAD
Example of setup.py
setup(
name='your_project',
version='v0.1',
description='',
long_description=readme(),
classifiers=[
'Programming Language :: Python :: 3',
],
url='https://github.com/phucnsp/production_data_science_template',
author='Phuc_Su',
author_email='',
license='',
packages=['your_package'],
install_requires=[
'pypandoc>=1.4',
'watermark>=1.5.0',
'pandas>=0.20.3',
'scikit-learn>=0.19.0',
'scipy>=0.19.1',
'matplotlib>=2.1.0',
'pytest>=3.2.3',
'pytest-runner>=2.12.1',
'click>=6.7'
],
setup_requires=['pytest-runner'],
tests_require=['pytest'],
)
and you are ready~! 🎉
Note: if you want to setup notification on slack for merge request from gitlab, reference here
```bash
conda create --name <environment_name> python=3.6
source activate <environment_name>
git clone <repository url>
cd to/the/project/directory
pip install -r requirements.txt
pip install -e .
```
For a private repository accessible only through an SSH authentication, substitute
https://github.com/
with[email protected]:
.
- Some rules:
1 branch/1 exploration/1 folder
- branch-name convention:
explore-* for exploration, refactor-* for refactor
git checkout master
git pull --all
# if you continue to work on old branch
git checkout <branch>
# if you want to start a new exploration
git checkout -b <new_branch>
# if your branch is far behind master and you want to merge
git merge master
#####################
Start working
#####################
git add <path_to_work_files/folder>
git commit -m "some message"
git push -u origin HEAD
requirements.txt
helps to setup your virtual environment, to make sure all contributors working on the same environments.
So whenever you have a new libraries need to install, after installing you need to add it intorequirements.txt
bypip freeze | grep -v <package_name> > requirements.txt
setup.py
allows you to create packages that you can redistribute. This script is meant to install your package on the end user's system, not to prepare the development environment.packages
- in-house development packages.install_requires
- packages that our development packages dependence on.py_modules=['new_module']
- in-house development modules need to install (placed in root directory)
pip install -e .
- to install packages/modules fromsetup.py
, in the editable mode.- If you want to add large file into working repository:
pip install git-lfs git lfs install # Tell LFS to track files with given path git lfs track "path_to_large_file" # Tell LFS to track files with format "*.jpg" git lfs track "*.jpg" # Tell LFS to track content of the whole directory git lfs track "data/*"
Install the library
conda create --name <environment_name> python=3.6
source activate <environment_name>
pip install -e 'git+https://github.com/phucnsp/production_data_science_template.git'
For a private repository accessible only through an SSH authentication, substitute
git+https://github.com
withgit+ssh://[email protected]
.
Note that-e
argument above to make the installation editable.