.
├── README.md
├── models <- compiled model .pkl or HDFS or .pb format
├── config <- any configuration files
├── data
│ ├── external <- external data
│ ├── interim <- data in intermediate processing stage
│ ├── processed <- data after all preprocessing has been done
│ └── raw <- original unmodified data acting as source of truth and provenance
├── docs <- usage documentation or reference papers
├── notebooks <- jupyter notebooks for exploratory analysis and explanation
├── docker <- docker image(s) for running project inside container(s)
└── src
├── data <- data prepare and/or preprocess
├── evaluate <- evaluating model stage code
├── pipelines <- scripts of pipelines
├── report <- visualization (often used in notebooks)
├── train <- train model stage code
├── transforms <- transformations data code (e.g., augmentation)
└── utils.py <- auxiliary functions and classes
git clone https://github.com/mlrepa/dvc-2-iris-demo-project.git
cd dvc-2-iris-demo-project
Download iris.csv
wget -P data/raw/ -nc https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv
It may not work for Windows. So, use the this link
to load data into data/raw/
folder
1) Install DVC
pip install dvc
Link for installation instructions
2) Initialize DVC init ONLY if you build the project from scratch. For projects clonned from GitHub it's already initialized.
Initialize DVC
dvc init
Commit dvc init
git commit -m "Initialize DVC"
3) Add remote storage for DVC (any local folder)
dvc init
dvc config cache.type copy
dvc remote add -d default_storage /tmp/dvc-storage
GIT_CONFIG_USER_NAME=<git user>
GIT_CONFIG_EMAIL=<git email>
example
GIT_CONFIG_USER_NAME=mnrozhkov
[email protected]
Tutorial should work beyond docker container BUT not tested.
1) Install Docker and docker-compose tools
Links may help:
2) Build docker image
ln -sf config/.env && docker-compose build
Run docker container via docker-compose
docker-compose up
- run all in Jupyter Notebooks
- i.e. main funcitons and classes
Pipeline (python) scripts location: src/pipelines
Main stages:
-
prepare_configs.py: load config/pipeline_config.yml and split it into configs specific for next stages
-
featurize.py: create new features
-
split_train_test.py: split source dataset into train/test
-
train.py: train classifier
-
evaluate.py: evaluate model and create metrics file
- add pipelines dependencies under DVC control
- add models/data/congis under DVC control
1) Prepare configs
Run stage:
dvc run -f stage_prepare_configs.dvc \
-d src/pipelines/prepare_configs.py \
-d config/pipeline_config.yml \
-o experiments/split_train_test_config.yml \
-o experiments/featurize_config.yml \
-o experiments/train_config.yml \
-o experiments/evaluate_config.yml \
python src/pipelines/prepare_configs.py \
--config=config/pipeline_config.yml
Reproduce stage: dvc repro pipeline_prepare_configs.dvc
2) Features extraction
dvc run -f stage_featurize.dvc \
-d src/pipelines/featurize.py \
-d experiments/featurize_config.yml \
-d data/raw/iris.csv \
-o data/interim/featured_iris.csv \
python src/pipelines/featurize.py \
--config=experiments/featurize_config.yml
this pipeline:
- creates new dataset with new features (
data/interim/featured_iris.csv
) - generates stage file
pipeline_featurize.dvc
Reproduce stage: dvc repro pipeline_featurize.dvc
3) Split train/test datasets
Run stage:
dvc run -f stage_split_train_test.dvc \
-d src/pipelines/split_train_test.py \
-d experiments/split_train_test_config.yml \
-d data/interim/featured_iris.csv \
-o data/processed/train_iris.csv \
-o data/processed/test_iris.csv \
python src/pipelines/split_train_test.py \
--config=experiments/split_train_test_config.yml \
--base_config=config/pipeline_config.yml
this stage:
- creates csv files
train_iris.csv
andtest_iris.csv
in folderdata/processed
- generates stage file
pipeline_split_train_test.dvc
Reproduce stage: dvc repro pipeline_split_train_test.dvc
4) Train model
Run stage:
dvc run -f stage_train.dvc \
-d src/pipelines/train.py \
-d experiments/train_config.yml \
-d data/processed/train_iris.csv \
-o models/model.joblib \
python src/pipelines/train.py \
--config=experiments/train_config.yml \
--base_config=config/pipeline_config.yml
this stage:
- trains and save model
- generates stage file
pipeline_train.dvc
Reproduce stage: dvc repro pipeline_train.dvc
5) Evaluate model
Run stage:
dvc run -f stage_evaluate.dvc \
-d src/pipelines/evaluate.py \
-d experiments/evaluate_config.yml \
-d models/model.joblib \
-m experiments/eval.txt \
python src/pipelines/evaluate.py \
--config=experiments/evaluate_config.yml \
--base_config=config/pipeline_config.yml
this stage:
- evaluate model
- save evaluating report (metrics file
experiments/eval.txt
) - generate stage file
pipeline_evaluate.dvc
Reproduce stage: dvc repro pipeline_evaluate.dvc