Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



5 Commits

Repository files navigation

Directory Structure

├── models  <- compiled model .pkl or HDFS or .pb format
├── config  <- any configuration files
├── data
│   ├── external <- external data
│   ├── interim <- data in intermediate processing stage
│   ├── processed <- data after all preprocessing has been done
│   └── raw <- original unmodified data acting as source of truth and provenance
├── docs  <- usage documentation or reference papers
├── notebooks <- jupyter notebooks for exploratory analysis and explanation 
├── docker <- docker image(s) for running project inside container(s)
└── src
    ├── data <- data prepare and/or preprocess
    ├── evaluate <- evaluating model stage code 
    ├── pipelines <- scripts of pipelines
    ├── report <- visualization (often used in notebooks)
    ├── train <- train model stage code
    ├── transforms <- transformations data code (e.g., augmentation) 
    └── <- auxiliary functions and classes


1. Clone this repository

git clone

cd dvc-2-iris-demo-project

2. Get data

Download iris.csv

wget -P data/raw/ -nc

It may not work for Windows. So, use the this link to load data into data/raw/ folder

3. Initialize DVC init

1) Install DVC pip install dvc

Link for installation instructions

2) Initialize DVC init ONLY if you build the project from scratch. For projects clonned from GitHub it's already initialized.

Initialize DVC

dvc init

Commit dvc init

git commit -m "Initialize DVC"

3) Add remote storage for DVC (any local folder)

dvc init
dvc config cache.type copy
dvc remote add -d default_storage /tmp/dvc-storage

4. Create .env file in config/ folder

GIT_CONFIG_EMAIL=<git email>


[email protected]

Setup docker tools and build docker image

Tutorial should work beyond docker container BUT not tested.

1) Install Docker and docker-compose tools
Links may help:

2) Build docker image

ln -sf config/.env && docker-compose build


Run docker container via docker-compose

docker-compose up


Step 1: All in Junyter Notebooks

  • run all in Jupyter Notebooks

Step 2: Move code to .py modules

  • i.e. main funcitons and classes

Step 3: Add pipelines (stages) on Python modules

Pipeline (python) scripts location: src/pipelines

Main stages:

  • load config/pipeline_config.yml and split it into configs specific for next stages

  • create new features

  • split source dataset into train/test

  • train classifier

  • evaluate model and create metrics file

Step 4: Automate pipelines (DAG) execution

  • add pipelines dependencies under DVC control
  • add models/data/congis under DVC control

1) Prepare configs

Run stage:

dvc run -f stage_prepare_configs.dvc \
        -d src/pipelines/ \
        -d config/pipeline_config.yml \
        -o experiments/split_train_test_config.yml \
        -o experiments/featurize_config.yml \
        -o experiments/train_config.yml \
        -o experiments/evaluate_config.yml \
        python src/pipelines/ \ 

Reproduce stage: dvc repro pipeline_prepare_configs.dvc

2) Features extraction

dvc run -f stage_featurize.dvc \
    -d src/pipelines/ \
    -d experiments/featurize_config.yml \
    -d data/raw/iris.csv \
    -o data/interim/featured_iris.csv \
    python src/pipelines/ \

this pipeline:

  1. creates new dataset with new features (data/interim/featured_iris.csv)
  2. generates stage file pipeline_featurize.dvc

Reproduce stage: dvc repro pipeline_featurize.dvc

3) Split train/test datasets

Run stage:

dvc run -f stage_split_train_test.dvc \
    -d src/pipelines/ \
    -d experiments/split_train_test_config.yml \
    -d data/interim/featured_iris.csv \
    -o data/processed/train_iris.csv \
    -o data/processed/test_iris.csv \
    python src/pipelines/ \
        --config=experiments/split_train_test_config.yml \

this stage:

  1. creates csv files train_iris.csv and test_iris.csv in folder data/processed
  2. generates stage file pipeline_split_train_test.dvc

Reproduce stage: dvc repro pipeline_split_train_test.dvc

4) Train model

Run stage:

dvc run -f stage_train.dvc \
    -d src/pipelines/ \
    -d experiments/train_config.yml \
    -d data/processed/train_iris.csv \
    -o models/model.joblib \
    python src/pipelines/ \
        --config=experiments/train_config.yml \

this stage:

  1. trains and save model
  2. generates stage file pipeline_train.dvc

Reproduce stage: dvc repro pipeline_train.dvc

5) Evaluate model

Run stage:

dvc run -f stage_evaluate.dvc \
    -d src/pipelines/ \
    -d experiments/evaluate_config.yml \
    -d models/model.joblib \
    -m experiments/eval.txt \
    python src/pipelines/ \
        --config=experiments/evaluate_config.yml \

this stage:

  1. evaluate model
  2. save evaluating report (metrics file experiments/eval.txt)
  3. generate stage file pipeline_evaluate.dvc

Reproduce stage: dvc repro pipeline_evaluate.dvc

References used for this tutorial

  1. DVC tutorial
  2. 100 - Logistic Regression with IRIS and pytorch


Data Version Control (DVC) tutorial 2. Iris Demo Project







No releases published


No packages published