📦️ DVC

Updated at 2020-02-03 18:15

DVC is a command-line tool to keep large files out of your Git while still being able to reference them. You integrate it with e.g. AWS S3 or Azure Blob Storage to host all the shared training data, features and resulting artifacts. Benefits of this approach are explored further.

You will be able to go back to old datasets by Git commands that go backin history to modify the reference files (*.dvc files) to older versions. If the old file still exists in the configured remote storage.

In a nutshell:

git pull   # or otherwise go to specific version of your project
dvc pull   # if properly configured, this will download files used locally
dvc run ... python train.py  # run something and record the results
dvc push   # sends local files to remote storage
git add .
git push   # save code changes, allowing other to do `dvc pull` after `git pull`

# for more details, check out the usage examples down below

DVC will create meta-files (*.dvc) for:

all datasets and artifacts relating to the project
each dvc run you do to record what were the inputs and outputs of the command
metric files for recording results from commands, these will be saved to Git as-is

What DVC does not offer:

Orchestration: You need to handle your won orchestration and scheduling if not just running on a single computer. Data scientist frequently want to access scalable cloud resources like NVIDIA Tesla brand GPUs on the cloud and DVC doesn't help with that.
Dependency Management: You need to have discipline just to keep your requirements.txt in check, not to mention any lower level dependencies like binaries your require, operating system you support, hardware you rely on, etc.
Remote Worker Processing: DVC pipeline steps all run on the single machine that it was started on. This is frequently far from real world machine learning pipeline of a moderate complexity where you have various different requirements for different steps of your pipeline. You can use DVC to manage datasets on your workers but that requires another layer of abstraction on top of DVC.

Usage

https://dvc.org/doc/get-started

# create virtual env "dvc"...
echo dvc[all] > requirements.in
pip install pip-tools
pip-compile requirements.in
grep -o ^[^#]* requirements.txt | wc -l  # installs 84 packges, ugh
pip install -r requirements.txt

# how to create a new project with DVC
mkdir my-test
cd my-test
git init
dvc init
git commit -m 'Create a DVC project'
# now there is a new directory .dvc/ with config/ and tmp/

# you can use DVC with various "remotes"
# these are basically different data storages like S3 or a local directory

# create a new locale remote and add it as the default
dvc remote add -d mylocal /tmp/dvc-test-remote
git add .
git commit -m 'Add local remote'

dvc config core.remote # "mylocal", our default remote

# download a file from a DVC project (from a public project)
mkdir data
dvc get \
  https://github.com/iterative/dataset-registry \
  get-started/data.xml \
  -o data/data.xml

# 1) replace the actual files with *.dvc placeholder that goes to git
# 2) add the actual file to .gitignore
dvc add data/data.xml
git add .
git commit -m 'Add raw data'

cat data/data.xml.dvc
# md5: 301598c8348f8ac0c95abc6fc19da952
# outs:
# - md5: a304afb96060aad90176268345e10355
#   path: data.xml
#   cache: true
#   metric: false
#   persist: false

# you can push DVC manage files to your remotes
dvc push
ls -lh /tmp/dvc-test-remote/a3/04afb96060aad90176268345e10355
# here, md5 hash of the file is "a304afb96060aad90176268345e10355"

# you can pull DVC managed files that were previously pushed
rm data/data.xml
dvc pull
ls -lh data/data.xml
# this will read *.dvc files to figure out what to download from your default remote
# or you can just pull individual files
rm data/data.xml
dvc pull data/data.xml.dvc
ls -lh data/data.xml

# if you have multiple projects using the same files, use dvc import
dvc import https://github.com/iterative/dataset-registry get-started/data.xml
# this will include "deps:" part in the "data.xml.dvc" file that links the files
# note that the source git repository doesn't have the file either, but data.xml.dvc
# this is used to locate the file FROM YOUR CONFIGURED REMOTE
git reset --hard
rm -f data.*

# lets get some example code to run
wget https://code.dvc.org/get-started/code.zip
unzip code.zip
rm -f code.zip
ls -
git add .
git commit -m 'Add example source code'

# lets run some stuff
dvc run \
    -f prepare.dvc \      # name the DVC file generated
    -d src/prepare.py \   # declare a code dependency
    -d data/data.xml \    # declare a data dependency
    -o data/prepared \    # declare a directory for the output
    python src/prepare.py data/data.xml # what to run
# genrates "prepare.dvc" which includes all "deps:" and "outs:" MD5s
# this will also automatically push outputs to DVCdo

cat prepare.dvc
# md5: 645d5baf13fb4404e17d77a2cf7461c4
# cmd: python src/prepare.py data/data.xml
# deps:
# - md5: 1a18704abffac804adf2d5c4549f00f7
#   path: src/prepare.py
# - md5: a304afb96060aad90176268345e10355
#   path: data/data.xml
# outs:
# - md5: 6836f797f3924fb46fcfd6b9f6aa6416.dir
#   path: data/prepared
#   cache: true
#   metric: false
#   persist: false

git add .
git commit -m 'Add data preparation stage'
dvc push

# data pipelines
pip install -r src/requirements.txt
dvc run \
  -f featurize.dvc \
  -d src/featurization.py \
  -d data/prepared \
  -o data/features \
  python src/featurization.py data/prepared data/features
dvc run \
  -f train.dvc \
  -d src/train.py \
  -d data/features \
  -o model.pkl \
  python src/train.py data/features model.pkl
cat featurize.dvc
# md5: f89c792aacc96be22aa7349f61b32506
# cmd: python src/featurization.py data/prepared data/features
# deps:
# - md5: e6d8262e922894e85a959816f9a77ae7
#   path: src/featurization.py
# - md5: 6836f797f3924fb46fcfd6b9f6aa6416.dir
#   path: data/prepared
# outs:
# - md5: 3338d2c21bdb521cda0ba4add89e1cb0.dir
#   path: data/features
#   cache: true
#   metric: false
#   persist: false
cat train.dvc
# md5: 86622ecb0df37e43993023133d273755
# cmd: python src/train.py data/features model.pkl
# deps:
# - md5: d05e0201a3fb47c878defea65bd85e4d
#   path: src/train.py
# - md5: 3338d2c21bdb521cda0ba4add89e1cb0.dir
#   path: data/features
# outs:
# - md5: 25431f604a2e9a1f5219de6c96792e0f
#   path: model.pkl
#   cache: true
#   metric: false
#   persist: false
git add .
git commit -m 'Add featurization and training stages'
dvc push

dvc pipeline show --tree train.dvc
# train.dvc
# └── featurize.dvc
#     └── prepare.dvc
#         └── data/data.xml.dvc

dvc pipeline show --ascii train.dvc --commands
#           +-------------------------------------+
#           | python src/prepare.py data/data.xml |
#           +-------------------------------------+
#                               *
#                               *
#                               *
# +---------------------------------------------------------+
# | python src/featurization.py data/prepared data/features |
# +---------------------------------------------------------+
#                               *
#                               *
#                               *
#       +---------------------------------------------+
#       | python src/train.py data/features model.pkl |
#       +---------------------------------------------+

# now I pushed the repository to GitHub, nuked my local version and pulled the changes
# as I should be able to reproduce the files IF YOU HAVE THE REMOTE CONFIGURED

pip install -r requirements.txt
pip install -r src/requirements.txt
dvc pull
dvc repro train.dvc
# blah, it does nothing because all results are already locally pulled from remote...

rm model.pkl
dvc repro train.dvc
# ok, now it trained it again, and says the result is already in remote so won't save
# so, you would change some parts of the "pipeline" and run "dvc repro train.dvc"
# to produce new outputs...

dvc run \
  -f evaluate.dvc \
  -d src/evaluate.py \
  -d model.pkl \
  -d data/features \
  -M auc.metric \     # mark as a metric file
  python src/evaluate.py model.pkl data/features auc.metric
dvc metrics show
# auc.metric: 0.588426

git add .
git commit -m 'Add evaluation stage'
git tag -a "baseline-experiment" -m "Baseline experiment evaluation"
git push --tags
dvc push

#
vim src/featurization.py
# edit:
# bag_of_words = CountVectorizer(stop_words='english',
#                                max_features=6000,
#                                ngram_range=(1, 2))
dvc repro train.dvc
git add .
git commit -m 'Reproduce model using bigrams'
git push
dvc push

# comparing experiments
git checkout baseline-experiment
dvc checkout
git checkout master
dvc checkout
dvc repro evaluate.dvc
git add .
git commit -m 'Evaluate bigrams model'
git tag -a "bigrams-experiment" -m "Bigrams experiment evaluation"
dvc metrics show -T
# working tree:
#         auc.metric: 0.602818
# baseline-experiment:
#         auc.metric: 0.588426
# bigrams-experiment:
#         auc.metric: 0.602818

# if you want the old model.pkl...
git checkout baseline-experiment train.dvc  # restore the .dvc file
dvc checkout train.dvc                      # restore the old file based on the dvc file

# if you want to fully go to an old experiment...
git checkout baseline-experiment
dvc checkout

Sources

DVC Documentation - Get Started