The goal of this example is to give you some hands-on experience with a basicmachine learning version control scenario: managing multiple datasets and MLmodels using DVC. We'll work with atutorialthat François Chollet put together to show how to builda powerful image classifier using a pretty small dataset.
We highly recommend reading François' tutorial itself. It's a greatdemonstration of how a general pre-trained model can be leveraged to build anew high-performance model, with very limited resources.
We first train a classifier model using 1000 labeled images, then we double thenumber of images (2000) and retrain our model. We capture both datasets andclassifier results and show how to use dvc checkout to switch betweenworkspace versions.
The specific algorithm used to train and validate the classifier is notimportant, and no prior knowledge of Keras is required. We'll reuse thescript fromthe original blog post as a black box – it takes some data and produces amodel file.
Preparation
We have last tested this tutorial with Python 3.7.
You'll need Git to run the commands in this tutorial.Also, follow these instructions to install DVC if it's notalready installed.
See Running DVC on Windows forimportant tips to improve your experience on Windows.
Okay! Let's first download the code and set up a Git repository:
$ git clone https://github.com/iterative/example-versioning.git$ cd example-versioning
This command pulls a DVC project with a single script train.py
that will train the model.
Let's now install the requirements. But before we do that, we stronglyrecommend creating avirtual environment:
$ python3 -m venv .env$ source .env/bin/activate$ pip install -r requirements.txt
Expand to learn about DVC internals
The repository you cloned is already DVC-initialized. It already contains a.dvc/ directory with the config
and .gitignore
files. These and otherfiles and directories are hidden from user, as typically there's no need tointeract with them directly.
First model version
Now that we're done with preparations, let's add some data and then train thefirst model. We'll capture everything with DVC, including the input dataset andmodel metrics.
$ dvc get https://github.com/iterative/dataset-registry \ tutorials/versioning/data.zip$ unzip -q data.zip$ rm -f data.zip
dvc get can download any file or directory tracked in a DVCrepository (and stored remotely). It's like wget
, but for DVC or Gitrepos. In this case we use our dataset registry repo as the data source (referto Data Registry for more info.)
This command downloads and extracts our raw dataset, consisting of 1000 labeledimages for training and 800 labeled images for validation. In total, it's a 43MB dataset, with a directory structure like this:
data├── train│ ├── dogs│ │ ├── dog.1.jpg│ │ ├── ...│ │ └── dog.500.jpg│ └── cats│ ├── cat.1.jpg│ ├── ...│ └─ ─ cat.500.jpg└── validation ├── dogs │ ├── dog.1001.jpg │ ├── ... │ └── dog.1400.jpg └── cats ├── cat.1001.jpg ├── ... └── cat.1400.jpg
(Who doesn't love ASCII directory art?)
Let's capture the current state of this dataset with dvc add:
$ dvc add data
You can use this command instead of git add
on files or directories that aretoo large to be tracked with Git: usually input datasets, models, someintermediate results, etc. It tells Git to ignore the directory and puts it intothe cache (while keeping afile linkto it in the workspace, so you can continue working the same way asbefore). This is achieved by creating a tiny, human-readable .dvc file thatserves as a pointer to the cache.
Next, we train our first model with train.py
. Because of the small dataset,this training process should be small enough to run on most computers in areasonable amount of time (a few minutes). This command outputs abunch of files, among them model.weights.h5
and metrics.csv
, weights of thetrained model, and metrics history. Thesimplest way to capture the current version of the model is to use dvc addagain:
$ python train.py$ dvc add model.weights.h5
We manually added the model output here, which isn't ideal. The preferred wayof capturing command outputs is with dvc stage add. More on this later.
Let's commit the current state:
$ git add data.dvc model.weights.h5.dvc metrics.csv .gitignore$ git commit -m "First model, trained with 1000 images"$ git tag -a "v1.0" -m "model v1.0, 1000 images"
Expand to learn more about how DVC works
As we mentioned briefly, DVC does not commit the data/
directory andmodel.weights.h5
file with Git. Instead, dvc add stores them in thecache (usually in .dvc/cache
) and adds them to .gitignore
.
In this case, we created data.dvc
and model.weights.h5.dvc
, which containfile hashes that point to cached data. We then git commit
these .dvc files.
Note that executing
train.py
produced other intermediate files. This is OK,we will use them later.$ git status... bottleneck_features_train.npy bottleneck_features_validation.npy`
Second model version
Let's imagine that our image dataset doubles in size. The next command extracts500 new cat images and 500 new dog images into data/train
:
$ dvc get https://github.com/iterative/dataset-registry \ tutorials/versioning/new-labels.zip$ unzip -q new-labels.zip$ rm -f new-labels.zip
For simplicity's sake, we keep the validation subset the same. Now our datasethas 2000 images for training and 800 images for validation, with a total size of67 MB:
data├── train│ ├── dogs│ │ ├── dog.1.jpg│ │ ├── ...│ │ └── dog.1000.jpg│ └── cats│ ├── cat.1.jpg│ ├── ...│ └── cat.1000.jpg└── validation ├── dogs │ ├── dog.1001.jpg │ ├── ... │ └── dog.1400.jpg └── cats ├── cat.1001.jpg ├── ... └── cat.1400.jpg
We will now want to leverage these new labels and retrain the model:
$ dvc add data$ python train.py$ dvc add model.weights.h5
Let's commit the second version:
$ git add data.dvc model.weights.h5.dvc metrics.csv$ git commit -m "Second model, trained with 2000 images"$ git tag -a "v2.0" -m "model v2.0, 2000 images"
That's it! We've tracked a second version of the dataset, model, and metrics inDVC and committed the .dvc files that point to them with Git. Let's now lookat how DVC can help us go back to the previous version if we need to.
Switching between workspace versions
The DVC command that helps get a specific committed version of data is designedto be similar to git checkout
. All we need to do in our case is toadditionally run dvc checkout to get the right data into theworkspace.
There are two ways of doing this: a full workspace checkout or checkout of aspecific data or model file. Let's consider the full checkout first. It's prettystraightforward:
$ git checkout v1.0$ dvc checkout
These commands will restore the workspace to the first snapshot we made: code,data files, model, all of it. DVC optimizes this operation to avoid copying dataor model files each time. So dvc checkout is quick even if you have largedatasets, data files, or models.
On the other hand, if we want to keep the current code, but go back to theprevious dataset version, we can target specific data, like this:
$ git checkout v1.0 data.dvc$ dvc checkout data.dvc
If you run git status
you'll see that data.dvc
is modified and currentlypoints to the v1.0
version of the dataset, while code and model files are fromthe v2.0
tag.
Automating capturing
dvc add makes sense when you need to keep track of different versions ofdatasets or model files that come from source projects. The data/
directoryabove (with cats and dogs images) is a good example.
On the other hand, there are files that are the result of running some code. Inour example, train.py
produces binary files (e.g.bottleneck_features_train.npy
), the model file model.weights.h5
, and themetrics file metrics.csv
.
When you have a script that takes some data as an input and produces other dataoutputs, a better way to capture them is to use dvc stage add:
If you tried the commands in theSwitching between workspace versionssection, go back to the master branch code and data, and remove the
model.weights.h5.dvc
file with:$ git checkout master$ dvc checkout$ dvc remove model.weights.h5.dvc
$ dvc stage add -n train -d train.py -d data \ -o model.weights.h5 -o bottleneck_features_train.npy \ -o bottleneck_features_validation.npy -M metrics.csv \ python train.py$ dvc repro
dvc stage add writes a pipeline stage named train
(specified using the -n
option) in dvc.yaml. It tracks all outputs (-o
) the same way as dvc adddoes. Unlike dvc add, dvc stage add also tracks dependencies (-d
) and thecommand (python train.py
) that was run to produce the result.
At this point you could run
git add .
andgit commit
to save thetrain
stage and its outputs to the repository.
dvc repro will run the train
stage if any of its dependencies (-d
)changed. For example, when we added new images to build the second version ofour model, that was a dependency change. It also updates outputs and puts theminto the cache.
To make things a little simpler: dvc add and dvc checkout provide a basicmechanism for model and large dataset versioning. dvc stage add anddvc repro provide a build system for machine learning models, which is similarto Make in software build automation.
What's next?
In this example, our focus was on giving you hands-on experience with datasetand ML model versioning. We specifically looked at the dvc add anddvc checkout commands. We'd also like to outline some topics and ideas youmight be interested to try next to learn more about DVC and how it makesmanaging ML projects simpler.
First, you may have noticed that the script that trains the model is written ina monolithic way. It uses the save_bottleneck_feature
function topre-calculate the bottom, "frozen" part of the net every time it is run.Features are written into files. The intention was probably that thesave_bottleneck_feature
can be commented out after the first run, but it's notvery convenient having to remember to do so every time the dataset changes.
Here's where the pipelines feature of DVC comes inhandy. We touched on it briefly when we described dvc stage add anddvc repro. The next step would be splitting the script into two parts andutilizing pipelines. SeeGet Started: Data Pipelines to gethands-on experience with pipelines, and try to apply it here. Don't hesitate tojoin our community and ask any questions!
Another detail we only brushed upon here is the way we captured themetrics.csv
metrics file with the -M
option of dvc stage add. Marking thisoutput as a metric enables us to compare its values across Git tagsor branches (for example, representing different experiments). Seedvc metrics, Comparing Changes, and Comparing Many Experiments to learn moreabout managing metrics with DVC.