Machine Learning is not Software Engineering
$ tree data
data
├── raw_data.csv
├── cleaned_data.csv
├── cleaned_data_final.csv
├── cleaned_data_preprocessed_final.csv
└── cleaned_data_preprocessed_final.csv.bak
DVC helps to handle these challenges.
The example project (github.com/midnightradio/handson-dvc) trains a small VGGNet to classify cat and dog images.
Follow the instruction in the repository to build a docker image and containerize the image with running bash shell. Following commands should be run inside the docker container
$ cd cats_and_dogs
$ tree
.
├── data
│ ├── finalized
│ ├── processed
│ └── raw
├── environment.sh
├── notebooks
├── requirements.txt
├── scripts
│ ├── dataload.sh
│ └── deploy.sh
└── src
├── catdog
└── setup.py
$ git init
$ git add src
$ git commit -m 'initialize repository'
$ dvc init
$ git status
new file: .dvc/.gitignore
new file: .dvc/config
new file: .dvc/plots/confusion.json
new file: .dvc/plots/default.json
new file: .dvc/plots/scatter.json
new file: .dvc/plots/smooth.json
$ git add .dvc
$ git commit -m 'initialize dvc'
It's pretty large dataset containing 25K images in total, half cats and half dogs.
$ scripts/dataload.sh
$ ls /tmp/PetImages
Cat Dog
$ cat << EOF > params.yaml
> data:
> raw: "/tmp/PetImages"
> processed: "data/processed"
>
> prep:
> split_rate: 0.9
> class_size: 2000
>
> train:
> learning_rate: 0.001
> batch_size: 100
> epochs: 15
> validation_rate: 0.2
> EOF
$ git add params.yaml
$ git commit -m "add parameters"
$ dvc run -n prep -p prep -d src/catdog/preprocess.py \
-o data/processed python -m catdog.preprocess
$ git status
data/
dvc.lock
dvc.yaml
$ cat data/.gitignore
/processed
$ git add dvc.yaml dvc.lock data
$ git commit -m "define prep stage"
Train and evaluate first VGGNet with One Convolutional layer and One Fully Connected layer
$ dvc run -n train -p train -d data/processed/ \
-d src/catdog/train.py -o data/model.h5 --plots-no-cache \
data/plot.json python -m catdog.train
......
$ dvc run -n evaluate -d data/model.h5 -d src/catdog/evaluate.py \
-m data/score.json python -m catdog.evaluate
......
$ git status
modified: data/.gitignore
modified: dvc.lock
modified: dvc.yaml
modified: src/catdog/train.py
$ cat data/.gitignore
/processed
/model.h5
/score.json
$ git add data/ dvc.lock dvc.yaml src/catdog/train.py
$ git commit -m '1 Conv, 1 FC'
$ git tag -a 0.1 -m "ver 0.1, 1 Conv, 1 FC"
$ dvc dag
+------+
| prep |
+------+
*
*
*
+-------+
| train |
+-------+
*
*
*
+----------+
| evaluate |
+----------+
`dvc repro` checks any changes in dependencies and automatically runs the pipeline from the first stage where a change happened
$ dvc repro
Stage 'prep' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
Data and pipelines are up to date.
Add another Convolutional layer and reproduce affected stages
model = keras.Sequential([
layers.Conv2D(16, (3, 3), activation='relu', input_shape=(224, 224, 3)),
layers.MaxPooling2D(2, 2),
layers.Conv2D(32, (3, 3), activation='relu'),
layers.MaxPooling2D(2, 2),
layers.Flatten(),
layers.Dense(512, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
Add another Convolutional layer and reproduce affected stages
$ dvc repro
......
$ git add data dvc.lock src/catdog/train.py
$ git commit -m '2 Conv, 1 FC'
$ git tag -a 0.2 -m "ver. 0.2 , 2Conv, 1FC"
Add third Convolutional layer and reproduce affected stages
model = keras.Sequential([
layers.Conv2D(16, (3, 3), activation='relu', input_shape=(224, 224, 3)),
layers.MaxPooling2D(2, 2),
layers.Conv2D(32, (3, 3), activation='relu'),
layers.MaxPooling2D(2, 2),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D(2, 2),
layers.Flatten(),
layers.Dense(512, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
Add third Convolutional layer and reproduce affected stages
$ dvc repro
......
$ git add data dvc.lock src/catdog/train.py
$ git commit -m '3 Conv, 1 FC'
$ git tag -a 0.3 -m "ver. 0.3 , 3Conv, 1FC"
Regarding to the accuracy, just adding Conv layers seems not helping the result.
$ dvc metrics show -T
workspace:
data/score.json:
acc: 0.675000011920929
0.1:
data/score.json:
acc: 0.6924999952316284
0.2:
data/score.json:
acc: 0.7124999761581421
0.3:
data/score.json:
acc: 0.675000011920929
The training process for each experiment tell something
v0.1 | v0.2 | v0.3 | |
---|---|---|---|
ACC | |||
Val. ACC |
Clear sign of overfitting --> Regularization
model = keras.Sequential([
layers.Conv2D(16, (3, 3), activation='relu', input_shape=(224, 224, 3)),
layers.MaxPooling2D(2, 2),
layers.Conv2D(32, (3, 3), activation='relu'),
layers.MaxPooling2D(2, 2),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D(2, 2),
layers.Flatten(),
layers.Dense(512, activation='relu'),
layers.Dropout(0.5),
layers.Dense(1, activation='sigmoid')
])
$ dvc repro
$ git add dvc.lock src/catdog/train.py data/plot.json
$ git commit -m '3 Conv, 1 FC, Dropout added'
$ tag -a 0.4 -m "ver. 0.4, 3Conv, 1FC, Dropout(0.5)"
Rather than increasing size of data, try data augmentation technique
$ git checkout -b data_augmentation 0.3
training_datagen = ImageDataGenerator(validation_split=validation_rate,
rescale=1.0/255.,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)