This tutorial demonstrates how to train DeepConsensus locally using GPUs. This tutorial is designed to be a starting point from which a more advanced training pipeline could be developed. Setting up a more advanced training pipeline will likely be necessary in order to train highly accurate DeepConsensus models.
Training examples can be created with DeepConsensus. See the Generate Training Examples for details. For this tutorial we will use a small toy data set.
These instructions were tested on Google Cloud VM with Intel 16 cores (n1-standard-16) and 1 x NVIDIA Tesla P100 GPU.
git clone https://github.com/google/deepconsensus.git
cd deepconsensus
./install-gpu.sh
curl https://raw.githubusercontent.com/google/deepvariant/r1.4/scripts/install_nvidia_docker.sh -o install_nvidia_docker.sh
bash install_nvidia_docker.sh
export DC_TRAIN_DIR="${HOME}/dc-model"
export TF_EXAMPLES="${DC_TRAIN_DIR}/tf-examples"
export DC_TRAIN_OUTPUT="${DC_TRAIN_DIR}/output"
mkdir "${DC_TRAIN_DIR}"
mkdir "${TF_EXAMPLES}"
mkdir "${DC_TRAIN_OUTPUT}"
gsutil -m cp -R gs://brain-genomics-public/research/deepconsensus/training-tutorial/v1.0/* "${TF_EXAMPLES}/"
The path to training examples has to be set in
deepconsensus/models/model_configs.py
in _set_custom_data_hparams
function.
For example, if training data is located in /home/user/dc-model/tf-examples the config will look like this:
def _set_custom_data_hparams(params):
"""Updates the given config with values for human data aligned to CCS."""
params.tf_dataset = ['/home/user/dc-model/tf-examples']
params.max_passes = 20
It is assumed that there are following subdirectories in this path that contain TensorFlow examples generated by DeepConsensus:
- train
- eval
- test
The directory also contains summary.training.json file.
export PYTHONPATH=$PWD:$PYTHONPATH
export CONFIG=deepconsensus/models/model_configs.py:transformer_learn_values+custom
python3 deepconsensus/models/model_train_custom_loop.py --params ${CONFIG} --out_dir ${DC_TRAIN_OUTPUT} --alsologtostderr
Beginning training from an existing model checkpoint will generally speed up
most training sessions. In order to start from a checkpoint add --checkpoint
parameter that points to the path + prefix of the checkpoint.
By default, training will run for 4 epochs. Batch size is set to 256 by default,
but this is scaled based on the number of GPUs or TPUs available. These values
can be configured by updating the model_configs.py
file.
The number of steps in an epoch = <number of examples>
/ <batch_size>
. Once
the training is finished the $DC_TRAIN_OUTPUT/best_checkpoint.txt
file will
contain the best performing checkpoint. This checkpoint can then be used during
inference.
The newly trained checkoint can be used to run the inference with DeepConsensus. See Run DeepConsensus for the detailed instructions on how to run DeepConsensus.