[go: nahoru, domu]

Skip to content

Commit

Permalink
Add warm start to TPU tutorial
Browse files Browse the repository at this point in the history
PiperOrigin-RevId: 496980689
  • Loading branch information
akolesnikov authored and Copybara-Service committed Dec 21, 2022
1 parent 01a4e1c commit 38e2ac7
Showing 1 changed file with 46 additions and 1 deletion.
47 changes: 46 additions & 1 deletion docs/train_tpu_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,8 @@ gcloud compute instances attach-disk "${USER}-disk-setup" \
--project ${PROJECT}
```

SSH to the newly created VM:
SSH to the newly created VM: Please note that it takes couple of minutes for the
VM to start up before you can SSH to it.

```bash
gcloud compute ssh "${USER}-disk-setup" \
Expand Down Expand Up @@ -237,6 +238,43 @@ time python3 deepconsensus/models/model_train_custom_loop.py \

`--tpu_topology=4x4` here should work for TPU v2 and v3.

## Optionally launch training from a checkpoint (warm-start training)

Beginning training from an existing model checkpoint will generally speed up
most training sessions. In order to start from a checkpoint add `--checkpoint`
parameter that points to the path + prefix of the checkpoint. DeepConsensus v1.1
checkpoint can be copied from
`gs://brain-genomics-public/research/deepconsensus/models/v1.1/model_checkpoint`
This directory contains 3 files:

```
gs://brain-genomics-public/research/deepconsensus/models/v1.1/model_checkpoint/checkpoint.data-00000-of-00001
gs://brain-genomics-public/research/deepconsensus/models/v1.1/model_checkpoint/checkpoint.index
gs://brain-genomics-public/research/deepconsensus/models/v1.1/model_checkpoint/params.json
```

Copy DeepConsensus checkpoint locally:

```bash
mkdir /mnt/disks/persist/model_checkpoint
gsutil cp gs://brain-genomics-public/research/deepconsensus/models/v1.1/model_checkpoint/* /mnt/disks/persist/model_checkpoint/
```

Add optional `--checkpoint` flag to
`deepconsensus/models/model_train_custom_loop.py`

```bash
export PYTHONPATH=$PWD:$PYTHONPATH
export CONFIG=deepconsensus/models/model_configs.py:transformer_learn_values+custom
time python3 deepconsensus/models/model_train_custom_loop.py \
--checkpoint /mnt/disks/persist/model_checkpoint/checkpoint \
--params ${CONFIG} \
--out_dir ${DC_TRAIN_OUTPUT} \
--alsologtostderr \
--tpu=local \
--tpu_topology=4x4 2>&1 | tee /tmp/dc-tpu.log &
```

## Runtime

In the log, you can find what the batch size was set to. For example:
Expand Down Expand Up @@ -301,6 +339,13 @@ gcloud compute tpus tpu-vm delete ${USER}-tpu-name \
--zone ${ZONE} --project ${PROJECT}
```

To delete the Cloud VM used for persistent disk setup:

```bash
gcloud compute instances delete ${USER}-disk-setup \
--zone ${ZONE} --project ${PROJECT}
```

You can also use `stop` instead `delete` to stop the VM.

And, the disk can be cleaned up with:
Expand Down

0 comments on commit 38e2ac7

Please sign in to comment.