[go: nahoru, domu]

Skip to content

Commit

Permalink
Update version numbers for v1.2.
Browse files Browse the repository at this point in the history
PiperOrigin-RevId: 510288078
  • Loading branch information
anastasiyabl authored and Copybara-Service committed Feb 17, 2023
1 parent f0f5dd0 commit 77798d6
Show file tree
Hide file tree
Showing 10 changed files with 43 additions and 33 deletions.
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,13 @@ For context, we are the team that created and maintains both DeepConsensus and
DeepVariant. For variant calling with DeepVariant, we tested different models
and found that the best performance is with DeepVariant v1.4 using the normal
pacbio model rather than the model trained on DeepConsensus v0.1 output. We plan
to include DeepConsensus v1.1 outputs when training the next DeepVariant model,
to include DeepConsensus v1.2 outputs when training the next DeepVariant model,
so if there is a DeepVariant version later than v1.4 when you read this, we
recommend using that latest version.

### For assembly downstream

We have confirmed that v1.1 outperforms v0.3 in terms of downstream assembly
We have confirmed that v1.2 outperforms v0.3 in terms of downstream assembly
contiguity and accuracy. See the
[assembly metrics page](docs/assembly_metrics.md) for details.

Expand Down Expand Up @@ -76,15 +76,15 @@ to inspect some example model inputs and outputs.
If you're on a GPU machine:

```bash
pip install deepconsensus[gpu]==1.1.0
pip install deepconsensus[gpu]==1.2.0
# To make sure the `deepconsensus` CLI works, set the PATH:
export PATH="/home/${USER}/.local/bin:${PATH}"
```

If you're on a CPU machine:

```bash
pip install deepconsensus[cpu]==1.1.0
pip install deepconsensus[cpu]==1.2.0
# To make sure the `deepconsensus` CLI works, set the PATH:
export PATH="/home/${USER}/.local/bin:${PATH}"
```
Expand All @@ -94,13 +94,13 @@ export PATH="/home/${USER}/.local/bin:${PATH}"
For GPU:

```bash
sudo docker pull google/deepconsensus:1.1.0-gpu
sudo docker pull google/deepconsensus:1.2.0-gpu
```

For CPU:

```bash
sudo docker pull google/deepconsensus:1.1.0
sudo docker pull google/deepconsensus:1.2.0
```

### From source
Expand Down
4 changes: 2 additions & 2 deletions README_pip.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,15 @@
If you're on a GPU machine:

```bash
pip install deepconsensus[gpu]==1.1.0
pip install deepconsensus[gpu]==1.2.0
# To make sure the `deepconsensus` CLI works, set the PATH:
export PATH="/home/${USER}/.local/bin:${PATH}"
```

If you're on a CPU machine:

```bash
pip install deepconsensus[cpu]==1.1.0
pip install deepconsensus[cpu]==1.2.0
# To make sure the `deepconsensus` CLI works, set the PATH:
export PATH="/home/${USER}/.local/bin:${PATH}"
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,5 @@
"truth_bed": "None",
"truth_split": "None",
"ins_trim": "5",
"version": "1.1.0"
"version": "1.2.0"
}
Original file line number Diff line number Diff line change
Expand Up @@ -26,5 +26,5 @@
"truth_bed": "testdata/human_1m/truth.bed",
"truth_split": "testdata/human_1m/truth_split.tsv",
"ins_trim": "5",
"version": "1.1.0"
"version": "1.2.0"
}
Original file line number Diff line number Diff line change
Expand Up @@ -26,5 +26,5 @@
"truth_bed": "testdata/human_1m/truth.bed",
"truth_split": "testdata/human_1m/truth_split.tsv",
"ins_trim": "5",
"version": "1.1.0"
"version": "1.2.0"
}
2 changes: 1 addition & 1 deletion deepconsensus/utils/dc_constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
import tensorflow as tf

# DeepConsensus Version
__version__ = '1.1.0'
__version__ = '1.2.0'

# Vocab
GAP = ' '
Expand Down
8 changes: 6 additions & 2 deletions docs/generate_examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ mkdir "${TF_EXAMPLES_DIR}/eval"
mkdir "${TF_EXAMPLES_DIR}/test"

# Download the input PacBio Subread data.
gsutil cp gs://brain-genomics-public/research/deepconsensus/quickstart/v1.1/n1000.subreads.bam "${BASE_DIR}"/
gsutil cp gs://brain-genomics-public/research/deepconsensus/quickstart/v1.2/n1000.subreads.bam "${BASE_DIR}"/

# Truth Reference
gsutil cp gs://deepconsensus/pacbio/datasets/chm13/chm13v2.0_noY.fa "${BASE_DIR}"/
Expand Down Expand Up @@ -159,7 +159,7 @@ https://docs.docker.com/engine/install/ubuntu/ to install Docker.

```bash
# Define DOCKER_IMAGE *once* depending on whether you will be using CPU or GPU:
DOCKER_IMAGE=google/deepconsensus:1.1.0 # For CPU
DOCKER_IMAGE=google/deepconsensus:1.2.0 # For CPU
sudo docker pull ${DOCKER_IMAGE}
```

Expand Down Expand Up @@ -313,6 +313,9 @@ export truth_reference=chm13v2.0_noY.fa
export ccs_shard_bam="${shard_id}.ccs.bam"
export truth_split=chm13v2.0_noY.chrom_mapping.txt
export subreads_to_ccs_shard_bam="${shard_id}.subreads_to_ccs.bam"
# If true, incorporate CCS Base Quality scores into tf.examples (DC v1.2).
export use_ccs_bq=True

# Output
TF_EXAMPLES_DIR="tf_examples"
export ccs_shard_to_truth_alignment_unfiltered="${shard_id}.ccs_to_truth_ref.unfiltered.bam"
Expand Down Expand Up @@ -392,6 +395,7 @@ deepconsensus preprocess \
--truth_bed="${truth_shard_bed}" \
--truth_to_ccs="${truth_to_ccs_shard_bam}" \
--truth_split="${truth_split}" \
--use_ccs_bq="${use_ccs_bq}" \
--output="${tf_example_fname_output}" \
--cpus="$(nproc)"

Expand Down
12 changes: 6 additions & 6 deletions docs/quick_start.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,10 +68,10 @@ Follow https://docs.docker.com/engine/install/ubuntu/ to install Docker.

## Parallelization

One 8M SMRT Cell can take ~1000 hours to run (without parallelization) depending
One 8M SMRT Cell can take ~500 hours to run (without parallelization) depending
on the fragment lengths of the sequencing library - see the
[yield metrics page](yield_metrics.md). If we split this into 500 shards, that
is about 2 hours per shard. There is some variability between shards, but this
is about 1 hour per shard. There is some variability between shards, but this
should give you an idea of what to expect. This estimate is only for the
DeepConsensus processing step, and does not include the preprocessing required
with *ccs* and *actc*.
Expand Down Expand Up @@ -100,10 +100,10 @@ QS_DIR="${HOME}/deepconsensus_quick_start"
mkdir -p "${QS_DIR}" "${QS_DIR}/model"

# Download the input PacBio Subread data.
gsutil cp gs://brain-genomics-public/research/deepconsensus/quickstart/v1.1/n1000.subreads.bam "${QS_DIR}"/
gsutil cp gs://brain-genomics-public/research/deepconsensus/quickstart/v1.2/n1000.subreads.bam "${QS_DIR}"/

# Download the DeepConsensus model.
gsutil cp -r gs://brain-genomics-public/research/deepconsensus/models/v1.1/model_checkpoint/* "${QS_DIR}"/model/
gsutil cp -r gs://brain-genomics-public/research/deepconsensus/models/v1.2/model_checkpoint/* "${QS_DIR}"/model/
```

This directory should now contain the following files:
Expand Down Expand Up @@ -133,8 +133,8 @@ the appropriate version (CPU / GPU) depending on your use case.

```bash
# Define DOCKER_IMAGE *once* depending on whether you will be using CPU or GPU:
DOCKER_IMAGE=google/deepconsensus:1.1.0 # For CPU
DOCKER_IMAGE=google/deepconsensus:1.1.0-gpu # For GPU
DOCKER_IMAGE=google/deepconsensus:1.2.0 # For CPU
DOCKER_IMAGE=google/deepconsensus:1.2.0-gpu # For GPU
sudo docker pull ${DOCKER_IMAGE}
```

Expand Down
9 changes: 6 additions & 3 deletions docs/train_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ mkdir "${DC_TRAIN_DIR}"
mkdir "${TF_EXAMPLES}"
mkdir "${DC_TRAIN_OUTPUT}"
gsutil -m cp -R gs://brain-genomics-public/research/deepconsensus/training-tutorial/v1.0/* "${TF_EXAMPLES}/"
gsutil -m cp -R gs://brain-genomics-public/research/deepconsensus/training-tutorial/v1.2/* "${TF_EXAMPLES}/"
```

The path to training examples has to be set in
Expand All @@ -52,11 +52,14 @@ The path to training examples has to be set in
For example, if training data is located in /home/user/dc-model/tf-examples the
config will look like this:

```
```python
def _set_custom_data_hparams(params):
"""Updates the given config with values for human data aligned to CCS."""
params.tf_dataset = ['/home/user/dc-model/tf-examples']
params.max_passes = 20
# Set this to True if the tf examples contain ccs base quality scores.
# Option available starting in v1.2.
params.use_ccs_bq = True

```

Expand Down Expand Up @@ -85,7 +88,7 @@ parameter that points to the path + prefix of the checkpoint.

## Runtime

By default, training will run for 4 epochs. Batch size is set to 256 by default,
By default, training will run for 9 epochs. Batch size is set to 256 by default,
but this is scaled based on the number of GPUs or TPUs available. These values
can be configured by updating the `model_configs.py` file.

Expand Down
23 changes: 13 additions & 10 deletions docs/train_tpu_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ Then, I copied over a dataset:
BASE_DIR=/mnt/disks/persist/dc_training_examples
mkdir -p ${BASE_DIR}/tf_examples
time gcloud alpha storage cp -R \
gs://brain-genomics-public/research/deepconsensus/training-tutorial/v1.1/* \
gs://brain-genomics-public/research/deepconsensus/training-tutorial/v1.2/* \
${BASE_DIR}/tf_examples
```

Expand Down Expand Up @@ -138,7 +138,7 @@ Get a Cloud TPU VM (`--accelerator-type=v2-8` specifies Cloud TPU v2):
gcloud compute tpus tpu-vm create ${USER}-tpu-name \
--zone=${ZONE} \
--accelerator-type=v2-8 \
--version=tpu-vm-tf-2.11.0 \
--version=tpu-vm-tf-2.9.1 \
--project ${PROJECT} \
--data-disk source=projects/${PROJECT}/zones/${ZONE}/disks/${USER}-tpu-disk,mode=read-write
```
Expand Down Expand Up @@ -167,7 +167,7 @@ git clone https://github.com/google/deepconsensus.git

```
cd deepconsensus
sed -i -e 's|python3 -m pip install --user "intel-tensorflow>=2.11.0"||' install.sh
sed -i -e 's|python3 -m pip install --user "intel-tensorflow==2.9.1"||' install.sh
./install.sh
```

Expand Down Expand Up @@ -209,6 +209,9 @@ def _set_custom_data_hparams(params):
# confusing.
params.n_examples_train = 100_000_000
params.n_examples_eval = 3_500_000
# Set this to True if the tf examples contain ccs base quality scores.
# Option available starting in v1.2.
params.use_ccs_bq = True
```

It is assumed that after copying training examples the
Expand Down Expand Up @@ -242,22 +245,22 @@ time python3 deepconsensus/models/model_train_custom_loop.py \

Beginning training from an existing model checkpoint will generally speed up
most training sessions. In order to start from a checkpoint add `--checkpoint`
parameter that points to the path + prefix of the checkpoint. DeepConsensus v1.1
parameter that points to the path + prefix of the checkpoint. DeepConsensus v1.2
checkpoint can be copied from
`gs://brain-genomics-public/research/deepconsensus/models/v1.1/model_checkpoint`
`gs://brain-genomics-public/research/deepconsensus/models/v1.2/model_checkpoint`
This directory contains 3 files:

```
gs://brain-genomics-public/research/deepconsensus/models/v1.1/model_checkpoint/checkpoint.data-00000-of-00001
gs://brain-genomics-public/research/deepconsensus/models/v1.1/model_checkpoint/checkpoint.index
gs://brain-genomics-public/research/deepconsensus/models/v1.1/model_checkpoint/params.json
gs://brain-genomics-public/research/deepconsensus/models/v1.2/model_checkpoint/checkpoint.data-00000-of-00001
gs://brain-genomics-public/research/deepconsensus/models/v1.2/model_checkpoint/checkpoint.index
gs://brain-genomics-public/research/deepconsensus/models/v1.2/model_checkpoint/params.json
```

Copy DeepConsensus checkpoint locally:

```bash
mkdir /mnt/disks/persist/model_checkpoint
gsutil cp gs://brain-genomics-public/research/deepconsensus/models/v1.1/model_checkpoint/* /mnt/disks/persist/model_checkpoint/
gsutil cp gs://brain-genomics-public/research/deepconsensus/models/v1.2/model_checkpoint/* /mnt/disks/persist/model_checkpoint/
```

Add optional `--checkpoint` flag to
Expand All @@ -284,7 +287,7 @@ I1026 05:48:32.895524 140202426203200 model_utils.py:271] Per-replica batch-size
I1026 05:48:32.895847 140202426203200 model_utils.py:280] Global batch size is 8192
```

By default, training will run for 7 epochs. Per-replica batch size and epochs
By default, training will run for 9 epochs. Per-replica batch size and epochs
can be configured by updating the `model_configs.py` file. Global batch size is
scaled based on the TPU topology and number of cores you have available.

Expand Down

0 comments on commit 77798d6

Please sign in to comment.