For each PacBio dataset (Movie ID), we compared yield at Q30 for ccs (baseline), and v0.2, v0.3, and v1.0 of DeepConsensus.
Movie ID | Sample | Chemistry | Mean insert size |
---|---|---|---|
m64011_181218_235052 | HG002 | 1 | 11 kb |
m64008_201124_002822 | HG002 | 2.2 | 15 kb |
m64014_200920_132517 | HG002 | 2.2 | 24 kb |
version | movie | dataset | num_reads_ccs | num_reads | yield@emQ20 | yield@emQ20/ccs | yield@emQ30 | yield@emQ30/ccs | yield@emQ40 | yield@emQ40/ccs | hours |
---|---|---|---|---|---|---|---|---|---|---|---|
v1.0 | m64011_181218_235052 | chem1_11kb | 1,393,202 | 1,516,705 | 17.03 Gb | 109.85% | 12.17 Gb | 132.79% | 4.93 Gb | 203.01% | 251.04 |
v1.0 | m64008_201124_002822 | chem2.2_15kb | 2,689,147 | 2,851,015 | 42.80 Gb | 107.06% | 32.85 Gb | 124.98% | 9.33 Gb | 237.00% | 618.68 |
v1.0 | m64014_200920_132517 | chem2.2_24kb | 1,919,192 | 2,048,905 | 49.33 Gb | 107.77% | 32.55 Gb | 175.76% | 2.94 Gb | 854.15% | 796.88 |
yield@emQ30/ccs
or "Yield at empirical Q30 relative to CCS" is calculated as
follows:
- Filter DeepConsensus output to predicted Q20.
- For each read, align it to the truth and calculate identity from that alignment: identity = # matches / (# matches + # mismatches + # insertions + # deletions).
- Take all the reads that have identity >= 0.999 (this is Q30).
- Because longer reads are more useful than shorter reads, we count the total bases and not just the number of reads.
- Next we repeat the above for the original CCS reads (run with default params = Q20 filtered) and subtract and divide them to get a percentage, e.g. 40% percent means that DeepConsensus increased yield of high quality reads in bases by 40% over CCS.
These were run on GCP n1-standard-16
machines with no GPU (in 500 shards,
combined above), with --batch_zmws=100 --batch_size=1024
, which is generally
what we recommend. For more information on compute setups, see the
runtime metrics page.
The --skip_windows_above
option (introduced in v0.3) allows DeepConsensus to
skip windows whose average CCS base qualities are already above a certain
quality threshold. The windows that are skipped just adopt the CCS sequence
without correction. This saves runtime, but there is a yield tradeoff, shown in
this chart for m64014_200920_132517-chr20:
The default in v1.0 is Q45, but you can adjust this level using
--skip_windows_above
.