split-index mapping in separate processes #787

mlin · 2021-07-21T02:23:14Z

Hi @lh3 @hasindu2008

My colleagues @morsecodist @katrinakalantar @rzlim08 @tfrcarvalho and I hacked on the existing split-index mode so that we can map against each index part in a separate process (in general on a cluster of machines), then merge the intermediates in a final gather process. It's still a bit rough, but we'd appreciate any early feedback. This is for a metagenomics use case (e.g. mapping to all of nt) which I gather was contemplated in the design of the split index.

In this approach we first split the database FASTA file if needed, then build a minimap2 index file from each FASTA part separately. As far as I could tell, the existing split index is just a blob concatenation of part indexes, so there's no meaningful difference with the same indexes in separate files -- did I miss something there?
Then we run e.g. minimap2 --split-map part7.intermediate part7.idx reads_1.fq reads_2.fq on each index part to generate the intermediate files.
Lastly minimap2 --split-merge reads_1.fq reads_2.fq . part1.intermediate part2.intermediate ... to merge the intermediates and generate PAF/SAM. This uses the existing merge subroutine unchanged, we're just restructuring the intermediate file I/O.

We have a proof-of-concept workflow of this running (still to be tested at full scale).

The new approach coexists with the single-process --split-prefix mode -- the diff factors out its logic for formulating temporary filenames and deleting them afterwards.

One wart we have right now is that the mm_mapopt_t must get initialized consistently in all the invocations for steps 2 and 3. I think it's okay if the user takes care to set the same command-line options to them, but I'm not completely sure, given that step 3 never sees any part of the index -- advice appreciated.

Do you know any big pitfalls awaiting us in the merging subroutine if we try to use this on full-size metagenomics databases? (One we're anticipating is the mapQ being zeroed by all the redundancy in nt)

hasindu2008 · 2021-07-22T12:10:06Z

Hi @mlin

The --split-prefix was implemented for enabling mapping to large genomes in systems with limited RAM and was tested mostly for the human genome. Metagenomics was a possible use case as well, but I did not get to do any tests on the nt database. However, I anticipate the results on the nt using a split prefix would be close to running with a single index. I am also quite interested in seeing how the results look like if you managed to execute.

The answers to your questions are as follows. It was a while I implemented this and minimap2 perhaps have had changes in the versions after, so my answers may not be entirely accurate.

In this approach we first split the database FASTA file if needed, then build a minimap2 index file from each FASTA part separately. As far as I could tell, the existing split index is just a blob concatenation of part indexes, so there's no meaningful difference with the same indexes in separate files -- did I miss something there?

Yes. You are correct. Infact, the script I wrote to balance the size of the indexes for the human genome separately indexes and concatenates them. So having them in separate files does not do any difference.

One wart we have right now is that the mm_mapopt_t must get initialized consistently in all the invocations for steps 2 and 3. I think it's okay if the user takes care to set the same command-line options to them, but I'm not completely sure, given that step 3 never sees any part of the index -- advice appreciated.

I think this is all right to let the user take care of the options - that makes the implementation simple. Step three does not see the index, but relies on command line parameters passed. For instance, here.

Do you know any big pitfalls awaiting us in the merging subroutine if we try to use this on full-size metagenomics databases? (One we're anticipating is the mapQ being zeroed by all the redundancy in nt)

I am not sure how minimap2 with a single index behaves on the nt database. The MAPQ from the split method is anticipated to be closer estimates to ones from the single index. If minimap2 with a single index gives MAPQ 0, split index also likely to give the same. A potential pitfall would be if the architectures of the cluster nodes are different (different endianness or 64-bit vs 32-bit) which is unlikely.

This reverts commit bebba68.

mlin · 2021-09-20T11:42:06Z

@lh3 @hasindu2008 Promoting this out of draft now that we (mainly @rzlim08) have built up some experience using the proposed distributed mapping for our metagenomics use case (where the reference index is much larger than memory). Added a FAQ item discussing how to use it too.

We found one significant caveat to both the existing --split-prefix and the new distributed mode, which is that sensitivity can decrease as the index part size (-I) increases.

For short reads this seems to be due to the fixed arbitrary values of mid_occ and max_occ controlling when minimizers are ignored for occurring too many times in an index part. Those can be changed with -f.
The long read options presets seemed a bit more complex and I haven't figured out exactly how the effect arises for those.

This seems like an effect users ought to be made aware of, perhaps in the FAQ as well? But let us know if this sounds correct and/or if you can help fill in the long read explanation as well.

cc @morsecodist @katrinakalantar @tfrcarvalho

hasindu2008 · 2021-09-20T12:59:00Z

@mlin
Up to how many partitions did you split the index?

When I checked for the human genome up to 16 partitions, the difference in the sensitivity was marginal (sometimes less, but sometimes higher). As I benchmarked on the human genome and these were very slight, I did not investigate the parameters anymore. However, as those discrepancies were mostly observed for mappings that mapped with very low MAPQ and as your metagenomics use case results in a larger amount of low MAPQ mappings, this may have now become prominent. Are you mapping to the nt database?

mlin · 2021-09-20T20:40:42Z

@hasindu2008 I think we were exploring comparable numbers of partitions but @rzlim08 can confirm.

I assume the preset thresholds (for short reads) of mid_occ and max_occ are tuned to minimize the sensitivity impact on human-sized references (while speeding up mapping). So it makes sense that the effect would be minimal no matter how much we partition the human genome, yet more pronounced when we try to scale up to much larger references.

We have indeed been testing with nt so far (although that's not necessarily the final strategy, for the reasons we've both alluded to on this thread!).

mlin added 2 commits July 13, 2021 21:05

first run

0221f27

fixup

5288c5a

mlin added 6 commits July 27, 2021 15:45

Merge remote-tracking branch 'upstream/master' into distributed-mapping

dd07f1f

report hash of mm_mapopt_t

bebba68

Revert "report hash of mm_mapopt_t"

ce0ff47

This reverts commit bebba68.

Merge remote-tracking branch 'upstream/master' into distributed-mapping

f92c0ae

document in FAQ

5b0c2ce

fix numbering

fe0dee6

mlin marked this pull request as ready for review September 20, 2021 11:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split-index mapping in separate processes #787

split-index mapping in separate processes #787

split-index mapping in separate processes #787

Are you sure you want to change the base?

split-index mapping in separate processes #787

Conversation