Using minimap2 to search against massive database for classification purposes (> 500GB) #144

Confurious · 2018-04-04T17:51:57Z

I have successfully used minimap2 to search queries against massive database with multi-index option -I. The command is minimap2 -c -k15 -w5 -t16 -I10G
I didn't set -N to more secondary alignments but I am already getting a lot hits for each query.

However, I also see a lot of hits with identical AS score, yet the other attributes of the alignment are not necesasrily the same. I am wondering if there is a more complicated way of evaluating how good a hit is?

Posting some output here for reference.

k99_1335274     898     22      732     +       GL456022.2      1934410 1207650 1208360 556     715     0       NM:i:159        ms:i:460        AS:i:460        nn:i:0  tp:A:P  cm:i:7  s1:i:40 s2:i:40 dv:f:0.1695     cg:Z:12M1I3M1D58M1D3M1I155M1I3M1D414M2D6M1I2M1I49M
k99_1335274     898     22      732     +       JH584266.1      4910976 1277142 1277852 556     715     0       NM:i:159        ms:i:460        AS:i:460        nn:i:0  tp:A:S  cm:i:7  s1:i:40 dv:f:0.1695     cg:Z:12M1I3M1D58M1D3M1I155M1I3M1D414M2D6M1I2M1I49M
k99_1335274     898     22      732     +       JH584328.1      4256209 1276903 1277613 556     715     0       NM:i:159        ms:i:460        AS:i:460        nn:i:0  tp:A:S  cm:i:7  s1:i:40 dv:f:0.1695     cg:Z:12M1I3M1D58M1D3M1I155M1I3M1D414M2D6M1I2M1I49M
k99_1335274     898     22      732     +       GL456022.2      1934410 1220564 1221274 557     716     0       NM:i:159        ms:i:458        AS:i:458        nn:i:0  tp:A:S  cm:i:7  s1:i:40 dv:f:0.1695     cg:Z:12M1I3M1D58M1D3M1I65M1I3M1D86M1I3M1D414M2D6M1I2M1I49M
k99_1335274     898     22      732     +       JH584266.1      4910976 1289953 1290663 557     716     0       NM:i:159        ms:i:458        AS:i:458        nn:i:0  tp:A:S  cm:i:7  s1:i:40 dv:f:0.1695     cg:Z:12M1I3M1D58M1D3M1I65M1I3M1D86M1I3M1D414M2D6M1I2M1I49M
k99_1335274     898     22      732     +       CM001010.2      94987271        34970860        34971570        557     716     0       NM:i:159        ms:i:458        AS:i:458        nn:i:0  tp:A:S  cm:i:7  s1:i:40 dv:f:0.1695     cg:Z:12M1I3M1D58M1D3M1I65M1I3M1D86M1I3M1D414M2D6M1I2M1I49M

For example, the top 3 hits have the same AS cores and the first one is labeled as Primary while the other two are labeled as Secondary. the identity score col10/col11 are the same as well. What other columns can be used to evaluate how good the alignment is?

Thanks

The text was updated successfully, but these errors were encountered:

lh3 · 2018-04-04T19:00:14Z

GenBank may contain multiple near identical strains of the same species. For your query, probably the first three database sequences are identical in the aligned part. It is not possible to separate them out. The real solution is to remove such obvious redundancies.

Confurious · 2018-04-04T20:08:55Z

Thanks. Those were perhaps not the best examples. Yes I am aware that GenBank and NCBI databases have some redundancy issues. However I am prepared to deal with that with my own parsing. For some reasons, for one query, I am seeing multiple "primary" alignment (tp:A:P), is this because of the multi-index step? So for every sub-index there is a primary alignment? Thanks!

k99_106222      536     0       536     +       AM270982.1      1609811 641610  642148  532     538     0       NM:i:6  ms:i:1036       AS:i:1036       nn:i:0  tp:A:P  cm:i:154        s1:i:496
        s2:i:496        dv:f:0.0081     cg:Z:31M1D30M1D475M
k99_106222      536     0       536     +       AM270982.1      1609811 633651  634189  532     538     0       NM:i:6  ms:i:1036       AS:i:1036       nn:i:0  tp:A:S  cm:i:154        s1:i:496
        dv:f:0.0081     cg:Z:31M1D30M1D475M
k99_106222      536     0       536     +       AM270982.1      1609811 650039  650577  532     538     0       NM:i:6  ms:i:1036       AS:i:1036       nn:i:0  tp:A:S  cm:i:154        s1:i:496
        dv:f:0.0081     cg:Z:31M1D30M1D475M
k99_106222      536     0       536     +       AM270982.1      1609811 657998  658536  532     538     0       NM:i:6  ms:i:1036       AS:i:1036       nn:i:0  tp:A:S  cm:i:154        s1:i:496
        dv:f:0.0081     cg:Z:31M1D30M1D475M
k99_106222      536     370     536     -       gi|702422764|ref|NW_010391501.1|        8982    181     348     129     167     22      NM:i:38 ms:i:104        AS:i:104        nn:i:0  tp:A:P  cm:i:9  s1:i:63 s2:i:44 dv:f:0.1156     cg:Z:29M1D137M
k99_106222      536     371     536     -       gi|550152876|ref|NW_005372339.1|        1197810 746153  746319  129     167     0       NM:i:38 ms:i:104        AS:i:104        nn:i:0  tp:A:S  cm:i:5  s1:i:49 dv:f:0.1508     cg:Z:25M1I2M2D137M
k99_106222      536     370     533     +       gi|557454186|ref|NW_005857785.1|        1558306 1147379 1147542 128     165     0       NM:i:37 ms:i:104        AS:i:104        nn:i:0  tp:A:S  cm:i:5  s1:i:53 dv:f:0.1522     cg:Z:84M2I1M1D50M1D26M
k99_106222      536     370     536     -       gi|949098558|ref|NW_014549397.1|        2147001 13732   13899   129     167     0       NM:i:38 ms:i:104        AS:i:104        nn:i:0  tp:A:S  cm:i:6  s1:i:44 dv:f:0.1427     cg:Z:29M1D137M
k99_106222      536     370     536     +       gi|695920435|ref|NW_009259687.1|        1991493 392082  392249  129     167     0       NM:i:38 ms:i:104        AS:i:104        nn:i:0  tp:A:S  cm:i:8  s1:i:48 dv:f:0.1235     cg:Z:137M1D29M
k99_106222      536     373     536     -       gi|554666229|ref|NW_005822290.1|        192154  8438    8602    127     164     0       NM:i:37 ms:i:104        AS:i:104        nn:i:0  tp:A:S  cm:i:8  s1:i:60 dv:f:0.1222     cg:Z:29M1D134M
k99_106222      536     370     536     -       AZMS01S00043927.1       226976  24414   24581   131     169     0       NM:i:38 ms:i:104        AS:i:104        nn:i:0  tp:A:S  cm:i:9  s1:i:63 dv:f:0.1156     cg:Z:27M1D65M1I4M2D3M1I65M
k99_106222      536     370     536     -       gi|701766627|ref|NW_010293388.1|        92771   61276   61443   129     167     0       NM:i:38 ms:i:104        AS:i:104        nn:i:0  tp:A:S  cm:i:9  s1:i:63 dv:f:0.1156     cg:Z:29M1D137M
k99_106222      536     370     536     +       KK343829.1      192418  185462  185629  133     172     0       NM:i:39 ms:i:104        AS:i:104        nn:i:0  tp:A:S  cm:i:9  s1:i:63 dv:f:0.1156     cg:Z:33M2I2M2D28M1I3M2D2M1I64M1I2M2D27M
k99_106222      536     370     536     +       HE587306.1      3773812 1654    1819    130     167     1       NM:i:37 ms:i:106        AS:i:106        nn:i:0  tp:A:P  cm:i:9  s1:i:63 s2:i:63 dv:f:0.1156     cg:Z:65M1I2M1D25M1I72M

lh3 · 2018-04-04T20:29:19Z

for one query, I am seeing multiple "primary" alignment (tp:A:P), is this because of the multi-index step? So for every sub-index there is a primary alignment?

Yes, each part gives one or multiple primary alignments. For a multi-part index, you should ignore the "tp" tag. It is misleading.

Confurious · 2018-04-04T20:38:21Z

Thanks. For several alignment with the same AS score (104), i noticed that the "cm", "s1", "dv" are all different. I realize AS 104 is not a particular high score and that may be why things get a bit shaky. However could you explain what does "cm" (Number of minimizers on the chain) and s1 or s2 (chaining score) exactly mean? the flag "dv" is not mentioned in the manual file. Is there a way to tell which alignments among these of the same AS score are slightly better? Thanks!

lh3 · 2018-04-04T20:44:59Z

"AS" is the ultimate standard. Other values are only useful when you don't compute "AS".

Confurious · 2018-04-05T18:05:34Z

Thanks. I have compared the results with blastn. it seems that of the 6500 testing queries, minimap2 using minimap2 -c -k15 -w5 -t16 -I10G was able to find "hits" for 2304 while dc-blast was able to get 4426 queries with potential hits. This is before any parsing performed. So obviously this is a rather larger difference. I am wondering if what other options I could try to increase the sensitivity? I don't mind sacrificing speed a bit at this point as currently the minimap2 is around 10 times faster than blastn for the same purposes.

lh3 · 2018-04-05T18:25:04Z

For a query file of your size, minimap2 will spend most of time on indexing the database. When you have a lot more query sequences, minimap2 should be even faster than blastn.

On sensitivity, blastn should be higher, as it is essentially using -k11 -w1. It would be good if you extract the 2122 (=4426-2304) pairs that minimap2 is unable to align. We can check if these pairs should be aligned together. If yes, we can optimize parameters on this much small dataset.

Confurious · 2018-04-05T18:31:04Z

Hi, yes I am basically doing this for a lot of collections of "contigs", which represent a lot samples. The speed as of now is fast enough that I am not worrying too much about it. But I need to get the sensitivity up. I am also not mapping reads but assembled contigs in these tests. each contig is 200- a few thousands bps. That's why I was thinking minimap2, being great on longer reads/contigs, should finally fit the bill.
I can get those 2122 contigs out, do you want me to try -k11 -w1 on them then? Thanks

lh3 · 2018-04-09T13:46:43Z

Sorry for the late response. Have you checked if those blast hits are good? When you increase sensitivity, you may reduce specificity as a cost.

Confurious · 2018-04-09T16:39:11Z

Yes, I have verified the blast hits based on mock experiments. I am trying to replace blast with something faster and preferably equally sensitive, hopefully minimap2. As for the sensitivity/specificity, I take very conservative approach (LCA) to parse the results and only report the consensus based on the top hits. So I would rather increase sensitivity at the query/search step. I have tried the parameters you suggested ```-k11 -w1``` that would produce blast-sensitivity-like output but so far it's requiring significantly more memory. for 10 GB per sub-index, the peak memory usage was more than 117 GB (>7*16). Is it expected? Thanks ``` batch 801964_10.b+ 7Gc 117429440K 71677484K 00:53:22 16 ```

…

On Mon, Apr 9, 2018 at 6:46 AM, Heng Li ***@***.***> wrote: Sorry for the late response. Have you checked if those blast hits are good? When you increase sensitivity, you may reduce specificity as a cost. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#144 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVzXMeioMB7c5_ldnfZW8swg4B4bgccmks5tm2ZEgaJpZM4THOrg> .

-- Sincerely yours, Chao Jiang

Confurious · 2018-04-11T18:01:55Z

I am wondering what are the resources usage scaling for -k and -w parameters are? It seems the program is using dramatically more resources with -k11 -w1 and i am trying to scale down to something like -k11 -w3 or -k15 -w1. Could you please give me some tips?

Confurious changed the title ~~Using minimap2 to search against massive database for taxonomy classification purposes (> 500GB)~~ Using minimap2 to search against massive database for classification purposes (> 500GB) Apr 4, 2018

lh3 added the question label Apr 4, 2018

lh3 closed this as completed Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using minimap2 to search against massive database for classification purposes (> 500GB) #144

Using minimap2 to search against massive database for classification purposes (> 500GB) #144

Using minimap2 to search against massive database for classification purposes (> 500GB) #144

Using minimap2 to search against massive database for classification purposes (> 500GB) #144

Comments