-
Notifications
You must be signed in to change notification settings - Fork 401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using minimap2 to search against massive database for classification purposes (> 500GB) #144
Comments
GenBank may contain multiple near identical strains of the same species. For your query, probably the first three database sequences are identical in the aligned part. It is not possible to separate them out. The real solution is to remove such obvious redundancies. |
Thanks. Those were perhaps not the best examples. Yes I am aware that GenBank and NCBI databases have some redundancy issues. However I am prepared to deal with that with my own parsing. For some reasons, for one query, I am seeing multiple "primary" alignment (tp:A:P), is this because of the multi-index step? So for every sub-index there is a primary alignment? Thanks!
|
Yes, each part gives one or multiple primary alignments. For a multi-part index, you should ignore the "tp" tag. It is misleading. |
Thanks. For several alignment with the same AS score (104), i noticed that the "cm", "s1", "dv" are all different. I realize AS 104 is not a particular high score and that may be why things get a bit shaky. However could you explain what does "cm" (Number of minimizers on the chain) and s1 or s2 (chaining score) exactly mean? the flag "dv" is not mentioned in the manual file. Is there a way to tell which alignments among these of the same AS score are slightly better? Thanks! |
"AS" is the ultimate standard. Other values are only useful when you don't compute "AS". |
Thanks. I have compared the results with blastn. it seems that of the 6500 testing queries, minimap2 using |
For a query file of your size, minimap2 will spend most of time on indexing the database. When you have a lot more query sequences, minimap2 should be even faster than blastn. On sensitivity, blastn should be higher, as it is essentially using |
Hi, yes I am basically doing this for a lot of collections of "contigs", which represent a lot samples. The speed as of now is fast enough that I am not worrying too much about it. But I need to get the sensitivity up. I am also not mapping reads but assembled contigs in these tests. each contig is 200- a few thousands bps. That's why I was thinking minimap2, being great on longer reads/contigs, should finally fit the bill. |
Sorry for the late response. Have you checked if those blast hits are good? When you increase sensitivity, you may reduce specificity as a cost. |
Yes, I have verified the blast hits based on mock experiments. I am trying
to replace blast with something faster and preferably equally sensitive,
hopefully minimap2. As for the sensitivity/specificity, I take very
conservative approach (LCA) to parse the results and only report the
consensus based on the top hits. So I would rather increase sensitivity at
the query/search step.
I have tried the parameters you suggested ```-k11 -w1``` that would produce blast-sensitivity-like
output but so far it's requiring significantly more memory. for 10 GB per
sub-index, the peak memory usage was more than 117 GB (>7*16). Is it
expected? Thanks
```
batch 801964_10.b+ 7Gc 117429440K 71677484K 00:53:22 16
```
…On Mon, Apr 9, 2018 at 6:46 AM, Heng Li ***@***.***> wrote:
Sorry for the late response. Have you checked if those blast hits are
good? When you increase sensitivity, you may reduce specificity as a cost.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#144 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AVzXMeioMB7c5_ldnfZW8swg4B4bgccmks5tm2ZEgaJpZM4THOrg>
.
--
Sincerely yours,
Chao Jiang
|
I am wondering what are the resources usage scaling for -k and -w parameters are? It seems the program is using dramatically more resources with -k11 -w1 and i am trying to scale down to something like -k11 -w3 or -k15 -w1. Could you please give me some tips? |
I have successfully used minimap2 to search queries against massive database with multi-index option -I. The command is
minimap2 -c -k15 -w5 -t16 -I10G
I didn't set -N to more secondary alignments but I am already getting a lot hits for each query.
However, I also see a lot of hits with identical AS score, yet the other attributes of the alignment are not necesasrily the same. I am wondering if there is a more complicated way of evaluating how good a hit is?
Posting some output here for reference.
For example, the top 3 hits have the same AS cores and the first one is labeled as Primary while the other two are labeled as Secondary. the identity score col10/col11 are the same as well. What other columns can be used to evaluate how good the alignment is?
Thanks
The text was updated successfully, but these errors were encountered: