[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using mappy across multiple processes #125

Closed
marcus1487 opened this issue Feb 22, 2018 · 3 comments
Closed

Using mappy across multiple processes #125

marcus1487 opened this issue Feb 22, 2018 · 3 comments

Comments

@marcus1487
Copy link
Contributor

I am working on making the incorporation of mappy into tombo (for nanopore modified base detection) and I am having some issues using mappy on larger genomes across many cores. My main issue is that right now I am opening a new mappy.Aligner object in each python process (via multiprocess module). For larger genomes, this leaves a large memory footprint. I wanted to be safe about using this object across multiple processes, so I opened a new Aligner in each new python multiprocess, but I am wondering if there is an existing solution that might allow for the in memory minimap2 index to be shared across multiple python processes via the mappy API to help decrease this memory footprint.

@lh3
Copy link
Owner
lh3 commented Feb 23, 2018

This will be technically difficult. It is possible to access memory from different processes using shared memory, but implementing that in minimap2 will be nontrivial.

Can you use multiple threads? You can create one mappy.Aligner object and call Aligner.map on different threads like (I have not tested this part, though):

aligner = mappy.Aligner(fn_index)
# then in each thread
thr_buf = mappy.ThreadBuffer()
for hit in aligner.map(seq, buf=thr_buf):
    ...

@marcus1487
Copy link
Contributor Author

I figured this might be quite an undertaking if it were not already built into minimap2/mappy. I am using python's multiprocess module right now, but slimming the memory footprint down for large genomes might warrant a switch to threading.

I will try some tests to see if this is feasible. Thanks or the pointer and use case for mappy.ThreadBuffer.

@marcus1487
Copy link
Contributor Author

I have tested the interface and it works quite well indeed. I will note that mixing multiprocessing with multithreading was quite a headache, but it is indeed possible. For others that might be trying this, I had to open all multiprocess objects (Pipes in my case) and start all processes before opening any threading objects.

While it would be nice to have a version that could share state across processes I think this is a sufficient workaround and so I would consider this issue resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants