suffix array/ bwt construction algorithms other than libdivsufsort #27

h-2 · 2017-09-20T14:19:50Z

We had already spoken about this with @simongog in march:
For both technical and licensing reasons we need replacements for libdivsufsort. We have both in-memory and external algorithms in SeqAn currently, that we could modernise and contribute if you would like to have them. Alternatively we would require some form interface so that we could use to plug our own algorithms into current functionality.
@simongog suggested you had planned on doing the second thing anyway. Do you have any concrete plans for this?

@cpockrandt what are your thought on this? Please contribute to this discussion as you are the one who will have to do the work either way 😉

mpetri · 2017-09-20T21:25:06Z

There are plans for this yes. We are meeting in Japan in 2 weeks so I suggest if you can, that you put all the feature requests that you want as issues in here so we can discuss them.

mpetri · 2017-09-20T21:30:53Z

Currently, we I mocked up some prototype that uses the builder pattern. So it would look like this:

construct_config()
               .set_sa_algorithm(sdsl::sa_alg::divsufsort)
               .set_lcp_algorithm(sdsl::lcp_alg::blabla)
               .set_cache_config()
               .set_other_stuff()
               .set_inmemory(true)
               .input_encoding(uint8_t)
               .construct(file_name);

h-2 · 2017-09-21T14:16:08Z

There are plans for this yes. We are meeting in Japan in 2 weeks so I suggest if you can, that you put all the feature requests that you want as issues in here so we can discuss them.

Ah, that's good to know! Thanks.

Currently, we I mocked up some prototype that uses the builder pattern

Ideally we could just pass a lambda to .set_sa_algorithm that calls our functions, but storing that in std::functions and getting the arguments right is not trivial. Note that we definitely need to be able to construct from a sequence (or multiple sequences) currently in memory (not just from files) since we have formatted files were we need to extract the relevant data first (and maybe apply transformations as well). Going through an extra IO-step for GBs of data would be very bad. You could require something like a sized random access range as input..
The inmemory attribute would from my POV also be dependent on the algorithm.

@cpockrandt Please chime in on these topics!

mpetri · 2017-09-22T00:57:05Z

Currently all inmemory construction (using constuct_im) is done by writing all temporary files to a "ram filesystem". So, if you pass in a "buffer" to consturct_im it is first copied to another buffer (a virtual file) in the ram_fs.

Do you think this is not sufficient?

If the sa_algorithm you pass in doesn't want to write stuff to disk (or the virtual ram fs) it doesn't have to do that. However, the wavelet tree construction algorithms all use streaming to be more memory efficient. In the ram_fs case, the stream would just be from a virtual file instead of from disk.

In the end, we can copy 3-4GB/s these days so that should never be a bottleneck.

In general we were thinking of moving towards just memory mapping everything letting the operating system handle things which is good practice anyway as far as I know.

h-2 · 2017-09-25T11:56:20Z

So the entire SA is copied from memory to mem-fs (or the other way around) at least once, or not? This would double the peak memory requirement...
We have been trying very hard to get usable in-place construction algorithms of the SA so that we can build SA+BWT in memory consumption of around 1.1 or 1.2 (<< 2) times the size of the SA. The reason is that we frequently build very large databases (50GiB and more). The on-disk algorithms that we had, had lower memory requirements, but prohibitively large disk-space requirements (~1TiB) which resulted in people using network storage which then increased the run-time even more. Most on-disk algorithms are also more difficult to parallelise.
Long story short: we would strongly prefer if we weren't forced to go through filesystems (virtual or not), even if other algorithms might do so.

PS: On an unrelated note: I also added progress reporting capabilities to some of our SACAs which was highly appreciated by our users (if something is expected to run somewhere between a day and a week it really helps to know whether it is closer to a day or a week!). Have you thought about this?

mpetri · 2017-09-25T12:22:41Z

Currently (and also in the future) we use an object called cache_config which contains a "file_map". The map stores locations of stuff we have constructed for reuse. In the current version of sdsl you could construct the SA and BWT externally and then add it do the file_map object. The CSA/CST construction process will then reuse those files instead of constructing the objects again. So if you want to do your own thing you can.

There are currently stages in the construct of a CSA where temporary files are written to the fs (virtual or not). Creating a separate inmemory version which doesn't just use a "virtual file" in the background would create lots of code deduplication.

I think in the ram filesystem we move memory around instead of copying whenever possible. We have some memory consumption visualizations and we are almost as efficient in terms of space consumption as possible and we are working on something to make this more efficient. The main issue is when to discard objects (e.g. the full SA or PSI) that are not needed anymore.

Regarding the "progress display", we could add this functionality to the new construction process. Maybe a show_progress() switch. However, not all SACAs might support this. For example, I would hesitate to add this to libdivsufsort. There are other parts of the CSA construction process which can be quite time consuming which would benefit from a progress bar. I think this should be turned off by default though.

Have you looked into existing work on SACAs which do (1) parallel and (2) on disk. I remember Dominik Kempa has been doing at lot of work in this area. I think all the code is available and I was planning to integrate this into sdsl if I ever have time and we have a reasonable interface for different SACAs. There is also a really good parallel libdivsufsort implementation here out there which gives almost linear speedup.

h-2 · 2017-09-28T16:03:13Z

There are currently stages in the construct of a CSA where temporary files are written to the fs (virtual or not). Creating a separate inmemory version which doesn't just use a "virtual file" in the background would create lots of code deduplication.

Hm, I understand that these changes are non-trivial....

I think in the ram filesystem we move memory around instead of copying whenever possible. We have some memory consumption visualizations and we are almost as efficient in terms of space consumption as possible and we are working on something to make this more efficient. The main issue is when to discard objects (e.g. the full SA or PSI) that are not needed anymore.

... but I still don't see how you can avoid doubling the peak memory usage in your model. Unless of course you detect when you are writing to ram-fs and then delete the currently held (real) memory block-wise after writing blocks – something that sounds rather inefficient and creates exactly the "special-case-code-overhead" that you are trying to avoid.
Like I said, we are really operating at the limit of available RAM often and the index construction is already really expensive compared to the requirements of the FM-Index once it's done so it would be a huge draw-back if the memory requirements of applications double when switching to SDSL.

Regarding the "progress display", we could add this functionality to the new construction process. Maybe a show_progress() switch. However, not all SACAs might support this

Yes, I don't think this should be a requirement, it's just bonus.

Have you looked into existing work on SACAs which do (1) parallel and (2) on disk.

We have fully parallel SACAs based on sorting in SeqAn. They work for non-repetitive data like protein sequences, but have problems with repetitive data like genomes. As on-disk SACAs we have Skew-variants, but they require a lot of disk space. If you have better alternatives here that would be very welcome!

cpockrandt · 2018-06-15T14:06:37Z

Do you have the mockup somewhere in a separate branch @mpetri? We'd then start implementing SA construction algorithms (at first our own inplace radix implementation from SeqAn). If you know a parallelizable algorithm that performs well in practice and you'd like to have in the SDSL, we're glad for any recommendation!

mpetri · 2018-06-16T02:03:04Z

I don't think the mockup that I created is very useful. I would just use the builder design pattern do all of this. However, there are quite a few things that need to be considered:

Depending on the different algorithms used the time certain temporary files can be removed changes. Simon and I were thinking of some sort of "stack" mechanism that figures this out from the type definitions. For example, the SA/BWT can't be deleted after computing a CSA, if the CSA is part of a CST which still requires those files.
We still have not resolved the inmemory construction issues discussed above.
The overall interface should still be simple. The current construct() call is quite easy to understand and the rewrite should provide something similar.

SA construction algorithms that should be included somehow:

Parallel divsufsort which is currently the fastest general purpose SA construction algorithm out there: https://github.com/jlabeit/parallel-divsufsort
This parallel external SA construction: https://www.cs.helsinki.fi/group/pads/pSAscan.html
an integer alphabet construction algorithm (still have to look up the references)

rrahn · 2020-04-21T17:02:34Z

@mpetri @h-2 @simongog @cpockrandt
Ok I started to look into this. Here is my current understanding and things we'd like to have.
So in general within the sdsl everything is nicely encapsulated through the file concept, which uses the cache_config to store the data structures in a file based approach. My first observation is that this is a really nice abstraction. It only comes with one issue. When building a CSA through the construct_im algorithm I count 3 copies that are made of the text. First, at the beginning to store the text in a ram file. 2. To check if the text contains a 0 characer. 3. to write it into a file with the appended 0. I am not sure what the text_mapper does? Is it a memory mapped file? So at least only a fraction of the text is held in memory? Is this observation correct?

I am a little puzzeled about the semantics of contains_no_zero_character. It never returns false but always throws. So this is already bad, as the client is never allowed to provide already a text with a trailing 0 character and possibly safe a copy. So if the client wants to build the SA before and pass it via the cache_config he also has to know when to add a trailing 0 and when not, which depends on the indext to construct. Also enabling an CSA with an implicit sentinel would complicate things.

I can understand this design decision, as it makes the code easier to use. How whould we know that if the last character of the text is 0, that this was on purpose or simply a regular symbol of the text and just happened to be the last one. On the other hand, this makes it even harder to provide everything from outside as the internals of the construction must be known and I feel, that this should be hidden.

So instead it would be indeed cool, if I can just add my own SACA algorithm from outside. It must get a text and an sdsl::int_vector for the SA. In general it is no problem to adapt divsufsort to work on a iterator instead of a const char_t * and I belive other algorithms can be similarly extended. To solve the text problem I am proposing an extended config_cache, that simply inherits from the config_cache and overwrites the operations for the text, such that it only holds a reference to the text it was constructed with. When dealing with the trailing 0 problem we could have simply a range adaptor that just extends the referenced text by one character with0 without changing the underlying text. I mean it must not be fully range::views compatible. This would be even more crucial if we want to have an index construction over a text collection as the current file based approach will get much more complicated if we want to preserve the collection structure in the mapped file.
So we would be able to pass through the text without writing it into some (possibly) memory mapped file, since we have it anyway the entire time in memory during the construction. Everything else should stay the way it is. However, it will require some refactorings to make the various construction algorithms aware of for dealing with a file or a text. I am looking into that right now.

One open question that I have is the current design of the construction call. There are basically two public interfaces:

void construct(t_index& idx, const std::string& file, cache_config& config, uint8_t num_bytes = 0);

and

void construct_im(t_index& idx, t_data&& data, uint8_t num_bytes = 0)

The first one always expects the cache to be given (also it is not documented). How is this interface intended to be used? I mean I can now cache a SA that might be different to the text that I pass. Basically, I think that I always need to construct something from a text. That is either inmemory or a file which needs to be loaded. So instead of passing the cache_config as a parameter, I would return it from the function with the filled data. If the cache_config is not captured then all temporary resources will be freed. Otherwise, the resources live as long as someone holds a handle on the cache_config. That would allow to still call more or less internal interfaces to construct certain data structures in a lazy fashion but simplifies the standard use case for the clients and makes the both interfaces more alike.

So here would be my current design proposal for the construcion functions:

cache_config construct(t_index& idx, const std::string& file, uint8_t num_bytes = 0);
text_cache_config<t_data> construct_im(t_index& idx, t_data&& data, uint8_t num_bytes = 0)

Later we can discuss about the config system. If we are allowing to set the construction algorithm from the client then the config chain config.option1(sdsl::divsufsort).option2(...) like proposed above must return a new config object and not a reference to the modified config. We could also model it like this new_config = old_config | option1(my_saca) | option2(...), which is more expressive in the sense that a new type will be returned and not the current one modified.

So that is my current take. Maybe I am missing a lot of issues and other design problems so feedback and other solutions/alternatives would be highly appreciated, especially since this is my first rodeo with the sdsl. Many thanks.

h-2 changed the title ~~suffix array constructions algorithms other than libdivsufsort~~ suffix array/ bwt construction algorithms other than libdivsufsort Sep 20, 2017

h-2 mentioned this issue Sep 20, 2017

(compressed) suffic arrays over collections #28

Open

rrahn mentioned this issue Jun 16, 2019

Suffix Array Construction in SDSL using lambda functions. seqan/seqan3#1099

Closed

rrahn mentioned this issue Mar 27, 2020

[Search] Provide mechanism to allow specification of the SACA algorithm from outside in order to have a better control over the index construction for different use cases. seqan/product_backlog#25

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suffix array/ bwt construction algorithms other than libdivsufsort #27

suffix array/ bwt construction algorithms other than libdivsufsort #27

suffix array/ bwt construction algorithms other than libdivsufsort #27

suffix array/ bwt construction algorithms other than libdivsufsort #27

Comments