[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted tabix index #393

Closed
dariober opened this issue Nov 23, 2015 · 4 comments
Closed

Corrupted tabix index #393

dariober opened this issue Nov 23, 2015 · 4 comments

Comments

@dariober
Copy link

Hello,

It seems to me that tabix indexes created with writeBasedOnFeatureFile are corrupted. Am I doing something wrong?

Here's a bed test file, it has 1M rows on 10 chroms. I bgzipped with tabix 1.2.1:
https://dl.dropboxusercontent.com/u/53487723/tmp.bedGraph.gz

I created an index for it with the following program (java 8, htsjdk-1.141):

import java.io.File;
import java.io.IOException;
import htsjdk.tribble.bed.BEDCodec;
import htsjdk.tribble.index.IndexFactory;
import htsjdk.tribble.index.tabix.TabixFormat;
import htsjdk.tribble.index.tabix.TabixIndex;

public class TestTabix {

    public static void main(String[] args) throws IOException {

        String bgzfOut= "tmp.bedGraph.gz";
        TabixIndex tabixIndexGz =
                IndexFactory.createTabixIndex(new File(bgzfOut), new BEDCodec(), TabixFormat.BED,
                        null);
        tabixIndexGz.writeBasedOnFeatureFile(new File(bgzfOut));
    }

}

Querying the resulting index returns only the first records of chr1:

tabix tmp.bedGraph.gz chr1 | wc -l  # Should be 100000
    4500
tabix tmp.bedGraph.gz chr2 | wc -l  # Should be 100000
       0
@magicDGS
Copy link
Member

Can you try with the latest master version, @dariober? I fixed some bugs in the Tabix/Tribble indexing recently...

@dariober
Copy link
Author

Hi - Thanks for looking into this and apologies I kept quiet. I tested the code above on htsjdk-2.7.0 an the problem persists.

Also, after indexing the following file (call it chrom.bed.gz):

chr1	1	10
chr1	100	1000000

The query

tabix chrom.bed.gz chr1

skips the first record and returns:

chr1	100	1000000

However, this file is correctly queried:

chr1	1	10
chr1	100	1000

It seems to me there is something going on when features are in the first bin 0-16384.

(Just to make sure... Can anybody reproduce this bug or it's just me?)

@magicDGS
Copy link
Member
magicDGS commented Mar 9, 2017

This is also happening to me, @dariober. I will create a PR with a failing index to point out that this is happening and ask how can be fixed.

@cmnbroad
Copy link
Collaborator
cmnbroad commented Jun 16, 2017

@dariober There are a couple of problems with the IndexFactory code paths that create indices based on an input file. The worst one is that although the indexer recognizes block compressed inputs, it wraps a PositionalBufferedStream around the BlockCompressedInputStream used to decode them, so instead of handing the indexer virtual file pointers, it hands it byte offsets. In addition, the first feature added to the index is always indexed as if it were at offset 0 in the input, even if the input file has a header. In addition, there is a bug in the calculation of the linear index part of the tabix index that manifests in the case where there is a feature that is located at offset 0 in the input file (which can happen with BED). Fixes are forthcoming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants