Corrupted tabix index #393

dariober · 2015-11-23T23:38:18Z

Hello,

It seems to me that tabix indexes created with writeBasedOnFeatureFile are corrupted. Am I doing something wrong?

Here's a bed test file, it has 1M rows on 10 chroms. I bgzipped with tabix 1.2.1:
https://dl.dropboxusercontent.com/u/53487723/tmp.bedGraph.gz

I created an index for it with the following program (java 8, htsjdk-1.141):

import java.io.File;
import java.io.IOException;
import htsjdk.tribble.bed.BEDCodec;
import htsjdk.tribble.index.IndexFactory;
import htsjdk.tribble.index.tabix.TabixFormat;
import htsjdk.tribble.index.tabix.TabixIndex;

public class TestTabix {

    public static void main(String[] args) throws IOException {

        String bgzfOut= "tmp.bedGraph.gz";
        TabixIndex tabixIndexGz =
                IndexFactory.createTabixIndex(new File(bgzfOut), new BEDCodec(), TabixFormat.BED,
                        null);
        tabixIndexGz.writeBasedOnFeatureFile(new File(bgzfOut));
    }

}

Querying the resulting index returns only the first records of chr1:

tabix tmp.bedGraph.gz chr1 | wc -l  # Should be 100000
    4500
tabix tmp.bedGraph.gz chr2 | wc -l  # Should be 100000
       0

The text was updated successfully, but these errors were encountered:

magicDGS · 2016-09-17T13:34:47Z

Can you try with the latest master version, @dariober? I fixed some bugs in the Tabix/Tribble indexing recently...

dariober · 2016-11-23T21:11:49Z

Hi - Thanks for looking into this and apologies I kept quiet. I tested the code above on htsjdk-2.7.0 an the problem persists.

Also, after indexing the following file (call it chrom.bed.gz):

chr1	1	10
chr1	100	1000000

The query

tabix chrom.bed.gz chr1

skips the first record and returns:

chr1	100	1000000

However, this file is correctly queried:

chr1	1	10
chr1	100	1000

It seems to me there is something going on when features are in the first bin 0-16384.

(Just to make sure... Can anybody reproduce this bug or it's just me?)

magicDGS · 2017-03-09T11:47:34Z

This is also happening to me, @dariober. I will create a PR with a failing index to point out that this is happening and ask how can be fixed.

cmnbroad · 2017-06-16T21:05:40Z

@dariober There are a couple of problems with the IndexFactory code paths that create indices based on an input file. The worst one is that although the indexer recognizes block compressed inputs, it wraps a PositionalBufferedStream around the BlockCompressedInputStream used to decode them, so instead of handing the indexer virtual file pointers, it hands it byte offsets. In addition, the first feature added to the index is always indexed as if it were at offset 0 in the input, even if the input file has a header. In addition, there is a bug in the calculation of the linear index part of the tabix index that manifests in the case where there is a feature that is located at offset 0 in the input file (which can happen with BED). Fixes are forthcoming.

dariober mentioned this issue Nov 28, 2016

Corrupted TabixIndex dariober/ASCIIGenome#38

Closed

magicDGS mentioned this issue Mar 9, 2017

Failing test for BED tabix indexing #820

Closed

5 tasks

magicDGS mentioned this issue Jun 2, 2017

IndexFeatureFile produces an incorrect tabix index on .vcf.gz files with a large header broadinstitute/gatk#2801

Closed

cmnbroad mentioned this issue Jun 20, 2017

Fix bugs in IndexFactory and BinningIndexBuilder. #906

Merged

2 tasks

magicDGS mentioned this issue Jul 25, 2017

BinningIndexBuilder assumes no feature can be at offset 0 #943

Closed

cmnbroad closed this as completed in #906 Aug 9, 2017

cmnbroad mentioned this issue Aug 14, 2017

TableCodec isn't properly indexable due to buffering done by OpenCSV reader. broadinstitute/gatk#3440

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted tabix index #393

Corrupted tabix index #393

Corrupted tabix index #393

Corrupted tabix index #393

Comments