divonlan/genozip

Possible bug in BAM compression?

lweasel opened this issue · 2 comments

Hi,

Really nice tool! The speed and compression improvements over, e.g. gzip, are very impressive.

I think there may be a potential bug in the compression of BAM files. Although the BAM file that I was originally trying has millions of records, I narrowed it down to the following. If I run genozip (v11.0.2) on a SAM file containing the following line, it works fine (genozip --threads 1 -f test.sam):

NS500125:680:HNHVYBGXG:2:11209:16805:14650 256 4 145637796 1 9M1494270N67M * 0 GAGTACGGGGAAGTCATGGAGGGAGACTAGTGCCTAGTATTTGCGGTGCCTGAAAACTTTCTTAAGAAGCAGTTGT A/AAAEEEEEEEEEEEEEAE/EAEEEEEE6AEAEEEEEEEEAEEE<EAAEEEEEEEEEEEEE/EEEAEEEEAAEAE NH:i:4 HI:i:4 AS:i:69 nM:i:1 XS:A:+

However, if I convert that SAM file to a BAM file (I'm using sambamba: sambamba view -S -f bam test.sam -o test.bam), and run genozip --threads 1 -f test.bam, I get the following output:

genozip test.bam : 0%
op_len=1 too long in vb=1494270:
[1] 28905 abort (core dumped) genozip --threads 1 -f test.bam

I think that it is complaining about the length of the number in the middle of the CIGAR string (i.e. 1494270). If I remove one digit from that number, and reconvert the SAM file to BAM, then genozip works without error.

Thanks Owen for reporting this! I have fixed the bug..

That's great, thank you!