Truncation issue with GZIPInputStream when reading bgzip-compressed streams
jkmatila opened this issue · 2 comments
Description of the issue
When reading bgzip-compressed data from a stream (e.g. over the network or from a pipe), htsjdk can stop early and not read the stream all the way to the end, processing partial data. This is due to using java.util.zip.GZIPInputStream
which has a known bug JDK-7036144 affecting reading concatenated gzip streams when the complete size of the stream is not known beforehand.
The problem has been discussed earlier in the comment thread to broadinstitute/gatk issue #4224, but it seems that a part of the recommended fix, replacing the use of the buggy GZIPInputStream
class, was not adopted. There also is a comment in SamInputResource.java which references this problem and the aforementioned JDK bug.
For example, GzipCompressorInputStream
from Apache Commons Compress would be an alternative which does not suffer from this problem, as noted in its Javadoc.
Your environment
- version of htsjdk: 4.0.2
- version of Java: OpenJDK 17.0.8.1 (Eclipse Temurin-17.0.8.1+1), OpenJDK 21 (Eclipse Temurin-21+35)
- which OS: Linux x86_64 (Ubuntu 22.04.3)
Steps to reproduce
Here is an example program which demonstrates one instance of the problem, VCFIteratorBuilder
using the problematic GZIPInputStream
class when reading a bgzipped VCF stream, and only part of the stream gets read. (Note: The code attempts to read a 200 MB file from the Web, although it in practice due to the bug it does not get very far.)
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import htsjdk.variant.vcf.VCFIterator;
import htsjdk.variant.vcf.VCFIteratorBuilder;
public class Repro {
public static void main(String[] args) throws IOException {
URL url = new URL("https://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz");
int expectedNumRecords = 1103547;
int numRecords = 0;
try (InputStream in = url.openStream();
VCFIterator vcfIterator = new VCFIteratorBuilder().open(in)) {
while (vcfIterator.hasNext()) {
vcfIterator.next();
numRecords++;
if (numRecords % 100000 == 0) {
System.out.println("Read " + numRecords + " records...");
}
}
}
if (numRecords != expectedNumRecords) {
System.err.println("URL: " + url);
System.err.println("Expected: " + expectedNumRecords + " records");
System.err.println("Parsed: " + numRecords + " records");
System.exit(1);
}
}
}
There are also other places where GZIPInputStream
is used in htsjdk.
Expected behaviour
The compressed VCF stream should be read all the way to the end and all records parsed.
Actual behaviour
The code stops reading the stream at some point, not parsing all records. No error is returned and parsing appears successful but in fact produces only part of the expected data. The exact location where it stops probably depends on how the block boundaries happen to coincide with the buffer size. It appears that if you have a large enough input file, it will likely stop at some point before reaching the end.
On both OpenJDK 17 (Eclipse Temurin-17.0.8.1+1) and on OpenJDK 21 (Eclipse Temurin-21+35) using the official Eclipse Temurin Docker images the above code outputs the following:
URL: https://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
Expected: 1103547 records
Parsed: 7395 records
Yeah, this bug is awful. We worked around it in GATK but never really fixed it in htsjdk. I decided to message the java development mailing list last week about it. I wish I had done that years ago, since someone responded really quickly. After 12 years or so since it was first reported, this bug is going to be fixed soon. I'm hoping/asking that it be back ported to java 17 or at least 21.