Potential problem with writing gz files
Opened this issue · 3 comments
I'm using fgoxide
to write a test involving .vcf.gz
files.
I've defined some input lines like
let tempdir = TempDir::new().unwrap();
let io = fgoxide::io::Io::default();
let input = tempdir.path().join("input.vcf.gz");
let input_data = vec![
r#"##fileformat=VCFv4.0 "#,
r#"##build_id=20201027095038"#,
r#"##Population=https://www.ncbi.nlm.nih.gov/biosample/?term=GRAF-pop"#,
r#"##FORMAT=<ID=AN,Number=1,Type=Integer,Description="Total allele count for the population, including REF">"#,
r#"##FORMAT=<ID=AC,Number=A,Type=Integer,Description="Allele count for each ALT allele for the population">"#,
r#"#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMN10492695 SAMN10492696 SAMN10492697 SAMN10492698 SAMN10492699 SAMN10492700 SAMN10492701 SAMN10492702 SAMN11605645 SAMN10492703 SAMN10492704 SAMN10492705"#,
r#"NC_000001.9 144135212 rs1553120241 G A . . . AN:AC 8560:5387 8:8 256:224 336:288 32:24 170:117 32:24 18:13 20:15 344:296 288:248 9432:6100"#,
r#"NC_000001.9 144148243 rs2236566 G T . . . AN:AC 5996:510 0:0 0:0 0:0 0:0 0:0 0:0 0:0 84:8 0:0 0:0 6080:518"#,
] ;
io.write_lines(&input, &input_data)?;
My test is failing with Error: bytes remaining on stream
, but passing with files I've run manually.
If I extract the generated VCF file and re-compress it with bgzip
gunzip input.vcf.gz
bgzip -c input.vcf > input2.vcf.gz
My tool works. The same issue occurs when running the tool directly on the files (so the test itself isn't the problem, unless I've written the input file incorrectly).
The bgzip version is slightly larger.
There's no difference in the vcf version of each.
diff <(gzcat test_path/input.vcf.gz) <(gzcat test_path/input2.vcf.gz)
@jrm5100 I wonder - how is your tool reading the VCF? I'm fairly sure that io::write_lines() will write a vanilla gzipped file if the extension is .gz - not a bgzipped file. Perhaps the tool you're running expects gzipped files to be bgzipped?
If that is the case I'd also be open to making io::write_lines() write bgzipped instead of plain gzipped files as that's probably more generally useful.
Yeah, it just occurred to me the incompatibility might be with noodles-bgzf, and I guess the name might confirm that your hunch is correct.
That's what I get for trying to write up a strange issue on a Friday evening without thinking it through.
Not sure what the best approach is. Perhaps .vcf.gz
should write bgzip files (since that is the norm for vcf), but I don't know that I'd want to encode too much logic into how file extensions are written.
What about the following simple rust psuedo-code that could be refactored into its own function. If it's only for testing purposes, that should work. We tend to use this library for compression: https://github.com/sstadick/gzp
let tempdir = TempDir::new().unwrap();
let io = fgoxide::io::Io::default();
let input = tempdir.path().join("input.vcf.gz");
let writer = BufWriter::new(File::create(&input).unwrap());
let mut bgzf_writer = BgzfSyncWriter::new(writer, Compression::new(3));
bgzf_writer.write_all(b"@NAME\nGATTACA\n+\nIIIIIII\n").unwrap();
bgzf_writer.flush().unwrap();
drop(gz_writer);