fulcrumgenomics/fgoxide

Potential problem with writing gz files

Opened this issue · 3 comments

I'm using fgoxide to write a test involving .vcf.gz files.

I've defined some input lines like

let tempdir = TempDir::new().unwrap();
let io = fgoxide::io::Io::default();

let input = tempdir.path().join("input.vcf.gz");
let input_data = vec![
        r#"##fileformat=VCFv4.0	"#,
        r#"##build_id=20201027095038"#,
        r#"##Population=https://www.ncbi.nlm.nih.gov/biosample/?term=GRAF-pop"#,
        r#"##FORMAT=<ID=AN,Number=1,Type=Integer,Description="Total allele count for the population, including REF">"#,
        r#"##FORMAT=<ID=AC,Number=A,Type=Integer,Description="Allele count for each ALT allele for the population">"#,
        r#"#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMN10492695	SAMN10492696	SAMN10492697	SAMN10492698	SAMN10492699	SAMN10492700	SAMN10492701	SAMN10492702	SAMN11605645	SAMN10492703	SAMN10492704	SAMN10492705"#,
        r#"NC_000001.9	144135212	rs1553120241	G	A	.	.	.	AN:AC	8560:5387	8:8	256:224	336:288	32:24	170:117	32:24	18:13	20:15	344:296	288:248	9432:6100"#,
        r#"NC_000001.9	144148243	rs2236566	G	T	.	.	.	AN:AC	5996:510	0:0	0:0	0:0	0:0	0:0	0:0	0:0	84:8	0:0	0:0	6080:518"#,
] ;

io.write_lines(&input, &input_data)?;

My test is failing with Error: bytes remaining on stream, but passing with files I've run manually.

If I extract the generated VCF file and re-compress it with bgzip

gunzip input.vcf.gz
bgzip -c input.vcf > input2.vcf.gz

My tool works. The same issue occurs when running the tool directly on the files (so the test itself isn't the problem, unless I've written the input file incorrectly).

The bgzip version is slightly larger.

There's no difference in the vcf version of each.

diff <(gzcat test_path/input.vcf.gz) <(gzcat test_path/input2.vcf.gz)

input.vcf.gz
input2.vcf.gz

@jrm5100 I wonder - how is your tool reading the VCF? I'm fairly sure that io::write_lines() will write a vanilla gzipped file if the extension is .gz - not a bgzipped file. Perhaps the tool you're running expects gzipped files to be bgzipped?

If that is the case I'd also be open to making io::write_lines() write bgzipped instead of plain gzipped files as that's probably more generally useful.

Yeah, it just occurred to me the incompatibility might be with noodles-bgzf, and I guess the name might confirm that your hunch is correct.

That's what I get for trying to write up a strange issue on a Friday evening without thinking it through.

Not sure what the best approach is. Perhaps .vcf.gz should write bgzip files (since that is the norm for vcf), but I don't know that I'd want to encode too much logic into how file extensions are written.

nh13 commented

What about the following simple rust psuedo-code that could be refactored into its own function. If it's only for testing purposes, that should work. We tend to use this library for compression: https://github.com/sstadick/gzp

let tempdir = TempDir::new().unwrap();
let io = fgoxide::io::Io::default();
let input = tempdir.path().join("input.vcf.gz");
let writer = BufWriter::new(File::create(&input).unwrap());
let mut bgzf_writer = BgzfSyncWriter::new(writer, Compression::new(3));
bgzf_writer.write_all(b"@NAME\nGATTACA\n+\nIIIIIII\n").unwrap();
bgzf_writer.flush().unwrap();
drop(gz_writer);