williamslab/ped-sim

bgzip for vcf output

genignored opened this issue · 9 comments

Hi,

Thank you for being so responsive. I hope this will be my last issue. Running ped-sim with compressed vcf files, the output from pedsim is gzipped vcf (.vcf.gz). However, bcftools requires that the file be compressed with bgzip, not gzip (which is also a .gz file, for reasons that I have never understood).

Is it possible to update to use bgzip not gzip for outputs?

Thank you,
=Matt

I'd like to do this. A few options:

  • Incorporating htslib code -- this wasn't so straightforward a few years ago when I looked, but they may have simplified the calls now. If you have the bandwidth to look, that'd be most helpful. Sadly, I'm not sure when this can happen unless someone can do the research.
  • Alternatively, and quite easily, Ped-sim could be modified to print the VCF output to, say, stderr, and you could pipe this to bgzip.
  • Related to the above, you could tell Ped-sim where the bgzip program is, and it could pipe to it internally.

Do you have a preference among these?

humm.

It's your code, so I won't say that I know what's best, but usually it seems like a better idea not to require the user have dependencies installed, unless absolutely necessary, so it's probably better to try for htslib.
For a temporary fix, would it be possible to have it output uncompressed vcf files, regardless of input type? Just to reduce confusion?

I am not sure whether I have the chops to work up htslib code or not, but I will take a look when I get a chance later today.

Yes, the way to do this is actually to gunzip it yourself, but you can do that inline in bash as:

ped-sim [options] -i <(zcat [input.vcf.gz])

In other shells, this hopefully works:

zcat [input.vcf.gz] | ped-sim [options] -i /dev/stdin

I think the relevant section of code is here:

ped-sim/bpvcffam.cc

Lines 118 to 130 in f975c59

int inVCFlen = strlen(inVCFfile);
if (strcmp(&CmdLineOpts::inVCFfile[ inVCFlen - 3 ], ".gz") == 0) {
sprintf(outFile, "%s.vcf.gz", CmdLineOpts::outPrefix);
makeVCF<gzFile>(simDetails, theSamples, totalFounderHaps,
CmdLineOpts::inVCFfile, /*outVCFfile=*/ outFile, map,
outs);
}
else {
sprintf(outFile, "%s.vcf", CmdLineOpts::outPrefix);
makeVCF<FILE *>(simDetails, theSamples, totalFounderHaps,
CmdLineOpts::inVCFfile, /*outVCFfile=*/ outFile, map,
outs);
}

I don't know enough about programming with htslib, but I think if you copy the relevant code in the else block up to the block that executes if input is .vcf.gz, then it'll output uncompressed .vcf regardless. You could still keep the block in place, such that in the future you could use htslib to do this.

But, IMHO, if you believe time is more expensive than space, I think uncompressed vcf (knowing it could be large) is better than gzipped (non-bgzip) VCF, which could require a user to decompress then recompress. If developers agree I'm happy to test those changes and submit a PR with the updated code.

It's easy to either make non-compressed vcfs the default behavior with an option to gzip (non-bgzip) or keep the current defaults an add an option to output uncompressed vcfs. But note that zcatting the input vcf works to get uncompressed output (as per the bash code above.)

I personally like the idea of uncompressed as default, but I don't know if that would introduce a breaking change to other users who expect gz output. I haven't tested yet with the bash process substitution, but that should be pretty straightforward without requiring a breaking change to the code.

The bash substitution works fine. I glanced at the code, hoping that it would be easy to just turn off the option for gz output at all, but it looks like that's non-trivial. Agree that there isn't a reason to break the outputs for others if they are using gz outs, just because it makes me add a second step into my scripts.

If there were an option/flag to always output to non-compressed, that would be spectacular.

You got it! Just pushed a --nogz option. Post here with any issues.

Toying with the idea of allowing a user to supply the location of the bgzip program for bgzipped output.