full vcf to hdf5...
Closed this issue · 3 comments
This is probably not much of a priority, but it is bugging me so I'm going to ask...
Is there are more streamlined way of generating hdf5 from vcf without going through the .npy intermediate files? If not, is it likely to be difficult?
Also, unless I'm being confused, you're normally running vcf2npy twice (variants and caldata_2d) before collecting that output into an hdf5. Is there a reason why vcf2npy cannot output both in a single pass? The code isn't structured to make that easy, but I haven't dug deeply enough to tell if it is doing fundamentally different things in the variant pass and the calldata_2d pass. (Though variant is about 5x faster than calldata_2d.)
Hi Travis,
Sorry, this is the most streamlined way of doing things that I've come up with so far.
It would be possible to load directly from a VCF into an HDF5. However, I typically need to parallelise the VCF parsing step because VCFs are big and parsing (especially calldata) is slow. Having the intermediate npy outputs makes it easy to parallelise this step via a qsub style scheduler or similar. Once I have the npy files for all genome regions, I then load them in series into an HDF5, which is quick. It would be possible but harder to come up with a system for parallel loading directly into an HDF5.
The reason why variants and calldata_2d need separate runs is just a technicality, it's because I'm using np.fromiter() under the hood. Again it would be possible to come up with a system for generating both variants and calldata from a single pass. This would increase convenience but wouldn't save much compute time as the calldata is what takes all the time to generate - parsing calldata is sloooow. When generating the variants arrays, vcfnp does not parse the calldata columns in the VCF, hence much faster.
I promise this step (VCF -> HDF5) is the most pain you'll suffer, once you've got the HDF5 life becomes a lot better :-)
Thanks, I'm starting to get it... I think.
I have a working script now which goes from VCF to HDF5. It needs a bit more work, but should be good for multicore systems or shared filesystem clusters since I'm just using gnu parallel.
forgot to close...