amplab/snap

Original fastq order

jkbonfield opened this issue · 2 comments

Hello,

I see you are now working on a dev branch of this (affineGap), so please consider this as a feature request.

SNAP is astoundingly fast, especially when driven to the limit with eg -f -D 0. While not producing ideal alignments, my usage is perhaps a little different.

I'd like to be able to leverage this as a potential for rapid but effective fastq compression. It aligns over a million reads / sec on my 8 core system. These can then be piped into samtools to output compressed CRAM, giving around 4 fold reduction over the size of gzipped fastq, but still at a fast rate (> 500k seq/sec). That's actually faster than the rate that pigz compresses my test data(!), although not as fast as bgzip due to libdeflate which is 3x quicker than pigz. (You may wish to consider using that over zlib.)

However the order of output is not quite the same. I believe this is because each thread is writing a block of data as it fills the buffer, but these aren't allocated and written in order. Thus there is randomness in the output order. Instead if they're put into an output queue and written in order it would be possible to use snap+cram as a fastq compressor, with "samtools fastq" as the fastq decompressor. This would probably need to be optional as it may have a slight performance hit.

A related trivial request would be to not emit the SAM header (or to replace it with my own). That's necessary to provide it a prefilled out header with M5 tags so that samtools can more rapidly do CRAM conversion via mmaping in md5 sequence files, rather than parsing and loading a ref.fa file.

The obvious next improvement would be directly linking against htslib so it can emit the correct format directly. This would also permit considerably faster SAM and BAM encoding.

Thanks for the reply. Some useful hints there.

NoSQ is sufficient I think - I can just prepend my output with a bunch of amended SQ lines before pipeing into the SAM->CRAM conversion tool.

I do need alignments, albeit basic, as CRAM uses the alignment between sequence and reference for compression purposes. However with Illumina data the number with indels tends to be pretty small so a basic match gets you most of the way I'd think.

As for htslib - it does support Windows and it's part of our continuous integration testing, but not the MSVC compiler. We build using mingw, but it ought to be linkable to a program built using MSVC I think. It's possible CRAM could be implemented faster, but probably not hugely so and sadly it's a very complex beast. I'd also doubt your SAM parser is more efficient and BAM definitely not as you use zlib. BAM is dominated by the Deflate algorithm, and htslib's use of libdeflate trounces zlib. The htslib memory management lets it down a bit though. (Of course you could, and probably should, use libdeflate yourselves even if it's just for fastq.gz decoding.)

Looking forward to 1.0. :-)