amplab/snap

feature request: option to move fastq comments to sam tags

eboyden opened this issue · 26 comments

Hi, I was wondering if an option to move the fastq comment (everything after the first whitespace, including additional whitespaces) to a field in the output sam file could be added. If the fastq comments were properly formatted as tab-delimited sam tags, then this would result in a properly formatted sam file with all metadata preserved. This is similar functionality to the BWA MEM -C and Bowtie2 --sam-append-comment options, and is useful for e.g. transferring UMI information from fastqs to sam files without relying on additional files and formatting steps. Thanks!

Hi,

is there an update for this issue? We've got some fastq's with UMI data in sam comments we'd like to be able to align.

Thanks

I haven't done any work on it, but it's probably not all that hard to do. It may have a performance effect because it will most likely require dynamic memory allocation, which we studiously avoid in the main loop, but since it's optional that's OK. I imagine that any degradation would be far less than a post-processing stage.

I'll think about it more.

Thanks for the reply! Great to know you're keeping it in mind.
Looking forward to what the next release might bring 😄

How important is it for you to have this supported for BAM output? With SAM I can just copy the string, but for BAM I'd have to parse it to break it down into individual tags, which is a pain.

I'm planning on having it rely on the input being correct like BWA-mem does, so if you feed it a FASTQ file with comments that don't look like SAM optional fields it will just blindly copy them and generate broken SAM. I presume that's also OK.

Definitely okay to blindly copy like BWA and Bowtie2 do; user beware if they misformat the fastq comment. It would be preferable if it could output BAM; if you just blindly copy the entire string whitespace and all, requiring the user to have formatted it properly, do you still have to parse the individual tags?

Yeah, I have to parse for BAM. BAM isn't just compressed SAM, it's got a little more structure, and part of that is that each tag has its own element.

Not that big of a deal, I'd just hoped to avoid doing it.

Well beggars can't be choosers. I'd love BAM output but I'm not the one doing the work, and there are workarounds (SAM>BAM) if you decide it's too much effort and not high enough priority. Much appreciate whatever you're willing to do.

I've already put in the effort to plumb it through the parsing is only a few hours to implement. And using samtools for SAM->BAM is often slower than FASTQ->BAM with SNAP, so not having it kind of defeats the point.

How much do you need the "B" (binary array) tag type? It's kind of a pain to convert to BAM, so I'm inclined to leave it unimplemented unless you need it.

Not only have I never needed it, I'm not sure I've ever even seen a tag of that format. So I say don't bother. If someone needs it down the road, it can be addressed then.

Same here!

I just pushed the code for this into the dev branch. Check it out and let me know if it works for you.

To get the behavior, you'll need to include the new -pfc (preserve FASTQ comments) switch. Otherwise, you get the old behavior.

It's in 2.0.2.dev.4.

Please let me know if this works. We're thinking about pushing it to master, but it would be nice to have feedback before we do.

Yes, sorry - will do so by Monday

It seems to work great when outputting to SAM. When outputting to BAM, however, it runs to completion without error but produces a corrupt file; I get this error message when attempting to read the file with Samtools:

[E::sam_format1_append] Corrupted aux data for read MN01688:12:000H3MG5N:1:11102:4839:1176
samtools view: writing to standard output failed: Invalid argument

It works on my machine with my input/index (not that the index is likely to matter).

Would you be willing to send me an example fastq to try?

Interleaved fastq plus a (good) sam and a (corrupt) bam (which I had to zip to get it to upload)
NTC.pairedInterleaved.fastq.gz
NTC.pairedInterleaved.sam.gz
NTC.pairedInterleaved.bam.gz

Try 2.0.2.dev.7. Turns out I didn't fix the used buffer size properly for paired read output. I'd only tested it for single, which worked fine.

Interestingly, that small example I uploaded ran fine; however a substantially larger fastq did not, and samtools threw a similar error as before:

Error reading input file
[E::bam_aux_get] Corrupted aux data for read MN01688:12:000H3MG5N:1:12108:23198:14782
[E::bam_aux_get] Corrupted aux data for read MN01688:12:000H3MG5N:1:12108:23198:14782

It seems like in certain situations, especially on larger files, SNAP is just silently producing a truncated bam (regardless of whether either -so or -pfc are turned on or not). Often this bam apparently includes an EOF marker as the bam passes samtools quickcheck. I have anecdotally heard from another user that he's encountered this phenomenon as well (on version 2.0.0), especially on larger BAMs (but for whatever reason it does not seem to be an issue when outputting SAMs, again regardless of either -so or -pfc).

My input file that initially failed to produce bam output with -pfc and 2.0.2.dev.6, but worked with 2.0.2.dev.7, was only 1154 read pairs. My file that continued to fail (but worked with sam output) was 10M read pairs (20M reads). Truncating this to 5M, 2.5M, or 1.25M read pairs all failed with or without -pfc (regardless of whether they were from the top or bottom of the file), but 1M read pairs (and fewer) worked (again regardless of whether it was the first or last 1M). Aligning a fastq with 10M SE reads also worked. So insofar as I'm able to test, it seems like -pfc is working properly, and this other apparent bam output bug is likely unrelated.

There was a bug in the -pfc code that didn't deal properly with tags crossing the end of the output buffer, which I fixed.

Much more interestingly, in looking for it I discovered a bug in BAM output that's existed in SNAP forever that would corrupt output when a compression block was larger than the uncompressed data. Presumably, we didn't see this very often because even a single SAM line is pretty compressible and so wouldn't tickle the bug. I only hit it because I was using a reference with only one contig, which sometimes generated a header that didn't compress (depending on the command line, which made this kind of tricky to debug).

Check it out this time and see if it works for you.

It did indeed - I am impressed and grateful. I tried recapitulating the bug using publicly available GIAB fastqs and was unable to do so, and unfortunately I wasn't permitted to share my fastqs that repeatedly failed, so I wasn't sure how to better enable you to find it yourself. It had been suggested to me that I specify the hardware I was using, especially that it's a Linux server, in case it only affected the Linux distribution, but I guess that wasn't the case?

In any case, as far as I can tell, bam output is now working fine with or without -so, as is -pfc.

In 2.0.2