biosails/pheniqs

Loosing header info from fastq files

prmunn opened this issue · 5 comments

I'm running Pheniqs with paired end fastq files as input, and creating bam files as output. The first field in the bam file retains the info from the fastq header row (the instrument type, run id, flowcell id, etc., but it removes the "member of a pair", "read filtred", "control bits on", and index sequence info.

e.g. if my fastq header contains "@LH00497:14:22KTM5LT3:8:1101:2036:1064 1:N:0:ACTTCTGC+CCGGGACT" then only the "LH00497:14:22KTM5LT3:8:1101:2036:1064" part is retained.

Is there a way to move the rest of the header (the "1:N:0:ACTTCTGC+CCGGGACT" part) to the bam file (say as a tag field perhaps).

Hi @prmunn

Thank you for using Pheniqs :)

So... the fastq format only really defines the read id as the part that comes after the @ sign on the first line of each 4 line record. anything after the space is considered a "comment" and can technically be anything. As long as the whole thing is less than 254 characters. I think, at least HTSLib assumes it to be. The ID itself (the part immediately after the @ and before any whitespace) is unique and also shared between segments in different fastq files so it's kind of the glue that sticks all the segments together. The comment, its more specific to a segment.

Ilumina has a specific syntax for the comment, which is not really part of the fastq format, but pheniqs is kind of still parsing it. you can see the specific code here:

pheniqs/fastq.h

Line 102 in d4bd514

case Platform::ILLUMINA: {

Some of this info does make it to the designated fields in a SAM file, like the Pass-QC field or the segment number. the rest is kind of very illumina specific and not really used downstream by anything I know of.

BUT if you really really really need it I can try and add a flag to move that to some field.

the trick is that pheniqs is in the business of manipulating the topology of a read. it may get a read with 4 segments and produce a read with 2 segments, or any rearrangement you can think of, really. its not always clear which metadata from which segment ends up on the output segment. defining this for the general case is much more involved then it sounds at first.

What did you have in mind? what exactly are you trying to achieve?

Hi @moonwatcher

What I'm ultimately trying to achieve is to move the barcode segments (I have four segments defined in a "cellular" template) to the header of a fastq file, similar to the way the "sample" template works. I noticed that when I use a "sample" template the comment section of the header is also retained.

What I've been doing up until now is create a bam file as the output of Pheniqs, which gives me a CB tag that has the "cellular" barcode segments and then have an awk script convert this to a fastq file with the contents of the CB tag moved to the fastq header. I'm also able to use the sam flag to parse out the the appropriate bit for the read number (for paired end reads) and add that back into the header. This approach is working (minus the rest of the comment section), but it would be better (at least for me) if I could skip the conversion step and have Pheniqs create the fastq with the "cellular" segments moved to the header. Since the "sample" template already moves one segment to the header, maybe modify it to allow multiple templates for the "sample" section.

Sorry for the slow reply, personal issues and deadline at my actual job. I'll get around soon to try and add a flag for writing the header comment to an auxiliary field.

Like I said, because of the nature of what pheniqs is doing that might raise some questions about more complicated case or require a default behavior.

Sorry just read your second reply again.

So what you ultimately want is to write the cell barcode to the fastq comment, same as the sample barcode. Which is working for sample barcodes...

There might be a way to do this with a few pipes. Pheniqs can produce SAM (uncompressed, simple text bam) to stdout, then you can use sed to switch the tag from cellular to sample and pipe that back to another pheniqs and convert to fastq. Am I getting this right? That should be very fast.

Alternatively, if there was a flag to write cell barcodes to the fastq comment, the same way the sample barcode is written now, that would satisfy your needs?

Yes, if there was a flag to write cell barcodes to the fastq comment, the same way the sample barcode is written now, that would work.

That said, I have written an awk script that that converts the sam to a fastq with the barcodes in the comment section and this runs pretty quickly, so I can make do with things as they stand. However, other people might find such a flag useful.