neufeld/pandaseq

ignore headers

Closed this issue · 23 comments

audy commented

Is it possible to add a flag to Pandaseq to completely ignore header format?

The reason is that I have some preprocessing upstream of Pandaseq where I need to add some info to headers (has to be first field too because bioinformatics).

Possible, but impractical. What is the upstream processing doing?

audy commented

Labelling reads by barcode.

The bar code information is preserved. It can be done after.

audy commented

We get the data from our sequencing core in a slightly different format. Barcodes are not in read headers; they're in a separate file.

Can you provide samples?

audy commented

Re #7 "Why some sequencing centres fail to do this is beyond my comprehension"

The reason why is that we use custom barcodes that are incompatible with the Illumina software thus we have to demultiplex ourselves.

It looks easier to have PANDAseq read the barcodes and attach them to the sequences as they are read than to deal with manipulated headers.

audy commented

You lost me. Pandaseq can read barcodes files? Where does it attach them to the sequence?

Currently, PANDAseq reads and parses the headers for the input FASTQ files. The files you provided have valid headers there is just no barcode information present (which is not a problem).

You propose that you manipulate the headers and then PANDAseq ignores the headers. This is difficult since PANDAseq has many expectations about what it can do with the headers.

I propose changing PANDAseq so it can take three files as input (forward, reverse, and index). It will then use the barcodes provided and include the barcode in the output header. You can also use the -C validtag:AAAAA to select a subset of sequences (i.e., use PANDAseq to also do the demultiplexing).

audy commented

I wasn't aware that Pandaseq needed information in the header for anything except verifying that I'm not doing something stupid like aligning reads from two different sequencing runs. If that's the only reason then I'd suggest adding a flag to skip it as sequencing companies change file formats all the time.

Can't I just attach the barcode sequence to the header myself? I can't find what a "good" header looks like in the docs.

I'm testing with this, where I manually added the "GATC":

pandaseq-checkid "@M02780:41:000000000-ADDJ7:1:1101:20524:1181:GATC"
@M02780:41:000000000-ADDJ7:1:1101:20524:1181:GATC
                                           ^
    BAD
    instrument = "M02780"
    run = 41
    flowcell = "000000000-ADDJ7"
    lane = 1
    tile = 1101
    x = 20524
    y = 1181
    tag = ""
    generator = CASAVA 1.7+

That's not the correct place for it.
From the manual:

The name of the input read did not follow the known Illumina standard formats. Older versions of CASAVA produce sequences with IDs that look like HWUSI-EAS1661_9323_FC619KG:7:1:1190:15190#ATCACG/1, where the fields are instrument:lane:tile:x:y#tag/direction. Newer version of CASAVA produce IDs that look like HWI-ST822:85:C05C3ACXX:1:1101:1171:2104 3:N:0:TAGACA, where the fields are instrument:run:flow‐cell:lane:tile:x:y direction:filtered:flags:tag. If your sequence headers do not look like either of these, either Illumina has created yet-another header format or, more likely, your sequence headers have been manipulated by some upstream processing, possibly at your sequencing centre. PANDAseq needs the original Illumina probabilities; not ones manipulated by other programs. We're very picky about that. Sometimes, for mysterious reasons, the sequences lack the barcoding tag. The -B option will cause the lack of barcode to be ignored. This will obviously invalidate the use of validation modules that depend on the barcode.

I'm halfway finished me proposal anyway.

audy commented

So can I fake the direction:filtered:flags:tag?

audy commented

Ah this is weird. The sequencing core was previously sending us files with headers like HWI-ST822:85:C05C3ACXX:1:1101:1171:2104

They just recently switched to @M02780:41:000000000-ADDJ7:1:1101:20524:1181:200

I'll ask them what kind of upstream processing they may have performed.

I hope they don't have to re-generate the FASTQ files because this usually takes them a week.

That's the same format, just a difference sequencing platform.

I've implemented something in dc65111. There is now a -i flag where you can supply the index reads and PANDAseq will apply the barcodes. There's no need to preprocess the files.

audy commented

I will give this a try. Thanks.

audy commented

This seems to work. Thanks for the quick response.

Hello,

I want to include pandaseq in a paper comparing the accuracy of overlap-based read-merging programs, but my methodology requires custom read headers. Are you willing to add a header-parsing override flag for this purpose?

This change is not practical to make. The parsed header structure is woven through the code.

You can use the PANDAseq API to track reads directly if desired. You will need to populate a header structure, but it need not be valid. See panda_assembler_assemble function for details. If C is bothersome, there are Vala bindings, which is a Java/C#-like language and it can provide an object oriented interface to deal directly with PANDAseq. These are used in the regression test.

OK, that's unfortunate, but thanks for your explanation.

audy commented

Hello again, Would it be possible to get a new release with that has the -i flag? I need to share my "Pandaseq analysis pipeline" with colleagues and could do without the extra "you must clone and compile this specific ref" step.

I have some bugs in queue, but I can do it after those issues are resolved. Should be under a month.