ignore headers

Question

ignore headers

Closed this issue 9 years ago · 23 comments

Is it possible to add a flag to Pandaseq to completely ignore header format?

The reason is that I have some preprocessing upstream of Pandaseq where I need to add some info to headers (has to be first field too because bioinformatics).

Answer 1 · 2015-03-13T15:29:26.000Z

Possible, but impractical. What is the upstream processing doing?

Answer 2 · 2015-03-13T15:39:59.000Z

Labelling reads by barcode.

Answer 3 · 2015-03-13T15:41:01.000Z

The bar code information is preserved. It can be done after.

Answer 4 · 2015-03-13T15:41:43.000Z

We get the data from our sequencing core in a slightly different format. Barcodes are not in read headers; they're in a separate file.

Answer 5 · 2015-03-13T15:42:08.000Z

Can you provide samples?

Answer 6 · 2015-03-13T15:43:47.000Z

Sure: https://www.dropbox.com/s/cozbtkxbn3qymez/fastq-sample.tar.gz?dl=0

Answer 7 · 2015-03-13T15:49:54.000Z

Re #7 "Why some sequencing centres fail to do this is beyond my comprehension"

The reason why is that we use custom barcodes that are incompatible with the Illumina software thus we have to demultiplex ourselves.

Answer 8 · 2015-03-13T15:51:43.000Z

It looks easier to have PANDAseq read the barcodes and attach them to the sequences as they are read than to deal with manipulated headers.

Answer 9 · 2015-03-13T15:52:43.000Z

You lost me. Pandaseq can read barcodes files? Where does it attach them to the sequence?

Answer 10 · 2015-03-13T15:56:15.000Z

Currently, PANDAseq reads and parses the headers for the input FASTQ files. The files you provided have valid headers there is just no barcode information present (which is not a problem).

You propose that you manipulate the headers and then PANDAseq ignores the headers. This is difficult since PANDAseq has many expectations about what it can do with the headers.

I propose changing PANDAseq so it can take three files as input (forward, reverse, and index). It will then use the barcodes provided and include the barcode in the output header. You can also use the -C validtag:AAAAA to select a subset of sequences (i.e., use PANDAseq to also do the demultiplexing).

Answer 11 · 2015-03-13T15:59:24.000Z

I wasn't aware that Pandaseq needed information in the header for anything except verifying that I'm not doing something stupid like aligning reads from two different sequencing runs. If that's the only reason then I'd suggest adding a flag to skip it as sequencing companies change file formats all the time.

Can't I just attach the barcode sequence to the header myself? I can't find what a "good" header looks like in the docs.

I'm testing with this, where I manually added the "GATC":

pandaseq-checkid "@M02780:41:000000000-ADDJ7:1:1101:20524:1181:GATC"
@M02780:41:000000000-ADDJ7:1:1101:20524:1181:GATC
                                           ^
    BAD
    instrument = "M02780"
    run = 41
    flowcell = "000000000-ADDJ7"
    lane = 1
    tile = 1101
    x = 20524
    y = 1181
    tag = ""
    generator = CASAVA 1.7+

Answer 12 · 2015-03-13T16:02:29.000Z

That's not the correct place for it.
From the manual:

The name of the input read did not follow the known Illumina standard formats. Older versions of CASAVA produce sequences with IDs that look like HWUSI-EAS1661_9323_FC619KG:7:1:1190:15190#ATCACG/1, where the fields are instrument:lane:tile:x:y#tag/direction. Newer version of CASAVA produce IDs that look like HWI-ST822:85:C05C3ACXX:1:1101:1171:2104 3:N:0:TAGACA, where the fields are instrument:run:flow‐cell:lane:tile:x:y direction:filtered:flags:tag. If your sequence headers do not look like either of these, either Illumina has created yet-another header format or, more likely, your sequence headers have been manipulated by some upstream processing, possibly at your sequencing centre. PANDAseq needs the original Illumina probabilities; not ones manipulated by other programs. We're very picky about that. Sometimes, for mysterious reasons, the sequences lack the barcoding tag. The -B option will cause the lack of barcode to be ignored. This will obviously invalidate the use of validation modules that depend on the barcode.

I'm halfway finished me proposal anyway.

Answer 13 · 2015-03-13T16:03:42.000Z

So can I fake the direction:filtered:flags:tag?

Answer 14 · 2015-03-13T16:07:33.000Z

Ah this is weird. The sequencing core was previously sending us files with headers like HWI-ST822:85:C05C3ACXX:1:1101:1171:2104

They just recently switched to @M02780:41:000000000-ADDJ7:1:1101:20524:1181:200

I'll ask them what kind of upstream processing they may have performed.

I hope they don't have to re-generate the FASTQ files because this usually takes them a week.

Answer 15 · 2015-03-13T16:08:14.000Z

That's the same format, just a difference sequencing platform.

Answer 16 · 2015-03-13T16:30:06.000Z

I've implemented something in dc65111. There is now a -i flag where you can supply the index reads and PANDAseq will apply the barcodes. There's no need to preprocess the files.

Answer 17 · 2015-03-13T16:49:06.000Z

I will give this a try. Thanks.

Answer 18 · 2015-03-13T20:09:25.000Z

This seems to work. Thanks for the quick response.

Answer 19 · 2015-04-03T02:01:34.000Z

Hello,

I want to include pandaseq in a paper comparing the accuracy of overlap-based read-merging programs, but my methodology requires custom read headers. Are you willing to add a header-parsing override flag for this purpose?

Answer 20 · 2015-04-03T02:19:49.000Z

This change is not practical to make. The parsed header structure is woven through the code.

You can use the PANDAseq API to track reads directly if desired. You will need to populate a header structure, but it need not be valid. See panda_assembler_assemble function for details. If C is bothersome, there are Vala bindings, which is a Java/C#-like language and it can provide an object oriented interface to deal directly with PANDAseq. These are used in the regression test.

Answer 21 · 2015-04-03T02:24:35.000Z

OK, that's unfortunate, but thanks for your explanation.

Answer 22 · 2015-07-01T15:43:07.000Z

Hello again, Would it be possible to get a new release with that has the -i flag? I need to share my "Pandaseq analysis pipeline" with colleagues and could do without the extra "you must clone and compile this specific ref" step.

Answer 23 · 2015-07-01T15:47:13.000Z

I have some bugs in queue, but I can do it after those issues are resolved. Should be under a month.