Zymo-Research/figaro

different read length due to different barcode length

Closed this issue · 4 comments

Dear all,

My Illumina Myseq (300bp) paired end data contains 4 different inline barcodes (preceeding the primers) , with a length of 13, 12, 11 and 10 nucleotides. When the barcodes are removed what it remains is a set of sequences + primers with 4 different lengths.

FIGARO can not handle my data because of this length distribution issue. Is there any way to solve this problems? If I trim all the reads at the same distance I would end with 4 different primer length and FIGARO requires an exact primer length.

Also, after the clipping of sequencing adapter remnants from all reads and primer detection and clipping I end with a set of "clean" reads ranging from 270 to 280 nts.

Hope you can find a solution since according to the issue section I am not the first one with a "multiple lengths" problem.

Best,
Fair.

You are definitely not the first one with this issue. I am working out a solution to this problem. When I first wrote this, just about every case where someone was telling me that they had issues with running variable read lengths was caused by the reads being pretrimmed, often with quality trimming being part of the pipeline. I was not sure how this would affect the model, but I suspected it would probably lead to some bad calls (because quality trimming would make the later bases look better, on average, by removing bad ones). I still think it's important that reads not be pretrimmed for quality, but I'm seeing more pipelines where this is an issue because of things that should not interfere with the model (like having to remove a variable length primer). Making the program handle this scenario is something I am currently hashing out.

Hi Michael,

Thanks for your answer. The way we handle our data would be:

remove the barcodes (4 different barcode lengths)
remove the adapter sequences
remove the primers

There is no quality trimming during the process BUT we end with quite a lot of different read lengths in our sequences, which are already 100% free of artificial chunks after the previously mentioned 3 steps.

This is quite common for really big sequencing companies in Europe like LGC and its different length inline barcodes pipeline. If you are able to fix this issue you will reach a much broader audience, my group is in fact looking for standardized quality trimming procedures and we think FIGARO is the way to go.

Best,
Fair.

I'm starting to get a good picture of how to deal with this issue. From what I've been hearing from both you and several others running similar pipelines, the reads will come in pretrimmed for technical sequence, but not for quality. The trimming for technical sequence will tend to result in reads of variable length, but that variability will be expected have a relatively small distribution (as opposed to someone who ran quality trimming where there could be reads that vary greatly in length). Fixing that issue will be much simpler, since I'm just dealing with a few bases of variable length and you would never want to actually cut inside that variable tail end anyway. I could just exclude the variable length tails from my model and wouldn't really even have to redo any attempts at validating.

Also, just out of curiosity, where are you located/which group are you part of? Always happy to see where Figaro is getting adopted. Do the variable length barcodes help improve 16S quality scores (possibly by ensuring more heterogeneity at each position)? If so, I may need to include something about it in my bioinformatics workshop.

Happy to say that I've gotten a bit of a break from COVID issues for a few days and I am working on a solution for this and a few other things actively. Check out the new branches to see progress. I may have an alpha version for testing very soon, so watch the new branches for a chance to test it out.