brwnj/fastq-multx

Unmatch decision question

Closed this issue · 9 comments

Hi @brwnj ,
It's actually more a question than an issue.
In my mind, your script is able to split raw read in different files according to a barcode.
So, I launch your script as follow :
fastq-multx -m 2 -b -x -B preprocess/barcodes.txt Undetermined_S0_L001_R1_001.fastq.gz Undetermined_S0_L001_R2_001.fastq.gz -o preprocess/multx/%_R1.fq.gz preprocess/multx/%_R2.fq.gz
The script runs well, but results are unexpected. fastq-multx sends 1/3 of my reads in the unmatched.fq.gz.
A lot of reads in the unmatched.fq.gz contain my barcodes.
Is there more than the barcode match involved in the unmatch decision ? Like the insert length or something else ?
Thanks !
Bastien

brwnj commented

This is the work of Erik Aronesty, but I wanted to separate fastq-multx from the ea-utils package. That said, I will attempt to answer.

If you're using -m 2 and -d 2 with many barcodes, it's possible that the barcodes showing in unmatched had mismatches or were similar enough that the application was unable to resolve them. I recommend using -m 0.

Thank you for your reply, unfortunatly the -m 0 option reduce the number of good results. A majority of my unmatched results start with a 'T' followed by one of my barcodes (without mismatch). I tried to remove the -b option (Force beginning of line (5') for barcode matching) to enlarge the flexibility of barcodes detection, but this did not affect the number of results neither.
I'm very curious to know on which criteria fastq-multx define a raw read as unmatch read.

brwnj commented

Something to remember when working with sequence data is that retaining reads solely to inflate read count may not be the best idea. Errors in the barcode read and erroneous assignment into a sample bin can affect your biological conclusion. If you insist on keeping reads with nucleotide mismatches in their barcode, consider lowering the edit distance to 1 (-d 1).

I understand that is not a good idea to inflate read count and it is not my aim here.
A short exemple. A have 10 barcodes and a set of reads (1.5M)
With fastq-multx on the first barcode (let's say : GATCTCTCGA (total random barcode)), 36 reads are kept while with the eighth barcode I got 500 000 reads.
When I zgrep my first barcode zgrep -c "GATCTCTCGA" unmatched_R1.fq.gz (so with no mismatch cause it is a zgrep) on the unmatched_R1.fq.gz (output of fastq-multx) I got like 60 000 results like this :
TGATCTCTCGAAGTCTAC
TGATCTCTCGAGCTAGCA
TGATCTCTCGAGCTAGCC
Why those reads are not kept ? The barcode is founded with no mismatch at the second position of the read. Do the T at the first base is throwing the whole read ?
Like I said above I did not use the -b option that force beginning of line (5') for barcode matching

brwnj commented

They're not kept because fastq-multx is expecting the barcode sequence to start at the end, which in typical cases they will. Having the initial, non-barcode base there will offset the barcode and every base will be recorded as a mismatch as it's not performing local sequence alignment.

Any chance the sequences are the reverse complements of the barcodes? From your example, TGATCTCTCG becomes CGAGAGATCA.

The -b option force beginning of line (5') for barcode matching. So, if I do not use this option, fastq-multx should not expect the barcode sequence to start at the end. Is there a way to look around the 5' position for a barcode match ? Like skip the first base of a read if necessary ?

My sequences are not the reverse complements of the barcodes.

brwnj commented

I don't know of a case where one would expect the barcode to ever shift positions. Are all of the barcodes off by one base? Are all of the sequences the same length? If you know for sure that those reads are starting with a T and then your barcode, why not trim the T then demultiplex?

Who is able to answer me about the shift position events ? I really need to know if fastq-multx can handle this alternative. In my mind, that should be done by default as soon as the -b option is disabled. Some of my reads are not off, some are off by one base, some by two (TT or CT), some could have more bases off but all my reads have the same lenght (251pb). I will try to trim the first base of all my reads off by one base to look at the fastq-multx behaviour.