brwnj/fastq-multx

what does "Skipped because of distance < 2 : 9439859" means?

kaine1973 opened this issue · 1 comments

I have some usage problem.

  1. data type:

paired end
only 8bp I7 index on R2 tail.

  1. barcode file:
SX20G0032	TTCTGGTG
SX20G0033	CCGAAAAC
SX20G0034	CGAAAAGG
SX20G0035	AACCAGCT
SX20G0081	GTTTGTGC
SX20G0083	CGGTTTTC
SX20G0084	TGACCGAA
SX20G0085	CCCTATTC
SX20G0086	ACGTTGTG
SX20G0087	AAAAGCGC
SX20G0088	TCACTCGT
SX20G0089	GCCGATTT
SX20G0090	TGCAAGAG
SX20U00079	CCCTTAAC
SX20U00080	CCTACCTA
SX20U00081	GTTTCAGC
  1. commandline
    fastq-multx -e -B barcode.txt -m 1 E100004487_L01_read_2.fq.gz E100004487_L01_read_1.fq.gz -o %_R2.fastq -o %_R1.fastq

  2. question

  3. Did I used the right format barcode and command line?

  4. Got an out put like "Skipped because of distance < 2 : 9439859", what exactly kind reads did it skipped?

  5. split rate:

Id	Count   File(s)

SX20G0032 232298140 SX20G0032_R2.fastq SX20G0032_R1.fastq
SX20G0033 16034321 SX20G0033_R2.fastq SX20G0033_R1.fastq
SX20G0034 245363709 SX20G0034_R2.fastq SX20G0034_R1.fastq
SX20G0035 599969474 SX20G0035_R2.fastq SX20G0035_R1.fastq
SX20G0081 596771042 SX20G0081_R2.fastq SX20G0081_R1.fastq
SX20G0083 40790595 SX20G0083_R2.fastq SX20G0083_R1.fastq
SX20G0084 1038522885 SX20G0084_R2.fastq SX20G0084_R1.fastq
SX20G0085 5594048 SX20G0085_R2.fastq SX20G0085_R1.fastq
SX20G0086 8100620 SX20G0086_R2.fastq SX20G0086_R1.fastq
SX20G0087 5314546 SX20G0087_R2.fastq SX20G0087_R1.fastq
SX20G0088 206429619 SX20G0088_R2.fastq SX20G0088_R1.fastq
SX20G0089 433365557 SX20G0089_R2.fastq SX20G0089_R1.fastq
SX20G0090 429231416 SX20G0090_R2.fastq SX20G0090_R1.fastq
SX20U00079 281572893 SX20U00079_R2.fastq SX20U00079_R1.fastq
SX20U00080 270611344 SX20U00080_R2.fastq SX20U00080_R1.fastq
SX20U00081 304485507 SX20U00081_R2.fastq SX20U00081_R1.fastq
unmatched 144289659 unmatched_R2.fastq unmatched_R1.fastq
total 563778079

So much reads unmatched, I am not sure if this is normal rate. if not , is it the result of my bad commandline or barcode setting

As I understand it (I didn't write the code and I've had some debate with others [who have also looked at the code] over the meaning of the distance metric controlled by -d), the skipped message means that the difference in the "euclidean" distances from the index read to the top matching barcode and the distance from the index to the second best matching barcode must be at least the value of -d (which defaults to 2). If you have a higher degree of trust in the accuracy of your index reads, setting -d to 1 should allow you to recover some of your unmatched/skipped reads.

However, this reveals a barcode design issue. The barcodes should be sufficiently different from one another to ensure that this does not happen. I believe (though I could be wrong - this should be checked), that distance calculation may account for an indel at the beginning of the sequence, and that could account for the euclidean closeness of some of the barcodes. Supplying -b should theoretically mitigate this if that's an issue with your barcodes.