sequencing/NxTrim

Make nxtrim resistant to junk polyG/C stretches from NextSeq device reads

Closed this issue · 3 comments

Hi Jared,
while inspecting *.pe.fastq.gz from nxtrim -s .7 -w --separate --separate --preserve-mp --rf (v0.4.1-4965b00) I realized these contain suspiciously polyG stretches (unlike *.mp.fastq.gz files). I suspect they are made by read-joining code and I would nxtrim to avoid read joins on anything barely resembling polyG, polyC, polyN at least.

I fished out reads from *.pe.fastq.gz with grep GGGGGGGGGGGGGGG.

$  wc -l  HFYJ5AFXX.polyG.lst
272 HFYJ5AFXX.polyG.lst
$

I extracted the readnames and went back to original files from which I extracted both read mates. Note that very few original reads match the GGGGGGGGGGGGGGG query.

$ grep -c GGGGGGGGGGGGGGG HFYJ5AFXX.*.polyG.fastq 
HFYJ5AFXX.1_5kb_R1.polyG.fastq:2
HFYJ5AFXX.1_5kb_R2.polyG.fastq:3
HFYJ5AFXX.1_8kb_R1.polyG.fastq:1
HFYJ5AFXX.1_8kb_R2.polyG.fastq:4
$

HFYJ5AFXX.1_8kb_R2.polyG.fastq.txt
HFYJ5AFXX.1_8kb_R1.polyG.fastq.txt
HFYJ5AFXX.1_5kb_R2.polyG.fastq.txt
HFYJ5AFXX.1_5kb_R1.polyG.fastq.txt

I attach the original reads, not those cleaned from Illumina sequencing adapters which were actually fed into nxtrim. However, I could attach them and also those from *.pe.fastq.gz files.

Note that NxTrim does not join any reads by default, this is enabled by the --joinreads command. I have not found this to be helpful and do not use it, but your mileage may vary.

Most likely you see these in the pe library since that is the end of reads, which is typically of lower quality. It might be sensible to remove these before assembly, but in general I find assemblers are pretty darn robust to such issues.

In any case, this is out-of-scope for NxTrim and you could apply such clipping with general purpose read manipulation tools.

Provided I did not include --joinreads on the commandline something else created them. As I said, you can hardly find GGGGGGGGGGGGGGG in the original reads but you do find that in all output from nxtrim.

Yes, but you can find the reverse-complement of GGGGGGGGGGGGGGG

$  grep -c CCCCCCCCCCCCCCC HFYJ5AFXX.*.polyG.*
HFYJ5AFXX.1_5kb_R1.polyG.fastq.txt:144
HFYJ5AFXX.1_5kb_R2.polyG.fastq.txt:0
HFYJ5AFXX.1_8kb_R1.polyG.fastq.txt:128
HFYJ5AFXX.1_8kb_R2.polyG.fastq.txt:0