switch to Smith-Waterman for adapter matching
Closed this issue · 46 comments
The current adapter matching routine is very simple. It looks for a match (within a Hamming distance) on either side of the junction adapter. If a match to either 19bp side of the adapter is found, the entire 38bp junction adapter is clipped appropriately (basically a crude form of seed and extend). This can miss cases were both sides of the junction adapter have an indel error.
Smith-Waterman alignment will improve this.
@mmokrejs the Smith-Waterman alignment now appears to be working well if you are still interested. It can be turned on with the -w
flag.
Usage example:
## SW for adapter matching
../nxtrim -s .7 -w -1 EcMG1_ATGTCA_L001_R1_001.fastq.gz -2 EcMG1_ATGTCA_L001_R2_001.fastq.gz --stdout-mp | bwa mem EcMG.fna -p - | gzip -1 > mp.sw.bam
../nxtrim -s .7 -w -1 EcMG1_ATGTCA_L001_R1_001.fastq.gz -2 EcMG1_ATGTCA_L001_R2_001.fastq.gz --stdout-un | bwa mem EcMG.fna -p - | gzip -1 > un.sw.bam
I used nxtrim -s .7 -w --separate --preserve-mp --rf
for the testing of version v0.4.1-4965b00.
I still see plenty linkers unremoved with minimum similarity 86.11%. Something is wrong with the scoring I guess.
$ for f in *.nxtrim_*.fastq.gz; do echo $f; p=`basename $f .fastq.gz`; blastall -p blastn -d $p.fasta -i Nextera_MatePair_hairpin.fasta -F F -v 99999 -b 99999 -m 8 > $p.missed_linkers.blastn.tsv; done
$
$ wc -l *missed_linkers.blastn.tsv
1495 HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.tsv
692 HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R1.pe.missed_linkers.blastn.tsv
3701 HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R1.unknown.missed_linkers.blastn.tsv
2048 HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R2.mp.missed_linkers.blastn.tsv
860 HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R2.pe.missed_linkers.blastn.tsv
5568 HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R2.unknown.missed_linkers.blastn.tsv
1534 HFYJ5AFXX.1_8kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.tsv
741 HFYJ5AFXX.1_8kb.trimmomatic.nxtrim_R1.pe.missed_linkers.blastn.tsv
3486 HFYJ5AFXX.1_8kb.trimmomatic.nxtrim_R1.unknown.missed_linkers.blastn.tsv
1973 HFYJ5AFXX.1_8kb.trimmomatic.nxtrim_R2.mp.missed_linkers.blastn.tsv
829 HFYJ5AFXX.1_8kb.trimmomatic.nxtrim_R2.pe.missed_linkers.blastn.tsv
5494 HFYJ5AFXX.1_8kb.trimmomatic.nxtrim_R2.unknown.missed_linkers.blastn.tsv
1544 HFYJ5AFXX.2_5kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.tsv
621 HFYJ5AFXX.2_5kb.trimmomatic.nxtrim_R1.pe.missed_linkers.blastn.tsv
3830 HFYJ5AFXX.2_5kb.trimmomatic.nxtrim_R1.unknown.missed_linkers.blastn.tsv
1810 HFYJ5AFXX.2_5kb.trimmomatic.nxtrim_R2.mp.missed_linkers.blastn.tsv
811 HFYJ5AFXX.2_5kb.trimmomatic.nxtrim_R2.pe.missed_linkers.blastn.tsv
5557 HFYJ5AFXX.2_5kb.trimmomatic.nxtrim_R2.unknown.missed_linkers.blastn.tsv
1408 HFYJ5AFXX.2_8kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.tsv
701 HFYJ5AFXX.2_8kb.trimmomatic.nxtrim_R1.pe.missed_linkers.blastn.tsv
3487 HFYJ5AFXX.2_8kb.trimmomatic.nxtrim_R1.unknown.missed_linkers.blastn.tsv
1878 HFYJ5AFXX.2_8kb.trimmomatic.nxtrim_R2.mp.missed_linkers.blastn.tsv
741 HFYJ5AFXX.2_8kb.trimmomatic.nxtrim_R2.pe.missed_linkers.blastn.tsv
5376 HFYJ5AFXX.2_8kb.trimmomatic.nxtrim_R2.unknown.missed_linkers.blastn.tsv
1438 HFYJ5AFXX.3_5kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.tsv
762 HFYJ5AFXX.3_5kb.trimmomatic.nxtrim_R1.pe.missed_linkers.blastn.tsv
3374 HFYJ5AFXX.3_5kb.trimmomatic.nxtrim_R1.unknown.missed_linkers.blastn.tsv
2176 HFYJ5AFXX.3_5kb.trimmomatic.nxtrim_R2.mp.missed_linkers.blastn.tsv
718 HFYJ5AFXX.3_5kb.trimmomatic.nxtrim_R2.pe.missed_linkers.blastn.tsv
5476 HFYJ5AFXX.3_5kb.trimmomatic.nxtrim_R2.unknown.missed_linkers.blastn.tsv
1502 HFYJ5AFXX.3_8kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.tsv
772 HFYJ5AFXX.3_8kb.trimmomatic.nxtrim_R1.pe.missed_linkers.blastn.tsv
3347 HFYJ5AFXX.3_8kb.trimmomatic.nxtrim_R1.unknown.missed_linkers.blastn.tsv
2052 HFYJ5AFXX.3_8kb.trimmomatic.nxtrim_R2.mp.missed_linkers.blastn.tsv
853 HFYJ5AFXX.3_8kb.trimmomatic.nxtrim_R2.pe.missed_linkers.blastn.tsv
5445 HFYJ5AFXX.3_8kb.trimmomatic.nxtrim_R2.unknown.missed_linkers.blastn.tsv
1926 HFYJ5AFXX.4_5kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.tsv
874 HFYJ5AFXX.4_5kb.trimmomatic.nxtrim_R1.pe.missed_linkers.blastn.tsv
4897 HFYJ5AFXX.4_5kb.trimmomatic.nxtrim_R1.unknown.missed_linkers.blastn.tsv
2179 HFYJ5AFXX.4_5kb.trimmomatic.nxtrim_R2.mp.missed_linkers.blastn.tsv
833 HFYJ5AFXX.4_5kb.trimmomatic.nxtrim_R2.pe.missed_linkers.blastn.tsv
5749 HFYJ5AFXX.4_5kb.trimmomatic.nxtrim_R2.unknown.missed_linkers.blastn.tsv
2038 HFYJ5AFXX.4_8kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.tsv
811 HFYJ5AFXX.4_8kb.trimmomatic.nxtrim_R1.pe.missed_linkers.blastn.tsv
4642 HFYJ5AFXX.4_8kb.trimmomatic.nxtrim_R1.unknown.missed_linkers.blastn.tsv
1989 HFYJ5AFXX.4_8kb.trimmomatic.nxtrim_R2.mp.missed_linkers.blastn.tsv
806 HFYJ5AFXX.4_8kb.trimmomatic.nxtrim_R2.pe.missed_linkers.blastn.tsv
7821 HFYJ5AFXX.4_8kb.trimmomatic.nxtrim_R2.unknown.missed_linkers.blastn.tsv
118665 total
Some total counts when the files came out of trimmomatic:
$ grep '^Input Read Pairs' *.trimmomatic.ilmn.log
HFYJ5AFXX.1_5kb.trimmomatic.ilmn.log:Input Read Pairs: 17502155 Both Surviving: 16495106 (94.25%) Forward Only Surviving: 969194 (5.54%) Reverse Only Surviving: 20662 (0.12%) Dropped: 17193 (0.10%)
HFYJ5AFXX.1_8kb.trimmomatic.ilmn.log:Input Read Pairs: 17481778 Both Surviving: 16300973 (93.25%) Forward Only Surviving: 1056900 (6.05%) Reverse Only Surviving: 21681 (0.12%) Dropped: 102224 (0.58%)
HFYJ5AFXX.2_5kb.trimmomatic.ilmn.log:Input Read Pairs: 17177379 Both Surviving: 16253919 (94.62%) Forward Only Surviving: 903030 (5.26%) Reverse Only Surviving: 11399 (0.07%) Dropped: 9031 (0.05%)
HFYJ5AFXX.2_8kb.trimmomatic.ilmn.log:Input Read Pairs: 17087286 Both Surviving: 16044969 (93.90%) Forward Only Surviving: 980999 (5.74%) Reverse Only Surviving: 12909 (0.08%) Dropped: 48409 (0.28%)
HFYJ5AFXX.3_5kb.trimmomatic.ilmn.log:Input Read Pairs: 16845780 Both Surviving: 15867509 (94.19%) Forward Only Surviving: 943755 (5.60%) Reverse Only Surviving: 18413 (0.11%) Dropped: 16103 (0.10%)
HFYJ5AFXX.3_8kb.trimmomatic.ilmn.log:Input Read Pairs: 16841620 Both Surviving: 15716944 (93.32%) Forward Only Surviving: 1010720 (6.00%) Reverse Only Surviving: 19198 (0.11%) Dropped: 94758 (0.56%)
HFYJ5AFXX.4_5kb.trimmomatic.ilmn.log:Input Read Pairs: 15543789 Both Surviving: 14582359 (93.81%) Forward Only Surviving: 938112 (6.04%) Reverse Only Surviving: 12569 (0.08%) Dropped: 10749 (0.07%)
HFYJ5AFXX.4_8kb.trimmomatic.ilmn.log:Input Read Pairs: 15524361 Both Surviving: 14459869 (93.14%) Forward Only Surviving: 988095 (6.36%) Reverse Only Surviving: 14084 (0.09%) Dropped: 62313 (0.40%)
$ grep '^Input Read Pairs' *.trimmomatic.ilmn.log | awk '{print $4, $12, $17}'
17502155 969194 20662
17481778 1056900 21681
17177379 903030 11399
17087286 980999 12909
16845780 943755 18413
16841620 1010720 19198
15543789 938112 12569
15524361 988095 14084
$
Total input was 141925868 reads pairs (or just single-end orphans). Anyway, not a single-one should slip out.
HFYJ5AFXX.1_5kb_R1.sw_missed_linker.fastq.txt
HFYJ5AFXX.1_5kb_R2.sw_missed_linker.fastq.txt
HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R2.pe.missed_linkers.blastn.txt
HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R2.mp.missed_linkers.blastn.txt
HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R1.unknown.missed_linkers.blastn.txt
HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R1.pe.missed_linkers.blastn.txt
HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.txt
HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R2.unknown.missed_linkers.blastn.txt
In terms of practical performance, I am pretty happy at this stage and probably won't be working on this for the time being. The assemblies that come out of this are very good for short read sequence data.
I can align the NxTrim output back to the reference and check the proportion of RF against FR:
joconnell@ubuntu:~/workspace/NxTrim/test$ /usr/bin/python alignment_summary.py mp.sw.bam
Orientation Frequency Median insert size
FR 392764 3384
RF 1706 1242515
FF 2150 1009331
RR 2072 942504
RF/(FR+RF) = 0.004325
joconnell@ubuntu:~/workspace/NxTrim/test$ /usr/bin/python alignment_summary.py un.sw.bam
Orientation Frequency Median insert size
FR 110925 3193
RF 710 712807
FF 667 1080812
RR 569 1085896
RF/(FR+RF) = 0.006360
The RF here are nominally incorrectly orientated pairs. For confirmed mate-pairs, the RF proportion is less than both RR and FF suggesting very clean data. For unknown, it is slightly elevated, but this is a known limitation of pairs with undetected adapters.
The blast evaluation only shows sensitivity. You need to balance this against how many of those blast hits are not actual adapters. These adapters share substrings with actual genomic DNA once you get down to <=10bp.
You also need to check how many of those hits are from read pairs with additional adapter copies, these are hard to handle.
The blast evaluation only shows sensitivity. You need to balance this against how many of those blast hits are not actual adapters. These adapters share substrings with actual genomic DNA once you get down to <=10bp.
No, the matches are much longer, and second, I do not mind discarding valid DNA if it is similar tot he linker. That DNA will come into the assembly from a whole genome shotgun library, not from MatePaired library. That is the key point. Trying to "protect" possibly valid matches is just wrong apporach. MatePair library data are just not proper source of sequence if some locus has similar sequence.
>NB501598:62:HFYJ5AFXX:1:11205:23960:15051 1:N:0:GCCAAT
Length = 26
Score = 52.0 bits (26), Expect = 5e-06
Identities = 26/26 (100%)
Strand = Plus / Plus
Query: 12 acacatctagatgtgtataagagaca 37
||||||||||||||||||||||||||
Sbjct: 1 acacatctagatgtgtataagagaca 26
and the worst from HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.txt file:
>NB501598:62:HFYJ5AFXX:1:11101:4458:1807 1:N:0:GCCAAT
Length = 155
Score = 32.2 bits (16), Expect = 4.4
Identities = 16/16 (100%)
Strand = Plus / Plus
Query: 8 ttatacacatctagat 23
||||||||||||||||
Sbjct: 139 ttatacacatctagat 154
Anyway, in both cases the match is clearly at the beginning/end of the raw read, so you are missing incomplete matches. BTW, I stick to the default -v 12
value. all hits reported by blast are just longer, much longer.
You also need to check how many of those hits are from read pairs with additional adapter copies, these are hard to handle.
Didn't you say elsewhere nxtrim discards read mate if after trimming a match remains? I wouldn't do anything different to it. Run blast, find matching readnames, and discard them, before feeding assembler.
Sorry, I think at this point you need to try another tool. Probably the trimming can be improved (albeit marginally) but the current implementation already produces very good assemblies.
No, the matches are much longer, and second, I do not mind discarding valid DNA if it is similar tot he >linker. That DNA will come into the assembly from a whole genome shotgun library, not from >MatePaired library. That is the key point. Trying to "protect" possibly valid matches is just wrong >apporach. MatePair library data are just not proper source of sequence if some locus has similar >sequence.
NxTrim is targeted at assembly using just a single Nextera MP library, not scaffolding. Maybe check out NextClip for scaffolding.
Best of luck!
nxtrim is clearly missing partial while exact matches on edges, the alignments show the Smith-Waterman somehow misses even mismatches, and you say from nucmbers of contigs/scaffodls that the assembly is good? And you did not answer why the above examples were missed. Did you try the tescases I prepared for you?
Thank you for pointing me to yet another partially complete tool. It would have been much more helpful if nxtrim supported a simple CSV input file format telling it where too look for linkers. It can do its own alignments but before starting splitting the reads it could "inject" the coordinates into its decision. That way one could provide a path list with coordinates one wants to use for splitting. I already proposed this before.
Sorry, I could not actually reproduce your first test case:
grep -A3 "NB501598:62:HFYJ5AFXX:1:11205:23960:15051" ~/Downloads/HFYJ5AFXX.1_5kb_R1.sw_missed_linker.fastq.txt > r1.fq
grep -A3 "NB501598:62:HFYJ5AFXX:1:11205:23960:15051" ~/Downloads/HFYJ5AFXX.1_5kb_R2.sw_missed_linker.fastq.txt > r2.fq
$ ./nxtrim -w -s 0.7 -1 r1.fq -2 r2.fq -O test
$ zcat test.pe.fastq.gz
@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 1:N:0:GCCAAT
AGTCGTTGTTTTATGTCACCTTCTTGTTTAGCCTGCGTTTCCTCGTGTTTTGCTTGCTTTGTTTTATTGTTTTGTACTTTGTTAAT
+
E<E/EEEEEE/AAEEEEAEA<E<EEEEEEEEEEEE<EEEEEAEAEEEE/AEE<EE<AE/AAEEEA//EEEEEAEE/6/AAAEEEAE
@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 2:N:0:GCCAAT
AGCATTAAAAGCATTTTCAATTTAGTGTTAAATATTGGATATATAGCCATTTGCTTTTAGTTTACATAAAAAAATCATTTAAATGAAGAATAAAAAAATCAGGTGTTTATAATTCAAAATTTTTATACATTGGGCAAATTTTTCAACACACAGAC
+
AAAAAEEEEEEEEEEEEE666/EAE6E6AEEEAEEE/E/AAEEE/E<//6E/EEEEEEE6EEA/E/EE//AAAEEE<EEEAEEEEEEEE6/EE//</EE//EEEEE<E<<E//EEEE///EEEA<<A//EEEAA<AE<E<EEA<//E///</E<<
joconnell@ubuntu:~/workspace/NxTrim$
joconnell@ubuntu:~/workspace/NxTrim$ zcat test.pe.fastq.gz | grep ACACATCTAGATGTGTATAAGAGACA | wc -l
0
Maybe stick to --separate --preserve-mp --rf
?
$ grep NB501598:62:HFYJ5AFXX:1:11205:23960:15051 HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_*.fastq
HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R1.mp.fastq:@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 1:N:0:GCCAAT
HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R2.mp.fastq:@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 2:N:0:GCCAAT
$
You have them assigned under PE instead of MP.
I still get a PE with those commands. As you say, the adapter is at the start of the read, so it has to be a PE (unless it misses the adapter).
$ grep -A3 "NB501598:62:HFYJ5AFXX:1:11205:23960:15051" ~/Downloads/HFYJ5AFXX.1_5kb_R1.sw_missed_linker.fastq.txt > r1.fq
$ grep -A3 "NB501598:62:HFYJ5AFXX:1:11205:23960:15051" ~/Downloads/HFYJ5AFXX.1_5kb_R2.sw_missed_linker.fastq.txt > r2.fq
joconnell@ubuntu:~/workspace/NxTrim$ ./nxtrim -1 r1.fq -2 r2.fq -O test --separate --preserve-mp --rf
Output: test.*.fastq.gz
Trimming:
R1: r1.fq
R2: r2.fq
--preserve-mp is on: will favour MPs over PEs
Trimming summary:
1 / 1 ( 100.00% ) reads passed chastity/purity filters.
0 / 1 ( 0.00% ) reads had TWO copies of adapter (filtered).
0 / 1 ( 0.00% ) read pairs were ignored because template length appeared less than read length
1 remaining reads were trimmed
0 / 1 ( 0.00% ) read pairs had MP orientation
1 / 1 ( 100.00% ) read pairs had PE orientation
0 / 1 ( 0.00% ) read pairs had unknown orientation
0 / 1 ( 0.00% ) were single end reads
0 / 1 ( 0.00% ) extra single end reads were generated from overhangs
joconnell@ubuntu:~/workspace/NxTrim$
joconnell@ubuntu:~/workspace/NxTrim$
joconnell@ubuntu:~/workspace/NxTrim$ zcat test_R1.pe.fastq.gz
@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 1:N:0:GCCAAT
ATTAACAAAGTACAAAACAATAAAACAAAGCAAGCAAAACACGAGGAAACGCAGGCTAAACAAGAAGGTGACATAAAACAACGACTAGCAAACACACAGTGAACAAGCAGTGTATAGGAGACAAGAAC
+
EAEEEAAA/6/EEAEEEEE//AEEEAA/EA<EE<EEA/EEEEAEAEEEEE<EEEEEEEEEEEE<E<AEAEEEEAA/EEEEEE/E<EEAEEEEEEAEEEEEEEEEEEEEEEAEEEEEEEAEEEEEEEEE
joconnell@ubuntu:~/workspace/NxTrim$ zcat test_R2.pe.fastq.gz
@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 2:N:0:GCCAAT
GTCTGTGTGTTGAAAAATTTGCCCAATGTATAAAAATTTTGAATTATAAACACCTGATTTTTTTATTCTTCATTTAAATGATTTTTTTATGTAAACTAAAAGCAAATGGCTATATATCCAATATTTAACACTAAATTGAAAATGCTTTTAATGCT
+
<<E/<///E//<AEE<E<EA<AAEEE//A<<AEEE///EEEE//E<<E<EEEEE//EE/<//EE/6EEEEEEEEAEEE<EEEAAA//EE/E/AEE6EEEEEEE/E6//<E/EEEAA/E/EEEAEEEA6E6EAE/666EEEEEEEEEEEEEAAAAA
joconnell@ubuntu:~/workspace/NxTrim$ zcat test_R1.pe.fastq.gz | grep ACACATCTAGATGTGTATAAGAGACA | wc -l
0
joconnell@ubuntu:~/workspace/NxTrim$ zcat test_R2.pe.fastq.gz | grep ACACATCTAGATGTGTATAAGAGACA | wc -l
0
Here is what I fed into nxtrim.
$ grep -A 3 NB501598:62:HFYJ5AFXX:1:11205:23960:15051 HFYJ5AFXX.1_5kb_R?.trimmomatic.?E.fastq
HFYJ5AFXX.1_5kb_R1.trimmomatic.PE.fastq:@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 1:N:0:GCCAAT
HFYJ5AFXX.1_5kb_R1.trimmomatic.PE.fastq-ACACATCTAGATGTGTATAAGAGACATGTTCTTGTCTCCTATACACTGCTTGTTCACTGTGTGTTTGCTAGTCGTTGTTTTATGTCACCTTCTTGTTTAGCCTGCGTTTCCTCGTGTTTTGCTTGCTTTGTTTTATTGTTTTGTACTTTGTTAAT
HFYJ5AFXX.1_5kb_R1.trimmomatic.PE.fastq-+
HFYJ5AFXX.1_5kb_R1.trimmomatic.PE.fastq-AAAA6EEAEA/AEEEEAEEEEEEEEEEEEEEEEEEEAEEEEEEEAEEEEEEEEEEEEEEEAEEEEEEAEE<E/EEEEEE/AAEEEEAEA<E<EEEEEEEEEEEE<EEEEEAEAEEEE/AEE<EE<AE/AAEEEA//EEEEEAEE/6/AAAEEEAE
--
HFYJ5AFXX.1_5kb_R2.trimmomatic.PE.fastq:@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 2:N:0:GCCAAT
HFYJ5AFXX.1_5kb_R2.trimmomatic.PE.fastq-AGCATTAAAAGCATTTTCAATTTAGTGTTAAATATTGGATATATAGCCATTTGCTTTTAGTTTACATAAAAAAATCATTTAAATGAAGAATAAAAAAATCAGGTGTTTATAATTCAAAATTTTTATACATTGGGCAAATTTTTCAACACACAGAC
HFYJ5AFXX.1_5kb_R2.trimmomatic.PE.fastq-+
HFYJ5AFXX.1_5kb_R2.trimmomatic.PE.fastq-AAAAAEEEEEEEEEEEEE666/EAE6E6AEEEAEEE/E/AAEEE/E<//6E/EEEEEEE6EEA/E/EE//AAAEEE<EEEAEEEEEEEE6/EE//</EE//EEEEE<E<<E//EEEE///EEEA<<A//EEEAA<AE<E<EEA<//E///</E<<
$
But you have the sequences reverse-complemented.
I gave them out in correct orientation, IMHO:
$ grep -A 3 NB501598:62:HFYJ5AFXX:1:11205:23960:15051 HFYJ5AFXX.1_5kb_R?.trimmomatic.?E.fastq | grep -i acacatctagatgtgtataagagaca
HFYJ5AFXX.1_5kb_R1.trimmomatic.PE.fastq-ACACATCTAGATGTGTATAAGAGACATGTTCTTGTCTCCTATACACTGCTTGTTCACTGTGTGTTTGCTAGTCGTTGTTTTATGTCACCTTCTTGTTTAGCCTGCGTTTCCTCGTGTTTTGCTTGCTTTGTTTTATTGTTTTGTACTTTGTTAAT
$
$ grep -A 3 NB501598:62:HFYJ5AFXX:1:11205:23960:15051 *.sw_missed_linker.fastq.txt | grep ACACATCTAGATGTGTATAAGAGACA
HFYJ5AFXX.1_5kb_R1.sw_missed_linker.fastq.txt-ACACATCTAGATGTGTATAAGAGACATGTTCTTGTCTCCTATACACTGCTTGTTCACTGTGTGTTTGCTAGTCGTTGTTTTATGTCACCTTCTTGTTTAGCCTGCGTTTCCTCGTGTTTTGCTTGCTTTGTTTTATTGTTTTGTACTTTGTTAAT
$
I don't know what is happening, but I am finding the adapter in that example and the reads look the same:
joconnell@ubuntu:~/workspace/NxTrim$ cat r1.fq
@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 1:N:0:GCCAAT
ACACATCTAGATGTGTATAAGAGACATGTTCTTGTCTCCTATACACTGCTTGTTCACTGTGTGTTTGCTAGTCGTTGTTTTATGTCACCTTCTTGTTTAGCCTGCGTTTCCTCGTGTTTTGCTTGCTTTGTTTTATTGTTTTGTACTTTGTTAAT
+
AAAA6EEAEA/AEEEEAEEEEEEEEEEEEEEEEEEEAEEEEEEEAEEEEEEEEEEEEEEEAEEEEEEAEE<E/EEEEEE/AAEEEEAEA<E<EEEEEEEEEEEE<EEEEEAEAEEEE/AEE<EE<AE/AAEEEA//EEEEEAEE/6/AAAEEEAE
joconnell@ubuntu:~/workspace/NxTrim$ cat r2.fq
@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 2:N:0:GCCAAT
AGCATTAAAAGCATTTTCAATTTAGTGTTAAATATTGGATATATAGCCATTTGCTTTTAGTTTACATAAAAAAATCATTTAAATGAAGAATAAAAAAATCAGGTGTTTATAATTCAAAATTTTTATACATTGGGCAAATTTTTCAACACACAGAC
+
AAAAAEEEEEEEEEEEEE666/EAE6E6AEEEAEEE/E/AAEEE/E<//6E/EEEEEEE6EEA/E/EE//AAAEEE<EEEAEEEEEEEE6/EE//</EE//EEEEE<E<<E//EEEE///EEEA<<A//EEEAA<AE<E<EEA<//E///</E<<
I suspect your cases where you can blast out some adapters are of the "multiple adapter" sort. Such as this one (from my E.coli):
raw:
@M903:194:000000000-A5FBD:1:1112:6063:10169 1:N:0:ATGTCA
GCTGGAACACAGCCACTGGCCGCGATTTCTGAGGCTTCTTTACCGCTAATCTCTTGCCGCATCCATTCCCGCGCAGCTTCTGGCGACAGTACCAGTCTGTCTCTTATACACATCTAGATGTGTATAAGAGACAGAGATGTGTATAAGATAC
+
AAA@AFFF1CAF1FGF11FGCEGEAEFGEGHGCFGCFGFHH2GE?EAECHHHHHHHHGGGEEGGHFHHFHGGCEEGEFFDGHFFEGGGGHFEFHHHGFHHHHHHFHHHEHFDCHHGFHHGHDFFGHHHHHHFHHFHHHGHEFHFHGHHGHB
joconnell@ubuntu:~/workspace/NxTrim/test$ cat r2.fq
@M903:194:000000000-A5FBD:1:1112:6063:10169 2:N:0:ATGTCA
TATCGTAGATTCAGTATTAGATGTGACTTTATTTAATTGAACCGGCTGAGTTTTTAACGAAAGATCATTTAATGGAAACATTTACACTCCCTGTATCTTATACACATCTCTGTCTCTTATACACATCTAGATGTGTATAAGAGACAGACTG
+
AA1AAF>A>DFFDE33BE333AFDFF3AGFFHBEFHCFGH11G0EEEGCFDGBGECGHGGGAEGGFHHFFFHHHBB1GBGHHBGHHGHFHHHHGFFFGHHHFHFHFHEHHHEGHFHFGGHGBGHFHHFHHHHG2BBGHHHHHHG00B<GFG
Where are the adapters?
R1:
perfect match partial overlap
CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG AGATGTGTATAAGAGACAG
GCTGGAACACAGCCACTGGCCGCGATTTCTGAGGCTTCTTTACCGCTAATCTCTTGCCGCATCCATTCCCGCGCAGCTTCTGGCGACAGTACCAGT|CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG|AGATGTGTATAAGATAC
imperfect match perfect match
R2: CTGTCTCTTATACACATCT CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG
TATCGTAGATTCAGTATTAGATGTGACTTTATTTAATTGAACCGGCTGAGTTTTTAACGAAAGATCATTTAATGGAAACATTTACACTCC|CTGTATCTTATACACATCT|CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG|ACTG
You can see there is a whole junction adapter and at least part of another one which is unexpected.
The code only catches these weird (unexpected) second adapter copies when there is a perfect string match, so in cases like above, they are missed. Yes, I could blast or sw-align, but they are so rare I just used a simple string find to catch a few of them. This could obviously be improved.
Yes, I could blast or sw-align, but they are so rare I just used a simple string find to catch a few of them. This could obviously be improved.
You probably think of me that I am nit picking but isn't this Issue about switch to Smith-Waterman matching? I thought everything in the code uses SW now (provided I used -w
option). This is unexpected to hear something still relies on exact string match. But I am happy to help your finiding these places, one by one. ;)
I did not read the code but I fear you are getting just a single match from SW, you really need to ask for all matches (you need the guarantee that you get them all). If not you are out of luck. I spent several years on developing a tool to clean Roche/454 data and in principle I was hitting same issues.
Practically, in this case, you are either thorough and look for the full-length junction adapter, then look for those halves, merge the regions eventually into a single range (allowing say 3nt in between if the alignment was narrower for some reason). Then you continue with usual splitting as you do in your code, which notably evaluates whether any sequence was left leftwards and rightwards from the splitting region. Alternatively you can throw away this particular mate, optionally whole pair (some user might prefer that for a good reason).
I wouldn't be so picky but if the linkers remain in the reads then you have mismerged contigs. It depends on the coverage and aligner, bt basically if you start assembly with these 'chimeric' reads with linker unremoved, you won't get anything good out. Assemblers can overcome some unremoved sequencing adapters provided they appear at the end of reads. This is not the case with linkers. For me 118k unremoved linkers is too much. It does not matter out of how many input reads. Assemblers will respect this as true sample sequence. They are more likely to normalize the "clean" reads and enrich these reads with unremoved linkers.
Nitpicking is fine, but it looks like we have been talking about two different things. The tool does find adapters on the edges, but it doesn't find all the weird cases of extra unexpected adapters. I use SW (or Hamming match) to find the single best match to the adapter. This is trimmed appropriately. Then, as a way to catch the edge case of an extra unexpected adapter (as in my above example), the code looks for a perfect match (regardless of whether SW/hamming is used - this has not changed).
This small bit of code to catch this is here:
https://github.com/sequencing/NxTrim/blob/master/matepair.cpp#L545-L550
So yes, this will only catch perfect match adapters after the primary adapter has been removed. I guess your remaining adapters are mostly these multi-adapter cases with some amount of mismatch in the second copy of the adapter. I would advocate just throwing the entire pair out, they just seem untrustworthy.
I'm happy to take a pull request if you want to try some solutions. I don't have a lot of time to spend on this at the moment.
FYI. I don't know what you are assembling, but I currently get the best results for bacteria using SPAdes and --justmp
. The pe/se libraries don't seem to help SPAdes.
I would advocate just throwing the entire pair out, they just seem untrustworthy.
In principle I am fine even with this approach, as a quick-and-dirty way to get data clean. Which is what we need, no matter some tiny data is lost.
Would it be relatively quick to implement (as those matches were just adjacent to the main match), I would prefer the range-based approach as some more real solution. This is not a corner-case IMHO. That works well. If you have two isolated ranges from each other then definitely the read/pair should be discarded. But in those adjacent cases you can save some sequence and in turn, good pairs.
This merely means you are not looping to get all matches out of the aligner. That is bad.
If you could replace the string exact matching approach it would be really helpful. I would say it is the priority number one now. Why cannot you just call the aligner on the upstream and downstream sequence in about the lines L545-L550? And if that finds a match, you return(1) as for the exact match? ;-)
Unfortunately I do not know C/C++ so I cannot help you here. I am certainly willing to test and help with other ideas. And I do not see almost any comments in the code, I mean useful comments. ;-)
I would say, this thread is also a proper place to document where elseare the exact string matches hiding.
BTW, I have a polyploid genome, fish. Previous, possibly velvet-based assembly is crap, full of gaps which are not supported in my eyes by reads (I have lots of readsspanning the gap proving there is not gap at all in the sequence). I assume something was wrong with scaffolding, but actually that must have been an issue during contigging because the contigs just should have been wider (the RapidRun reads just bear the whole chunk of sequence). Something was terribly wrong during data preparationIMHO, or wrong cmdline arguments for assembler, who knows. I did not receive any documentation what the previous person did.
@mmokrejs try this branch
git pull
git checkout iss37
make clean
make
The code now just runs the adapter detection routine twice. After the first run, the adapter is N
masked, if another adapter copy is found, the read is discarded. This probably doubles the run-time at the moment.
nxtrim v0.4.1-7fb1356 failed with an error:
nxtrim: matepair.cpp:105: int sw_match(uint8_t*, int, uint8_t*, int, int, float, int8_t*): Assertion `r.te!=-1' failed.
$ gdb NxTrim/nxtrim core.31390
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from NxTrim/nxtrim...done.
[New Thread 31390]
warning: Ignoring non-absolute filename: <linux-vdso.so.1>
Missing separate debuginfo for linux-vdso.so.1
Try: yum --enablerepo='*-debug*' install /usr/lib/debug/.build-id/79/86395b139f22da92f9154f30c5951e68b4b7f6
Missing separate debuginfo for /apps/gentoo/lib64/ld-linux-x86-64.so.2
Try: yum --enablerepo='*-debug*' install /usr/lib/debug/.build-id/79/86395b139f22da92f9154f30c5951e68b4b7f6
Reading symbols from /apps/gentoo/usr/lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /apps/gentoo/usr/lib64/libz.so.1
Reading symbols from /apps/gentoo/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.4/libstdc++.so.6...done.
Loaded symbols for /apps/gentoo/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.4/libstdc++.so.6
Reading symbols from /apps/gentoo/lib64/libm.so.6...done.
Loaded symbols for /apps/gentoo/lib64/libm.so.6
Reading symbols from /apps/gentoo/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.4/libgcc_s.so.1...done.
Loaded symbols for /apps/gentoo/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.4/libgcc_s.so.1
Reading symbols from /apps/gentoo/lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /apps/gentoo/lib64/libc.so.6
Reading symbols from /apps/gentoo/lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /apps/gentoo/lib64/ld-linux-x86-64.so.2
Core was generated by `NxTrim/nxtrim'.
Program terminated with signal 6, Aborted.
#0 0x00007fffce2a6044 in raise () from /apps/gentoo/lib64/libc.so.6
(gdb) where
#0 0x00007fffce2a6044 in raise () from /apps/gentoo/lib64/libc.so.6
#1 0x00007fffce2a74aa in abort () from /apps/gentoo/lib64/libc.so.6
#2 0x00007fffce29edcd in __assert_fail_base () from /apps/gentoo/lib64/libc.so.6
#3 0x00007fffce29ee82 in __assert_fail () from /apps/gentoo/lib64/libc.so.6
#4 0x0000000000407abe in sw_match(unsigned char*, int, unsigned char*, int, int, float, signed char*) () at matepair.cpp:105
#5 0x0000000000407df6 in matePair::findAdapter(std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, int, float, bool) () at matepair.cpp:187
#6 0x0000000000409b55 in matePair::build(readPair&, int, float, int, bool, bool, bool, bool) () at matepair.cpp:578
#7 0x000000000040312f in main () at nxtrim.cpp:176
(gdb) bt full
#0 0x00007fffce2a6044 in raise () from /apps/gentoo/lib64/libc.so.6
No symbol table info available.
#1 0x00007fffce2a74aa in abort () from /apps/gentoo/lib64/libc.so.6
No symbol table info available.
#2 0x00007fffce29edcd in __assert_fail_base () from /apps/gentoo/lib64/libc.so.6
No symbol table info available.
#3 0x00007fffce29ee82 in __assert_fail () from /apps/gentoo/lib64/libc.so.6
No symbol table info available.
#4 0x0000000000407abe in sw_match(unsigned char*, int, unsigned char*, int, int, float, signed char*) () at matepair.cpp:105
std::__ioinit = {static _S_refcount = 5, static _S_synced_with_stdio = true}
adapter1 = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x614028 "CTGTCTCTTATACACATCT"}}
adapter2 = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x614068 "AGATGTGTATAAGAGACAG"}}
r2_external_adapter = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x614148 "GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"}}
seq_nt4_table = '\004' <repeats 65 times>, "\000\004\001\004\004\004\002", '\004' <repeats 12 times>, "\003", '\004' <repeats 12 times>, "\000\004\001\004\004\004\002", '\004' <repeats 12 times>, "\003", '\004' <repeats 139 times>
r1_external_adapter = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x6140f8 "GATCGGAAGAGCACACGTCTGAACTCCAGTCAC"}}
adapterj = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x6140a8 "CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG"}}
#5 0x0000000000407df6 in matePair::findAdapter(std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, int, float, bool) () at matepair.cpp:187
std::__ioinit = {static _S_refcount = 5, static _S_synced_with_stdio = true}
adapter1 = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x614028 "CTGTCTCTTATACACATCT"}}
adapter2 = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x614068 "AGATGTGTATAAGAGACAG"}}
r2_external_adapter = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x614148 "GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"}}
seq_nt4_table = '\004' <repeats 65 times>, "\000\004\001\004\004\004\002", '\004' <repeats 12 times>, "\003", '\004' <repeats 12 times>, "\000\004\001\004\004\004\002", '\004' <repeats 12 times>, "\003", '\004' <repeats 139 times>
r1_external_adapter = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x6140f8 "GATCGGAAGAGCACACGTCTGAACTCCAGTCAC"}}
adapterj = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x6140a8 "CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG"}}
#6 0x0000000000409b55 in matePair::build(readPair&, int, float, int, bool, bool, bool, bool) () at matepair.cpp:578
std::__ioinit = {static _S_refcount = 5, static _S_synced_with_stdio = true}
adapter1 = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x614028 "CTGTCTCTTATACACATCT"}}
adapter2 = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x614068 "AGATGTGTATAAGAGACAG"}}
r2_external_adapter = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x614148 "GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"}}
seq_nt4_table = '\004' <repeats 65 times>, "\000\004\001\004\004\004\002", '\004' <repeats 12 times>, "\003", '\004' <repeats 12 times>, "\000\004\001\004\004\004\002", '\004' <repeats 12 times>, "\003", '\004' <repeats 139 times>
r1_external_adapter = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x6140f8 "GATCGGAAGAGCACACGTCTGAACTCCAGTCAC"}}
adapterj = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x6140a8 "CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG"}}
#7 0x000000000040312f in main () at nxtrim.cpp:176
std::__ioinit = {static _S_refcount = 5, static _S_synced_with_stdio = true}
(gdb)
OK, here is a better stacktrace from CXXFLAGS="-ggdb -O0"
binary:
(gdb) bt full
#0 0x00007fffce385625 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007fffce386e05 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007fffce37e74e in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3 0x00007fffce37e810 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4 0x0000000000407ba4 in sw_match (target=0x80b740 '\004' <repeats 22 times>, tlen=22, query=0x6235f0 "\001\003\002\003\001\003\001\003\003", qlen=38, minoverlap=12, min_similarity=0.699999988, mat=0x7fffffffda40 "\001\376\376\376") at matepair.cpp:105
sa = 1
go = 1
ge = 3
substring_length = -1
sim = -0
__PRETTY_FUNCTION__ = "int sw_match(uint8_t*, int, uint8_t*, int, int, float, int8_t*)"
sb = 2
r = {score = 0, te = -1, qe = 0, score2 = -1, te2 = -1, tb = 0, qb = 0}
#5 0x0000000000407ed8 in matePair::findAdapter (this=0x7fffffffd8e0, s='N' <repeats 22 times>, minoverlap=12, similarity=0.699999988, use_hamming=false) at matepair.cpp:187
L2 = 19
perfect = 4294967295
s_tmp = 0x80b740 '\004' <repeats 22 times>
a = 4209342
L1 = 22
#6 0x0000000000409e85 in matePair::build (this=0x7fffffffd8e0, readpair=..., minovl=12, sim=0.699999988, ml=21, jr=false, uh=false, pmp=true, jmp=false) at matepair.cpp:586
tmp = {l = 22, filtered = false, description = true, h = "NB501598:62:HFYJ5AFXX:1:11107:2087:3600 2:N:0:CTTGTA", s = 'N' <repeats 22 times>, l3 = "+", q = '#' <repeats 22 times>}
L2 = 22
a1 = 38
b2 = 30
rc2 = {l = 155, filtered = false, description = false, h = "", s = "", l3 = "", q = "CAGCATACTTCCCAGCATGCCAACTATCCCCCTCCCTCCTGTCTCTTATACACAACTAGATGTGTATAAGAGACAGGTCTGTGGGGTGCCCCCAAACTTTACATGAGCTTTTGCTGCAGGACAGATTTTTAG"}
minoverlap2 = -9296
L1 = 132
a2 = -8
b1 = 76
rc1 = {l = 90, filtered = 255, description = 127, h = "", s = "", l3 = "", q = ""}
#7 0x00000000004032e4 in main (argc=12, argv=0x7fffffffde38) at nxtrim.cpp:176
weird = false
m = {joinreads = false, use_hamming = false, preserve_mp = true, _justmp = false, se = {l = 0, filtered = false, description = true, h = "", s = "", l3 = "", q = ""}, mp = {r1 = {l = 0, filtered = false, description = true, h = "", s = "", l3 = "", q = ""}, r2 = {l = 0, filtered = false, description = true,
h = "", s = "", l3 = "", q = ""}, l = 0, filtered = false}, pe = {r1 = {l = 0, filtered = false, description = true, h = "", s = "", l3 = "", q = ""}, r2 = {l = 0, filtered = false, description = true, h = "", s = "", l3 = "", q = ""}, l = 0, filtered = false}, unknown = {r1 = {l = 0, filtered = false,
description = true, h = "", s = "", l3 = "", q = ""}, r2 = {l = 0, filtered = false, description = true, h = "", s = "", l3 = "", q = ""}, l = 1, filtered = false}, minoverlap = 12, minlen = 21, similarity = 0.699999988, adapter1_sw = 0x6235b0 "\001\003\002\003\001\003\001\003\003",
adapter2_sw = 0x6235d0 "", adapterj_sw = 0x6235f0 "\001\003\002\003\001\003\001\003\003", sw_mat = "\001\376\376\376\000\376\001\376\376\000\376\376\001\376\000\376\376\376\001\000\000\000\000\000"}
npass = 2740
trim_warn = false
se_only = 470
pos = {first = 0, second = 0}
nweird = 5
nread = 2740
out = {weird = -9488, n_mp = 1507, n_unk = 76, n_se = 878, n_pe = 341, _justmp = false, _write_mp = true, _write_se = true, _write_pe = true, _write_un = true, print_to_stdout = 240, mp_out = {separate = true, outfile = {fp = 0x0, _stdout = false}, outfile1 = {fp = 0x623700, _stdout = false}, outfile2 = {
fp = 0x6691e0, _stdout = false}}, pe_out = {separate = true, outfile = {fp = 0x0, _stdout = 208}, outfile1 = {fp = 0x73a210, _stdout = false}, outfile2 = {fp = 0x77fcf0, _stdout = false}}, unknown_out = {separate = true, outfile = {fp = 0x0, _stdout = false}, outfile1 = {fp = 0x6aec80,
_stdout = false}, outfile2 = {fp = 0x6f4720, _stdout = false}}, se_out = {fp = 0x7c57d0, _stdout = false}}
minlen = 21
write_stdout_un = false
nodata = 341
p = {r1 = {l = 132, filtered = false, description = true, h = "NB501598:62:HFYJ5AFXX:1:11101:23244:5416 1:N:0:CTTGTA", s = "CAGCATACTTCCCAGCATGCCAACTATCCCCCTCCCTCCTGTCTCTTATACACAACTAGATGTGTATAAGAGACAGGTCTGTGGGGTGCCCCCAAACTTTACATGAGCTTTTGCTGCAGGACAGATTTTTAG", l3 = "+", q =
"AAAA6E6EAEE/EEE/EEEAEEEAEEAE</E/AEEEEE<AAAEEEEEEEAAEEE///EEA/E/EA/E/A///E//E</A<E/////A<</E/E/A//EEA//E<///<<EEE////E<//E/</EEEE/<<<"}, r2 = {l = 22, filtered = false, description = true, h = "NB501598:62:HFYJ5AFXX:1:11107:2087:3600 2:N:0:CTTGTA", s = "TATACACATCTAGATGTGTATA", l3 = "+", q =
"6AA6AEEAE/AEE6EEEEE6E/"}, l = 112, filtered = false}
similarity = 0.699999988
prefix = "HFYJ5AFXX.1_8kb.trimmomatic.PE.nxtrim"
ignorePF = false
write_stdout_mp = false
loptions = {{name = 0x410620 "r1", has_arg = 1, flag = 0x0, val = 49}, {name = 0x410623 "r2", has_arg = 1, flag = 0x0, val = 50}, {name = 0x410626 "output-prefix", has_arg = 1, flag = 0x0, val = 79}, {name = 0x410634 "stdout", has_arg = 0, flag = 0x0, val = 0}, {name = 0x41063b "stdout-mp", has_arg = 0,
flag = 0x0, val = 9}, {name = 0x410645 "stdout-un", has_arg = 0, flag = 0x0, val = 10}, {name = 0x41064f "justmp", has_arg = 0, flag = 0x0, val = 1}, {name = 0x410656 "joinreads", has_arg = 0, flag = 0x0, val = 2}, {name = 0x410660 "rf", has_arg = 0, flag = 0x0, val = 3}, {name = 0x410663 "preserve-mp",
has_arg = 0, flag = 0x0, val = 4}, {name = 0x41066f "ignorePF", has_arg = 0, flag = 0x0, val = 5}, {name = 0x410678 "mp", has_arg = 0, flag = 0x0, val = 6}, {name = 0x41067b "unknown", has_arg = 0, flag = 0x0, val = 7}, {name = 0x410683 "separate", has_arg = 0, flag = 0x0, val = 8}, {
name = 0x41068c "smith-waterman", has_arg = 0, flag = 0x0, val = 119}, {name = 0x41069b "similarity", has_arg = 1, flag = 0x0, val = 115}, {name = 0x4106a6 "minoverlap", has_arg = 1, flag = 0x0, val = 118}, {name = 0x4106b1 "minlength", has_arg = 1, flag = 0x0, val = 108}, {name = 0x0, has_arg = 0,
flag = 0x0, val = 0}}
c = -1
justmp = false
write_stdout = false
infile = {f1 = 0x614270, f2 = 0x61bc10}
joinreads = false
r1 = 0x7fffffffe1e9 "HFYJ5AFXX.1_8kb_R1.trimmomatic.SE.fastq"
rc = true
hamming = false
preserve_mp = true
minoverlap = 12
r2 = 0x7fffffffe214 "HFYJ5AFXX.1_8kb_R2.trimmomatic.SE.fastq"
separate = true
(gdb)
Try again with the last commit.
Works now ... I will keep the debug binary running on all data. ETA probably ~4-6hrs.
Okay. There may be no need for debug binary here. That above error was an assert failure due to a problematic SW alignment, it would have been caught in the release build.
I have already put the debug binary through valgrind without error.
OK, recompiled&restarted with CXXFLAGS="-ggdb -march=native -O2" make
Did you use for development testing the FASTQ files I uploaded above? Just to know what I shall anticipate. ;-)
joconnell@ubuntu:~/workspace/NxTrim$ ./nxtrim -1 /home/joconnell/Downloads/HFYJ5AFXX.1_5kb_R1.sw_missed_linker.fastq.txt -2 /home/joconnell/Downloads/HFYJ5AFXX.1_5kb_R2.sw_missed_linker.fastq.txt --separate --preserve-mp --rf -O test -w -s 0.7
Output: test.*.fastq.gz
Trimming:
R1: /home/joconnell/Downloads/HFYJ5AFXX.1_5kb_R1.sw_missed_linker.fastq.txt
R2: /home/joconnell/Downloads/HFYJ5AFXX.1_5kb_R2.sw_missed_linker.fastq.txt
--preserve-mp is on: will favour MPs over PEs
Trimming summary:
7185 / 7185 ( 100.00% ) reads passed chastity/purity filters.
173 / 7185 ( 2.41% ) reads had TWO copies of adapter (filtered).
8 / 7012 ( 0.11% ) read pairs were ignored because template length appeared less than read length
7004 remaining reads were trimmed
4168 / 7004 ( 59.51% ) read pairs had MP orientation
570 / 7004 ( 8.14% ) read pairs had PE orientation
1880 / 7004 ( 26.84% ) read pairs had unknown orientation
386 / 7004 ( 5.51% ) were single end reads
1194 / 7004 ( 17.05% ) extra single end reads were generated from overhangs
joconnell@ubuntu:~/workspace/NxTrim$ for i in test*gz;do echo $i;zcat $i | agrep -1 CTGTCTCTTATACACATCT | wc -l;done
test_R1.mp.fastq.gz
0
test_R1.pe.fastq.gz
0
test_R1.unknown.fastq.gz
0
test_R2.mp.fastq.gz
0
test_R2.pe.fastq.gz
0
test_R2.unknown.fastq.gz
0
test.se.fastq.gz
0
joconnell@ubuntu:~/workspace/NxTrim$ for i in test*gz;do echo $i;zcat $i | agrep -1 AGATGTGTATAAGAGACAG | wc -l;done
test_R1.mp.fastq.gz
0
test_R1.pe.fastq.gz
0
test_R1.unknown.fastq.gz
0
test_R2.mp.fastq.gz
0
test_R2.pe.fastq.gz
0
test_R2.unknown.fastq.gz
0
test.se.fastq.gz
0
Seems too few were found. Try this.
# find the missed matches using blastn
for f in *.mp.fastq.gz *.pe.fastq.gz *.unknown.fastq.gz *.se.fastq.gz; do p=`basename $f .fastq.gz`; echo "$f"; gzip -dc $f > "$p".fastq; fq2fa "$p".fastq "$p".fasta; formatdb -p F -i "$p".fasta; blastn -num_threads 18 -task blastn-short -db "$p".fasta -query Nextera_MatePair_hairpin.fasta -word_size 4 -dust no -reward 1 -max_target_seqs 99999 -max_hsps_per_subject 20 -perc_identity 75 -strand plus > "$p".missed_linkers.blastn.txt; done
for f in *.mp.fastq.gz *.pe.fastq.gz *.unknown.fastq.gz *.se.fastq.gz; do p=`basename $f .fastq.gz`; echo "$f"; gzip -dc $f > "$p".fastq; fq2fa "$p".fastq "$p".fasta; formatdb -p F -i "$p".fasta; blastn -num_threads 18 -task blastn-short -db "$p".fasta -query Nextera_MatePair_hairpin.fasta -word_size 4 -dust no -reward 1 -max_target_seqs 99999 -max_hsps_per_subject 20 -perc_identity 75 -strand plus -outfmt 6 | awk '{print $2}' > "$p".missed_linkers.txt; done
reads had TWO copies of adapter (filtered).
How about:
reads had TWO or more copies of adapter (discarding whole pair).
if that is true.
$ grep 'reads had TWO copies of' *nxtrim__preserve_mp_from_original.log | sort
HFYJ5AFXX.1_5kb.trimmomatic.PE.nxtrim__preserve_mp_from_original.log:29721 / 16495106 ( 0.18% ) reads had TWO copies of adapter (filtered).
HFYJ5AFXX.1_5kb.trimmomatic.SE.nxtrim__preserve_mp_from_original.log:39 / 20662 ( 0.19% ) reads had TWO copies of adapter (filtered).
HFYJ5AFXX.1_8kb.trimmomatic.PE.nxtrim__preserve_mp_from_original.log:29331 / 16300973 ( 0.18% ) reads had TWO copies of adapter (filtered).
HFYJ5AFXX.1_8kb.trimmomatic.SE.nxtrim__preserve_mp_from_original.log:40 / 21681 ( 0.18% ) reads had TWO copies of adapter (filtered).
HFYJ5AFXX.2_5kb.trimmomatic.PE.nxtrim__preserve_mp_from_original.log:29124 / 16253919 ( 0.18% ) reads had TWO copies of adapter (filtered).
HFYJ5AFXX.2_5kb.trimmomatic.SE.nxtrim__preserve_mp_from_original.log:14 / 11399 ( 0.12% ) reads had TWO copies of adapter (filtered).
HFYJ5AFXX.2_8kb.trimmomatic.PE.nxtrim__preserve_mp_from_original.log:29476 / 16044969 ( 0.18% ) reads had TWO copies of adapter (filtered).
HFYJ5AFXX.2_8kb.trimmomatic.SE.nxtrim__preserve_mp_from_original.log:40 / 12909 ( 0.31% ) reads had TWO copies of adapter (filtered).
HFYJ5AFXX.3_5kb.trimmomatic.PE.nxtrim__preserve_mp_from_original.log:28822 / 15867509 ( 0.18% ) reads had TWO copies of adapter (filtered).
HFYJ5AFXX.3_5kb.trimmomatic.SE.nxtrim__preserve_mp_from_original.log:45 / 18413 ( 0.24% ) reads had TWO copies of adapter (filtered).
HFYJ5AFXX.3_8kb.trimmomatic.PE.nxtrim__preserve_mp_from_original.log:28387 / 15716944 ( 0.18% ) reads had TWO copies of adapter (filtered).
HFYJ5AFXX.3_8kb.trimmomatic.SE.nxtrim__preserve_mp_from_original.log:33 / 19198 ( 0.17% ) reads had TWO copies of adapter (filtered).
HFYJ5AFXX.4_5kb.trimmomatic.PE.nxtrim__preserve_mp_from_original.log:26571 / 14582359 ( 0.18% ) reads had TWO copies of adapter (filtered).
HFYJ5AFXX.4_5kb.trimmomatic.SE.nxtrim__preserve_mp_from_original.log:29 / 12569 ( 0.23% ) reads had TWO copies of adapter (filtered).
HFYJ5AFXX.4_8kb.trimmomatic.PE.nxtrim__preserve_mp_from_original.log:26923 / 14459869 ( 0.19% ) reads had TWO copies of adapter (filtered).
HFYJ5AFXX.4_8kb.trimmomatic.SE.nxtrim__preserve_mp_from_original.log:29 / 14084 ( 0.21% ) reads had TWO copies of adapter (filtered).
The discarded total is maybe a bit higher than those 180k from previous tests but it is OK. Will now run the blastn searches ...
These still remain in MP data, for example. The scoring must be wrong somewhere.
> NB501598:62:HFYJ5AFXX:1:21310:6408:6560 1:N:0:GCCAAT
Length=151
Score = 38.2 bits (19), Expect = 0.072
Identities = 25/27 (93%), Gaps = 0/27 (0%)
Strand=Plus/Plus
Query 8 TTATACACATCTAGATGTGTATAAGAG 34
|||||||||| ||| ||||||||||||
Sbjct 26 TTATACACATATAGTTGTGTATAAGAG 52
> NB501598:62:HFYJ5AFXX:1:21310:1598:1343 1:N:0:GCCAAT
Length=155
Score = 36.2 bits (18), Expect = 0.28
Identities = 24/26 (92%), Gaps = 0/26 (0%)
Strand=Plus/Plus
Query 8 TTATACACATCTAGATGTGTATAAGA 33
|||||||||| || ||||||||||||
Sbjct 41 TTATACACATATATATGTGTATAAGA 66
> NB501598:62:HFYJ5AFXX:1:21301:22800:8378 1:N:0:GCCAAT
Length=155
Score = 36.2 bits (18), Expect = 0.28
Identities = 30/34 (88%), Gaps = 0/34 (0%)
Strand=Plus/Plus
Query 2 TGTCTCTTATACACATCTAGATGTGTATAAGAGA 35
|||||||||||| |||||||| | |||||||||
Sbjct 17 TGTCTCTTATACCGATCTAGATTTCTATAAGAGA 50
> NB501598:62:HFYJ5AFXX:1:21205:14665:11772 1:N:0:GCCAAT
Length=155
Score = 36.2 bits (18), Expect = 0.28
Identities = 30/34 (88%), Gaps = 0/34 (0%)
Strand=Plus/Plus
Query 4 TCTCTTATACACATCTAGATGTGTATAAGAGACA 37
|||| |||| |||||||||| ||||||||||||
Sbjct 31 TCTCATATAGACATCTAGATCAGTATAAGAGACA 64
> NB501598:62:HFYJ5AFXX:1:21203:19031:19916 1:N:0:GCCAAT
Length=155
Score = 36.2 bits (18), Expect = 0.28
Identities = 30/34 (88%), Gaps = 0/34 (0%)
Strand=Plus/Plus
Query 3 GTCTCTTATACACATCTAGATGTGTATAAGAGAC 36
|||||||||| | |||||||||| ||||| ||||
Sbjct 1 GTCTCTTATAAAAATCTAGATGTTTATAATAGAC 34
If you consider your first example, sure there is ~26bp string match, but the flanking bases are very divergent:
CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG
TGCAAAAGTTCGGTAATGTATTATGTGTAATGTATTTACTTTTACTTTTACAAGCTTGACAAAGTGAAACAACTCATAGTGAAACATCCTAATACCTTACTCTTATACACAACTATATGTGTATAATAAACAGGTATAATTGTGTGTGATT
If you drop to -s 0.6
this will be caught, but I don't know how safe that is in general.
I won't be fixing cases like this sorry.
No, they are just true linkers.
> NB501598:62:HFYJ5AFXX:1:21310:6408:6560 1:N:0:GCCAAT
Length=151
Score = 44.6 bits (48), Expect = 4e-04
Identities = 30/34 (88%), Gaps = 0/34 (0%)
Strand=Plus/Plus
Query 1 CTGTCTCTTATACACATCTAGATGTGTATAAGAG 34
|||| | |||||||||| ||| ||||||||||||
Sbjct 19 CTGTTTATTATACACATATAGTTGTGTATAAGAG 52
> NB501598:62:HFYJ5AFXX:1:21301:22800:8378 1:N:0:GCCAAT
Length=155
Score = 44.6 bits (48), Expect = 4e-04
Identities = 32/37 (86%), Gaps = 0/37 (0%)
Strand=Plus/Plus
Query 2 TGTCTCTTATACACATCTAGATGTGTATAAGAGACAG 38
|||||||||||| |||||||| | ||||||||| ||
Sbjct 17 TGTCTCTTATACCGATCTAGATTTCTATAAGAGATAG 53
> NB501598:62:HFYJ5AFXX:1:21205:14665:11772 1:N:0:GCCAAT
Length=155
Score = 44.6 bits (48), Expect = 4e-04
Identities = 32/37 (86%), Gaps = 0/37 (0%)
Strand=Plus/Plus
Query 1 CTGTCTCTTATACACATCTAGATGTGTATAAGAGACA 37
|| |||| |||| |||||||||| ||||||||||||
Sbjct 28 CTCTCTCATATAGACATCTAGATCAGTATAAGAGACA 64
> NB501598:62:HFYJ5AFXX:1:21203:19031:19916 1:N:0:GCCAAT
Length=155
Score = 44.6 bits (48), Expect = 4e-04
Identities = 30/34 (88%), Gaps = 0/34 (0%)
Strand=Plus/Plus
Query 3 GTCTCTTATACACATCTAGATGTGTATAAGAGAC 36
|||||||||| | |||||||||| ||||| ||||
Sbjct 1 GTCTCTTATAAAAATCTAGATGTTTATAATAGAC 34
Here is the better commandline with increased reward for a match.
# find the missed matches using blastn
for f in *.mp.fastq.gz *.pe.fastq.gz *.unknown.fastq.gz *.se.fastq.gz; do p=`basename $f .fastq.gz`; echo "$f"; gzip -dc $f > "$p".fastq; fq2fa "$p".fastq "$p".fasta; formatdb -p F -i "$p".fasta; blastn -num_threads 18 -task blastn-short -db "$p".fasta -query Nextera_MatePair_hairpin.fasta -word_size 4 -dust no -reward 2 -max_hsps 20 -num_descriptions 99999 -num_alignments 99999 -perc_identity 75 -strand plus > "$p".missed_linkers.blastn.txt; done
for f in *.mp.fastq.gz *.pe.fastq.gz *.unknown.fastq.gz *.se.fastq.gz; do p=`basename $f .fastq.gz`; echo "$f"; gzip -dc $f > "$p".fastq; fq2fa "$p".fastq "$p".fasta; formatdb -p F -i "$p".fasta; blastn -num_threads 18 -task blastn-short -db "$p".fasta -query Nextera_MatePair_hairpin.fasta -word_size 4 -dust no -reward 2 -max_target_seqs 99999 -max_hsps_per_subject 20 -perc_identity 75 -strand plus -outfmt 6 | awk '{print $2}' > "$p".missed_linkers.txt; done
The scores you calculate are unusually low. Would it be possible possible to stick to the scoring which e.g. blastn uses? I know there is some web page describing why blastn scores are different from EMBOSS water, but they are roughly same. Your is a third approach. -s 0.6
sounds weird but these cases woudl really fit into 0.7
easily. See https://www.ncbi.nlm.nih.gov/books/NBK279678/ But I do not see how to figure out what blastn
, actually -task blastn-short
enforces. Ahh, luckily it is printed at the bottom of plaintext out. So here is what I propose (and what the above blastn really used):
Matrix: blastn matrix 2 -3
Gap Penalties: Existence: 5, Extension: 2
I am not debating that they are true junction adapters, but they are low quality reads and we have to accept a lot of mismatches to get them. You are welcome to play with the scoring.
Gap-open(go
)/gap-extend(ge
):
https://github.com/sequencing/NxTrim/blob/sw_full_match/matepair.cpp#L88
Match(sa
)/mismatch(sb
):
https://github.com/sequencing/NxTrim/blob/sw_full_match/matepair.cpp#L320
You mean this?
$ git diff
diff --git a/matepair.cpp b/matepair.cpp
index f0ab2cf..22615ed 100755
--- a/matepair.cpp
+++ b/matepair.cpp
@@ -91,7 +91,7 @@ int sw_match(uint8_t *target,int tlen,uint8_t *query,int qlen,int minoverlap,flo
}
int sa=1;
int sb=2;
- int go = 1, ge = 3;
+ int go = 5, ge = 2; // gap open, gap extension
// ta_opt_set_mat(sa, sb, mat);
kswr_t r = ksw_align(qlen, query, tlen, target, 5, mat, go, ge, KSW_XBYTE|KSW_XSTART|(8*sa), 0);
int substring_length = r.qe - r.qb < r.te - r.tb? r.qe - r.qb : r.te - r.tb;
@@ -347,7 +347,7 @@ int matePair::resolve_overhang(fqread & r1, fqread & r2,int a,int b) {
matePair::matePair()
{
- ta_opt_set_mat(1,2,sw_mat);
+ ta_opt_set_mat(2,-3,sw_mat); //match and mismatch
adapter1_sw=(uint8_t *)malloc(adapter1.length()+1);
for(size_t i=0;i<adapter1.length();i++)
{
But how about the actual calculation, will that work? I found these but they are not the page I was looking for.
https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
http://embossgui.sourceforge.net/demo/manual/water.html
https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm
Anyway, what do you say?
I made it clearer by putting all the penalty terms at the top of the code:
https://github.com/sequencing/NxTrim/blob/iss37/matepair.cpp#L13-L16
I don't know much about BLAST. The current SW similarity measure is calculated as the alignment score divided by the maximum possible score for the alignment. The maximum possible score is SW_MATCH multiplied by the alignment length.
I used -s .7 -w --separate --preserve-mp
but everything matched after this change.
$ git diff
diff --git a/matepair.cpp b/matepair.cpp
index a170c69..9518d38 100755
--- a/matepair.cpp
+++ b/matepair.cpp
@@ -10,10 +10,10 @@ string adapterj = adapter1+adapter2;
string r1_external_adapter = "GATCGGAAGAGCACACGTCTGAACTCCAGTCAC";
string r2_external_adapter = "GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT";
-#define SW_GAP_EXTENSION 3
-#define SW_GAP_OPEN 1
-#define SW_MATCH 1
-#define SW_MISMATCH 2
+#define SW_GAP_EXTENSION 2
+#define SW_GAP_OPEN 5
+#define SW_MATCH 2
+#define SW_MISMATCH -3
#define DEBUG 0
Trimming summary:
20662 / 20662 ( 100.00% ) reads passed chastity/purity filters.
19351 / 20662 ( 93.66% ) reads had TWO copies of adapter (filtered).
1311 / 1311 ( 100.00% ) read pairs were ignored because template length appeared less than read length
0 remaining reads were trimmed
0 / 0 ( -nan% ) read pairs had MP orientation
0 / 0 ( -nan% ) read pairs had PE orientation
0 / 0 ( -nan% ) read pairs had unknown orientation
0 / 0 ( -nan% ) were single end reads
It's a bit confusing - you need #define SW_MISMATCH 3
not -3
Good? No, to loose. Currently I am testing on the full dataset:
--- a/matepair.cpp
+++ b/matepair.cpp
@@ -10,9 +10,9 @@ string adapterj = adapter1+adapter2;
string r1_external_adapter = "GATCGGAAGAGCACACGTCTGAACTCCAGTCAC";
string r2_external_adapter = "GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT";
-#define SW_GAP_EXTENSION 3
-#define SW_GAP_OPEN 1
-#define SW_MATCH 1
+#define SW_GAP_EXTENSION 1
+#define SW_GAP_OPEN 2
+#define SW_MATCH 2
#define SW_MISMATCH 2
#define DEBUG 0
The read NB501598:62:HFYJ5AFXX:1:21310:6408:6560 is a good example I hope.
What I need is a much better log output into individual files (mp, pe, se, unknown).
$readname $linker_start $linker_end $similarity $matched_sequence
$ NxTrim/nxtrim -s 0.7 -w -1 minitest_R1.fastq -2 minitest_R2.fastq --stdout --preserve-mp --rf
Writing to stdout
Trimming:
R1: minitest_R1.fastq
R2: minitest_R2.fastq
--preserve-mp is on: will favour MPs over PEs
@NB501598:62:HFYJ5AFXX:1:21310:6408:6560 1:N:0:GCCAAT
TGCAAAAGTTCGGTAATGTATTATGTGTAATGTATTTACTTTTACTTTTACAAGCTTGACAAAGTGAAACAACTCATAGTGAAACATCCTAATAC
+
AAAA6EEE6EA///EEAEEEE/EA//A/EEEE<EE/EE//EAEAEE//EEAEEA6EE//</EEA//EEEEE6//////E<AEE/EAE////6E66
@NB501598:62:HFYJ5AFXX:1:21310:6408:6560 2:N:0:GCCAAT
GTGGGCCCCTCTTTAGTTAAAAAAACAATAACAAAATCAAACACAATTATAC
+
A//AAA/EAAAAAEE/EEEAAEEEE6//EEEEAEEEE/A///E<EEE/EEEE
The left part before the linkers is trimmed 4nt sooner than necessary so we loose CTTA
:
$ grep -A 3 NB501598:62:HFYJ5AFXX:1:21310:6408:6560 HFYJ5AFXX.1_5kb_R1.trimmomatic.PE.fastq | grep CTCTTATACACAACTATATGTGTATAATAAACAG
TGCAAAAGTTCGGTAATGTATTATGTGTAATGTATTTACTTTTACTTTTACAAGCTTGACAAAGTGAAACAACTCATAGTGAAACATCCTAATACCTTACTCTTATACACAACTATATGTGTATAATAAACAGGTATAATTGTGTGTGATTTTGT
The second mate is surprisingly shortened and I do not understand why. Probably nxtrim thought it is followed by the very same linker?
$ grep -A 3 NB501598:62:HFYJ5AFXX:1:21310:6408:6560 HFYJ5AFXX.1_5kb_R2.trimmomatic.PE.fastq | grep GTGGGCCCCTCTTTAGTTAAAAAAACAATAACAAAATCAAACACAATTATAC
GTGGGCCCCTCTTTAGTTAAAAAAACAATAACAAAATCAAACACAATTATACCTGTATCTTATAAACATCTAGATATGTATAATAGACAGTAATTAGGATGTTTCACTCTACCTTGTTTAACTTTTGCAAGATTGTAAAATTAAAATAAAATAAA
The second mate is surprisingly shortened and I do not understand why. Probably nxtrim thought it is followed by the very same linker?
No, I have a different theory. It looks the sequence GTATAATTGTGTGTGATTTTGT
downstream of the linker in R1 read was reverse-complemented and aligned to R2 read to find where the mates overlap. But the overlap=alignment should well cover the linker and even its sequence on its left.
You can turn on debugging output by setting #DEFINE DEBUG 2
. The output is very verbose so only run it on a very small dataset (ie one read-pair). You then see things like this:
read L1 = 155
read L2 = 155
adapter locations (first pass): 155 193 42 80
adapter locations (second pass): 155 193 42 80
adapter locations (third pass): 155 193 42 80
it is telling you where the adapters were detected. So here (42,80) means that was the detected adapter in read 2. When the adapter start is ==
to read length, no adapter is found.
@mmokrejs FYI I just pushed a minor bugfix that was causing minoverlap
to not be respected in the SW code. I don't think this substantially changes results though (except that some minoverlap
adapter matches were missed).
Thanks, will try but probably after I figure out how to get the SW-matching code to work. I with below changes I ended up with some 242k linkers unremoved, about 60k more then with original. :(
--- a/matepair.cpp
+++ b/matepair.cpp
@@ -12,8 +12,8 @@ string r2_external_adapter = "GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT";
#define SW_GAP_EXTENSION 3
#define SW_GAP_OPEN 1
-#define SW_MATCH 1
-#define SW_MISMATCH 2
+#define SW_MATCH 2
+#define SW_MISMATCH 3
#define DEBUG 0
Some difficult cases (I lazily assume they are in the FASTQ file I already provided):
> NB501598:62:HFYJ5AFXX:1:21312:10138:16626 1:N:0:GCCAAT
Length=105
Score = 42.8 bits (46), Expect = 0.001
Identities = 32/38 (84%), Gaps = 0/38 (0%)
Strand=Plus/Plus
Query 1 CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG 38
|||||||||||| | || ||||||| ||||| |||||
Sbjct 22 CTGTCTCTTATAAAAATATAGATGTTTATAATTGACAG 59
> NB501598:62:HFYJ5AFXX:1:21312:5193:12408 1:N:0:GCCAAT
Length=155
Score = 42.8 bits (46), Expect = 0.001
Identities = 32/38 (84%), Gaps = 0/38 (0%)
Strand=Plus/Plus
Query 1 CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG 38
|||||||||||| | || ||||||| ||||| || |||
Sbjct 23 CTGTCTCTTATAAAAATATAGATGTTTATAATAGTCAG 60
> NB501598:62:HFYJ5AFXX:1:21311:21303:18350 1:N:0:GCCAAT
Length=155
Score = 42.8 bits (46), Expect = 0.001
Identities = 32/38 (84%), Gaps = 0/38 (0%)
Strand=Plus/Plus
Query 1 CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG 38
|| ||||| ||| |||||||||| | ||||||| ||||
Sbjct 32 CTCTCTCTAATAAACATCTAGATCTCTATAAGATACAG 69
> NB501598:62:HFYJ5AFXX:1:21310:13226:18075 1:N:0:GCCAAT
Length=155
Score = 42.8 bits (46), Expect = 0.001
Identities = 32/38 (84%), Gaps = 0/38 (0%)
Strand=Plus/Plus
Query 1 CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG 38
||| || |||||||| ||||||| |||||||| | |||
Sbjct 48 CTGGCTGTTATACACTTCTAGATTTGTATAAGTGTCAG 85
You could also use https://ncbi.github.io/cxx-toolkit/pages/ch_blast#ch_blast.CLocalBlast provided you index both forward and reverse FASTA files separately (after FASTQ to FASTA conversion). But I still think it is faster to run external blastn in parallel (user can provide it before hand, and notably edit it easily -- e.g. remove false positives or add some missed cases). They you parse it's CSV output for forward and reverse-read streams separately, and import into nxtrim. You cannot beat its performance doing SW-alignment one by one read pair IMHO (CPU will do a couple of NOP cycles here and there due to latencies), not even speaking of paralellization being missing. Issue #46 is the way to go IMHO. I will just work and we can focus on the remaining code. I still foresee some problem in the case of NB501598:62:HFYJ5AFXX:1:21310:6408:6560
.
I just pushed a minor bugfix that was causing minoverlap to not be respected in the SW code. I don't think this substantially changes results though (except that some minoverlap adapter matches were missed).
Aha, so now the branch is gone and the region changed in the patch below was dropped as well. Was that intentional?
--- a/matepair.cpp
+++ b/matepair.cpp
@@ -12,8 +12,8 @@ string r2_external_adapter = "GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT";
#define SW_GAP_EXTENSION 3
#define SW_GAP_OPEN 1
-#define SW_MATCH 1
-#define SW_MISMATCH 2
+#define SW_MATCH 2
+#define SW_MISMATCH 3
Yes that looks right.
The branch is now merged to master with the bugfix and scoring tweaks.
Closing this - it turns out the SW is not ideal for this problem. Semi-global alignment is more appropriate, the cutadapt algorithm looks appropriate (chapter 2):
https://eldorado.tu-dortmund.de/bitstream/2003/31824/1/Dissertation.pdf
This is worth revisiting but I won't get time in the short term.
(smith-waterman remains on the sw
branch but this is unlikely to be merged)