sequencing/NxTrim

switch to Smith-Waterman for adapter matching

Closed this issue · 46 comments

The current adapter matching routine is very simple. It looks for a match (within a Hamming distance) on either side of the junction adapter. If a match to either 19bp side of the adapter is found, the entire 38bp junction adapter is clipped appropriately (basically a crude form of seed and extend). This can miss cases were both sides of the junction adapter have an indel error.

Smith-Waterman alignment will improve this.

@mmokrejs the Smith-Waterman alignment now appears to be working well if you are still interested. It can be turned on with the -w flag.

Usage example:

## SW for adapter matching
../nxtrim -s .7 -w -1 EcMG1_ATGTCA_L001_R1_001.fastq.gz -2 EcMG1_ATGTCA_L001_R2_001.fastq.gz --stdout-mp | bwa mem EcMG.fna -p - | gzip -1 > mp.sw.bam
 ../nxtrim -s .7 -w -1 EcMG1_ATGTCA_L001_R1_001.fastq.gz -2 EcMG1_ATGTCA_L001_R2_001.fastq.gz --stdout-un | bwa mem EcMG.fna -p - | gzip -1 > un.sw.bam

I used nxtrim -s .7 -w --separate --preserve-mp --rf for the testing of version v0.4.1-4965b00.

I still see plenty linkers unremoved with minimum similarity 86.11%. Something is wrong with the scoring I guess.

$ for f in *.nxtrim_*.fastq.gz; do echo $f; p=`basename $f .fastq.gz`; blastall -p blastn -d $p.fasta -i Nextera_MatePair_hairpin.fasta -F F -v 99999 -b 99999 -m 8 > $p.missed_linkers.blastn.tsv; done
$
$ wc -l *missed_linkers.blastn.tsv
    1495 HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.tsv
     692 HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R1.pe.missed_linkers.blastn.tsv
    3701 HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R1.unknown.missed_linkers.blastn.tsv
    2048 HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R2.mp.missed_linkers.blastn.tsv
     860 HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R2.pe.missed_linkers.blastn.tsv
    5568 HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R2.unknown.missed_linkers.blastn.tsv
    1534 HFYJ5AFXX.1_8kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.tsv
     741 HFYJ5AFXX.1_8kb.trimmomatic.nxtrim_R1.pe.missed_linkers.blastn.tsv
    3486 HFYJ5AFXX.1_8kb.trimmomatic.nxtrim_R1.unknown.missed_linkers.blastn.tsv
    1973 HFYJ5AFXX.1_8kb.trimmomatic.nxtrim_R2.mp.missed_linkers.blastn.tsv
     829 HFYJ5AFXX.1_8kb.trimmomatic.nxtrim_R2.pe.missed_linkers.blastn.tsv
    5494 HFYJ5AFXX.1_8kb.trimmomatic.nxtrim_R2.unknown.missed_linkers.blastn.tsv
    1544 HFYJ5AFXX.2_5kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.tsv
     621 HFYJ5AFXX.2_5kb.trimmomatic.nxtrim_R1.pe.missed_linkers.blastn.tsv
    3830 HFYJ5AFXX.2_5kb.trimmomatic.nxtrim_R1.unknown.missed_linkers.blastn.tsv
    1810 HFYJ5AFXX.2_5kb.trimmomatic.nxtrim_R2.mp.missed_linkers.blastn.tsv
     811 HFYJ5AFXX.2_5kb.trimmomatic.nxtrim_R2.pe.missed_linkers.blastn.tsv
    5557 HFYJ5AFXX.2_5kb.trimmomatic.nxtrim_R2.unknown.missed_linkers.blastn.tsv
    1408 HFYJ5AFXX.2_8kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.tsv
     701 HFYJ5AFXX.2_8kb.trimmomatic.nxtrim_R1.pe.missed_linkers.blastn.tsv
    3487 HFYJ5AFXX.2_8kb.trimmomatic.nxtrim_R1.unknown.missed_linkers.blastn.tsv
    1878 HFYJ5AFXX.2_8kb.trimmomatic.nxtrim_R2.mp.missed_linkers.blastn.tsv
     741 HFYJ5AFXX.2_8kb.trimmomatic.nxtrim_R2.pe.missed_linkers.blastn.tsv
    5376 HFYJ5AFXX.2_8kb.trimmomatic.nxtrim_R2.unknown.missed_linkers.blastn.tsv
    1438 HFYJ5AFXX.3_5kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.tsv
     762 HFYJ5AFXX.3_5kb.trimmomatic.nxtrim_R1.pe.missed_linkers.blastn.tsv
    3374 HFYJ5AFXX.3_5kb.trimmomatic.nxtrim_R1.unknown.missed_linkers.blastn.tsv
    2176 HFYJ5AFXX.3_5kb.trimmomatic.nxtrim_R2.mp.missed_linkers.blastn.tsv
     718 HFYJ5AFXX.3_5kb.trimmomatic.nxtrim_R2.pe.missed_linkers.blastn.tsv
    5476 HFYJ5AFXX.3_5kb.trimmomatic.nxtrim_R2.unknown.missed_linkers.blastn.tsv
    1502 HFYJ5AFXX.3_8kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.tsv
     772 HFYJ5AFXX.3_8kb.trimmomatic.nxtrim_R1.pe.missed_linkers.blastn.tsv
    3347 HFYJ5AFXX.3_8kb.trimmomatic.nxtrim_R1.unknown.missed_linkers.blastn.tsv
    2052 HFYJ5AFXX.3_8kb.trimmomatic.nxtrim_R2.mp.missed_linkers.blastn.tsv
     853 HFYJ5AFXX.3_8kb.trimmomatic.nxtrim_R2.pe.missed_linkers.blastn.tsv
    5445 HFYJ5AFXX.3_8kb.trimmomatic.nxtrim_R2.unknown.missed_linkers.blastn.tsv
    1926 HFYJ5AFXX.4_5kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.tsv
     874 HFYJ5AFXX.4_5kb.trimmomatic.nxtrim_R1.pe.missed_linkers.blastn.tsv
    4897 HFYJ5AFXX.4_5kb.trimmomatic.nxtrim_R1.unknown.missed_linkers.blastn.tsv
    2179 HFYJ5AFXX.4_5kb.trimmomatic.nxtrim_R2.mp.missed_linkers.blastn.tsv
     833 HFYJ5AFXX.4_5kb.trimmomatic.nxtrim_R2.pe.missed_linkers.blastn.tsv
    5749 HFYJ5AFXX.4_5kb.trimmomatic.nxtrim_R2.unknown.missed_linkers.blastn.tsv
    2038 HFYJ5AFXX.4_8kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.tsv
     811 HFYJ5AFXX.4_8kb.trimmomatic.nxtrim_R1.pe.missed_linkers.blastn.tsv
    4642 HFYJ5AFXX.4_8kb.trimmomatic.nxtrim_R1.unknown.missed_linkers.blastn.tsv
    1989 HFYJ5AFXX.4_8kb.trimmomatic.nxtrim_R2.mp.missed_linkers.blastn.tsv
     806 HFYJ5AFXX.4_8kb.trimmomatic.nxtrim_R2.pe.missed_linkers.blastn.tsv
    7821 HFYJ5AFXX.4_8kb.trimmomatic.nxtrim_R2.unknown.missed_linkers.blastn.tsv
  118665 total

Some total counts when the files came out of trimmomatic:

$ grep '^Input Read Pairs' *.trimmomatic.ilmn.log
HFYJ5AFXX.1_5kb.trimmomatic.ilmn.log:Input Read Pairs: 17502155 Both Surviving: 16495106 (94.25%) Forward Only Surviving: 969194 (5.54%) Reverse Only Surviving: 20662 (0.12%) Dropped: 17193 (0.10%)
HFYJ5AFXX.1_8kb.trimmomatic.ilmn.log:Input Read Pairs: 17481778 Both Surviving: 16300973 (93.25%) Forward Only Surviving: 1056900 (6.05%) Reverse Only Surviving: 21681 (0.12%) Dropped: 102224 (0.58%)
HFYJ5AFXX.2_5kb.trimmomatic.ilmn.log:Input Read Pairs: 17177379 Both Surviving: 16253919 (94.62%) Forward Only Surviving: 903030 (5.26%) Reverse Only Surviving: 11399 (0.07%) Dropped: 9031 (0.05%)
HFYJ5AFXX.2_8kb.trimmomatic.ilmn.log:Input Read Pairs: 17087286 Both Surviving: 16044969 (93.90%) Forward Only Surviving: 980999 (5.74%) Reverse Only Surviving: 12909 (0.08%) Dropped: 48409 (0.28%)
HFYJ5AFXX.3_5kb.trimmomatic.ilmn.log:Input Read Pairs: 16845780 Both Surviving: 15867509 (94.19%) Forward Only Surviving: 943755 (5.60%) Reverse Only Surviving: 18413 (0.11%) Dropped: 16103 (0.10%)
HFYJ5AFXX.3_8kb.trimmomatic.ilmn.log:Input Read Pairs: 16841620 Both Surviving: 15716944 (93.32%) Forward Only Surviving: 1010720 (6.00%) Reverse Only Surviving: 19198 (0.11%) Dropped: 94758 (0.56%)
HFYJ5AFXX.4_5kb.trimmomatic.ilmn.log:Input Read Pairs: 15543789 Both Surviving: 14582359 (93.81%) Forward Only Surviving: 938112 (6.04%) Reverse Only Surviving: 12569 (0.08%) Dropped: 10749 (0.07%)
HFYJ5AFXX.4_8kb.trimmomatic.ilmn.log:Input Read Pairs: 15524361 Both Surviving: 14459869 (93.14%) Forward Only Surviving: 988095 (6.36%) Reverse Only Surviving: 14084 (0.09%) Dropped: 62313 (0.40%)

$ grep '^Input Read Pairs' *.trimmomatic.ilmn.log | awk '{print $4, $12, $17}' 
17502155 969194 20662
17481778 1056900 21681
17177379 903030 11399
17087286 980999 12909
16845780 943755 18413
16841620 1010720 19198
15543789 938112 12569
15524361 988095 14084
$

Total input was 141925868 reads pairs (or just single-end orphans). Anyway, not a single-one should slip out.

HFYJ5AFXX.1_5kb_R1.sw_missed_linker.fastq.txt
HFYJ5AFXX.1_5kb_R2.sw_missed_linker.fastq.txt

In terms of practical performance, I am pretty happy at this stage and probably won't be working on this for the time being. The assemblies that come out of this are very good for short read sequence data.

I can align the NxTrim output back to the reference and check the proportion of RF against FR:

joconnell@ubuntu:~/workspace/NxTrim/test$ /usr/bin/python alignment_summary.py mp.sw.bam
Orientation  Frequency   Median insert size
FR              392764                 3384
RF                1706              1242515
FF                2150              1009331
RR                2072               942504

RF/(FR+RF) = 0.004325
joconnell@ubuntu:~/workspace/NxTrim/test$ /usr/bin/python alignment_summary.py un.sw.bam
Orientation  Frequency   Median insert size
FR              110925                 3193
RF                 710               712807
FF                 667              1080812
RR                 569              1085896

RF/(FR+RF) = 0.006360

The RF here are nominally incorrectly orientated pairs. For confirmed mate-pairs, the RF proportion is less than both RR and FF suggesting very clean data. For unknown, it is slightly elevated, but this is a known limitation of pairs with undetected adapters.

The blast evaluation only shows sensitivity. You need to balance this against how many of those blast hits are not actual adapters. These adapters share substrings with actual genomic DNA once you get down to <=10bp.

You also need to check how many of those hits are from read pairs with additional adapter copies, these are hard to handle.

The blast evaluation only shows sensitivity. You need to balance this against how many of those blast hits are not actual adapters. These adapters share substrings with actual genomic DNA once you get down to <=10bp.

No, the matches are much longer, and second, I do not mind discarding valid DNA if it is similar tot he linker. That DNA will come into the assembly from a whole genome shotgun library, not from MatePaired library. That is the key point. Trying to "protect" possibly valid matches is just wrong apporach. MatePair library data are just not proper source of sequence if some locus has similar sequence.

>NB501598:62:HFYJ5AFXX:1:11205:23960:15051 1:N:0:GCCAAT
          Length = 26

 Score = 52.0 bits (26), Expect = 5e-06
 Identities = 26/26 (100%)
 Strand = Plus / Plus

                                    
Query: 12 acacatctagatgtgtataagagaca 37
          ||||||||||||||||||||||||||
Sbjct: 1  acacatctagatgtgtataagagaca 26

and the worst from HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R1.mp.missed_linkers.blastn.txt file:

>NB501598:62:HFYJ5AFXX:1:11101:4458:1807 1:N:0:GCCAAT
          Length = 155

 Score = 32.2 bits (16), Expect = 4.4
 Identities = 16/16 (100%)
 Strand = Plus / Plus

                           
Query: 8   ttatacacatctagat 23
           ||||||||||||||||
Sbjct: 139 ttatacacatctagat 154

Anyway, in both cases the match is clearly at the beginning/end of the raw read, so you are missing incomplete matches. BTW, I stick to the default -v 12 value. all hits reported by blast are just longer, much longer.

You also need to check how many of those hits are from read pairs with additional adapter copies, these are hard to handle.

Didn't you say elsewhere nxtrim discards read mate if after trimming a match remains? I wouldn't do anything different to it. Run blast, find matching readnames, and discard them, before feeding assembler.

Sorry, I think at this point you need to try another tool. Probably the trimming can be improved (albeit marginally) but the current implementation already produces very good assemblies.

No, the matches are much longer, and second, I do not mind discarding valid DNA if it is similar tot he >linker. That DNA will come into the assembly from a whole genome shotgun library, not from >MatePaired library. That is the key point. Trying to "protect" possibly valid matches is just wrong >apporach. MatePair library data are just not proper source of sequence if some locus has similar >sequence.

NxTrim is targeted at assembly using just a single Nextera MP library, not scaffolding. Maybe check out NextClip for scaffolding.

Best of luck!

nxtrim is clearly missing partial while exact matches on edges, the alignments show the Smith-Waterman somehow misses even mismatches, and you say from nucmbers of contigs/scaffodls that the assembly is good? And you did not answer why the above examples were missed. Did you try the tescases I prepared for you?

Thank you for pointing me to yet another partially complete tool. It would have been much more helpful if nxtrim supported a simple CSV input file format telling it where too look for linkers. It can do its own alignments but before starting splitting the reads it could "inject" the coordinates into its decision. That way one could provide a path list with coordinates one wants to use for splitting. I already proposed this before.

Sorry, I could not actually reproduce your first test case:

grep -A3 "NB501598:62:HFYJ5AFXX:1:11205:23960:15051" ~/Downloads/HFYJ5AFXX.1_5kb_R1.sw_missed_linker.fastq.txt > r1.fq
grep -A3 "NB501598:62:HFYJ5AFXX:1:11205:23960:15051" ~/Downloads/HFYJ5AFXX.1_5kb_R2.sw_missed_linker.fastq.txt > r2.fq

$ ./nxtrim -w -s 0.7 -1 r1.fq -2 r2.fq -O test

$ zcat test.pe.fastq.gz 
@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 1:N:0:GCCAAT
AGTCGTTGTTTTATGTCACCTTCTTGTTTAGCCTGCGTTTCCTCGTGTTTTGCTTGCTTTGTTTTATTGTTTTGTACTTTGTTAAT
+
E<E/EEEEEE/AAEEEEAEA<E<EEEEEEEEEEEE<EEEEEAEAEEEE/AEE<EE<AE/AAEEEA//EEEEEAEE/6/AAAEEEAE
@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 2:N:0:GCCAAT
AGCATTAAAAGCATTTTCAATTTAGTGTTAAATATTGGATATATAGCCATTTGCTTTTAGTTTACATAAAAAAATCATTTAAATGAAGAATAAAAAAATCAGGTGTTTATAATTCAAAATTTTTATACATTGGGCAAATTTTTCAACACACAGAC
+
AAAAAEEEEEEEEEEEEE666/EAE6E6AEEEAEEE/E/AAEEE/E<//6E/EEEEEEE6EEA/E/EE//AAAEEE<EEEAEEEEEEEE6/EE//</EE//EEEEE<E<<E//EEEE///EEEA<<A//EEEAA<AE<E<EEA<//E///</E<<
joconnell@ubuntu:~/workspace/NxTrim$ 
joconnell@ubuntu:~/workspace/NxTrim$ zcat test.pe.fastq.gz  | grep ACACATCTAGATGTGTATAAGAGACA | wc -l
0

Maybe stick to --separate --preserve-mp --rf?

$ grep NB501598:62:HFYJ5AFXX:1:11205:23960:15051 HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_*.fastq
HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R1.mp.fastq:@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 1:N:0:GCCAAT
HFYJ5AFXX.1_5kb.trimmomatic.nxtrim_R2.mp.fastq:@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 2:N:0:GCCAAT
$

You have them assigned under PE instead of MP.

I still get a PE with those commands. As you say, the adapter is at the start of the read, so it has to be a PE (unless it misses the adapter).

$ grep -A3 "NB501598:62:HFYJ5AFXX:1:11205:23960:15051" ~/Downloads/HFYJ5AFXX.1_5kb_R1.sw_missed_linker.fastq.txt > r1.fq
$ grep -A3 "NB501598:62:HFYJ5AFXX:1:11205:23960:15051" ~/Downloads/HFYJ5AFXX.1_5kb_R2.sw_missed_linker.fastq.txt > r2.fq

joconnell@ubuntu:~/workspace/NxTrim$ ./nxtrim  -1 r1.fq -2 r2.fq -O test --separate --preserve-mp --rf 
Output: test.*.fastq.gz
Trimming:
R1:	r1.fq
R2:	r2.fq
--preserve-mp is on: will favour MPs over PEs

Trimming summary:
1 / 1	( 100.00% )	reads passed chastity/purity filters.
0 / 1	( 0.00% )	reads had TWO copies of adapter (filtered).
0 / 1	( 0.00% )	read pairs were ignored because template length appeared less than read length
1 remaining reads were trimmed

0 / 1	( 0.00% )	read pairs had MP orientation
1 / 1	( 100.00% )	read pairs had PE orientation
0 / 1	( 0.00% )	read pairs had unknown orientation
0 / 1	( 0.00% )	were single end reads

0 / 1	( 0.00% )	extra single end reads were generated from overhangs
joconnell@ubuntu:~/workspace/NxTrim$ 
joconnell@ubuntu:~/workspace/NxTrim$ 
joconnell@ubuntu:~/workspace/NxTrim$ zcat test_R1.pe.fastq.gz
@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 1:N:0:GCCAAT
ATTAACAAAGTACAAAACAATAAAACAAAGCAAGCAAAACACGAGGAAACGCAGGCTAAACAAGAAGGTGACATAAAACAACGACTAGCAAACACACAGTGAACAAGCAGTGTATAGGAGACAAGAAC
+
EAEEEAAA/6/EEAEEEEE//AEEEAA/EA<EE<EEA/EEEEAEAEEEEE<EEEEEEEEEEEE<E<AEAEEEEAA/EEEEEE/E<EEAEEEEEEAEEEEEEEEEEEEEEEAEEEEEEEAEEEEEEEEE
joconnell@ubuntu:~/workspace/NxTrim$ zcat test_R2.pe.fastq.gz
@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 2:N:0:GCCAAT
GTCTGTGTGTTGAAAAATTTGCCCAATGTATAAAAATTTTGAATTATAAACACCTGATTTTTTTATTCTTCATTTAAATGATTTTTTTATGTAAACTAAAAGCAAATGGCTATATATCCAATATTTAACACTAAATTGAAAATGCTTTTAATGCT
+
<<E/<///E//<AEE<E<EA<AAEEE//A<<AEEE///EEEE//E<<E<EEEEE//EE/<//EE/6EEEEEEEEAEEE<EEEAAA//EE/E/AEE6EEEEEEE/E6//<E/EEEAA/E/EEEAEEEA6E6EAE/666EEEEEEEEEEEEEAAAAA
joconnell@ubuntu:~/workspace/NxTrim$ zcat test_R1.pe.fastq.gz | grep ACACATCTAGATGTGTATAAGAGACA | wc -l
0
joconnell@ubuntu:~/workspace/NxTrim$ zcat test_R2.pe.fastq.gz | grep ACACATCTAGATGTGTATAAGAGACA | wc -l
0

Here is what I fed into nxtrim.

$ grep -A 3 NB501598:62:HFYJ5AFXX:1:11205:23960:15051 HFYJ5AFXX.1_5kb_R?.trimmomatic.?E.fastq
HFYJ5AFXX.1_5kb_R1.trimmomatic.PE.fastq:@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 1:N:0:GCCAAT
HFYJ5AFXX.1_5kb_R1.trimmomatic.PE.fastq-ACACATCTAGATGTGTATAAGAGACATGTTCTTGTCTCCTATACACTGCTTGTTCACTGTGTGTTTGCTAGTCGTTGTTTTATGTCACCTTCTTGTTTAGCCTGCGTTTCCTCGTGTTTTGCTTGCTTTGTTTTATTGTTTTGTACTTTGTTAAT
HFYJ5AFXX.1_5kb_R1.trimmomatic.PE.fastq-+
HFYJ5AFXX.1_5kb_R1.trimmomatic.PE.fastq-AAAA6EEAEA/AEEEEAEEEEEEEEEEEEEEEEEEEAEEEEEEEAEEEEEEEEEEEEEEEAEEEEEEAEE<E/EEEEEE/AAEEEEAEA<E<EEEEEEEEEEEE<EEEEEAEAEEEE/AEE<EE<AE/AAEEEA//EEEEEAEE/6/AAAEEEAE
--
HFYJ5AFXX.1_5kb_R2.trimmomatic.PE.fastq:@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 2:N:0:GCCAAT
HFYJ5AFXX.1_5kb_R2.trimmomatic.PE.fastq-AGCATTAAAAGCATTTTCAATTTAGTGTTAAATATTGGATATATAGCCATTTGCTTTTAGTTTACATAAAAAAATCATTTAAATGAAGAATAAAAAAATCAGGTGTTTATAATTCAAAATTTTTATACATTGGGCAAATTTTTCAACACACAGAC
HFYJ5AFXX.1_5kb_R2.trimmomatic.PE.fastq-+
HFYJ5AFXX.1_5kb_R2.trimmomatic.PE.fastq-AAAAAEEEEEEEEEEEEE666/EAE6E6AEEEAEEE/E/AAEEE/E<//6E/EEEEEEE6EEA/E/EE//AAAEEE<EEEAEEEEEEEE6/EE//</EE//EEEEE<E<<E//EEEE///EEEA<<A//EEEAA<AE<E<EEA<//E///</E<<
$

But you have the sequences reverse-complemented.

I gave them out in correct orientation, IMHO:

$ grep -A 3 NB501598:62:HFYJ5AFXX:1:11205:23960:15051 HFYJ5AFXX.1_5kb_R?.trimmomatic.?E.fastq |  grep -i acacatctagatgtgtataagagaca
HFYJ5AFXX.1_5kb_R1.trimmomatic.PE.fastq-ACACATCTAGATGTGTATAAGAGACATGTTCTTGTCTCCTATACACTGCTTGTTCACTGTGTGTTTGCTAGTCGTTGTTTTATGTCACCTTCTTGTTTAGCCTGCGTTTCCTCGTGTTTTGCTTGCTTTGTTTTATTGTTTTGTACTTTGTTAAT
$
$ grep -A 3 NB501598:62:HFYJ5AFXX:1:11205:23960:15051 *.sw_missed_linker.fastq.txt |  grep ACACATCTAGATGTGTATAAGAGACA
HFYJ5AFXX.1_5kb_R1.sw_missed_linker.fastq.txt-ACACATCTAGATGTGTATAAGAGACATGTTCTTGTCTCCTATACACTGCTTGTTCACTGTGTGTTTGCTAGTCGTTGTTTTATGTCACCTTCTTGTTTAGCCTGCGTTTCCTCGTGTTTTGCTTGCTTTGTTTTATTGTTTTGTACTTTGTTAAT
$

I don't know what is happening, but I am finding the adapter in that example and the reads look the same:

joconnell@ubuntu:~/workspace/NxTrim$ cat r1.fq
@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 1:N:0:GCCAAT
ACACATCTAGATGTGTATAAGAGACATGTTCTTGTCTCCTATACACTGCTTGTTCACTGTGTGTTTGCTAGTCGTTGTTTTATGTCACCTTCTTGTTTAGCCTGCGTTTCCTCGTGTTTTGCTTGCTTTGTTTTATTGTTTTGTACTTTGTTAAT
+
AAAA6EEAEA/AEEEEAEEEEEEEEEEEEEEEEEEEAEEEEEEEAEEEEEEEEEEEEEEEAEEEEEEAEE<E/EEEEEE/AAEEEEAEA<E<EEEEEEEEEEEE<EEEEEAEAEEEE/AEE<EE<AE/AAEEEA//EEEEEAEE/6/AAAEEEAE

joconnell@ubuntu:~/workspace/NxTrim$ cat r2.fq 
@NB501598:62:HFYJ5AFXX:1:11205:23960:15051 2:N:0:GCCAAT
AGCATTAAAAGCATTTTCAATTTAGTGTTAAATATTGGATATATAGCCATTTGCTTTTAGTTTACATAAAAAAATCATTTAAATGAAGAATAAAAAAATCAGGTGTTTATAATTCAAAATTTTTATACATTGGGCAAATTTTTCAACACACAGAC
+
AAAAAEEEEEEEEEEEEE666/EAE6E6AEEEAEEE/E/AAEEE/E<//6E/EEEEEEE6EEA/E/EE//AAAEEE<EEEAEEEEEEEE6/EE//</EE//EEEEE<E<<E//EEEE///EEEA<<A//EEEAA<AE<E<EEA<//E///</E<<

I suspect your cases where you can blast out some adapters are of the "multiple adapter" sort. Such as this one (from my E.coli):

raw:

@M903:194:000000000-A5FBD:1:1112:6063:10169 1:N:0:ATGTCA
GCTGGAACACAGCCACTGGCCGCGATTTCTGAGGCTTCTTTACCGCTAATCTCTTGCCGCATCCATTCCCGCGCAGCTTCTGGCGACAGTACCAGTCTGTCTCTTATACACATCTAGATGTGTATAAGAGACAGAGATGTGTATAAGATAC
+
AAA@AFFF1CAF1FGF11FGCEGEAEFGEGHGCFGCFGFHH2GE?EAECHHHHHHHHGGGEEGGHFHHFHGGCEEGEFFDGHFFEGGGGHFEFHHHGFHHHHHHFHHHEHFDCHHGFHHGHDFFGHHHHHHFHHFHHHGHEFHFHGHHGHB
joconnell@ubuntu:~/workspace/NxTrim/test$ cat r2.fq 
@M903:194:000000000-A5FBD:1:1112:6063:10169 2:N:0:ATGTCA
TATCGTAGATTCAGTATTAGATGTGACTTTATTTAATTGAACCGGCTGAGTTTTTAACGAAAGATCATTTAATGGAAACATTTACACTCCCTGTATCTTATACACATCTCTGTCTCTTATACACATCTAGATGTGTATAAGAGACAGACTG
+
AA1AAF>A>DFFDE33BE333AFDFF3AGFFHBEFHCFGH11G0EEEGCFDGBGECGHGGGAEGGFHHFFFHHHBB1GBGHHBGHHGHFHHHHGFFFGHHHFHFHFHEHHHEGHFHFGGHGBGHFHHFHHHHG2BBGHHHHHHG00B<GFG

Where are the adapters?

R1:
                                                                                                 perfect match                          partial overlap
                                                                                                 CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG AGATGTGTATAAGAGACAG
GCTGGAACACAGCCACTGGCCGCGATTTCTGAGGCTTCTTTACCGCTAATCTCTTGCCGCATCCATTCCCGCGCAGCTTCTGGCGACAGTACCAGT|CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG|AGATGTGTATAAGATAC

                                                                                           imperfect match     perfect match
R2:                                                                                        CTGTCTCTTATACACATCT CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG
TATCGTAGATTCAGTATTAGATGTGACTTTATTTAATTGAACCGGCTGAGTTTTTAACGAAAGATCATTTAATGGAAACATTTACACTCC|CTGTATCTTATACACATCT|CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG|ACTG

You can see there is a whole junction adapter and at least part of another one which is unexpected.

The code only catches these weird (unexpected) second adapter copies when there is a perfect string match, so in cases like above, they are missed. Yes, I could blast or sw-align, but they are so rare I just used a simple string find to catch a few of them. This could obviously be improved.

Yes, I could blast or sw-align, but they are so rare I just used a simple string find to catch a few of them. This could obviously be improved.

You probably think of me that I am nit picking but isn't this Issue about switch to Smith-Waterman matching? I thought everything in the code uses SW now (provided I used -w option). This is unexpected to hear something still relies on exact string match. But I am happy to help your finiding these places, one by one. ;)

I did not read the code but I fear you are getting just a single match from SW, you really need to ask for all matches (you need the guarantee that you get them all). If not you are out of luck. I spent several years on developing a tool to clean Roche/454 data and in principle I was hitting same issues.

Practically, in this case, you are either thorough and look for the full-length junction adapter, then look for those halves, merge the regions eventually into a single range (allowing say 3nt in between if the alignment was narrower for some reason). Then you continue with usual splitting as you do in your code, which notably evaluates whether any sequence was left leftwards and rightwards from the splitting region. Alternatively you can throw away this particular mate, optionally whole pair (some user might prefer that for a good reason).

I wouldn't be so picky but if the linkers remain in the reads then you have mismerged contigs. It depends on the coverage and aligner, bt basically if you start assembly with these 'chimeric' reads with linker unremoved, you won't get anything good out. Assemblers can overcome some unremoved sequencing adapters provided they appear at the end of reads. This is not the case with linkers. For me 118k unremoved linkers is too much. It does not matter out of how many input reads. Assemblers will respect this as true sample sequence. They are more likely to normalize the "clean" reads and enrich these reads with unremoved linkers.

Nitpicking is fine, but it looks like we have been talking about two different things. The tool does find adapters on the edges, but it doesn't find all the weird cases of extra unexpected adapters. I use SW (or Hamming match) to find the single best match to the adapter. This is trimmed appropriately. Then, as a way to catch the edge case of an extra unexpected adapter (as in my above example), the code looks for a perfect match (regardless of whether SW/hamming is used - this has not changed).

This small bit of code to catch this is here:

https://github.com/sequencing/NxTrim/blob/master/matepair.cpp#L545-L550

So yes, this will only catch perfect match adapters after the primary adapter has been removed. I guess your remaining adapters are mostly these multi-adapter cases with some amount of mismatch in the second copy of the adapter. I would advocate just throwing the entire pair out, they just seem untrustworthy.

I'm happy to take a pull request if you want to try some solutions. I don't have a lot of time to spend on this at the moment.

FYI. I don't know what you are assembling, but I currently get the best results for bacteria using SPAdes and --justmp. The pe/se libraries don't seem to help SPAdes.

I would advocate just throwing the entire pair out, they just seem untrustworthy.

In principle I am fine even with this approach, as a quick-and-dirty way to get data clean. Which is what we need, no matter some tiny data is lost.

Would it be relatively quick to implement (as those matches were just adjacent to the main match), I would prefer the range-based approach as some more real solution. This is not a corner-case IMHO. That works well. If you have two isolated ranges from each other then definitely the read/pair should be discarded. But in those adjacent cases you can save some sequence and in turn, good pairs.

This merely means you are not looping to get all matches out of the aligner. That is bad.

If you could replace the string exact matching approach it would be really helpful. I would say it is the priority number one now. Why cannot you just call the aligner on the upstream and downstream sequence in about the lines L545-L550? And if that finds a match, you return(1) as for the exact match? ;-)

Unfortunately I do not know C/C++ so I cannot help you here. I am certainly willing to test and help with other ideas. And I do not see almost any comments in the code, I mean useful comments. ;-)

I would say, this thread is also a proper place to document where elseare the exact string matches hiding.

BTW, I have a polyploid genome, fish. Previous, possibly velvet-based assembly is crap, full of gaps which are not supported in my eyes by reads (I have lots of readsspanning the gap proving there is not gap at all in the sequence). I assume something was wrong with scaffolding, but actually that must have been an issue during contigging because the contigs just should have been wider (the RapidRun reads just bear the whole chunk of sequence). Something was terribly wrong during data preparationIMHO, or wrong cmdline arguments for assembler, who knows. I did not receive any documentation what the previous person did.

@mmokrejs try this branch

git pull
git checkout iss37
make clean
make

The code now just runs the adapter detection routine twice. After the first run, the adapter is N masked, if another adapter copy is found, the read is discarded. This probably doubles the run-time at the moment.

nxtrim v0.4.1-7fb1356 failed with an error:

nxtrim: matepair.cpp:105: int sw_match(uint8_t*, int, uint8_t*, int, int, float, int8_t*): Assertion `r.te!=-1' failed.

$ gdb NxTrim/nxtrim core.31390
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from NxTrim/nxtrim...done.
[New Thread 31390]

warning: Ignoring non-absolute filename: <linux-vdso.so.1>
Missing separate debuginfo for linux-vdso.so.1
Try: yum --enablerepo='*-debug*' install /usr/lib/debug/.build-id/79/86395b139f22da92f9154f30c5951e68b4b7f6
Missing separate debuginfo for /apps/gentoo/lib64/ld-linux-x86-64.so.2
Try: yum --enablerepo='*-debug*' install /usr/lib/debug/.build-id/79/86395b139f22da92f9154f30c5951e68b4b7f6
Reading symbols from /apps/gentoo/usr/lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /apps/gentoo/usr/lib64/libz.so.1
Reading symbols from /apps/gentoo/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.4/libstdc++.so.6...done.
Loaded symbols for /apps/gentoo/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.4/libstdc++.so.6
Reading symbols from /apps/gentoo/lib64/libm.so.6...done.
Loaded symbols for /apps/gentoo/lib64/libm.so.6
Reading symbols from /apps/gentoo/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.4/libgcc_s.so.1...done.
Loaded symbols for /apps/gentoo/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.4/libgcc_s.so.1
Reading symbols from /apps/gentoo/lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /apps/gentoo/lib64/libc.so.6
Reading symbols from /apps/gentoo/lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /apps/gentoo/lib64/ld-linux-x86-64.so.2
Core was generated by `NxTrim/nxtrim'.
Program terminated with signal 6, Aborted.
#0  0x00007fffce2a6044 in raise () from /apps/gentoo/lib64/libc.so.6
(gdb) where
#0  0x00007fffce2a6044 in raise () from /apps/gentoo/lib64/libc.so.6
#1  0x00007fffce2a74aa in abort () from /apps/gentoo/lib64/libc.so.6
#2  0x00007fffce29edcd in __assert_fail_base () from /apps/gentoo/lib64/libc.so.6
#3  0x00007fffce29ee82 in __assert_fail () from /apps/gentoo/lib64/libc.so.6
#4  0x0000000000407abe in sw_match(unsigned char*, int, unsigned char*, int, int, float, signed char*) () at matepair.cpp:105
#5  0x0000000000407df6 in matePair::findAdapter(std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, int, float, bool) () at matepair.cpp:187
#6  0x0000000000409b55 in matePair::build(readPair&, int, float, int, bool, bool, bool, bool) () at matepair.cpp:578
#7  0x000000000040312f in main () at nxtrim.cpp:176
(gdb) bt full
#0  0x00007fffce2a6044 in raise () from /apps/gentoo/lib64/libc.so.6
No symbol table info available.
#1  0x00007fffce2a74aa in abort () from /apps/gentoo/lib64/libc.so.6
No symbol table info available.
#2  0x00007fffce29edcd in __assert_fail_base () from /apps/gentoo/lib64/libc.so.6
No symbol table info available.
#3  0x00007fffce29ee82 in __assert_fail () from /apps/gentoo/lib64/libc.so.6
No symbol table info available.
#4  0x0000000000407abe in sw_match(unsigned char*, int, unsigned char*, int, int, float, signed char*) () at matepair.cpp:105
        std::__ioinit = {static _S_refcount = 5, static _S_synced_with_stdio = true}
        adapter1 = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x614028 "CTGTCTCTTATACACATCT"}}
        adapter2 = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x614068 "AGATGTGTATAAGAGACAG"}}
        r2_external_adapter = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x614148 "GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"}}
        seq_nt4_table = '\004' <repeats 65 times>, "\000\004\001\004\004\004\002", '\004' <repeats 12 times>, "\003", '\004' <repeats 12 times>, "\000\004\001\004\004\004\002", '\004' <repeats 12 times>, "\003", '\004' <repeats 139 times>
        r1_external_adapter = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x6140f8 "GATCGGAAGAGCACACGTCTGAACTCCAGTCAC"}}
        adapterj = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x6140a8 "CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG"}}
#5  0x0000000000407df6 in matePair::findAdapter(std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, int, float, bool) () at matepair.cpp:187
        std::__ioinit = {static _S_refcount = 5, static _S_synced_with_stdio = true}
        adapter1 = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x614028 "CTGTCTCTTATACACATCT"}}
        adapter2 = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x614068 "AGATGTGTATAAGAGACAG"}}
        r2_external_adapter = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x614148 "GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"}}
        seq_nt4_table = '\004' <repeats 65 times>, "\000\004\001\004\004\004\002", '\004' <repeats 12 times>, "\003", '\004' <repeats 12 times>, "\000\004\001\004\004\004\002", '\004' <repeats 12 times>, "\003", '\004' <repeats 139 times>
        r1_external_adapter = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x6140f8 "GATCGGAAGAGCACACGTCTGAACTCCAGTCAC"}}
        adapterj = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x6140a8 "CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG"}}
#6  0x0000000000409b55 in matePair::build(readPair&, int, float, int, bool, bool, bool, bool) () at matepair.cpp:578
        std::__ioinit = {static _S_refcount = 5, static _S_synced_with_stdio = true}
        adapter1 = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x614028 "CTGTCTCTTATACACATCT"}}
        adapter2 = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x614068 "AGATGTGTATAAGAGACAG"}}
        r2_external_adapter = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x614148 "GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"}}
        seq_nt4_table = '\004' <repeats 65 times>, "\000\004\001\004\004\004\002", '\004' <repeats 12 times>, "\003", '\004' <repeats 12 times>, "\000\004\001\004\004\004\002", '\004' <repeats 12 times>, "\003", '\004' <repeats 139 times>
        r1_external_adapter = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x6140f8 "GATCGGAAGAGCACACGTCTGAACTCCAGTCAC"}}
        adapterj = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x6140a8 "CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG"}}
#7  0x000000000040312f in main () at nxtrim.cpp:176
        std::__ioinit = {static _S_refcount = 5, static _S_synced_with_stdio = true}
(gdb)

OK, here is a better stacktrace from CXXFLAGS="-ggdb -O0" binary:

(gdb) bt full
#0  0x00007fffce385625 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007fffce386e05 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007fffce37e74e in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3  0x00007fffce37e810 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4  0x0000000000407ba4 in sw_match (target=0x80b740 '\004' <repeats 22 times>, tlen=22, query=0x6235f0 "\001\003\002\003\001\003\001\003\003", qlen=38, minoverlap=12, min_similarity=0.699999988, mat=0x7fffffffda40 "\001\376\376\376") at matepair.cpp:105
        sa = 1
        go = 1
        ge = 3
        substring_length = -1
        sim = -0
        __PRETTY_FUNCTION__ = "int sw_match(uint8_t*, int, uint8_t*, int, int, float, int8_t*)"
        sb = 2
        r = {score = 0, te = -1, qe = 0, score2 = -1, te2 = -1, tb = 0, qb = 0}
#5  0x0000000000407ed8 in matePair::findAdapter (this=0x7fffffffd8e0, s='N' <repeats 22 times>, minoverlap=12, similarity=0.699999988, use_hamming=false) at matepair.cpp:187
        L2 = 19
        perfect = 4294967295
        s_tmp = 0x80b740 '\004' <repeats 22 times>
        a = 4209342
        L1 = 22
#6  0x0000000000409e85 in matePair::build (this=0x7fffffffd8e0, readpair=..., minovl=12, sim=0.699999988, ml=21, jr=false, uh=false, pmp=true, jmp=false) at matepair.cpp:586
        tmp = {l = 22, filtered = false, description = true, h = "NB501598:62:HFYJ5AFXX:1:11107:2087:3600 2:N:0:CTTGTA", s = 'N' <repeats 22 times>, l3 = "+", q = '#' <repeats 22 times>}
        L2 = 22
        a1 = 38
        b2 = 30
        rc2 = {l = 155, filtered = false, description = false, h = "", s = "", l3 = "", q = "CAGCATACTTCCCAGCATGCCAACTATCCCCCTCCCTCCTGTCTCTTATACACAACTAGATGTGTATAAGAGACAGGTCTGTGGGGTGCCCCCAAACTTTACATGAGCTTTTGCTGCAGGACAGATTTTTAG"}
        minoverlap2 = -9296
        L1 = 132
        a2 = -8
        b1 = 76
        rc1 = {l = 90, filtered = 255, description = 127, h = "", s = "", l3 = "", q = ""}
#7  0x00000000004032e4 in main (argc=12, argv=0x7fffffffde38) at nxtrim.cpp:176
        weird = false
        m = {joinreads = false, use_hamming = false, preserve_mp = true, _justmp = false, se = {l = 0, filtered = false, description = true, h = "", s = "", l3 = "", q = ""}, mp = {r1 = {l = 0, filtered = false, description = true, h = "", s = "", l3 = "", q = ""}, r2 = {l = 0, filtered = false, description = true, 
              h = "", s = "", l3 = "", q = ""}, l = 0, filtered = false}, pe = {r1 = {l = 0, filtered = false, description = true, h = "", s = "", l3 = "", q = ""}, r2 = {l = 0, filtered = false, description = true, h = "", s = "", l3 = "", q = ""}, l = 0, filtered = false}, unknown = {r1 = {l = 0, filtered = false, 
              description = true, h = "", s = "", l3 = "", q = ""}, r2 = {l = 0, filtered = false, description = true, h = "", s = "", l3 = "", q = ""}, l = 1, filtered = false}, minoverlap = 12, minlen = 21, similarity = 0.699999988, adapter1_sw = 0x6235b0 "\001\003\002\003\001\003\001\003\003", 
          adapter2_sw = 0x6235d0 "", adapterj_sw = 0x6235f0 "\001\003\002\003\001\003\001\003\003", sw_mat = "\001\376\376\376\000\376\001\376\376\000\376\376\001\376\000\376\376\376\001\000\000\000\000\000"}
        npass = 2740
        trim_warn = false
        se_only = 470
        pos = {first = 0, second = 0}
        nweird = 5
        nread = 2740
        out = {weird = -9488, n_mp = 1507, n_unk = 76, n_se = 878, n_pe = 341, _justmp = false, _write_mp = true, _write_se = true, _write_pe = true, _write_un = true, print_to_stdout = 240, mp_out = {separate = true, outfile = {fp = 0x0, _stdout = false}, outfile1 = {fp = 0x623700, _stdout = false}, outfile2 = {
              fp = 0x6691e0, _stdout = false}}, pe_out = {separate = true, outfile = {fp = 0x0, _stdout = 208}, outfile1 = {fp = 0x73a210, _stdout = false}, outfile2 = {fp = 0x77fcf0, _stdout = false}}, unknown_out = {separate = true, outfile = {fp = 0x0, _stdout = false}, outfile1 = {fp = 0x6aec80, 
              _stdout = false}, outfile2 = {fp = 0x6f4720, _stdout = false}}, se_out = {fp = 0x7c57d0, _stdout = false}}
        minlen = 21
        write_stdout_un = false
        nodata = 341
        p = {r1 = {l = 132, filtered = false, description = true, h = "NB501598:62:HFYJ5AFXX:1:11101:23244:5416 1:N:0:CTTGTA", s = "CAGCATACTTCCCAGCATGCCAACTATCCCCCTCCCTCCTGTCTCTTATACACAACTAGATGTGTATAAGAGACAGGTCTGTGGGGTGCCCCCAAACTTTACATGAGCTTTTGCTGCAGGACAGATTTTTAG", l3 = "+", q = 
    "AAAA6E6EAEE/EEE/EEEAEEEAEEAE</E/AEEEEE<AAAEEEEEEEAAEEE///EEA/E/EA/E/A///E//E</A<E/////A<</E/E/A//EEA//E<///<<EEE////E<//E/</EEEE/<<<"}, r2 = {l = 22, filtered = false, description = true, h = "NB501598:62:HFYJ5AFXX:1:11107:2087:3600 2:N:0:CTTGTA", s = "TATACACATCTAGATGTGTATA", l3 = "+", q = 
    "6AA6AEEAE/AEE6EEEEE6E/"}, l = 112, filtered = false}
        similarity = 0.699999988
        prefix = "HFYJ5AFXX.1_8kb.trimmomatic.PE.nxtrim"
        ignorePF = false
        write_stdout_mp = false
        loptions = {{name = 0x410620 "r1", has_arg = 1, flag = 0x0, val = 49}, {name = 0x410623 "r2", has_arg = 1, flag = 0x0, val = 50}, {name = 0x410626 "output-prefix", has_arg = 1, flag = 0x0, val = 79}, {name = 0x410634 "stdout", has_arg = 0, flag = 0x0, val = 0}, {name = 0x41063b "stdout-mp", has_arg = 0, 
            flag = 0x0, val = 9}, {name = 0x410645 "stdout-un", has_arg = 0, flag = 0x0, val = 10}, {name = 0x41064f "justmp", has_arg = 0, flag = 0x0, val = 1}, {name = 0x410656 "joinreads", has_arg = 0, flag = 0x0, val = 2}, {name = 0x410660 "rf", has_arg = 0, flag = 0x0, val = 3}, {name = 0x410663 "preserve-mp", 
            has_arg = 0, flag = 0x0, val = 4}, {name = 0x41066f "ignorePF", has_arg = 0, flag = 0x0, val = 5}, {name = 0x410678 "mp", has_arg = 0, flag = 0x0, val = 6}, {name = 0x41067b "unknown", has_arg = 0, flag = 0x0, val = 7}, {name = 0x410683 "separate", has_arg = 0, flag = 0x0, val = 8}, {
            name = 0x41068c "smith-waterman", has_arg = 0, flag = 0x0, val = 119}, {name = 0x41069b "similarity", has_arg = 1, flag = 0x0, val = 115}, {name = 0x4106a6 "minoverlap", has_arg = 1, flag = 0x0, val = 118}, {name = 0x4106b1 "minlength", has_arg = 1, flag = 0x0, val = 108}, {name = 0x0, has_arg = 0, 
            flag = 0x0, val = 0}}
        c = -1
        justmp = false
        write_stdout = false
        infile = {f1 = 0x614270, f2 = 0x61bc10}
        joinreads = false
        r1 = 0x7fffffffe1e9 "HFYJ5AFXX.1_8kb_R1.trimmomatic.SE.fastq"
        rc = true
        hamming = false
        preserve_mp = true
        minoverlap = 12
        r2 = 0x7fffffffe214 "HFYJ5AFXX.1_8kb_R2.trimmomatic.SE.fastq"
        separate = true
(gdb)

Try again with the last commit.

Works now ... I will keep the debug binary running on all data. ETA probably ~4-6hrs.

Okay. There may be no need for debug binary here. That above error was an assert failure due to a problematic SW alignment, it would have been caught in the release build.

I have already put the debug binary through valgrind without error.

OK, recompiled&restarted with CXXFLAGS="-ggdb -march=native -O2" make

Did you use for development testing the FASTQ files I uploaded above? Just to know what I shall anticipate. ;-)

joconnell@ubuntu:~/workspace/NxTrim$ ./nxtrim -1 /home/joconnell/Downloads/HFYJ5AFXX.1_5kb_R1.sw_missed_linker.fastq.txt -2 /home/joconnell/Downloads/HFYJ5AFXX.1_5kb_R2.sw_missed_linker.fastq.txt --separate --preserve-mp --rf -O test -w -s 0.7
Output: test.*.fastq.gz
Trimming:
R1:	/home/joconnell/Downloads/HFYJ5AFXX.1_5kb_R1.sw_missed_linker.fastq.txt
R2:	/home/joconnell/Downloads/HFYJ5AFXX.1_5kb_R2.sw_missed_linker.fastq.txt
--preserve-mp is on: will favour MPs over PEs

Trimming summary:
7185 / 7185	( 100.00% )	reads passed chastity/purity filters.
173 / 7185	( 2.41% )	reads had TWO copies of adapter (filtered).
8 / 7012	( 0.11% )	read pairs were ignored because template length appeared less than read length
7004 remaining reads were trimmed

4168 / 7004	( 59.51% )	read pairs had MP orientation
570 / 7004	( 8.14% )	read pairs had PE orientation
1880 / 7004	( 26.84% )	read pairs had unknown orientation
386 / 7004	( 5.51% )	were single end reads

1194 / 7004	( 17.05% )	extra single end reads were generated from overhangs
joconnell@ubuntu:~/workspace/NxTrim$ for i in test*gz;do echo $i;zcat $i | agrep -1 CTGTCTCTTATACACATCT | wc -l;done
test_R1.mp.fastq.gz
0
test_R1.pe.fastq.gz
0
test_R1.unknown.fastq.gz
0
test_R2.mp.fastq.gz
0
test_R2.pe.fastq.gz
0
test_R2.unknown.fastq.gz
0
test.se.fastq.gz
0
joconnell@ubuntu:~/workspace/NxTrim$ for i in test*gz;do echo $i;zcat $i | agrep -1 AGATGTGTATAAGAGACAG | wc -l;done
test_R1.mp.fastq.gz
0
test_R1.pe.fastq.gz
0
test_R1.unknown.fastq.gz
0
test_R2.mp.fastq.gz
0
test_R2.pe.fastq.gz
0
test_R2.unknown.fastq.gz
0
test.se.fastq.gz
0

Seems too few were found. Try this.

# find the missed matches using blastn
for f in *.mp.fastq.gz *.pe.fastq.gz *.unknown.fastq.gz *.se.fastq.gz; do p=`basename $f .fastq.gz`; echo "$f"; gzip -dc $f > "$p".fastq; fq2fa "$p".fastq "$p".fasta; formatdb -p F -i "$p".fasta; blastn -num_threads 18 -task blastn-short -db "$p".fasta -query Nextera_MatePair_hairpin.fasta -word_size 4 -dust no -reward 1 -max_target_seqs 99999 -max_hsps_per_subject 20 -perc_identity 75 -strand plus > "$p".missed_linkers.blastn.txt; done

for f in *.mp.fastq.gz *.pe.fastq.gz *.unknown.fastq.gz *.se.fastq.gz; do p=`basename $f .fastq.gz`; echo "$f"; gzip -dc $f > "$p".fastq; fq2fa "$p".fastq "$p".fasta; formatdb -p F -i "$p".fasta; blastn -num_threads 18 -task blastn-short -db "$p".fasta -query Nextera_MatePair_hairpin.fasta -word_size 4 -dust no -reward 1 -max_target_seqs 99999 -max_hsps_per_subject 20 -perc_identity 75 -strand plus -outfmt 6 | awk '{print $2}' > "$p".missed_linkers.txt; done

reads had TWO copies of adapter (filtered).

How about:

reads had TWO or more copies of adapter (discarding whole pair).

if that is true.

$ grep 'reads had TWO copies of' *nxtrim__preserve_mp_from_original.log | sort
HFYJ5AFXX.1_5kb.trimmomatic.PE.nxtrim__preserve_mp_from_original.log:29721 / 16495106   ( 0.18% )       reads had TWO copies of adapter (filtered).
HFYJ5AFXX.1_5kb.trimmomatic.SE.nxtrim__preserve_mp_from_original.log:39 / 20662 ( 0.19% )       reads had TWO copies of adapter (filtered).
HFYJ5AFXX.1_8kb.trimmomatic.PE.nxtrim__preserve_mp_from_original.log:29331 / 16300973   ( 0.18% )       reads had TWO copies of adapter (filtered).
HFYJ5AFXX.1_8kb.trimmomatic.SE.nxtrim__preserve_mp_from_original.log:40 / 21681 ( 0.18% )       reads had TWO copies of adapter (filtered).
HFYJ5AFXX.2_5kb.trimmomatic.PE.nxtrim__preserve_mp_from_original.log:29124 / 16253919   ( 0.18% )       reads had TWO copies of adapter (filtered).
HFYJ5AFXX.2_5kb.trimmomatic.SE.nxtrim__preserve_mp_from_original.log:14 / 11399 ( 0.12% )       reads had TWO copies of adapter (filtered).
HFYJ5AFXX.2_8kb.trimmomatic.PE.nxtrim__preserve_mp_from_original.log:29476 / 16044969   ( 0.18% )       reads had TWO copies of adapter (filtered).
HFYJ5AFXX.2_8kb.trimmomatic.SE.nxtrim__preserve_mp_from_original.log:40 / 12909 ( 0.31% )       reads had TWO copies of adapter (filtered).
HFYJ5AFXX.3_5kb.trimmomatic.PE.nxtrim__preserve_mp_from_original.log:28822 / 15867509   ( 0.18% )       reads had TWO copies of adapter (filtered).
HFYJ5AFXX.3_5kb.trimmomatic.SE.nxtrim__preserve_mp_from_original.log:45 / 18413 ( 0.24% )       reads had TWO copies of adapter (filtered).
HFYJ5AFXX.3_8kb.trimmomatic.PE.nxtrim__preserve_mp_from_original.log:28387 / 15716944   ( 0.18% )       reads had TWO copies of adapter (filtered).
HFYJ5AFXX.3_8kb.trimmomatic.SE.nxtrim__preserve_mp_from_original.log:33 / 19198 ( 0.17% )       reads had TWO copies of adapter (filtered).
HFYJ5AFXX.4_5kb.trimmomatic.PE.nxtrim__preserve_mp_from_original.log:26571 / 14582359   ( 0.18% )       reads had TWO copies of adapter (filtered).
HFYJ5AFXX.4_5kb.trimmomatic.SE.nxtrim__preserve_mp_from_original.log:29 / 12569 ( 0.23% )       reads had TWO copies of adapter (filtered).
HFYJ5AFXX.4_8kb.trimmomatic.PE.nxtrim__preserve_mp_from_original.log:26923 / 14459869   ( 0.19% )       reads had TWO copies of adapter (filtered).
HFYJ5AFXX.4_8kb.trimmomatic.SE.nxtrim__preserve_mp_from_original.log:29 / 14084 ( 0.21% )       reads had TWO copies of adapter (filtered).

The discarded total is maybe a bit higher than those 180k from previous tests but it is OK. Will now run the blastn searches ...

These still remain in MP data, for example. The scoring must be wrong somewhere.


> NB501598:62:HFYJ5AFXX:1:21310:6408:6560 1:N:0:GCCAAT
Length=151

 Score = 38.2 bits (19),  Expect = 0.072
 Identities = 25/27 (93%), Gaps = 0/27 (0%)
 Strand=Plus/Plus

Query  8   TTATACACATCTAGATGTGTATAAGAG  34
           |||||||||| ||| ||||||||||||
Sbjct  26  TTATACACATATAGTTGTGTATAAGAG  52


> NB501598:62:HFYJ5AFXX:1:21310:1598:1343 1:N:0:GCCAAT
Length=155

 Score = 36.2 bits (18),  Expect = 0.28
 Identities = 24/26 (92%), Gaps = 0/26 (0%)
 Strand=Plus/Plus

Query  8   TTATACACATCTAGATGTGTATAAGA  33
           |||||||||| || ||||||||||||
Sbjct  41  TTATACACATATATATGTGTATAAGA  66


> NB501598:62:HFYJ5AFXX:1:21301:22800:8378 1:N:0:GCCAAT
Length=155

 Score = 36.2 bits (18),  Expect = 0.28
 Identities = 30/34 (88%), Gaps = 0/34 (0%)
 Strand=Plus/Plus

Query  2   TGTCTCTTATACACATCTAGATGTGTATAAGAGA  35
           ||||||||||||  |||||||| | |||||||||
Sbjct  17  TGTCTCTTATACCGATCTAGATTTCTATAAGAGA  50


> NB501598:62:HFYJ5AFXX:1:21205:14665:11772 1:N:0:GCCAAT
Length=155

 Score = 36.2 bits (18),  Expect = 0.28
 Identities = 30/34 (88%), Gaps = 0/34 (0%)
 Strand=Plus/Plus

Query  4   TCTCTTATACACATCTAGATGTGTATAAGAGACA  37
           |||| |||| ||||||||||  ||||||||||||
Sbjct  31  TCTCATATAGACATCTAGATCAGTATAAGAGACA  64


> NB501598:62:HFYJ5AFXX:1:21203:19031:19916 1:N:0:GCCAAT
Length=155

 Score = 36.2 bits (18),  Expect = 0.28
 Identities = 30/34 (88%), Gaps = 0/34 (0%)
 Strand=Plus/Plus

Query  3   GTCTCTTATACACATCTAGATGTGTATAAGAGAC  36
           |||||||||| | |||||||||| ||||| ||||
Sbjct  1   GTCTCTTATAAAAATCTAGATGTTTATAATAGAC  34


If you consider your first example, sure there is ~26bp string match, but the flanking bases are very divergent:

                                                                                               CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG
TGCAAAAGTTCGGTAATGTATTATGTGTAATGTATTTACTTTTACTTTTACAAGCTTGACAAAGTGAAACAACTCATAGTGAAACATCCTAATACCTTACTCTTATACACAACTATATGTGTATAATAAACAGGTATAATTGTGTGTGATT

If you drop to -s 0.6 this will be caught, but I don't know how safe that is in general.

I won't be fixing cases like this sorry.

No, they are just true linkers.

> NB501598:62:HFYJ5AFXX:1:21310:6408:6560 1:N:0:GCCAAT
Length=151

 Score = 44.6 bits (48),  Expect = 4e-04
 Identities = 30/34 (88%), Gaps = 0/34 (0%)
 Strand=Plus/Plus

Query  1   CTGTCTCTTATACACATCTAGATGTGTATAAGAG  34
           |||| | |||||||||| ||| ||||||||||||
Sbjct  19  CTGTTTATTATACACATATAGTTGTGTATAAGAG  52


> NB501598:62:HFYJ5AFXX:1:21301:22800:8378 1:N:0:GCCAAT
Length=155

 Score = 44.6 bits (48),  Expect = 4e-04
 Identities = 32/37 (86%), Gaps = 0/37 (0%)
 Strand=Plus/Plus

Query  2   TGTCTCTTATACACATCTAGATGTGTATAAGAGACAG  38
           ||||||||||||  |||||||| | ||||||||| ||
Sbjct  17  TGTCTCTTATACCGATCTAGATTTCTATAAGAGATAG  53


> NB501598:62:HFYJ5AFXX:1:21205:14665:11772 1:N:0:GCCAAT
Length=155

 Score = 44.6 bits (48),  Expect = 4e-04
 Identities = 32/37 (86%), Gaps = 0/37 (0%)
 Strand=Plus/Plus

Query  1   CTGTCTCTTATACACATCTAGATGTGTATAAGAGACA  37
           || |||| |||| ||||||||||  ||||||||||||
Sbjct  28  CTCTCTCATATAGACATCTAGATCAGTATAAGAGACA  64


> NB501598:62:HFYJ5AFXX:1:21203:19031:19916 1:N:0:GCCAAT
Length=155

 Score = 44.6 bits (48),  Expect = 4e-04
 Identities = 30/34 (88%), Gaps = 0/34 (0%)
 Strand=Plus/Plus

Query  3   GTCTCTTATACACATCTAGATGTGTATAAGAGAC  36
           |||||||||| | |||||||||| ||||| ||||
Sbjct  1   GTCTCTTATAAAAATCTAGATGTTTATAATAGAC  34

Here is the better commandline with increased reward for a match.

# find the missed matches using blastn
for f in *.mp.fastq.gz *.pe.fastq.gz *.unknown.fastq.gz *.se.fastq.gz; do p=`basename $f .fastq.gz`; echo "$f"; gzip -dc $f > "$p".fastq; fq2fa "$p".fastq "$p".fasta; formatdb -p F -i "$p".fasta; blastn -num_threads 18 -task blastn-short -db "$p".fasta -query Nextera_MatePair_hairpin.fasta -word_size 4 -dust no -reward 2 -max_hsps 20 -num_descriptions 99999 -num_alignments 99999 -perc_identity 75 -strand plus > "$p".missed_linkers.blastn.txt; done

for f in *.mp.fastq.gz *.pe.fastq.gz *.unknown.fastq.gz *.se.fastq.gz; do p=`basename $f .fastq.gz`; echo "$f"; gzip -dc $f > "$p".fastq; fq2fa "$p".fastq "$p".fasta; formatdb -p F -i "$p".fasta; blastn -num_threads 18 -task blastn-short -db "$p".fasta -query Nextera_MatePair_hairpin.fasta -word_size 4 -dust no -reward 2 -max_target_seqs 99999 -max_hsps_per_subject 20 -perc_identity 75 -strand plus -outfmt 6 | awk '{print $2}' > "$p".missed_linkers.txt; done

The scores you calculate are unusually low. Would it be possible possible to stick to the scoring which e.g. blastn uses? I know there is some web page describing why blastn scores are different from EMBOSS water, but they are roughly same. Your is a third approach. -s 0.6 sounds weird but these cases woudl really fit into 0.7 easily. See https://www.ncbi.nlm.nih.gov/books/NBK279678/ But I do not see how to figure out what blastn, actually -task blastn-short enforces. Ahh, luckily it is printed at the bottom of plaintext out. So here is what I propose (and what the above blastn really used):

Matrix: blastn matrix 2 -3
Gap Penalties: Existence: 5, Extension: 2

I am not debating that they are true junction adapters, but they are low quality reads and we have to accept a lot of mismatches to get them. You are welcome to play with the scoring.

Gap-open(go)/gap-extend(ge):
https://github.com/sequencing/NxTrim/blob/sw_full_match/matepair.cpp#L88

Match(sa)/mismatch(sb):
https://github.com/sequencing/NxTrim/blob/sw_full_match/matepair.cpp#L320

You mean this?

$ git diff
diff --git a/matepair.cpp b/matepair.cpp
index f0ab2cf..22615ed 100755
--- a/matepair.cpp
+++ b/matepair.cpp
@@ -91,7 +91,7 @@ int sw_match(uint8_t *target,int tlen,uint8_t *query,int qlen,int minoverlap,flo
     }
     int sa=1;
     int sb=2;
-    int go = 1, ge = 3;    
+    int go = 5, ge = 2; // gap open, gap extension
 //    ta_opt_set_mat(sa, sb, mat);
     kswr_t r = ksw_align(qlen, query, tlen, target, 5, mat, go, ge, KSW_XBYTE|KSW_XSTART|(8*sa), 0);
     int substring_length = r.qe - r.qb < r.te - r.tb? r.qe - r.qb : r.te - r.tb;
@@ -347,7 +347,7 @@ int matePair::resolve_overhang(fqread & r1, fqread & r2,int a,int b) {
 
 matePair::matePair()
 {
-    ta_opt_set_mat(1,2,sw_mat);
+    ta_opt_set_mat(2,-3,sw_mat); //match and mismatch
     adapter1_sw=(uint8_t *)malloc(adapter1.length()+1);
     for(size_t i=0;i<adapter1.length();i++)
     {

But how about the actual calculation, will that work? I found these but they are not the page I was looking for.
https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
http://embossgui.sourceforge.net/demo/manual/water.html
https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm

Anyway, what do you say?

I made it clearer by putting all the penalty terms at the top of the code:

https://github.com/sequencing/NxTrim/blob/iss37/matepair.cpp#L13-L16

I don't know much about BLAST. The current SW similarity measure is calculated as the alignment score divided by the maximum possible score for the alignment. The maximum possible score is SW_MATCH multiplied by the alignment length.

I used -s .7 -w --separate --preserve-mp but everything matched after this change.

$ git diff
diff --git a/matepair.cpp b/matepair.cpp
index a170c69..9518d38 100755
--- a/matepair.cpp
+++ b/matepair.cpp
@@ -10,10 +10,10 @@ string adapterj = adapter1+adapter2;
 string r1_external_adapter = "GATCGGAAGAGCACACGTCTGAACTCCAGTCAC";
 string r2_external_adapter = "GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT";
 
-#define SW_GAP_EXTENSION 3
-#define SW_GAP_OPEN 1
-#define SW_MATCH 1 
-#define SW_MISMATCH 2
+#define SW_GAP_EXTENSION 2
+#define SW_GAP_OPEN 5
+#define SW_MATCH 2
+#define SW_MISMATCH -3
 
 #define DEBUG 0
Trimming summary:
20662 / 20662   ( 100.00% )     reads passed chastity/purity filters.
19351 / 20662   ( 93.66% )      reads had TWO copies of adapter (filtered).
1311 / 1311     ( 100.00% )     read pairs were ignored because template length appeared less than read length
0 remaining reads were trimmed

0 / 0   ( -nan% )       read pairs had MP orientation
0 / 0   ( -nan% )       read pairs had PE orientation
0 / 0   ( -nan% )       read pairs had unknown orientation
0 / 0   ( -nan% )       were single end reads

It's a bit confusing - you need #define SW_MISMATCH 3 not -3

Good? No, to loose. Currently I am testing on the full dataset:

--- a/matepair.cpp
+++ b/matepair.cpp
@@ -10,9 +10,9 @@ string adapterj = adapter1+adapter2;
 string r1_external_adapter = "GATCGGAAGAGCACACGTCTGAACTCCAGTCAC";
 string r2_external_adapter = "GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT";
 
-#define SW_GAP_EXTENSION 3
-#define SW_GAP_OPEN 1
-#define SW_MATCH 1 
+#define SW_GAP_EXTENSION 1
+#define SW_GAP_OPEN 2
+#define SW_MATCH 2
 #define SW_MISMATCH 2
 
 #define DEBUG 0

The read NB501598:62:HFYJ5AFXX:1:21310:6408:6560 is a good example I hope.

What I need is a much better log output into individual files (mp, pe, se, unknown).
$readname $linker_start $linker_end $similarity $matched_sequence

$ NxTrim/nxtrim -s 0.7 -w -1 minitest_R1.fastq -2 minitest_R2.fastq --stdout --preserve-mp --rf  
Writing to stdout
Trimming:
R1:     minitest_R1.fastq
R2:     minitest_R2.fastq
--preserve-mp is on: will favour MPs over PEs
@NB501598:62:HFYJ5AFXX:1:21310:6408:6560 1:N:0:GCCAAT
TGCAAAAGTTCGGTAATGTATTATGTGTAATGTATTTACTTTTACTTTTACAAGCTTGACAAAGTGAAACAACTCATAGTGAAACATCCTAATAC
+
AAAA6EEE6EA///EEAEEEE/EA//A/EEEE<EE/EE//EAEAEE//EEAEEA6EE//</EEA//EEEEE6//////E<AEE/EAE////6E66
@NB501598:62:HFYJ5AFXX:1:21310:6408:6560 2:N:0:GCCAAT
GTGGGCCCCTCTTTAGTTAAAAAAACAATAACAAAATCAAACACAATTATAC
+
A//AAA/EAAAAAEE/EEEAAEEEE6//EEEEAEEEE/A///E<EEE/EEEE

The left part before the linkers is trimmed 4nt sooner than necessary so we loose CTTA:

$ grep -A 3 NB501598:62:HFYJ5AFXX:1:21310:6408:6560 HFYJ5AFXX.1_5kb_R1.trimmomatic.PE.fastq |  grep CTCTTATACACAACTATATGTGTATAATAAACAG
TGCAAAAGTTCGGTAATGTATTATGTGTAATGTATTTACTTTTACTTTTACAAGCTTGACAAAGTGAAACAACTCATAGTGAAACATCCTAATACCTTACTCTTATACACAACTATATGTGTATAATAAACAGGTATAATTGTGTGTGATTTTGT

The second mate is surprisingly shortened and I do not understand why. Probably nxtrim thought it is followed by the very same linker?

$ grep -A 3 NB501598:62:HFYJ5AFXX:1:21310:6408:6560 HFYJ5AFXX.1_5kb_R2.trimmomatic.PE.fastq  |  grep GTGGGCCCCTCTTTAGTTAAAAAAACAATAACAAAATCAAACACAATTATAC
GTGGGCCCCTCTTTAGTTAAAAAAACAATAACAAAATCAAACACAATTATACCTGTATCTTATAAACATCTAGATATGTATAATAGACAGTAATTAGGATGTTTCACTCTACCTTGTTTAACTTTTGCAAGATTGTAAAATTAAAATAAAATAAA

The second mate is surprisingly shortened and I do not understand why. Probably nxtrim thought it is followed by the very same linker?

No, I have a different theory. It looks the sequence GTATAATTGTGTGTGATTTTGT downstream of the linker in R1 read was reverse-complemented and aligned to R2 read to find where the mates overlap. But the overlap=alignment should well cover the linker and even its sequence on its left.

You can turn on debugging output by setting #DEFINE DEBUG 2. The output is very verbose so only run it on a very small dataset (ie one read-pair). You then see things like this:

read L1 = 155
read L2 = 155
adapter locations (first pass): 155 193 42 80
adapter locations (second pass): 155 193 42 80
adapter locations (third pass): 155 193 42 80

it is telling you where the adapters were detected. So here (42,80) means that was the detected adapter in read 2. When the adapter start is == to read length, no adapter is found.

@mmokrejs FYI I just pushed a minor bugfix that was causing minoverlap to not be respected in the SW code. I don't think this substantially changes results though (except that some minoverlap adapter matches were missed).

Thanks, will try but probably after I figure out how to get the SW-matching code to work. I with below changes I ended up with some 242k linkers unremoved, about 60k more then with original. :(

--- a/matepair.cpp
+++ b/matepair.cpp
@@ -12,8 +12,8 @@ string r2_external_adapter = "GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT";
 
 #define SW_GAP_EXTENSION 3
 #define SW_GAP_OPEN 1
-#define SW_MATCH 1 
-#define SW_MISMATCH 2
+#define SW_MATCH 2
+#define SW_MISMATCH 3
 
 #define DEBUG 0

Some difficult cases (I lazily assume they are in the FASTQ file I already provided):

> NB501598:62:HFYJ5AFXX:1:21312:10138:16626 1:N:0:GCCAAT
Length=105

 Score = 42.8 bits (46),  Expect = 0.001
 Identities = 32/38 (84%), Gaps = 0/38 (0%)
 Strand=Plus/Plus

Query  1   CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG  38
           |||||||||||| | || ||||||| |||||  |||||
Sbjct  22  CTGTCTCTTATAAAAATATAGATGTTTATAATTGACAG  59


> NB501598:62:HFYJ5AFXX:1:21312:5193:12408 1:N:0:GCCAAT
Length=155

 Score = 42.8 bits (46),  Expect = 0.001
 Identities = 32/38 (84%), Gaps = 0/38 (0%)
 Strand=Plus/Plus

Query  1   CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG  38
           |||||||||||| | || ||||||| ||||| || |||
Sbjct  23  CTGTCTCTTATAAAAATATAGATGTTTATAATAGTCAG  60


> NB501598:62:HFYJ5AFXX:1:21311:21303:18350 1:N:0:GCCAAT
Length=155

 Score = 42.8 bits (46),  Expect = 0.001
 Identities = 32/38 (84%), Gaps = 0/38 (0%)
 Strand=Plus/Plus

Query  1   CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG  38
           || ||||| ||| |||||||||| | ||||||| ||||
Sbjct  32  CTCTCTCTAATAAACATCTAGATCTCTATAAGATACAG  69


> NB501598:62:HFYJ5AFXX:1:21310:13226:18075 1:N:0:GCCAAT
Length=155

 Score = 42.8 bits (46),  Expect = 0.001
 Identities = 32/38 (84%), Gaps = 0/38 (0%)
 Strand=Plus/Plus

Query  1   CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG  38
           ||| || |||||||| ||||||| |||||||| | |||
Sbjct  48  CTGGCTGTTATACACTTCTAGATTTGTATAAGTGTCAG  85

You could also use https://ncbi.github.io/cxx-toolkit/pages/ch_blast#ch_blast.CLocalBlast provided you index both forward and reverse FASTA files separately (after FASTQ to FASTA conversion). But I still think it is faster to run external blastn in parallel (user can provide it before hand, and notably edit it easily -- e.g. remove false positives or add some missed cases). They you parse it's CSV output for forward and reverse-read streams separately, and import into nxtrim. You cannot beat its performance doing SW-alignment one by one read pair IMHO (CPU will do a couple of NOP cycles here and there due to latencies), not even speaking of paralellization being missing. Issue #46 is the way to go IMHO. I will just work and we can focus on the remaining code. I still foresee some problem in the case of NB501598:62:HFYJ5AFXX:1:21310:6408:6560.

I just pushed a minor bugfix that was causing minoverlap to not be respected in the SW code. I don't think this substantially changes results though (except that some minoverlap adapter matches were missed).

Aha, so now the branch is gone and the region changed in the patch below was dropped as well. Was that intentional?

--- a/matepair.cpp
+++ b/matepair.cpp
@@ -12,8 +12,8 @@ string r2_external_adapter = "GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT";
 
 #define SW_GAP_EXTENSION 3
 #define SW_GAP_OPEN 1
-#define SW_MATCH 1 
-#define SW_MISMATCH 2
+#define SW_MATCH 2
+#define SW_MISMATCH 3

Yes that looks right.

The branch is now merged to master with the bugfix and scoring tweaks.

Closing this - it turns out the SW is not ideal for this problem. Semi-global alignment is more appropriate, the cutadapt algorithm looks appropriate (chapter 2):

https://eldorado.tu-dortmund.de/bitstream/2003/31824/1/Dissertation.pdf

This is worth revisiting but I won't get time in the short term.

(smith-waterman remains on the sw branch but this is unlikely to be merged)