sequencing/NxTrim

nxtrim does not filter reads with multiple Nextera linkers

mmokrejs opened this issue · 18 comments

Hi,
it seems nxtrim v0.4.1-7db257e does not discard reads with multiple linkers despite its log output. It seems those are two sample DNA inserts ligated together via the linker (either the forward or reverse sequence alone).

$ nxtrim -1 "$p"_R1.fastq -2 "$p"_R2.fastq -O "$p"_nxtrim --separate -v 8
...
$ tail HFYJ5AFXX.1_5kb_nxtrim.log
Trimming summary:
17502155 / 17502155     ( 100.00% )     reads passed chastity/purity filters.
87 / 17502155   ( 0.00% )       reads had TWO copies of adapter (filtered).
91342 / 17502068        ( 0.52% )       read pairs were ignored because template length appeared less than read length
17410726 remaining reads were trimmed

8240428 / 17410726      ( 47.33% )      read pairs had MP orientation
6885712 / 17410726      ( 39.55% )      read pairs had PE orientation
1330291 / 17410726      ( 7.64% )       read pairs had unknown orientation
954295 / 17410726       ( 5.48% )       were single end reads

6268836 / 17410726      ( 36.01% )      extra single end reads were generated from overhangs

I cannot say which were those 87 but were you eventully speaking of Illumina adapters instead of Nextera linkers (transposase sequences)?

Here are the total exact and full-length match counts before and after nxtrim. Only MP reads are shown after nxtrimming, the other output groups like unknown and pe seem not affected by this bug. BLASTN finds 1037+1608 matches in R1 or R2 reads, respectively, instead of 12+21 exact and full-length matches shown in a quick check in the below table.

image

I will send you the FASTQ files via email. It seems you should perform one more search for linkers and discard entries which still contain a match (discard both forward and reverse read as a pair, and introduce discarded FASTQ file group).

Thanks for the bug report. This was an unexpected edge case that the code would not catch. I have improved the detection of such cases.

Could you try branch 4f60628 and see if that solves the problem?

git pull
git checkout issue33
make

No, nxtrim v0.4.1-4f60628 is still missing full-legth Nextera hairpins (CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG) without any sequencing error:

@NB501598:62:HFYJ5AFXX:1:11101:6408:17346 1:N:0:GCCAAT
TGGCTCACAGCTGGCCGCGGGGGCTGCGCAAATATGCGTTCCATCCAGTGAGCCACATCGCACAGACACTGTGCAAGGTCAGGCAGGACGAGGAGTCTCAACCTGTCTCTTATACACATCTAGATGTGTATAAGAGACAGGTGCTGGACGGCTCG
+
6AAA/E/E/EEEEEEE/AE///E/////AAAE6/EEA/AE/A/<<E//E//<EAAA///A/E//E/</AA/AAEEEAEE<A/E<EA//EE//AE<E/////////////</////<E/EEEAA/A<E/AAEEE///<AEA/E<//AA</////EA
@NB501598:62:HFYJ5AFXX:1:21208:6197:9791 1:N:0:GCCAAT
TAGCCTCTGGCATGTTTATTGACATGTACCTTCTCTCACAGGCAGGTTCAAGGCTGTCTCTTATACACATCTAGATGTGTATAAGAGACAGACCTACCGTCTGCTGTAAACAGCCAAGAGGTGAGAACGCAAGATCTCAACAGAACACGTCCCAA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEE/EEEE<EEEAEEEEEE6/AEEA6<AEEEEE/E//<E//EEEEAEEEAE/EAEEE/<6AEEA/EEEEAEAAEAE<EEE<AE/A<<E<E<</E//E</EE<///E/A//<//</////A<
@NB501598:62:HFYJ5AFXX:1:21212:6738:1229 1:N:0:GCCAAT
AGTTTTCTGTATAGATCATAACCTTGTCAAAATGTCCAGCATACAACAGCTGTCTCTTATACACATCTAGATGTGTATAAGAGACAGGTTAGAGAGTGTGCGTGAGAAGGAGAGAGAGAAAGTGTGTGTGTGTGGTGATCTGAAGGGCCCAGGTC
+
AAAAAEEEEAAEEE6E/EEEEEAEEEEEEEEEEEEEAAEEEEEEE/EEEAEEEAA<EEEEEEEEAEEEEEE/EEEEEEEEEEEEEEEAEEEAEEEEEEEAEEEEEEEEEEEEEE<EEEE/EEEAEEAE<EEAEE//</6//AE/6//////////
@NB501598:62:HFYJ5AFXX:1:21301:17123:18191 1:N:0:GCCAAT
ATCATGCTAACTTCATGCTAATCATGCTAACATCATGCTAGCTTCATGCTAATCATGTTAGCTTCATGCTAGCTTCTGTCTCTTATACACATCTAGATGTGTATAAGAGACAGCACGGGGGTTGGACCGTTTTACTCGTGACTCGGAGGGTAATC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEAA</A/<<E/AA/EEEEEEAEAEEEEEEEEEAEEE<AAEEEEE<AA6/AEAEEAE/EEEEEAEE<EEEAAAAAEEAEEE/EA

More examples attached.

HFYJ5AFXX.1_5kb_R1.missed.fastq.txt
HFYJ5AFXX.1_5kb_R2.missed.fastq.txt

Sorry, I am having trouble replicating this. I ran the following on your examples and don't see missing adapters:

joconnell@ubuntu:~/workspace/NxTrim$ ./nxtrim -1 example/HFYJ5AFXX.1_5kb_R1.missed.fastq.txt -2 example/HFYJ5AFXX.1_5kb_R2.missed.fastq.txt -O test
Output: test.*.fastq.gz
Trimming:
R1:	example/HFYJ5AFXX.1_5kb_R1.missed.fastq.txt
R2:	example/HFYJ5AFXX.1_5kb_R2.missed.fastq.txt

Trimming summary:
21 / 21	( 100.00% )	reads passed chastity/purity filters.
0 / 21	( 0.00% )	reads had TWO copies of adapter (filtered).
0 / 21	( 0.00% )	read pairs were ignored because template length appeared less than read length
21 remaining reads were trimmed

9 / 21	( 42.86% )	read pairs had MP orientation
11 / 21	( 52.38% )	read pairs had PE orientation
1 / 21	( 4.76% )	read pairs had unknown orientation
0 / 21	( 0.00% )	were single end reads

4 / 21	( 19.05% )	extra single end reads were generated from overhangs
joconnell@ubuntu:~/workspace/NxTrim$ zcat test.pe.fastq.gz |grep CTGTCTCTTATACACATCT | wc -l
0
joconnell@ubuntu:~/workspace/NxTrim$ zcat test.mp.fastq.gz |grep CTGTCTCTTATACACATCT | wc -l
0
joconnell@ubuntu:~/workspace/NxTrim$ zcat test.unknown.fastq.gz |grep CTGTCTCTTATACACATCT | wc -l
0
joconnell@ubuntu:~/workspace/NxTrim$ zcat test.se.fastq.gz |grep CTGTCTCTTATACACATCT | wc -l
0

I used -v 8 --preserve-mp --rf --stdout-mp arguments.

Sorry - I still cannot recreate it with those parameters:

joconnell@ubuntu:~/workspace/NxTrim$ ./nxtrim -1 example/HFYJ5AFXX.1_5kb_R1.missed.fastq.txt -2 example/HFYJ5AFXX.1_5kb_R2.missed.fastq.txt -v 8 --preserve-mp --rf --stdout-mp 2> /dev/null | grep CTGTCTCTTATACACATCT | wc -l
0
joconnell@ubuntu:~/workspace/NxTrim$ ./nxtrim -1 example/HFYJ5AFXX.1_5kb_R1.missed.fastq.txt -2 example/HFYJ5AFXX.1_5kb_R2.missed.fastq.txt -v 8 --preserve-mp --rf --stdout-mp 2> /dev/null | grep AGATGTGTATAAGAGACAG | wc -l
0

I would advise against -v 8 , this is likely to cause false positive matches to the adapter (maybe it is okay for a viral genome). There will be some amount of true DNA with an 8bp substring of the adapter.

$ git branch  
* issue33
  master
$ git status .
On branch issue33
Your branch is up-to-date with 'origin/issue33'.
nothing to commit, working tree clean
$

Weird, I wil re-test. Yes, I know -v 8 is a bit too low but I need it for de novo assembly and I want only absolutely pure MP reads.

OK, seems the exact matches to Nextera linkers are gone now in branch33, I probably provided the testcases based on older output or I have no explanation.

So how many mismatches does nxtrim allow? I think the below numbers seem too high IMHO, especially those in *.mp.fastq.gz files which were already split by either of the linkers. Could it be they were split by the forward or reverse variant only instead of the full-length dimer? ;-)

$ for f in *nxtrim_*.fastq.gz; do echo -n "$f "; gzip -dc $f | agrep -1 -c CTGTCTCTTATACACATCT; done
HFYJ5AFXX.1_5kb_nxtrim_R1.mp.fastq.gz 122
HFYJ5AFXX.1_5kb_nxtrim_R1.pe.fastq.gz 436
HFYJ5AFXX.1_5kb_nxtrim_R1.unknown.fastq.gz 489
HFYJ5AFXX.1_5kb_nxtrim_R2.mp.fastq.gz 141
HFYJ5AFXX.1_5kb_nxtrim_R2.pe.fastq.gz 460
HFYJ5AFXX.1_5kb_nxtrim_R2.unknown.fastq.gz 465
$
$ for f in *nxtrim_*.fastq.gz; do echo -n "$f "; gzip -dc $f | agrep -2 -c CTGTCTCTTATACACATCT; done
HFYJ5AFXX.1_5kb_nxtrim_R1.mp.fastq.gz 343
HFYJ5AFXX.1_5kb_nxtrim_R1.pe.fastq.gz 725
HFYJ5AFXX.1_5kb_nxtrim_R1.unknown.fastq.gz 942
HFYJ5AFXX.1_5kb_nxtrim_R2.mp.fastq.gz 360
HFYJ5AFXX.1_5kb_nxtrim_R2.pe.fastq.gz 843
HFYJ5AFXX.1_5kb_nxtrim_R2.unknown.fastq.gz 954
$
$ for f in *nxtrim_*.fastq.gz; do echo -n "$f "; gzip -dc $f | agrep -3 -c CTGTCTCTTATACACATCT; done
HFYJ5AFXX.1_5kb_nxtrim_R1.mp.fastq.gz 12785
HFYJ5AFXX.1_5kb_nxtrim_R1.pe.fastq.gz 4146
HFYJ5AFXX.1_5kb_nxtrim_R1.unknown.fastq.gz 21729
HFYJ5AFXX.1_5kb_nxtrim_R2.mp.fastq.gz 16745
HFYJ5AFXX.1_5kb_nxtrim_R2.pe.fastq.gz 5457
HFYJ5AFXX.1_5kb_nxtrim_R2.unknown.fastq.gz 32335
$
$ agrep
usage: agrep [-#cdehiklnpstvwxBDGIS] [-f patternfile] pattern [files]

summary of frequently used options:
-#: find matches with at most # errors
-c: output the number of matched records
-d: define record delimiter
-h: do not output file names
-i: case-insensitive search, e.g., 'a' = 'A'
-l: output the names of files that contain a match
-n: output record prefixed by record number
-v: output those records containing no matches
-w: pattern has to match as a word, e.g., 'win' will not match 'wind'
-B: best match mode. find the closest matches to the pattern
-G: output the files that contain a match
$

*  app-text/agrep
      Latest version available: 2.04-r2
      Latest version installed: 2.04-r2
      Size of files: 61 KiB
      Homepage:      ftp://ftp.cs.arizona.edu/agrep/README
      Description:   A tool for the fast searching of text allowing for errors in the search pattern
      License:       AGREP

Look for CTGTCTCTTATACACACT, which is a 1nt deletion of the CTGTCTCTTATACACATCT query. Many of them are actually incomplete hairpins:

incomplete:          CTGTCTCTTATACACA CTAGATCTGTAGTAGAGACA
full-length hairpin: CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG

Why such matches are still detectable in *nxtrim_R1.mp.fastq.gz files? They were already split using some linker sequence. Actually, would they be split then they would not be 151 nucleotides long in my case. Needless to say, I did not check the other mate sequence, maybe that one was split.

Why such matches are still detectable in *nxtrim_R1.mp.fastq.gz files? They were already split using some linker sequence.

I assume these *.mp.fastq.gz examples with remaining junction adapters are examples where there are two copies of the adapter in a original read?

There are two steps at play here:

  1. Find the junction adapter with up to a 15% hamming mismatch rate and trim/reverse-complement appropriately.
  2. Check the remaining DNA for perfect matches of either CTGTCTCTTATACACATCT / AGATGTGTATAAGAGACAG. If these exist, discard the read-pair.

You can see that step 2 has very stringent requirements for matching. Why? Well these occurrences should be very rare (about 1/5000 read pairs in my data using agrep -3) so I just did something simple. What is the rate you are seeing?

If it is high, then maybe step 2. needs improved sensitivity from approximate matching. The relevant code is here:
https://github.com/sequencing/NxTrim/blob/master/matepair.cpp#L388-L392

$ grep -c "^@" HFYJ5AFXX.1_5kb_R1.fastq 
17502155
$

So 1/13689 in my case.

I still have problems to understand why you discard whole read pairs (in step 2 described above).

So 1/13689 in my case.

I speculate the effect on assembly is going to be negligible then :) We could still look at improving this, there is little doubt your matches are true adapters with a small amount of errors.

I still have problems to understand why you discard whole read pairs (in step 2 described above).

If we see two copies of the junction adapter, then something funny is going on at a molecular level. Maybe two fragments have stuck together or something, I have never investigated in detail. Given the very low rate, it seems safer to just remove them than try to fix them.

I speculate the effect on assembly is going to be negligible then.

I don't think so. Actually this is why I am redoing a previous work of somebody else. We have gaps in scaffolds which contradict RapidSeq reads. Obviously something went wrong during scaffolding. Having having about 100k such overlaps up to ~38nt long (even a bit more) times 8 datasets will create havoc.

If we see two copies of the junction adapter, then something funny is going on at a molecular level.

But there two copies. fwd+revcom, one after each other! Or vice versa, revcomp+fwd.

You mean fwd+fwd or revcomp+revcomp instead? That would be true. Or do you mean two isolated copies spaced with deemed sample sequence by at least 5 nt? That would be true as well IMHO.

But there two copies. fwd+revcom, one after each other!

You mean fwd+fwd or revcomp+revcomp instead? That would be true. Or do you mean two isolated >copies spaced with deemed sample sequence by at least 5 nt? That would be true as well IMHO.

I mean in addition to the 'CTGTCTCTTATACACATCT'+'AGATGTGTATAAGAGACAG' string there is at least one more of 'CTGTCTCTTATACACATCT' either 'AGATGTGTATAAGAGACAG'. This should not happen in a well formed mate-pair.

OK, that I agree with.

Okay. So there were two issues in this thread.

  1. A bug that did not find certain cases where there were perfect additional copies of the junction adapter (indicative of a problematic molecule). This is now fixed.

  2. The other issue is missed adapters leading to untrimmed output. This is a sensitivity/precision trade-off.

Quite possibly my similarity threshold -s .85 is too stringent. It is missing cases like this one:

$ grep "@NB501598:62:HFYJ5AFXX:1:11101:12753:4668" example/HFYJ5AFXX.1_5kb_R1.missed_mismatch.fastq.txt -A1

@NB501598:62:HFYJ5AFXX:1:11101:12753:4668 1:N:0:GCCAAT
GACCGGGCATTCCCTTCCCATTATAAGTCTATGGGGAAATTTTGGGGGCCTCTTACACCCCAGGGGTACAGCTGTCTCATAAACACATGTTGAGTTGTATAAGAGACAGATAGAATCATGATTAGAAATAACCAAGAATGATCATAACACTAATA

The adapter is obviously in there but does have 6 mismatches (1-6/38=0.842 which is below my threshold).

We can tweak the -s parameter to give better results:

$ ./nxtrim -1 example/HFYJ5AFXX.1_5kb_R1.missed_mismatch.fastq.txt -2 example/HFYJ5AFXX.1_5kb_R2.missed_mismatch.fastq.txt --stdout-mp -s .85 2> /dev/null | agrep -1 AGATGTGTATAAGAGACAG | wc -l
26
$ ./nxtrim -1 example/HFYJ5AFXX.1_5kb_R1.missed_mismatch.fastq.txt -2 example/HFYJ5AFXX.1_5kb_R2.missed_mismatch.fastq.txt --stdout-mp -s .8 2> /dev/null | agrep -1 AGATGTGTATAAGAGACAG | wc -l
12

So there are still 12 potential adapters in there (with a gapped alignment). What do these look like? At least this one looks like a true missed junction adapter:

TAGATGTGTAATAAGAGACAGGTCTATAGCGTCGGTGGCAGGGCAGCCAGAGCTTTTTTTGATGCTCTCGCCGGTGGCGGTGTGAAAGCACAGTTACACAGCCATCCAATCACAGTGGAGGAGGGGTGGGACAAATATCACAACAACCAACCAGC
 AGATGTGTA-TAAGAGACAG

The solution here is to switch to a gap-aware alignment, but this would be some work. I originally had this in the code but it was slower and I didn't find it to improve assemblies. I might be able to do this, but would like to see what rate such problematic MPs are occurring at.

I think it only makes sense to do Smith-Waterman alignment here, and I do not mind the slower speed. Would you mind just accepting a simple tabular input format defining position of the linker? See blastn example below. You could run the blastn process directly from within nxtrim. And even fallback to blastall -p blastn is blastn is missing from $PATH.

Meanwhile I would again suggest improving README emphasizing that not even a single InDel is practically allowed, and -s 0.8 allows just one in the full-length hairpin linker (still no InDel allowed in either just the forward or revcomp linker halves).

To count number of incidences of unremoved linkers, run blastn?

blastn -task blastn-short -db "$p".fasta -query Nextera_MatePair_single_junction.fasta -word_size 4 -dust no -reward 1 -outfmt 6

I dug out my code with SW alignment but it is actually in a pretty terrible state (the branch is sw but I wouldn't recommend using it). You could check out nextclip which I believe is focused on scaffolding.

I think it only makes sense to do Smith-Waterman alignment here, and I do not mind the slower speed.

Agree that SW is the right thing to do. In practice, the rate of missed adapters in the current implementation is something like 1/5,000 (at least in my data) so my Hamming heuristic was good enough for me. Tuning the SW penalties and scores when you have incomplete adapters on the edges can be a delicate operation whereas Hamming is straightforward. The main aim of this work was to demonstrate the suitability of Nextera mate-pairs as a standalone library for assembly.

Meanwhile I would again suggest improving README emphasizing that not even a single InDel is practically allowed

It is actually a bit more sophisticated. It will match either side of the junction adapter (one of CTGTCTCTTATACACATCT AGATGTGTATAAGAGACAG) with Hamming distance, and then clip appropriately. So the only case it does not catch is when indels occur in both sides. I updated the README with a description.

Closing this one. See #37