Evaluation of repeat identification
Opened this issue · 4 comments
Dear developers,
I am using NCFR to identify in our nanopore reads 4 different motifs: TG, CCCG, CCTG, TCTG. I've set as minimum length of the repeat 4 times its length, using the following command lines:
cat $trimmed_reads | /opt/NoiseCancellingRepeatFinder/NCRF TG --minlength=8 --scoring=nanopore --minmratio=0.85 --stats=events --positionalevents
cat $trimmed_reads | /opt/NoiseCancellingRepeatFinder/NCRF CCCG --minlength=16 --scoring=nanopore --minmratio=0.85 --stats=events --positionalevents
cat $trimmed_reads | /opt/NoiseCancellingRepeatFinder/NCRF CCTG --minlength=16 --scoring=nanopore --minmratio=0.85 --stats=events --positionalevents
cat $trimmed_reads | /opt/NoiseCancellingRepeatFinder/NCRF TCTG --minlength=16 --scoring=nanopore --minmratio=0.85 --stats=events --positionalevents
I've put all the results together in a file, filling the gaps with a "N" motif in regions where nothing was identified. However, I found a read with a repetition of CCTG in a region theoretically free from any of these repetitions (from position 11683 to 11705, with 100% identity). Attached a file of this example, with the read sequence in the last column of the file.
Do you have any idea of the reason of this issue?
Please, let me know if you need further information.
Thank you in advance for your help.
Barbara
Thanks for reporting that, I will take a look at it.
I can verify that it doesn't report that repeat segment when given the entire read. There are a couple of stages of internal filtering intended to discard alignments and subalignments that aren't "good enough", so my next step will be to check whether this segment got thrown out with the bathwater in of those.
I also asked it to search just the 11412..11704 hole indicated by your spreadsheet (by hardmasking the rest of the sequence). It does manage to report the missing segment you mention, plus two others (see below).
So my best guess at the moment is that the initial alignment identified a much longer segment than is reported, and one of the filtering steps excised some subsegments as being suboptimal and whittled them out. And I might have made the assumption that any whittled-out subsegments can't contain any subsubsegments that are "good enough" to be reported.
# score=123 querybp=16 mRatio=93.8% m=15 mm=0 i=0 d=1 =========x======
fw_0c99392b-a8c5-420f-a61e-600a8878496d 15925 15bp 11485-11500 GCCTGCCTG-CTGCCT
CCTG+ 16bp score=123 GCCTGCCTGCCTGCCT
# score=313 querybp=46 mRatio=93.5% m=43 mm=1 i=0 d=2 ==========x===================x==x============
fw_0c99392b-a8c5-420f-a61e-600a8878496d 15925 44bp 11501-11545 CTGCCTGCCT-CCTGCCTGCCTGCCTGCCTACC-GCCTGCCTGCCT
CCTG+ 46bp score=313 CTGCCTGCCTGCCTGCCTGCCTGCCTGCCTgCCTGCCTGCCTGCCT
# score=313 querybp=35 mRatio=97.1% m=34 mm=0 i=0 d=1 =======x===========================
fw_0c99392b-a8c5-420f-a61e-600a8878496d 15925 34bp 11671-11705 CTGCCTG-CTGCCTGCCTGCCTGCCTGCCTGCCTG
CCTG+ 35bp score=313 CTGCCTGCCTGCCTGCCTGCCTGCCTGCCTGCCTG
What I said was my best guess is what happened.
The first pass identified 3244..14078 as a single alignment. A 'debridging' step identified 11412..11704, with mRatio 76.5%, as a probable random bridge. (As discussed in supplementary note S1B, linked below). Removing the bridge partitions that long alignment into segments with mRatio 98.1, 76.5, and 96.9. That's reasonable.
What isn't reasonable is my assumption that the bridge is all garbage. What I need to do instead of throwing it out entirely is to search it for repeats.
I think this will be a pretty simple fix, but I'm not sure yet. If you are in a hurry, a workaround would be to (a) N-mask out all the intervals NcRF reported, and (b) run a second pass of NcRF. But if all goes well I should have a fix here in the repo by tomorrow.
The manuscript supplement can be found here. Note S1B explains the rationale for removing alignment artifacts like random bridges.
https://academic.oup.com/bioinformatics/article/35/22/4809/5530597#supplementary-data
I've updated the main branch to include what I think is a fix. I given it a new temporary version number, 1.01.04. After I subject it to more thorough testing I'll bump the version number again and make it a tagged release, and then see if I can figure out how to update the bioconda version of it.
@biadarola if you could try running this on some of your data and try to verify that it reports a superset of what the earlier version reported, that would be helpful. And of course, that the newly-reported alignments meet your criterion.
I expect this might be a little slower.
Here's what I now get for the read you sent, searching for CCTG, and sorted by position on the read.
#line motif seq start end strand seqLen querybp mRatio m mm i d
1 CCTG fw_0c99392b-a8c5-420f-a61e-600a8878496d 3244 11412 + 15925 8175 98.1% 8080 28 60 67
9 CCTG fw_0c99392b-a8c5-420f-a61e-600a8878496d 11485 11500 + 15925 16 93.8% 15 0 0 1
17 CCTG fw_0c99392b-a8c5-420f-a61e-600a8878496d 11501 11545 + 15925 46 93.5% 43 1 0 2
25 CCTG fw_0c99392b-a8c5-420f-a61e-600a8878496d 11671 11705 + 15925 35 97.1% 34 0 0 1
33 CCTG fw_0c99392b-a8c5-420f-a61e-600a8878496d 11705 14078 + 15925 2379 96.9% 2330 18 25 31
41 CCTG fw_0c99392b-a8c5-420f-a61e-600a8878496d 14176 14193 + 15925 17 94.1% 16 1 0 0
49 CCTG fw_0c99392b-a8c5-420f-a61e-600a8878496d 14198 14254 + 15925 58 96.6% 56 0 0 2
57 CCTG fw_0c99392b-a8c5-420f-a61e-600a8878496d 14331 14390 + 15925 62 93.5% 58 1 0 3
65 CCTG fw_0c99392b-a8c5-420f-a61e-600a8878496d 14476 14557 + 15925 80 94.0% 78 0 3 2
73 CCTG fw_0c99392b-a8c5-420f-a61e-600a8878496d 14577 14601 + 15925 24 92.0% 23 0 1 1
81 CCTG fw_0c99392b-a8c5-420f-a61e-600a8878496d 14632 14654 + 15925 22 100.0% 22 0 0 0
89 CCTG fw_0c99392b-a8c5-420f-a61e-600a8878496d 14688 14722 + 15925 34 97.1% 33 1 0 0
97 CCTG fw_0c99392b-a8c5-420f-a61e-600a8878496d 14772 14862 + 15925 89 91.3% 84 3 3 2