makovalab-psu/NoiseCancellingRepeatFinder

Evaluation of repeat identification

Opened this issue · 4 comments

Dear developers,

I am using NCFR to identify in our nanopore reads 4 different motifs: TG, CCCG, CCTG, TCTG. I've set as minimum length of the repeat 4 times its length, using the following command lines:

cat $trimmed_reads | /opt/NoiseCancellingRepeatFinder/NCRF TG --minlength=8 --scoring=nanopore --minmratio=0.85 --stats=events --positionalevents 

cat $trimmed_reads | /opt/NoiseCancellingRepeatFinder/NCRF CCCG --minlength=16 --scoring=nanopore --minmratio=0.85 --stats=events --positionalevents 

cat $trimmed_reads | /opt/NoiseCancellingRepeatFinder/NCRF CCTG --minlength=16 --scoring=nanopore --minmratio=0.85 --stats=events --positionalevents 

cat $trimmed_reads | /opt/NoiseCancellingRepeatFinder/NCRF TCTG --minlength=16 --scoring=nanopore --minmratio=0.85 --stats=events --positionalevents 

I've put all the results together in a file, filling the gaps with a "N" motif in regions where nothing was identified. However, I found a read with a repetition of CCTG in a region theoretically free from any of these repetitions (from position 11683 to 11705, with 100% identity). Attached a file of this example, with the read sequence in the last column of the file.

Example_repeat.xlsx

Do you have any idea of the reason of this issue?

Please, let me know if you need further information.

Thank you in advance for your help.

Barbara

Thanks for reporting that, I will take a look at it.

I can verify that it doesn't report that repeat segment when given the entire read. There are a couple of stages of internal filtering intended to discard alignments and subalignments that aren't "good enough", so my next step will be to check whether this segment got thrown out with the bathwater in of those.

I also asked it to search just the 11412..11704 hole indicated by your spreadsheet (by hardmasking the rest of the sequence). It does manage to report the missing segment you mention, plus two others (see below).

So my best guess at the moment is that the initial alignment identified a much longer segment than is reported, and one of the filtering steps excised some subsegments as being suboptimal and whittled them out. And I might have made the assumption that any whittled-out subsegments can't contain any subsubsegments that are "good enough" to be reported.

# score=123 querybp=16 mRatio=93.8% m=15 mm=0 i=0 d=1          =========x======
fw_0c99392b-a8c5-420f-a61e-600a8878496d 15925 15bp 11485-11500 GCCTGCCTG-CTGCCT
CCTG+                                         16bp score=123   GCCTGCCTGCCTGCCT

# score=313 querybp=46 mRatio=93.5% m=43 mm=1 i=0 d=2          ==========x===================x==x============
fw_0c99392b-a8c5-420f-a61e-600a8878496d 15925 44bp 11501-11545 CTGCCTGCCT-CCTGCCTGCCTGCCTGCCTACC-GCCTGCCTGCCT
CCTG+                                         46bp score=313   CTGCCTGCCTGCCTGCCTGCCTGCCTGCCTgCCTGCCTGCCTGCCT

# score=313 querybp=35 mRatio=97.1% m=34 mm=0 i=0 d=1          =======x===========================
fw_0c99392b-a8c5-420f-a61e-600a8878496d 15925 34bp 11671-11705 CTGCCTG-CTGCCTGCCTGCCTGCCTGCCTGCCTG
CCTG+                                         35bp score=313   CTGCCTGCCTGCCTGCCTGCCTGCCTGCCTGCCTG

What I said was my best guess is what happened.

The first pass identified 3244..14078 as a single alignment. A 'debridging' step identified 11412..11704, with mRatio 76.5%, as a probable random bridge. (As discussed in supplementary note S1B, linked below). Removing the bridge partitions that long alignment into segments with mRatio 98.1, 76.5, and 96.9. That's reasonable.

What isn't reasonable is my assumption that the bridge is all garbage. What I need to do instead of throwing it out entirely is to search it for repeats.

I think this will be a pretty simple fix, but I'm not sure yet. If you are in a hurry, a workaround would be to (a) N-mask out all the intervals NcRF reported, and (b) run a second pass of NcRF. But if all goes well I should have a fix here in the repo by tomorrow.

The manuscript supplement can be found here. Note S1B explains the rationale for removing alignment artifacts like random bridges.
https://academic.oup.com/bioinformatics/article/35/22/4809/5530597#supplementary-data

I've updated the main branch to include what I think is a fix. I given it a new temporary version number, 1.01.04. After I subject it to more thorough testing I'll bump the version number again and make it a tagged release, and then see if I can figure out how to update the bioconda version of it.

@biadarola if you could try running this on some of your data and try to verify that it reports a superset of what the earlier version reported, that would be helpful. And of course, that the newly-reported alignments meet your criterion.

I expect this might be a little slower.

Here's what I now get for the read you sent, searching for CCTG, and sorted by position on the read.

#line motif seq                                     start end   strand seqLen querybp mRatio m    mm i  d
1     CCTG  fw_0c99392b-a8c5-420f-a61e-600a8878496d 3244  11412 +      15925  8175    98.1%  8080 28 60 67
9     CCTG  fw_0c99392b-a8c5-420f-a61e-600a8878496d 11485 11500 +      15925  16      93.8%  15   0  0  1
17    CCTG  fw_0c99392b-a8c5-420f-a61e-600a8878496d 11501 11545 +      15925  46      93.5%  43   1  0  2
25    CCTG  fw_0c99392b-a8c5-420f-a61e-600a8878496d 11671 11705 +      15925  35      97.1%  34   0  0  1
33    CCTG  fw_0c99392b-a8c5-420f-a61e-600a8878496d 11705 14078 +      15925  2379    96.9%  2330 18 25 31
41    CCTG  fw_0c99392b-a8c5-420f-a61e-600a8878496d 14176 14193 +      15925  17      94.1%  16   1  0  0
49    CCTG  fw_0c99392b-a8c5-420f-a61e-600a8878496d 14198 14254 +      15925  58      96.6%  56   0  0  2
57    CCTG  fw_0c99392b-a8c5-420f-a61e-600a8878496d 14331 14390 +      15925  62      93.5%  58   1  0  3
65    CCTG  fw_0c99392b-a8c5-420f-a61e-600a8878496d 14476 14557 +      15925  80      94.0%  78   0  3  2
73    CCTG  fw_0c99392b-a8c5-420f-a61e-600a8878496d 14577 14601 +      15925  24      92.0%  23   0  1  1
81    CCTG  fw_0c99392b-a8c5-420f-a61e-600a8878496d 14632 14654 +      15925  22      100.0% 22   0  0  0
89    CCTG  fw_0c99392b-a8c5-420f-a61e-600a8878496d 14688 14722 +      15925  34      97.1%  33   1  0  0
97    CCTG  fw_0c99392b-a8c5-420f-a61e-600a8878496d 14772 14862 +      15925  89      91.3%  84   3  3  2