unexpected result
Closed this issue · 12 comments
Hi @aaronmck ,
I ran into an issue that I do not expect and can not explain. I attached an image containing the relevant result.
As you can see there is a 0-mismatch off-target in the result. Interestingly, it happens to be within my target gene.
So I quickly checked the sequence using bedtools getfasta
. (It seems coordinates in flashfry results are 1-based) I got these:
>chr3:52211336-52211359
ACACCGTACGGTCCTATGCTGCT
>chr3:52211337-52211360
CACCGTACGGTCCTATGCTGCTG
>chr3:52211338-52211361
ACCGTACGGTCCTATGCTGCTGG
>chr3:52211339-52211362
CCGTACGGTCCTATGCTGCTGGC
This indicates that the 0-mismatch off-target is supposed to be the target. Am I right? If so, then this result from flashfry output is incorrect.
Can you please help me to sort this out? Thanks a lot!
Thanks for the report Huanle! I'm trying to figure this out, it certainly looks like a bug. I'm making a simple test case to track it down, as I can't clearly see where the +1 offset is getting added to the off-target sequence positions. Hopefully I'll find the issue today and update with a new version.
I can't seem to reproduce this with a simple example genome (the off-target positions are correct and match the in-genome target). Could you let me know what genome you're using? I can try to see if I can recreate your exact issue.
I also tried with a single guide against the mouse genome (ROSA26 primer CTAAAGAAGAGGCTGTGCTTTGG) and it got the correct zero-based positions. Are you running all of chromosome 3 against itself here?
I found the above guide in human HG38 for your entry:
chr3:52211338-52211361
ACCGTACGGTCCTATGCTGCTGG
Which is the zero-index representation of the right position that I get when BLAT'ing the sequence at NCBI:
So the off-target should be at position 52211338; I'm not sure why your target is off by one position (52211337), how are you setting up your input fasta file to find targets?
Hi @aaronmck ,
thanks a lot for your quick response.
You are right that this guide is designed using hg38 as the reference genome.
Below are the commands I used for the guide designing and off-target prediction and I attached the input files (including all but the indexed database) used in these commands.
Hope this helps you and ultimately me to sort it out.
Best,
Huanle
exe="../flashfry/FlashFry-assembly-1.12.jar"
db="../hg38_cas9**NRG**_database"
fasta="test.fa"
java -Xmx8g -jar ${exe} \
discover \
--flankingSequence=10 \
--maxMismatch 4 \
--maximumOffTargets=4000 \
--positionOutput \
--database ${db} \
--fasta ${fa} \
--output ${out}
java -Xmx8g -jar ${exe} \
score \
--database ${db} \
--includeOTs \
--input ${out} \
--output ${out}.score \
--inputAnnotationBed GeneFeature:${ano_bed} \
--transformPositions ${seq_bed} \
-scoringMetrics doench2014ontarget,moreno2015,JostandSantos,doench2016cfd,dangerous,hsu2013,minot,folding,rank,bedannotator
PS:
you need to change the file suffix to .tgz and uncompress it.
Thanks, this was helpful! The issue is just zero-based vs one-based BED files. When I first wrote the tool it seemed more people assumed BED files were one-based (besides UCSC), but it seems things have changed and the standard is now zero-based. I've updated the code to be zero-based, there's a new 1.14 release that should fix this. Your target and "off-target" is now at 52211338 using your bed files.
Hi @aaronmck ,
thanks a lot for your effort. It seems to be fixed.
Is there any specific reason that you changed the output format by placing Doench2016CFDScore
into the off-targets
column rather than in an independent column as in the past?
I think it's in both places? DoenchCFD_maxOT and DoenchCFD_specificityscore are per target scores, whereas the Doench2016CFDScore is a per-off-target score? Maybe something got mangled here?
hmm. i am not sure.
Another thing that seems to have been changed is to include the on-target into the offTargets
. Maybe i was like this before and i failed to notice it?
Thanks a lot for your efforts.
Sure, the on-target should always be an 'off-target' if you're running against the same genome. Some scoring schemes take this into account, others (like Hsu et al.) need to be told what to do; we default to not including it in the score unless asked to (--countOnTargetInScore parameter).
Hi @aaronmck ,
I found exceptions similar to what is shown in the attached image: the number of 0-mismatch off-target is 0 instead of q.
In my case, two-thirds of guides have 1-perfect off-target
(aka, the guide/target themselves) and one-third have no perfectly matched off-target.
Your target overflowed (OVERFLOW set) which mean not all off-targets were captured. We do this to avoid processing targets that won't be suitable candidates. You can set the maximum with the --maximumOffTargets parameter.