czbiohub-sf/dashit

revcomp issue

emilydc opened this issue · 3 comments

Hi guys - when we re-ran everything with the corrected input guide file, it solved the problem of that first guide still being present in the undashed.fasta file. However, we still found that the undahsed.fasta file created with nribo2_150_guides.csv is full of reads that map to ribosomal RNA, even mapping to places very close to actual nribo2_150 guides. What we then realized is that score_guides was only putting reads in undashed.fasta that include a forward sequence of one of the guides - NOT if they include the reverse complement. It should be doing both. It seems unlikely (though not impossible) that we have come this far with DASHit without discovering this problem before, so I'm wondering if it's something that was introduced accidentally when score_guides was updated to have that function of splitting the fasta file in two.

Optimize_guides should also consider something "hit" if either the forward or the rev comp of a guide is present in it, though I suspect that this actually wouldn't make too much difference in the output of optimize_guides. Could you check though and let me know whether optimize_guides considers rev comp?

In any case, I'm hoping this will be an easy fix for score_guides. Thanks for addressing it!

Emily

Here's how data flows through DASHit, with respect to revcomps:

  1. crispr_sites -r scans FASTQ files looking for PAM sites. This search looks in both directions, in the sense that it looks for ...NGG and CCN... and then outputs either i) the 20-mer ... in the ...NGG case, or ii) the reverse complement of the 20-mer ... in the CCN... case
    • The result of this is that the candidate guides crispr_sites outputs always precede NGG on one of the strands
  2. crispr_sites outputs all candidate guides, along with the indices of the reads those guides were found in, in a _sites_to_reads.txt file
  3. optimize_guides chooses candidate guides from _sites_to_reads.txt in a greedy fashion, to cover as many reads as possible. Since crispr_sites found candidate guides in both the forward and reverse direction, in this step we will choose guides that match both strands of DNA

To me it seems like the guide selection process is OK, in the sense that it will correctly identify and select guides based on hitting reads in either the forward or reverse direction.

As you say, it is true that score_guides does a simple search for each guide in each read, not checking for reverse complements. Here is the pseudocode:

for each read r
    for each guide g
        if g a substring of r
            readHit = True

if readHit:
    hits++
else
    misses++

Since the guides that crispr_sites produces are always normalized to precede the NGG PAM, any reads that match guides via CCN... will not be counted if we simply check if the guide is a substring of the read, which is what we're doing now.

An easy example to reproduce, this FASTA file:

>
CCGAAAAATTTTTCCCCCATATA

produces this:

Unique sites to reads: 1
Guide 0 had largest number of reads: 1
Outputting 1 unique guides.
Total reads: 1
TATATGGGGGAAAAATTTTT    1

but if we search for the guide TATATGGGGGAAAAATTTTT as a substring in the FASTA file, we won't find it.

The simple fix is to simply check both the guide and its reverse complement against every read.

I'm not sure why we didn't see this problem before - perhaps we were getting lucky and weren't seeing a lot of CCN... reads? Or there were small enough numbers of them that we didnt notice amidst the millions of reads getting hit?

Another possibility is that the old BioPython version of score_guides.py used the code if read in guide, and perhaps BioPython overloads that membership to test reverse complements.

Actually, I just realized there's another bug here too.

The guides that are produced by crispr_sites don't contain the PAM. So when score_guides looks for hits in a FASTQ file, it needs to additionally check that the guide it sees is adjacent to a PAM (either the revcomp of the guide after CCN or the guide itself before a NGG)

Currently, since we're just seeing if the guide is a substring of the read, if we got (un)lucky enough to have a guide sequence in a read that wasn't adjacent to a PAM, score_guides would count it as a hit, when in fact Cas9 wouldn't recognize it.

Fixed this bug and pushed changes to DASHit server.