pritykinlab/guidescan-cli

Question on the precomputed GuideScan2 hg38 database

mzhibo opened this issue · 6 comments

Thanks for developing this great tool!
I am trying to use guidescan2-cli to find some good paired gRNA candidates for a list of genes. I thought your precomputed database could be helpful on saving some running time. Maybe I missed it somewhere, I couldn't find the parameters that were used to generate those precomputed databases. Do you remember what --mismatches and --alt-pam were used?
Additionally, are alternative PAMs weighted in comparison to the NGG PAM in offtarget score calculation? Will it be a good idea to include those NNG and NGA non-canonical PAMs?
Thanks,
Zhibo

Did you find what you need? The --mismatches parameter was set to 3, --alt-pam was NAG, and all off-targets are weighted according to the CFD score introduced by Doench et al. The CFD score incorporates the binding affinity of alternative PAMs, that is, the weighting is implicit.

@schmidt73 Thank you for the details.
May I ask what is the best approaches to query your precompuated databases? I already have a list of kmers that I want to rank based on specificity and on-target efficiency. I tried to run the enumerate step, which worked pretty well when the list is at the order of 10^3 but significantly slows down when it reaches to the order of 10^5-6.
Do you have a script to query the precomputed bam and only decode the ones on the query kmers?
Thanks,
Zhibo

There are a few available options. You could search directly for the kmers in the BAM database, this will be slow but will scale up. If you have the positions of the kmers, I would use a tool such as BedMap to intersect your list of kmer positions with that in the database. This is very efficient and will scale up to millions of kmers.

If your list is of the order 10^5 or 10^6, you could probably use the indices directly. If you have access to many computers, one way to do this is to split up the list into chunks of 10K-20K and then run a seperate instance of Guidescan2 on each computer. This will be fast and will allow customization on the # of mismatches, alternative PAMs, etc. This sort of approach is detailed in the manual.

@schmidt73 Thank you for the tips. I will try and see which one works better. I have several large sets of kmers but only a workstation. So parallel in large scale is a kind of limited.

Let me know if you need any more help. I think BedMap is the best approach for your usage. I'll close this for now.

@schmidt73 thank you! I will try it.