The F1 cross analysis attempts to find sets of guides that are specific to each genome. The difficult part in the standard analysis that I performed was not generating these sets, but finding the corresponding witnesses. So I had to take another approach.
I took the set of indels/SNPS (as a VCF file) and generated a synthetic 129S1 genome along with an associated mapping file using the tool called MMARGE. The variant calls are drawn from Ensemble's mm10 genome, so the resultant genome is
The script prepare_genome.sh
in the scripts directory will
perform this entire procedure from scratch given that the MMARGE
tool is installed.
To put the genomes in a standard FASTA format, simply run:
$ cat 129S1_SVIMJ/*.fa > 129s1.fna
$ cat genomes/*.fa > mm10.fna
We generate the sets of gRNAs for both 129S1 and B6 independently
using our build_dbs
script that parallelizes execution of Guidescan2.
The results of this generation are located under the experiment14
directory.
Now that we have both databases and their corresponding
off-targets, we run the 129S1 set of guides against the B6 genome
and vice versa. This will run more quickly, since most guides are
filtered out because of their perfect match in the other genome. To
perform this procedure, we run the pull_kmers.py
script on the
SAM databases for mm10 and 129S1 to generate a kmer set for both
organisms.
$ python pull_kmers.py 129s1.sam > 129s1.kmers.csv
$ python pull_kmers.py mm10.sam > mm10.kmers.csv
With the kmer sets in hand, we run the 129S1 kmer set against mm10 and the mm10 kmer set against 129S1 using Guidescan2 with the no perfect matches option enabled.
At this point, we have a total of 4 databases: - A mm10 database with guides targeting mm10 - A mm10 specific database with guides targeting mm10 and not 129S1 - A 129S1 database with guides targeting 129S1 - A 129S1 specific database with guides targeting 129S1 and not mm10
The first two databases are relatively easy to work with as the guideRNA coordinates
are with respect to mm10, it is only the off-targets that are with respect to another
genome. For the latter two we need to convert sgRNA coordinates to mm10. We do this
using MMARGE as follows where 129s1_dir
is the directory in which the prepare_genomes.sh
script was executed in the first step.
$ $MMARGE shift -data_dir 129s1_dir/ -ind 129S1_SVIMJ -sam -files 129s1_x_mm10.sam
In the final step, we parse out off-target information and classify the type of disruption to create two CSVs for mm10 and 129S1 allele specific gRNAs.
$ python scripts/classify_disruption.py 129s1_dir/inputs/129S1_SvImJ.mgp.v5.{snps,indels}.*.sorted.gz 129s1_x_mm10.coords_shifted_from_129S1_SVIMJ.sam --annotation-db /data/leslie/schmidth/genomes/grcm38/GRCm38_68.db --chromosome-mapping inputs/mm10_chr_mapping.txt --non-specific-db 129s1.bam.sorted -o 129s1_x_mm10.csv
$ python scripts/classify_disruption.py ../experiment15/inputs/129S1_SvImJ.mgp.v5.{snps,indels}.*.sorted.gz mm10_x_129s1.sam --annotation-db /data/leslie/schmidth/genomes/grcm38/GRCm38_68.db --chromosome-mapping inputs/mm10_chr_mapping.txt --non-specific-db mm10.bam.sorted -o mm10_x_129s1.csv
We then merge these CSVs using awk
.