CRISPETa is a flexible tool to design optimal pairs of sgRNAs for deletion of desired genomic regions. Using as input a BED format file CRISPETa is able to find, analyze, and score all possible sgRNAs. As a result the program returns:
- A file with information of n ranked pairs of sgRNAs for every target region (n, the desired number of targets, can be selected by the user): position in the genome, sequence of the sgRNA+PAM, individual and paired scores and distance between sgRNAs.
- A log file with summary of run settings: number of analyzed regions, mean of sgRNA scores (individual and paired), sgRNAs filtered out for each of the filters, etc.
- A BED file of designed sgRNAs ready to be uploaded to UCSC Genome Browser for visualization of target regions and sgRNA pairs related to them.
- PDF with graphics based on results: histogram with pairs per target region, histograms of individual and paired scores distribution and pie chart with distributions of complete and incomplete designs.
- HTML with same graphics as the PDF above.
The code can be found on github: https://github.com/guigolab/CRISPETA or on our web server: http://crispeta.crg.eu
- python 2.7
- Numpy
- biopython
- mysql-python
- plotly and chart-studio (optional)
- pdfkit (optional)
- BEDtools
- MySQL (tested on v5.1 and v5.5)
. ~/anaconda3/etc/profile.d/conda.sh
conda activate
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda create -n crispeta --yes python=2 numpy biopython plotly chart-studio mysql-python pdfkit bedtools
To activate the environment, run
. ~/anaconda3/etc/profile.d/conda.sh
conda activate crispeta
- Defaults to connect to the database can be found in
config.py
file. Change parameters as necessary. - Before running CRISPETa the user must create a database in MySQL/MariaDB to store off-target information for sgRNAs. This step can take a while depending on the size of the database and computer resources (more than 1 hour for human). Files with precomputed off-target information for some organisms can be directly download from our web server (http://crispeta.crg.eu/download). NOTE:: all files must be uncompressed before calling the creation scripts.
The off-target database can be created:
- Using module crispeta_mysql.py
- Manually using MySQL.
crispeta_mysql can use comma separated files downloaded from the web site to create the data base. By default crispeta_mysql will use the following values to create and access to database:
- user name: "crispeta"
- password: "crispeta"
- host: "localhost"
- database name: "crispeta"
- table name: "crispeta"
- column names: "gnra", "off0", "off1", "off2", "off3", "off4"
The user can modify theese parameters, except column names, directly from the command line. If you change theese values remember to change them also in the config.py file.
Eexample:
$ python crispeta_mysql -i [coma_separated_file.txt] -u [user_name] -p [pwd]
Make sure that all required files (crispeta_mysql.py, config.py and func.py) are in the same directory
The comma separated file can be loaded directly to MySQL from the terminal using the following commands:
mysql> CREATE DATABASE crispeta;
mysql> USE crispeta;
mysql> CREATE TABLE [genome_name] (
-> grna VARCHAR(20) NOT NULL,
-> off0 INT NOT NULL,
-> off1 INT NOT NULL,
-> off2 INT NOT NULL,
-> off3 INT NOT NULL,
-> off4 INT NOT NULL,
-> PRIMARY KEY (grna));
mysql> LOAD DATA LOCAL INFILE '[coma_separated_file.txt]' INTO TABLE [genome_name]
-> FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
In config.py
, change the parameter [table] with you [genome_name].
Running example
$ python CRISPETA.py -i [file.bed] -g [genome.fasta] -o results.txt
Make sure that all program files (CRISPETA.py, config.py and func.py) are in the same directory
Parameters (Also see Table 1 of the CRISPETa manuscript for details)
"dir" = directory e.g: path/to/file; "int" = integer e.g: 10; "float" = decimal number e.g: 0.5; "string" = text e.g: this is a string; "bool" = boolean e.g: True or False
- -i dir: Path to input BED file.
- -g dir: Path to genome fasta file
- -t string (default: 1,0,0,x,x): Off-targets: String with maximum number of off-targets allowed with 0,1,2,3 and 4 mismatches (x: no limit). Text must have five integers (or x), comma-separated, with no spaces.
- -o dir (default: ./sgRNA_pairs): Path/prefix of output files
- -n int (default: 10): Maximum number of pairs to be returned for each target.
- -du int (default: 500): Upstream design region: Length of upstream region for sgRNAs search
- -dd int (default: 500): Downstream design region: Length of downstream region for sgRNAs search
- -eu int (default: 100): Exclude upstream: Length of upstream region adjacent to target excluded from sgRNAs search
- -ed int (default: 100): Exclude downstream: Length of downstream region adjacent to target excluded from sgRNAs search
- -v float (default: 0.5): Diversity measure: the maximum fraction of returned pairs for each target that contain the same sgRNA
- -si float (default: 0.2): Individual score cutoff: The minimum score individual sgRNAs must have to be considered
- -sp float (default: 0.4): Paired score cutoff: The minimum combined score that a sgRNAs pair must have to be considered
- -sc string (default: +): Score combination: Method by which individual scores are combined to yield pair score: addition ("sum") or multiplied ("product")
- -r string (default: score): Ranking method: Criteria for ranking protospacer pairs ("score"or "dist")
- -c string (default: None): Construction method: Method applied when making sgRNAs pairs and oligo construction: “none” or “DECKO” (only returns pairs where first protospacer starts with G)
- -mp dir: Positive mask: File with favoured regions from genome, in BED format
- -mn dir: Negative mask: File with disfavoured regions from genome, in BED format
Input
The input data should be specified using a tab separated file (BED format) and passing it to the pipeline command with the option -i. Here is an example of the file format:
chr1 29557453 29557454 region1 0 -
chr2 32716839 32716840 region2 0 +
chr7 151129139 151129140 region3 0 +
chr12 151138423 151138424 region4 0 -
The fields in the file correspond to:
- Chromosome name
- Start position
- End position
- Unique ID
- Score (irrelevant here)
- Strand
Outputs
-
DESIGN file: sgRNA pairs found by CRISPETa for each target regions. Each line correspond to one pair
Sequence_ID(#pair) chromosome start end sgRNA_1+PAM score_1 chromosome start end sgRNA_2+PAM score_2 distance_to_exclude_up_region distance_to_exclude_down_region distance_between_sgRNAs paired_score mask_score oligo region1(1) chr1 29557261 29557284 GCTTGTCTATGGGCACCACGGGG 0.944 chr1 29557844 29557867 CGTGTACTCTCCTCAGTGTAGGG 0.572 -69 290 560 1.516 2 . region1(2) chr1 29557261 29557284 GCTTGTCTATGGGCACCACGGGG 0.944 chr1 29557805 29557828 CCTATGCCGTTACATGGTAGTGG 0.542 -69 251 521 1.486 2 . region1(3) chr1 29557261 29557284 GCTTGTCTATGGGCACCACGGGG 0.944 chr1 29557576 29557599 GACTGCGTGTGGGCCCCGGAGGG 0.501 -69 22 292 1.445 2 .
-
DESIGN BED file: Pairs and target regions in BED format ready to upload to UCSC GenomeBrowser as a custom trak. The new tracks will show target regions and pairs for each region passed as input for CRISPETa
track name="Target_Regions" description="Regions_for_sgRNA_searching" visibility=1 itemRgb="On" chr1 29557453 29557454 region1 0 + 0 0 0,0,0 chr1 29557353 29557554 region1 0 + 0 0 180,180,130 chr1 29556853 29558054 region1 0 + 0 0 40,175,40 ... track name="sgRNA_pairs" description="sgRNA_pairs" visibility=3 itemRgb="On" chr1 29557261 29557867 region1(1) 1.516 . 29557261 29557867 193.0,0,0 2 23,23 0,583 chr1 29557261 29557828 region1(2) 1.486 . 29557261 29557828 189.0,0,0 2 23,23 0,544 chr1 29557261 29557599 region1(3) 1.445 . 29557261 29557599 184.0,0,0 2 23,23 0,315 ...
-
DESIGN Settings & Statistics: A summary of the design performance. It contains the number of regions analyzed, mean of individual and paired scores, mean of pair distances, number of sgRNAs excluded by filters, etc.
-
DESIGN Plots: Some graphics are plotted in html and pdf format using designs information.
CRISPETA.py -> Main script.
crispeta_mysql.py -> Module to load off-target information to MySQL database.
config.py -> Options and values for MySQL configuration.
func.py -> necessary functions for CRISPETA.py and crispeta_mysql.py to work.
README.md -> Markdown file with information about CRISPETA.
- Setup conda environment.
You need to install Anaconda Python distribution, and change conda path accordingly to you case (shown is default).
. ~/anaconda3/etc/profile.d/conda.sh
conda activate
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda create -n crispeta --yes python=2 numpy biopython plotly chart-studio mysql-python pdfkit bedtools
-
Location of crispeta executable is in
/data/projects/p283_rna_and_disease/projects/CRISPETa_data/CRISPETA
. You should enter the directory before running the commands. -
Testing environment
-
Login into the cluster.
-
Open an interactive session:
srun --pty --time=360 --cpus-per-task=1 --mem=5G /bin/bash
- Run with a minimal file:
cd /data/projects/p283_rna_and_disease/projects/CRISPETa_data/CRISPETA
. ~/anaconda3/etc/profile.d/conda.sh
conda activate crispeta
python CRISPETA.py --help
python CRISPETA.py -i /data/projects/p283_rna_and_disease/projects/CRISPETa_data/test/example.bed -g /data/projects/p283_rna_and_disease/projects/CRISPETa_data/hg19_masked.fa -o /data/projects/p283_rna_and_disease/projects/CRISPETa_data/test/results
Results should be at /data/projects/p283_rna_and_disease/projects/CRISPETa_data/test/results*
.