See the included Jupyter notebook.
python3
jupyter
(for notebook integration)numpy
matplotlib
pandas
biopython
gffutils
pybedtools
pysam
openpyxl
(for XLSX file generation)- bedtools v2.26.0
- NCBI BLAST 2.7.1
- ViennaRNA Package 2.4.3
Quick installation with pipenv
To satisfy python
dependencies install pipenv
using pip
or your preferred
package manager. Then run
pipenv install
from the program directory. This will install all the required modules with
their dependencies, as specified in the provided Pipfile
and Pipfile.lock
into the new virtual environment.
Quick installation with conda
For jupyter
and non-pythonic dependencies run
conda install -c defaults -c conda-forge -c bioconda jupyter bedtools blast=2.7.1 viennarna
Ensembl GTF annotation file, soft masked genomic sequence FASTA file and GVF variation file are required, unless working only on user input (without any database sequence lookup).
Ensembl cDNA FASTA file and ncRNA FASTA file are required for BLAST database searches.
Download the required files from Ensembl FTP site.
gene2csm.py
is a tool for efficient design of crRNA guide sequences for RNA mediated RNA processing enzymes. It is designed to satisfy the following features of the oligonucleotide:
-
Specificity (within target organism)
Every sequence is the subject of BLAST [1] database search of coding and non-coding RNA sequences within the target organism to exclude off-target effects of highly homologous sequences.
-
Isoform prevalence
Considered sequence should span the region covered by maximal number of annotated isoforms for a complete target knock-down.
-
Absence of alternative, annotated start codons downstream of the cut site
Considered sequence should not have any annotated start codons downstream of the target site on the protein coding isoforms.
-
Avoid exon-exon boundaries
Sequence should not contain any annotated splice sites.
-
Not spanning know SNP sites
Sequence should not contain any variable nucleotides contained in the dbSNP database and Ensembl resources.
Other features taken into consideration:
-
No low complexity regions
Sequence should not contain any interspersed repeats and low complexity sequences, masked by RepeatMasker
-
Balanced GC content
Sequences should have GC content within the given limits.
-
No target mRNA secondary structures within the binding region
Target sequence should avoid highly structured regions of the transcript to assure the highest accessibility to the RNA strand. The RNA secondary structure modeling is performed with the ViennaRNA package [2].
-
No self-complementarity
Considered sequence should not form stable homodimeric duplexes.
-
No crRNA secondary structures
The sequence should not contain any local secondary structures.
The output table is sorted by the score
column and contains 50 best scoring cRNA sequences characterized as follows:
- 1st column contains a unique index number; the numbering follows the genomic start position of the crRNA in descending order, starting from 0;
seq
column contains a sequence of the putative crRNA; reverse-complementary to the chosen target transcript;GC
column contains the %GC content of the crRNAent
column contains the mean positional entropy value of the target mRNA sequence in the binding position of crRNA; it describes the structural well-definedness of the region; the higher the better;dG_AA
column contains the change in Gibbs free energy of the homodimer duplex created by two crRNA oligonucleotides; the higher (less negative) the better- the less stable the homodimer complex isG_A
column stores the Gibbs free energy of the monomeric crRNA, the higher (less negative) the better- the less structured the monomer is;bitscore
column contains the bitscore value of the best alignment of the sequence to the sequences from the database as defined by the BLAST algorithm; the lower the betternident
column contains the number of identical matching nucleotides in the best scoring blast alignment of the crRNA to the sequences from the database; the lower the better;chr
column contains the name of the chromosome the target loci is on;start
column contains the start position of the crRNA on the chromosome; 0-based;end
column contains the stop position of the crRNA on the chromosome; non-inclusivescore
column contains the cumulative rank score calculated from the entropy value and the bitscore value of the sequence only; no other characteristic is taken into consideration; the lower the better
References:
- Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.
- Lorenz, Ronny and Bernhart, Stephan H. and Höner zu Siederdissen, Christian and Tafer, Hakim and Flamm, Christoph and Stadler, Peter F. and Hofacker, Ivo L.; ViennaRNA Package 2.0; Algorithms for Molecular Biology, 6:1 26, 2011, doi:10.1186/1748-7188-6-26