/CRIS

Find CRISPR binding sites

Primary LanguagePython

alt text

Find CRISPR/Cas9 binding sites in a genbank file.

To get help:

python CRIS.py -h
usage: CRIS.py [-h] -s SEQ_INFILE [-l LENGTH_TARGET_SEQ]
               [-t THREE_PRIME_CLAMP] [-p PAM_SEQ] [-q FEATURE_QUALIFIER] [-v]
               [-n | -o]

Find CRISPR/Cas9 target sites in a genbank file. Accepts files with multiple
genbank records per file. Author: dr.mark.schultz@gmail.com. Acknowledgements:
Torsten Seemann, Ian Monk, Timothy P. Stinear.

optional arguments:
  -h, --help            show this help message and exit
  -s SEQ_INFILE, --seq_infile SEQ_INFILE
                        DNA seqs (contigs) to scan (Genbank format)
  -l LENGTH_TARGET_SEQ, --length_target_seq LENGTH_TARGET_SEQ
                        Set total length of target sequence NOT including the
                        PAM sequence. Default=20.
  -t THREE_PRIME_CLAMP, --three_prime_clamp THREE_PRIME_CLAMP
                        At the 3' end, how long do you want the clamp sequence
                        to be? Default=12.
  -p PAM_SEQ, --PAM_seq PAM_SEQ
                        Protospacer Adjacent Motif (PAM). Depends on Cas9
                        species. Default='NGG'.
  -q FEATURE_QUALIFIER, --feature_qualifier FEATURE_QUALIFIER
                        Genbank feature qualifier in which to find target
                        sites. Could be 'gene', 'CDS', 'mRNA' etc. Case-
                        sensitive, exact spelling required. Default='gene'.
  -v, --verbose         Verbose on. Default=False
  -n, --no_overwrite    Do not overwrite output file if it exists.
  -o, --overwrite       Overwrite output file if it exists, otherwise write
                        new. Default is to overwrite.

Example usage:

python CRIS.py -s test_multigbk.gbk

##Explanation of CRIS.py Reads in multi-record genbank file.

User sets up the PAM sequence, target length and 3'-clamp length and/or accepts the defaults.

For each record, searches through the features of type requested on command line (e.g., gene, CDS or mRNA) and finds all full length CRISPR/Cas9 target sequences matching the RegEx.

Within the feature, CRIS.py assesses for each of the full length CRISPR/Cas9 sequences whether the sequence at the 3'-clamp (specified on command line) is unique throughout the genome.

After finding unique hits, CRIS.py assesses whether each of the full length CRISPR/Cas9 target sequences overlaps with other features of the requested type. If a sequence is unique AND does not overlap other features, it is stored in the candidate list. Within the candidate list, the GC content of each CRISPR/Cas9 target sequences is calculated. Sequences with a GC content equal to the maximum GC content in the candidate set are retained. If only one hit, this is retained as the 'best' match. If more than one is retained, the CRISPR/Cas9 target sequence closest to the 5' end of the gene is selected as the 'best' match. The best match is reported. A summary is printed now and at the end of the run. In verbose mode, lots of statements are printed as the run progresses.

The output of the run is a copy of the input genbank file with all the best hits marked up in the file. This annotated genbank can be viewed in Artemis etc.