This is a Python pipeline and CLI for identifying novel orthologs of CRISPR-Cas effectors from publicly available metagenome data.
Given a set of known protein sequences, it:
- queries local metagenomic BLAST databases for similar ORFs using TBLASTN 🧬
- searches significant contigs for putative CRISPR arrays using CRISPRFinder1 🔍
- ranks, cleans, and summarizes the results for subsequent synthesis and experimental characterization. 🧪
In addition, it includes helper functions for:
- sorting, formatting, and preprocessing raw sequence data into searchable BLAST databases
- efficiently deduplicating sequences
- multithreading
- logging output
- basic job scheduling, including customizable SMS alerts whenever a run finishes or fails.
Developed during my Summer 2019 research in the Hsu Lab at the Salk Institute.
1. Grissa I, Vergnaud G, Pourcel C. CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats. Nucleic Acids Res. (2007):W52-7. https://doi.org/10.1093/nar/gkm360 ↩