/CRISPR-search

automated pipeline for mining metagenomes for new CRISPR systems

Primary LanguagePerlGNU General Public License v3.0GPL-3.0

Mining metagenomes for new CRISPR effectors

This is a Python pipeline and CLI for identifying novel orthologs of CRISPR-Cas effectors from publicly available metagenome data.

Given a set of known protein sequences, it:

  1. queries local metagenomic BLAST databases for similar ORFs using TBLASTN 🧬
  2. searches significant contigs for putative CRISPR arrays using CRISPRFinder1 🔍
  3. ranks, cleans, and summarizes the results for subsequent synthesis and experimental characterization. 🧪

In addition, it includes helper functions for:

  • sorting, formatting, and preprocessing raw sequence data into searchable BLAST databases
  • efficiently deduplicating sequences
  • multithreading
  • logging output
  • basic job scheduling, including customizable SMS alerts whenever a run finishes or fails.

Developed during my Summer 2019 research in the Hsu Lab at the Salk Institute.

References

1. Grissa I, Vergnaud G, Pourcel C. CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats. Nucleic Acids Res. (2007):W52-7. https://doi.org/10.1093/nar/gkm360