/acrfinder

AcrFinder, a tool for automated identification of Acr-Aca loci

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

AcrFinder

(c) Yin Lab@UNL2019

Contents:

I. Installation / Dependencies

II. About

III. Using AcrFinder

IV. Docker Support

V. Examples

VI. Workflow

VII. FAQ


I. Installation / Dependencies

Dependencies

Clone/download the repository. Some dependencies are included and can be found in the dependencies/ directory. Program expects these versions and using other versions can result in unexpected behavior.

CRISPRCasFinder - Already in dependencies/ directory. To use CRISPRCasFinder on your machine make sure you run its install script. The manual can be found here. Running the install script will setup paths for all the dependencies of CRISPRCasFinder.

It is a common problem to forget to install CRISPRCasFinder, so ensure that CRISPRCasFinder runs properly before executing acr_aca_cri_runner.py to avoid errors.

blastn - acr_aca_cri_runner.py will call/use blastn to search a genome. Install blastn from NCBI.

psiblast+ - Used with CDD to find mobilome proteins. Install at NCBI

blastp - Used with prophage database to find prophage. Install blastp from NCBI

python3 - For all scripts with .py extension. Use any version at or above 3.4.

PyGornism - Already in dependencies/ directory. Used to parse organism files and generate organism files in certain formats.

Database Preparation

After git clone the repository, there are 3 database to be installed.

Prophage

cd dependencies/prophage && makeblastdb -in prophage_virus.db -dbtype prot -out prophage

CDD-MGE

cd dependencies/ && tar -xzf cdd-mge.tar.gz && rm cdd-mge.tar.gz

CDD

mkdir -p dependencies/cdd
cd dependencies/cdd && wget ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/cdd.tar.gz && tar -xzf cdd.tar.gz && rm cdd.tar.gz
makeprofiledb -title CDD.v.3.12 -in Cdd.pn -out Cdd -threshold 9.82 -scale 100.0 -dbtype rps -index true

II. About

AcrFinder is a tool used to identify Anti-CRISPR proteins (Acr) using both sequence homology and guilt-by-association approaches.

This README file contains information about only the python scripts found in the current directory. These are the scripts that are used to identify genomic loci that contain Acr and/or Aca homologs.

To find out how to use other dependencies look at online sources:

CRISPRCasFinder - https://crisprcas.i2bc.paris-saclay.fr/CrisprCasFinder/Index

*CRISPRCasFinder is used to identify CRISPR Cas systems. This will then be used to classify the genomic loci that contain Acr and/or Aca homologs. If no CRISPR Cas systems are found within a genome, then only homology based search will be implemented for Acr homologs.


III. Using AcrFinder

Input

AcrFinder needs .fna, .gff and .faa as input. Only .fna file as input is also acceptable; in that case, the .gff and .faa file will be generated by running Prodigal.

List of Options

Option Alternative Purpose
-h --help Shows all available options
-n --inFNA Required fna file
-f --inGFF Required Path to gff file to use/parse
-a --inFAA Required Path to faa file to use/parse
-m --aaThresh Max size of a protein in order to be considered Aca/Acr (aa) {default = 200} [integer]
-d --distThresh Max intergenic distance between proteins (bp) {default = 150} [integer]
-r --minProteins Min number of proteins needed per locus {default = 2} [integer]
-y --arrayEvidence Minimum evidence level needed of a CRISPR spacer to use {default = 3} [integer]
-o --outDir Path to output directory to store results in. If not provided, the program will attempt to create a new one with given path
-t --aca Known Aca file (.faa) to diamond candidate aca in candidate Acr-Aca loci
-u --acr Known Acr file (.faa) to diamond the homolog of Acr
-z --genomeType How to treat the genome. There are three options: Virus, Bacteria and Archaea. Viruses will not run CRISPRCasFinder (Note: when virus is checked, also check -c 0 such that no mge search for virus.), Archaea will run CRISPRCasFinder with a special Archaea flag (-ArchaCas), Bacteria will use CRISPRCasFinder without the Archaea flag {default = V} [string]
-e --proteinUpDown Number of surrounding (up- and down-stream) proteins to use when gathering a neighborhood {default = 10} [integer]
-c --minCDDProteins Minimum number of proteins in neighborhood that must have a CDD mobilome hit so the Acr/Aca locus can be attributed to a CDD hit {default = 1} [integer]
-g --gi Uses IslandViewer (GI) database. {default = false} [boolean]
-p --prophage Uses PHASTER (prophage) database. {default = false} [boolean]
-s --strict All proteins in locus must lie within a region found in DB(s) being used {default = false} [boolean]
-l --lax Only one protein must lie within a region found in DB(s) being used {default = true} [boolean]
--blsType None Which blast type to choose when searching mobile genome element (mge). {default = blastp} Possible choices: blastp or rpsblast
--identity None The --id (identity) parameter for diamond to search {default=30} [integer]
--coverage None The --query-cover parameter for diamond to search {default=0.8} [float]
--e_value None The -e (e-value) parameter for diamond to search {default=0.01} [float]
--blast_slack None how far an Acr/Aca locus is allowed to be from a blastn hit to be considered high confidence {default=5000}

Output

Classification

There are three levels of classification in output:

Classification Meaning
Low Confidence If this Acr-Aca locus has a CRISPR-Cas locus but no self-targeting spacers in the genome, it is labeled as “low confidence” and inferred to target the CRISPR-Cas locus.
Medium Confidence If this Acr-Aca locus has a self-targeting spacer target in the genome but not nearby, it is labeled as “medium confidence” and inferred to target the CRISPR-Cas locus with the self-targeting spacer. "Nearby" means within 5,000 BP.
High Confidence If this Acr-Aca locus has a nearby self-targeting spacer target, it is labeled as “high confidence” and inferred to target the CRISPR-Cas locus with the self-targeting spacer.

Ouput files

Name Meaning
<output_dir>/CRISPRCas_OUTPUT The output folder of CRISPRCasFinder
<output_dir>/subjects The folder that contains the input files
<output_dir>/intermediates The folder that contains intermediate result files
<output_dir>/intermediates/blast_out.txt Results from blast+
<output_dir>/<organism_id>_guilt-by-association.out The final set of Acr/Aca regions that passed the initial filters as well as the CDD mobilome and prophage/gi filters.
<output_dir>/<organism_id>_homology_based.out The final set of proteins that have similarity to proteins in the Acr database under given similarity threshold.
<output_dir>/intermediates/masked_db/ The directory contains the db (fna with crispr array regions masked) to be used for blastn search for self-targeting spacer matches (the database for blastn search)
<output_dir>/intermediates/spacers_with_desired_evidence.fna The file contains CRISPR spacers extracted from crisprcasfinder results that have the desired evidence level. The query for blastn search
<output_dir>/intermediates/<organism_id>_candidate_acr_aca.txt Potential Acr/Aca regions that passed initial filters.
<output_dir>/intermediates/<organism_id>_candidate_acr_aca.faa Potential Acr/Aca regions in an faa format.
<output_dir>/intermediates/<organism_id>_candidate_acr_aca_neighborhood.faa An extension of the previous file that also inludes the neighboring proteins of the potential Acr/Aca. Used as the query for blastp search against prophage.
<output_dir>/intermediates/<organism_id>candidate_acr_aca{blastp/rpsblast}_results.txt Result file from blastp against prophage database or rpsblast against cdd-mge database.
<output_dir>/intermediates/<organism_id>_candidate_acr_aca_diamond_result.txt Results of diamond. These are search results with the Aca database as the query and <output_dir>/intermediates/<organism_id>_candidate_acr_aca.faa as the database.
<output_dir>/intermediates/<organism_id>_candidate_acr_homolog_result.txt Results of diamond. These are search results with the Acr database as the query and <output_dir>/intermediates/<organism_id>_candidate_acr_aca.faa as the database.
<output_dir>/intermediates/<organism_id>_candidate_acr_aca_diamond_database.dmnd Database of diamond made from <organism_id>_candidate_acr_aca.faa file.
<output_dir>/intermediates/<organism_id>_acr_homolog_result.txt Results of diamond. These are search results with the Acr database as the query and <output_dir>/subjects/<organism_id>_protein.faa as the database.
<output_dir>/intermediates/<organism_id>_acr_homolog_result.fasta Protein Sequence file (.faa) of protein in <output_dir>/intermediates/<organism_id>_acr_homolog_result.txt
<output_dir>/intermediates/<organism_id>_acr_diamond_database.dmnd Database of diamond made from <output_dir>/subjects/<organism_id>_protein.faa file

IV. Docker Support

To help users to configure the environment to use the software easily, we provide the .Dockerfile can be used using the command ([tag name] indicates the name of the tag. You can set any tag name.):

git clone https://github.com/haidyi/acrfinder.git
cd acrfinder
docker build -t [tag name] .

If you don't want to build the image by yourself, AcrFinder is also available at Docker Hub. You can pull the AcrFinder from docker hub directly using the command:

docker pull [OPTIONS] haidyi/acrfinder:latest

V. Examples

python3 acr_aca_cri_runner.py -n sample_organisms/GCF_000210795.2/GCF_000210795.2_genomic.fna -f sample_organisms/GCF_000210795.2/GCF_000210795.2_genomic.gff -a sample_organisms/GCF_000210795.2/GCF_000210795.2_protein.faa -o [output_dir] -z B -c 2 -p true -g true

or you can only use .fna file as input.

python3 acr_aca_cri_runner.py -n sample_organisms/GCF_000210795.2/GCF_000210795.2_genomic.fna -o [output_dir] -z B -c 2 -p true -g true

Run the container

Firstly, make sure the docker image has been pulled from the docker hub or built by yourself. AcrFinder is located at the work directory of the container.

Interactive Usage
docker run [OPTIONS] [NAME:TAG] /bin/bash
python3 acr_aca_cri_runner.py -n sample_organisms/GCF_000210795.2/GCF_000210795.2_genomic.fna -f sample_organisms/GCF_000210795.2/GCF_000210795.2_genomic.gff -a sample_organisms/GCF_000210795.2/GCF_000210795.2_protein.faa -o [output dir] -z B -c 2 -p true -g true
Use own sequence

If you want to use your own sequence for analysis, you can use the flag -v in docker to load your the host directory to the containder. The entire command is like this:

For example, if you want to use GCF_000210795.2 (contain .fna,gff,faa file in the directory ~/GCF_000210795.2) to implement acrfinder algorithm, you can use the command below:

docker run --rm -it -v ~/GCF_000210795.2:/app/acrfinder/GCF_000210795.2 haidyi/acrfinder:latest python3 acr_aca_cri_runner.py -n GCF_000210795.2/GCF_000210795.2_genomic.fna -f GCF_000210795.2/GCF_000210795.2_genomic.gff -a GCF_000210795.2/GCF_000210795.2_protein.faa -o GCF_000210795.2/output_dir -z B -c 2 -p true -g true

Then, you will see the output result in ~/GCF_000210795.2/output_dir.

For more information about how to use docker, you can refer to https://docs.docker.com.


VI. Workflow of AcrFinder


VII. FAQ

Q) I ran acr_aca_cri_runner.py and I got errors that pertain to CRISPR/Cas. Whats the issue?

A) Make sure CRIPSRCasFinder is installed properly. CRIPSRCasFinder has many dependencies of its own and will only work if they are all installed correctly. A good indicator of a correctly installed CRIPSRCasFinder is the following terminal output:

################################################################
# --> Welcome to dependencies/CRISPRCasFinder/CRISPRCasFinder.pl (version 4.2.17)
################################################################


vmatch2 is...............OK
mkvtree2 is...............OK
vsubseqselect2 is...............OK
fuzznuc (from emboss) is...............OK
needle (from emboss) is...............OK