Tomas Bruna, Alexandre Lomsadze, Mark Borodovsky
Georgia Institute of Technology, Atlanta, Georgia, USA
Reference: GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins
ProtHint is a pipeline for predicting and scoring hints (in the form of introns, start and stop codons) in the genome of interest by mapping and spliced aligning predicted genes to a database of reference protein sequences.
To install, copy the content of this distribution to desired location. To verify the installation, run ProtHint with the sample input located in the example folder.
Running ProtHint requires a Linux system with Bash. The following dependencies need to be satisfied.
Perl 5.10 or higher is required.
The following non-Core Perl modules are required:
MCE::Mutex
threads
YAML
Math::Utils
Core module Thread::Queue
needs to be updated to a version 3.11
or higher.
These modules are available at CPAN and can be installed/updated with
cpan MCE::Mutex threads YAML Thread::Queue Math::Utils
Python 3.3 or higher is required. No libraries outside of the Python Standard Library are required.
There are two ways of using GeneMark-ES in ProtHint:
-
Run ProtHint with
--geneMarkGtf genemark.gtf
option which specifies the path to a file with GeneMark-ES predictions. If this option is used, GeneMark-ES does not need to be installed as a part of ProtHint. -
Install GeneMark-ES, ProtHint will run it automatically as a part of the pipeline.
Download and extract the contents of the GeneMark-ES suite (versions 4.30 and up) into the
ProtHint/dependencies/GeneMarkES
folder. GeneMark-ES suite is available at http://exon.gatech.edu/GeneMark/license_download.cgiTo verify that GeneMark-ES is installed correctly, run the following command:
ProtHint/dependencies/GeneMarkES/check_install.bash
.
DIAMOND local sequence aligner (available at https://github.com/bbuchfink/diamond) is included in this distribution package.
In case the included version is not working, install DIAMOND from
https://github.com/bbuchfink/diamond and replace the diamond
binary in
ProtHint/dependencies folder.
Spaln, space-efficient spliced alignment program (available at https://github.com/ogotoh/spaln), is included in this distribution package.
In case the included version is not working, install Spaln from
https://github.com/ogotoh/spaln and replace the spaln
binary in
ProtHint/dependencies folder.
Binary for parsing and scoring hints from Spaln's alignment output is included in this distribution package.
In case the included binary is not working, compile it from source at https://github.com/gatech-genemark/spaln-boundary-scorer and replace the spaln_boundary_scorer
binary
in ProtHint/dependencies folder.
ProtHint inputs consist of:
- Genomic sequence from the target species in multi-FASTA format
- Reference protein sequences in multi-FASTA format
The tool is applicable to complete as well as draft genome assemblies. Every sequence in each multi-FASTA input needs to have a unique ID (first word of a FASTA header is used for ID). Examples of valid FASTA headers:
>contig10
ID: contig10
> seq3 genome Z
ID: seq3
>IV contig 25
ID: IV
We recommend to use a relevant portion of OrthoDB protein database as the source of reference protein sequences.
For example, if your genome of interest is an insect, download arthropoda proteins:
wget https://v100.orthodb.org/download/odb10_arthropoda_fasta.tar.gz
tar xvf odb10_arthropoda_fasta.tar.gz
and concatenate proteins from all species into a single file:
cat arthropoda/Rawdata/* > proteins.fasta
For other genomes of interest, you can select the most specific OrthoDB section from the list below and repeat the procedure desribed above.
- Fungi: https://v100.orthodb.org/download/odb10_fungi_fasta.tar.gz
- Metazoa: https://v100.orthodb.org/download/odb10_metazoa_fasta.tar.gz
- Protozoa: https://v100.orthodb.org/download/odb10_protozoa_fasta.tar.gz
- Viridiplantae: https://v100.orthodb.org/download/odb10_plants_fasta.tar.gz
To run ProtHint, use the following command:
prothint.py genome.fasta proteins.fasta
See the example folder for a sample input and output.
To display a list of all available options, use:
prothint.py --help
Frequently used options are:
--workdir WORKDIR Folder for results and temporary files. If not
specified, current directory is used
--geneMarkGtf GENEMARKGTF
File with GeneMark-ES predictions in gtf format. If
this file is provided, GeneMark-ES run is skipped.
--diamondPairs DIAMONDPAIRS
File with "seed gene-protein" hits generated by
DIAMOND. If this file is provided, DIAMOND search for
protein hits is skipped.
ProtHint generates two main outputs:
prothint.gff
Gff file with all reported hints (introns, starts and stops)evidence.gff
High confidence subset ofprothint.gff
which is, for instance, suitable for the GeneMark-EP Plus mode. This set is generated using default thresholds inProtHint/bin/print_high_confidence.py
script. If you wish to use different filtering criteria, re-runprint_high_confidence.py
script with custom thresholds.
An output which is ready to be used in BRAKER and AUGUSTUS is also generated:
prothint_augustus.gff
ProtHint is developed by Tomas Bruna and Alexandre Lomsadze at Dr. Mark Borodovsky's Bioinformatics Lab, Georgia Institute of Technology, Atlanta, USA.