The goal of GeneValidator is to identify problems with gene predictions and provide useful information based on the similarities to genes in public databases. The results produced will make provide evidence on how sequencing curation may be done and will be useful in improving or trying out new approaches for gene prediction tools. The main target of this tool are biologists who wish to validate the data produced in their labs.
If you use GeneValidator in your work, please cite us as follows:
"Dragan M, Moghul MI, Priyam A & Wurm Y (in prep.) GeneValidator: identify problematic gene predictions"
GeneValidator currently carries out a number of validations which include:
- Length validation by clusterization (a graph is dynamically produced)
- Length validation by ranking
- Check gene merge (a graph is dynamically produced)
- Check duplications
- Reading frame validation (for nucleotides)
- Main ORF validation (for nucleotides) (a graph is dynamically produced)
- Validation based on multiple alignment (a graph is dynamically produced)
- Ruby (>= 1.9.3)
- NCBI BLAST+ (>= 2.2.25+)
- MAFFT installation (download it from : http://mafft.cbrc.jp/alignment/software/ ).
Linux and MacOS are officially supported! - Mozilla FireFox - In order to dynamically produce graphs for some of the validation, GeneValidator relies on dependency called 'd3'. Unfortunately, at this moment of time, d3 only works in Firefox.
- Type the following command in the terminal
$ gem install GeneValidator
- After installing, GeneValidator can be run by typing the following command in the terminal
USAGE:
$ genevalidator [OPTIONS] INPUT_FILE
ARGUMENTS:
INPUT_FILE: Path to the input FASTA file containing the predicted sequences.
OPTIONAL ARGUMENTS:
-v, --validations <String> The Validations to be applied.
Validation Options Available (separated by coma):
all = run all validations (default)
lenc = length validation by clusterization
lenr = length validation by ranking
frame = reading frame validation
merge = check gene merge
dup = check duplications
orf = main ORF validation (applicable for nucleotides)
align = validation based on multiple alignment
-d, --db [BLAST_DATABASE] Name of the BLAST database
e.g. "swissprot -remote" or a local BLAST database
-x, --skip_blast [FILENAME] Skip blast-ing part and provide a blast xml or tabular output
as input to this script.
Only BLAST xml (BLAST -outfmt 5) or basic tabular (BLAST -outfmt 6
or 7) outputs accepted
-t [BLAST OUTFMT STRING], Custom format used in BLAST -outfmt argument
--tabular Usage:
$ genevalidator -x tabular_file -t "slen qstart qend" INPUT_FILE
See BLAST+ manual pages for more details
-m, --mafft [MAFFT_PATH] Path to MAFFT bin folder
-b, --blast [BLAST_PATH] Path to BLAST+ bin folder
-r, --raw_seq [FASTA_FILE] Fasta file containing the raw sequences of each of the BLAST hits in
BLAST XML output file.
-n, --num_threads num_of_threads Specify the number of processor
threads to utilise when running BLAST and Mafft within
GeneValidator.
--version The version of GeneValidator that you are running.
-h, --help Show this screen.
Please type genevalidator -h
into your terminal to see this information in your terminal.
The output produced by GeneValidator is presented in three manners
Firstly, the output is produced as a colourful, HTML file. This file is titled 'results.html' (found in the 'html' folder) and can be opened in a web browser (please use Mozilla Firefox). This file contains all the results in an easy-to-view manner with graphical visualisations
The output is also produced in YAML. This allows you to reuse the results and all the related global variables within your own programs.
Lastly, a summary of the results is also outputted in the terminal to provide quick feedback on the results.