For a full ReadMe, click here.
Basic Syntax:
run_FragGeneScan.pl -genome=< FASTA file > -out=< output path > -complete=< whole genome? > -train=< sequencing type >
The -genome flag specifies the input sequence path.
The -out flag tells FragGeneScan where to place the output files and what to name them.
The -complete flag is a boolean that specifies if the input is a complete genomic sequence or a composition of short reads.
The -train flag tells FragGeneScan to use a particular set of sequencing parameters for scanning the input genome. The options are:
- Complete
- Sanger Sequencing (0.5% & 1% error)
- 454 Pyrosequencing (1% & 3% error)
- Illumina Sequencing (0.5% & 1% error)
Output: FragGeneScan generates three output files:
- SeqName.out
- This file holds the locations of various genes in the input sequence. It details the start position, end position, strand, frame, and score.
- SeqName.faa
- This file holds the amino acid sequences that correspond to the genes described in the .out file. We will use this file for HMMer.
- SeqName.ffn
- This file holds the nucleotide sequences that correspond to the genes described in the .out file.
HUGE Issue with FragGeneScan: No multithreaded support (this package is REALLY SLOW). Some of our datasets took 9-12 hours to analyze completely.
A newer version has been developed with support for cluster computing called FragGeneScan-Plus. FragGeneScan-Plus claims to analyze data 5x faster than FragGeneScan on a single core, and about 50x faster using 8 cores. This package can be found here.
Update: In our tests using 64 cores, FragGeneScan-Plus analyzed one of our datasets in minutes! This is a HUGE improvement over FragGeneScan.
For the offical (120 page) HMMer manual, click here to visit the offical HMMer website and click the link to download the documentation PDF.
HMMer has many functions:
- hmmbuild - Build profile HMM from input multiple alignment
- hmmalign - Make multiple aligment to a common profile HMM
- phmmer - Search single protein sequence against protein sequence database (BLASTP-like)
- jackhmmer - Iteratively search protein sequence against protein sequence database (PSIBLAST-like)
- hmmsearch - Search protein profile HMM against a protein sequence database
- hmmscan - Seach protein sequence against a protein profile HMM database
- hmmpgmd - Search daemon for submitting jobs to hmmer.org
- nhmmer - Search DNA sequence, alignment, or profile HMM against a DNA sequence database (BLASTN-like)
- nhmmscan - Search DNA sequence against a DNA profile HMM database
- hmmpress - Format HMM database into binary format for hmmscan
This package encourages the user to create an HMM profile for the data being analyzed. A sample HMM profile looks like this (taken directly from the HMMer3.1 user manual):
HMMER3/f [3.1 | February 2013] NAME fn3
ACC PF00041.13
DESC Fibronectin type III domain LENG 86
ALPH amino
RF no
MM no
CONS yes
CS yes
MAP yes
DATE Fri
NSEQ 106
EFFN 11.415833
CKSUM 3564431818
GA 8.00 7.20
TC 8.00 7.20
NC 7.90 7.90
STATS LOCAL MSV
STATS LOCAL VITERBI -9.7737 0.71847 STATS LOCAL FORWARD -3.8341 0.71847
HMM
-9.4043 0.71847
A C D E F G H I (...) Y
(...)
//
85 2.48488 5.72055
2.68618 4.42225
0.00338 6.08833
86 3.03720 5.94099
2.68618 4.42225 0.00227 6.08723
Feb 15 06:04:13 2013
m->m m->i COMPO 2.70330 4.91262 2.68618 4.42225 0.00338 6.08833
1 3.16986 5.21447 2.68629 4.42236 0.09796 2.38361
2 2.70230 5.97353 2.68618 4.42225 0.00338 6.08833
m->d i->m i->i d->m 3.03272 2.64079 3.60307 2.84344 2.77519 2.73123 3.46354 2.40513 6.81068 0.61958 0.77255 0.00000 4.52134 3.29953 4.34285 4.18764 2.77530 2.73088 3.46365 2.40512 6.81068 0.10064 2.34607 0.48576 2.24744 2.62947 5.31433 2.60356 2.77519 2.73123 3.46354 2.40513 6.81068 0.61958 0.77255 0.48576
3.87501 1.97538 3.04853 3.48010
2.77519 2.73123 3.46354 2.40513 6.81068 0.61958 0.77255 0.48576 3.75455 2.96917 5.26587 2.91682
2.77519 2.73123 3.46354 2.40513 * 0.61958 0.77255 0.00000
d->d
3.74204 3.07942 (...) 3.21526 3.72494 3.29354 (...) 3.61503
*
4.30886 3.35801 (...) 3.93889 3.72505 3.29365 (...) 3.61514 0.95510
4.43584 4.79731 (...) 4.25623 3.72494 3.29354 (...) 3.61503 0.95510
4.51877 3.51898 (...) 3.43366
3.72494 3.29354 (...) 3.61503 0.95510
3.66571 4.11840 (...) 4.99111
3.72494 3.29354 (...) 3.61503 *
1p---
3s---
120 e - - B
121 s - - E
Creating an HMM profile is not always possible because the parser in hmmbuild is incredibly selective of what data format it accepts. The recommended data format is Stockholm, which looks like this (again, taken from the HMMer3.1 user manual):
# STOCKHOLM 1.0
seq1 ACDEF...GHIKL
seq2 ACDEF...GHIKL
seq3 ...EFMNRGHIKL
seq1 MNPQTVWY
seq2 MNPQTVWY
seq3 MNPQT...
//
In this tutorial, we revert to using hmmpress and hmmscan to analyze the protein sequences produced by FragGeneScan (since HMM profiles cannot be created with our data). The hmmscan program will accept a protein sequence and compare it against an HMM database. Before hmmscan can be used, the HMM database must be converted into a binary format to optimize the speed of the scan. The hmmpress program prepares an HMM profile for hmmscan, and its basic syntax is as follows:
hmmpress < HMM profile >
The basic syntax for hmmscan is as follows:
hmmscan <options> < HMM profile > < FAA file >
For the complete set of options that go with every command of HMMer, consult the user manual or the man pages using the -h flag after the HMMer command.
Speed of HMMer: The user manual states that HMMer 3.1 should be as fast as BLAST for "most protein queries." However, the older FASTA format does not fall into this category. Even with 32 cores available, our first dataset is STILL being processed as we present this tutorial!
EMBL provides an alternate method of scanning protein sequences against their Pfam database. They provide a tool called Pfam_Scan, which is essentially a Perl script that runs HMMer against the Pfam-A and Pfam-B databases. It also offers a "pretty" output format for easier interpretation of the data, which includes Pfam abundance info. We were not able to get this script to install on Proteus due to permission issues around modifying the Perl configuration. However, if you were to run this script, the general syntax is:
${PFAMPATH}/pfam_scan.pl -fasta < Protein Sequence > -out < Pfam Database Directory >
For more on Pfam, click here.
General settings for Proteus:
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -M ejm335@drexel.edu
#$ -P nsftuesPrj
#$ -pe shm 32
#$ -l h_rt=48:00:00
#$ -q all.q@@amdhosts
. /etc/profile.d/modules.sh
module load shared
module load proteus
module load sge/univa
module load gcc/4.8.1
First, we need to download our data. This command downloads our dataset (we must then move it all into one folder):
wget -r --no-parent ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX144/SRX144807/
Once all of the data is in one place, we need to unzip the files.
gunzip *.gz
We need to convert all of the SRA data files into a FASTA format.
# Define the path for SRA Toolkit & HMMer
PATH=/mnt/HA/groups/nsftuesGrp/.local/bin:$PATH
# Convert from SRA to FASTA
fastq-dump --fasta ${DATAPATH}/*.sra
Now we have all of the files we need to run FragGeneScan.
# Define Data/Package Locations
FRAGPATH=/home/ejm335/fraggenescan-code
TMPPATH=${TMP}/FragGeneScanOutput
DATAPATH=/home/ejm335/TutorialData
OUTPUTPATH=/home/ejm335/TutorialData/output
# Define data array
datafiles=(SRR492065 SRR492066 SRR492182 SRR492183 SRR492184 SRR492185 SRR492186 SRR492187 SRR492188 SRR492189 SRR492190 SRR492191 SRR492192 SRR492193 SRR492194 SRR492195 SRR492196 SRR492197)
# Make a folder to hold output
mkdir ${TMP}/FragGeneScanOutput
# Run FragGeneScan (we ended up running this separately per file, each file takes ~9 hours to analyze)
for file in ${datafiles[@]}
do
${FRAGPATH}/run_FragGeneScan.pl -genome=${DATAPATH}/${file}.fasta -out=${TMPPATH}/${file} -complete=0 -train=454_30
done
# Move output from scatch to home
mv ${TMPPATH}/* ${OUTPUTPATH}/
We now have three output files from each FASTA file. Sample of each: SRR492065.out
>SRR492065.1 HWI-EAS385_0095_FC:2:1:6702:1434 length=200
1 72 - 3 1.435572 I:10, D:
>SRR492065.2 HWI-EAS385_0095_FC:2:1:6931:1435 length=200
1 86 - 3 1.358242 I: D:
>SRR492065.3 HWI-EAS385_0095_FC:2:1:9984:1431 length=200
1 120 - 1 1.331453 I: D:
>SRR492065.4 HWI-EAS385_0095_FC:2:1:11577:1434 length=200
>SRR492065.5 HWI-EAS385_0095_FC:2:1:15121:1434 length=200
1 84 - 1 1.379883 I: D:
SRR492065.ffn
>SRR492065.1_1_72_-
GTAACGACAATCGTGACAAGTCGCGATTGGAGTGGTCTTCACAGTGAAGCCAAGCATAGGATGGCT
>SRR492065.2_1_86_-
ATNNNNNTTTCTTCCGACAAAGTTGATCAAGTCGCTGAGTTTGGAAATTCTAGTAAAATCACAGTCGGTGAGCCTGCTATT
>SRR492065.3_1_120_-
GTAAAACAANNNNNGAAAGGCGGCGATTGGCGTGCTNNNNNTTTGAATATTGTTTTAATTCTAGTGGCCATCTTGATTTACTATCCGTTCTTTGTAGCTTATGATAAAAATGAGCTT
>SRR492065.5_1_84_-
NNNNNTAAATCAGCCCATGACTCAGGTGCTGCTCTACTATGTAAACATGACGGTGTCGTAGAATTCGTCGATGCCAAAGAA
SRR492065.faa
>SRR492065.1_1_72_-
VTTIVTSRDWSGLHSEAKHRMA
>SRR492065.2_1_86_-
XXXSSDKVDQVAEFGNSSKITVGEPAI
>SRR492065.3_1_120_-
VKQXXKGGDWRAXXLNIVLILVAILIYYPFFVAYDKNEL
>SRR492065.5_1_84_-
XXKSAHDSGAALLCKHDGVVEFVDAKE
>SRR492065.6_1_200_+
NYPCKIYIYDGNVKLENLNIDEDFIXXKEEIWKAXXTSXXXXXVPLNHXXILLQQLLLQTHFYAL
Now that we have run FragGeneScan, we now need to download the Pfams database in order to anotate each dataset's protein sequences (.faa). Navigate to the desired directory and run this command:
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam//releases/Pfam27.0/Pfam-A.hmm.gz
HMMer offers many commands to analyze data, but the most straightfoward one to use is hmmscan. This is because it takes in a protein sequence and compares it to a protein database in HMM format. In the previous command, we downloaded the Pfam-A database in the HMM format. Before we can use hmmscan, we need to compress this database into a binary format using hmmpress:
# Define the path for SRA Toolkit & HMMer
PATH=/mnt/HA/groups/nsftuesGrp/.local/bin:$PATH
# Define other paths
DBPATH=/home/ejm335/TutorialData
DATAPATH=/home/ejm335/TutorialData/output/
OUTPATH=/home/ejm335/TutorialData/output/HMMer
# Compress Pfam Database
hmmpress ${DBPATH}/Pfam-A.hmm
This generates four binary files in the ${DATAPATH} path we defined earlier. We are now all set to run hmmscan (and wait a VERY LONG time):
## Run HMMer for each protein sequence
for file in ${datafiles[@]}
do
hmmscan -o ${OUTPATH}/${file} --cpu 32 ${DBPATH}/Pfam-A.hmm ${DATAPATH}/${file}.faa
done
This creates a long output file in the ${OUTPATH} directory that looks like this:
# hmmscan :: search sequence(s) against a profile database
# HMMER 3.1b1 (May 2013); http://hmmer.org/
# Copyright (C) 2013 Howard Hughes Medical Institute.
# Freely distributed under the GNU General Public License (GPLv3).
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# query sequence file: /home/ejm335/TutorialData/output//SRR492190.faa
# target HMM database: /home/ejm335/TutorialData/Pfam-A.hmm
# number of worker threads: 32
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Query: SRR492190.4_1_200_+ [L=65]
Scores for complete sequence (score includes all domains):
--- full sequence --- --- best 1 domain --- -#dom-
E-value score bias E-value score bias exp N Model Description
------- ------ ----- ------- ------ ----- ---- -- -------- -----------
1.3e-06 28.2 0.1 1.7e-06 27.8 0.1 1.2 1 IF-2 Translation-initiation factor 2
Domain annotation for each model (and alignments):
>> IF-2 Translation-initiation factor 2
# score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc
--- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ----
1 ! 27.8 0.1 1.2e-10 1.7e-06 15 48 .. 2 35 .. 1 42 [. 0.90
Alignments for each domain:
== domain 1 score: 27.8 bits; conditional E-value: 1.2e-10
IF-2 15 vkelnvivkaDvqGsleAlkesLeklsteevkvk 48
ke+n+ivkaDvqG eA+++sL+ + e v++
SRR492190.4_1_200_+ 2 FKEVNIIVKADVQGXDEAVSASLQXXDVEGVRLX 35
589************************9999986 PP
We can use the --pfamtblout flag to produce a formatted output, and we can limit the inclusion threshold by specifying an inclusion e-value of 1e-5 with the -incE flag.
# HMMScan with formatted output
hmmscan --pfamtblout ${OUTPATH}/${file} --cpu 32 --incE 0.00001 ${DBPATH}/Pfam-A.hmm ${DATAPATH}/${file}.faa
This produces output that looks like this (first 25 matches in the SRR492190 dataset):
# name bits E-value n exp bias description
# ------------------- ------ --------- --- ----- ----- ---------------------
IF-2 28.2 1.3e-06 1 1.2 0.1 Translation-initiation factor 2
Asp23 16.5 0.0074 1 1.1 0.0 Asp23 family
RIX1 32.3 6.6e-08 1 1.1 0.2 rRNA processing/ribosome biogenesis
DUF2969 23.2 5.1e-05 1 1.1 0.2 Protein of unknown function (DUF2969)
Malic_M 13.6 0.036 1 1.1 0.0 Malic enzyme, NAD binding domain
Glyco_hydro_35 14.1 0.022 1 1.0 0.0 Glycosyl hydrolases family 35
ThiW 35.8 6.3e-09 1 1.0 0.8 Thiamine-precursor transporter protein (ThiW)
YbaB_DNA_bd 15.5 0.011 1 1.0 0.1 YbaB/EbfC DNA-binding family
TPR_2 24.8 1.2e-05 1 1.3 0.0 Tetratricopeptide repeat
TPR_1 24.3 1.5e-05 1 1.4 0.0 Tetratricopeptide repeat
TPR_8 24.3 1.6e-05 1 1.2 0.0 Tetratricopeptide repeat
TPR_12 23.3 4.4e-05 1 1.1 0.0 Tetratricopeptide repeat
TPR_7 22.9 4.6e-05 1 1.4 0.0 Tetratricopeptide repeat
TPR_11 18.6 0.001 1 1.1 0.0 TPR repeat
TPR_14 18.1 0.0028 1 1.3 0.0 Tetratricopeptide repeat
TPR_10 16.9 0.0041 1 1.3 0.0 Tetratricopeptide repeat
TPR_16 16.9 0.0076 1 1.2 0.1 Tetratricopeptide repeat
TPR_6 15.9 0.013 1 1.2 0.0 Tetratricopeptide repeat
CbiQ 16.8 0.0037 1 1.1 0.0 Cobalt transport protein
DUF1129 13.8 0.026 1 1.1 1.0 Protein of unknown function (DUF1129)
SHMT 19.6 0.00024 1 1.0 1.0 Serine hydroxymethyltransferase
SAICAR_synt 17.3 0.0018 1 1.1 0.1 SAICAR synthetase
H_kinase_N 19.8 0.00046 1 1.1 0.0 Signal transduction histidine kinase
Cna_B 19.5 0.00059 1 1.2 0.5 Cna protein B-type domain
OTCace 15.4 0.012 1 1.1 0.0 Aspartate/ornithine carbamoyltransferase, Asp/Orn binding domain
MoaE 25.4 9.4e-06 1 1.1 0.0 MoaE protein