/FragGeneScan_HPC

Acceleration of a gene prediction method on multicore clusters

Primary LanguageCGNU General Public License v3.0GPL-3.0

FragGeneScan_HPC

Installation

To install FragGeneScan_HPC, please follow the steps below:

1. Untar the downloaded file "FragGeneScan_HPC.tar.gz". This will automatically generate the directory "FragGeneScan_HPC".

2. Make sure that you also have a C and MPI compiler such as "mpicc" and perl interpreter.

3. Run "makefile" to compile and build excutable "FragGeneScan_HPC"

	make clean
	make fgs

Running the program

1. To run FragGeneScan_HPC:

./run_FragGeneScan.pl -genome=[seq_file_name] -out=[output_file_name] -complete=[1 or 0] -train=[train_file_name] -processes=[processes]

  • [seq_file_name]: sequence file name including the full path

  • [output_file_name]: output file name including the full path

  • [whole_genome]: 1 if the sequence file has complete genomic sequences 0 if the sequence file has short sequence reads

  • [train_file_name]: file name that contains model parameters; this file should be in the "train" directory.

     [complete] for complete genomic sequences or short sequence reads without sequencing error  
     [sanger_5] for Sanger sequencing reads with about 0.5% error rate  
     [sanger_10] for Sanger sequencing reads with about 1% error rate  
     [454_5] for 454 pyrosequencing reads with about 0.5% error rate  
     [454_10] for 454 pyrosequencing reads with about 1% error rate  
     [454_30] for 454 pyrosequencing reads with about 3% error rate  
     [illumina_5] for Illumina sequencing reads with about 0.5% error rate  
     [illumina_10] for Illumina sequencing reads with about 1% error rate  
    

    Note that files containing model parameters already exist in the "train" directory.

  • [processes]: number of processes used in FragGeneScan. Default 1.

2. To test FragGeneScan_HPC with a complete genomic sequence:

./run_FragGeneScan.pl -genome=./example/NC_000913.fna -out=./example/NC_000913-fgs -complete=1 -train=complete

  • [NC_000913.fna]: this sequence file has the complete genomic sequence of E.coli

    (NCBI gene predictions for this genome are available under the same folder example/)

3. To test FragGeneScan_HPC with sequencing reads:

./run_FragGeneScan.pl -genome=./example/NC_000913-454.fna -out=./example/NC_000913-454-fgs -complete=0 -train=454_10

  • [NC_000913-454.fna]: this sequence file has simulated reads (pyrosequencing, average length = 400 bp and sequencing error = 1%) generated using Metasim

    For illumina reads, please use illumina_5 or illumina_10 as the train model.

4. To test FragGeneScan_HPC with assembly contigs:

./run_FragGeneScan.pl -genome=./example/contigs.fna -out=./example/contigs-fgs -complete=1 -train=complete

Note: -complete=1 & -train=complete are used as the parameters.

5. To test FragGeneScan_HPC with whole genome:

./run_FragGeneScan.pl -genome=./example/NC_000913.fna -out=./example/NC_000913-fgs -complete=1 -train=complete

Output

Upon completion, FragGeneScan_HPC generates four files.

1. The first file is "[output_file_name].out", which lists the coordinates of putative genes. This file consists of five columns (start position, end position, strand, frame, score). For example,

>gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655, complete genome
108     440     -       3       1.378688
337     2799    +       1       1.303498
2801    3733    +       2       1.317386
3734    5020    +       2       1.293573
5234    5530    +       2       1.354725
5683    6459    -       1       1.290816
6529    7959    -       1       1.326412
8238    9191    +       3       1.286832
9306    9893    +       3       1.317067

2. The second file is "[output_file_name].ffn", which lists nucleotide sequences corresponding to the putative genes in "[output_file_name].out". For example,

>gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655, complete genome start=108 e
nd=338 strand=-
GTTGTTACCTCGTTACCTTTGGTCGAAAAAAAAAGCCCGCACTGTCAGGTGCGGGCTTTTTTCTGTGTTTCCTGTACGCGTCAGCCCGCACCGTTACCTG
TGGTAATGGTGATGGTGGTGGTAATGGTGGTGCTAATGCGTTTCATGGATGTTGTGTACTCTGTAATTTTTATCTGTCTGTGCGCTATGCCTATATTGGT
TAAAGTATTTAGTGACCTAAGTCAA
>gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655, complete genome start=343 e
nd=2799 strand=+
TTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCC
TCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTAT
TTTTGCCGAACTTTTGACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAAAACAT
GTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCG

3. The third file is "[output_file_name].faa", which lists amino acid sequences corresponding to the putative genes in "[output_file_name].out". For example,

>gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655, complete genome start=108 e
nd=338 strand=-
VVTSLPLVEKKSPHCQVRAFFCVSCTRQPAPLPVVMVMVVVMVVLMRFMDVVYSVIFICLCAMPILVKVFSDLSQ
>gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655, complete genome start=343 e
nd=2799 strand=+
LKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNISDAERIFAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKH
VLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELVVLG
RNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPCLIKNTGNPQAPGTLIGASRDEDE
LPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLITQSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVG
DGMRTLRGISAKFFAALARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSWLKNKHIDLRVCGV
ANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKANTSSMDYYHQLRYAAEKSRRKF
LYDTNVGAGLPVIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEGMSFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIEIEP
VLPAEFNAEGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFKVKNGENALAFYSHYYQPLPLVLRGYGAGNDVTA
AGVFADLLRTLSWKLGV
>gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655, complete genome start=2801
end=3733 strand=+
VKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGRFADKLPSEPRENIVYQCWERFCQELGKQIPVAMTLEKNMPIGSGLGSSAC
SVVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHYDNVAPCFLGGMQLMIEENDIISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDCI
AHGRHLAGFIHACYSRQPELAAKLMKDVIAEPYRERLLPGFRQARQAVAEIGAVASGISGSGPTLFALCDKPETAQRVADWLGKNYLQNQEGFVHICRLD
TAGARVLEN

4. The fourth file is "[output_file_name].gff" which show the gene prediction results in gff format.

License

Copyright (C) 2010 Mina Rho, Yuzhen Ye and Haixu Tang.

Copyright (C) 2020 Bruno Cabado.

The source for FragGeneScan_HPC is released under the GNU General Public License as published by the Free Software Foundation, either version 3 of the License (See COPYING.txt).