MetAMOS v1.1 README Last updated: December 14th 2012 NEWS: *HTML report vastly improved *FunctionalAnnotation step added *Added Bowtie2 support *Updated to ruffus version 2.3b *Fixed Issues #69,71 *Various other minor bug fixes > SUMMARY * A/ HARDWARE REQUIREMENTS * B/ SOFTWARE REQUIREMENTS * C/ INSTALLING MetAMOS * D/ QUICK START * E/ TEST SUITE * F/ EXAMPLE OUTPUT * G/ CONTACT * H/ CITING ---------------------------------------------------------------------------------- A/ HARDWARE REQUIREMENTS MetAMOS was designed to work on any standard 64bit Linx environment. To use MetAMOS for tutorial/teaching purposes, a minimum of 8 GB RAM is required. To get started on real data sets a minimum of 32 GB of RAM is recommened, and anywhere from 64-512 GB may be necessary for larger datasets. In our experience, for most 50-100 million read datasets, 64 GB is a good place to start (128 GB of memory available on High Memory Instance at Amazon Elastic Compute Cloud ). ---------------------------------------------------------------------------------- B/ SOFTWARE REQUIREMENTS The main prequisite software is python2.6+ and AMOS (available from http://amos.sf.net). Once python2.6+ and AMOS are installed, there should not be any other major prerequisites as most everything that is needed is distributed with MetAMOS inside of the /Utilities directory. Depending on your platform/Linux distribution, you might need to download and install the following BEFORE running INSTALL.py: 1) git 2) wget 3) gcc 4) automake 5) python-tools 6) python-devel 7) zlib-devel 8) numpy 9) freetype, freetype-devel 10) libpng-devel Additionally, there is some software that MetAMOS can incorporate into its pipeline that we are not allowed to distribute, such as MetaGeneMark. To get a license to use MetaGeneMark, plesae visit: http://exon.gatech.edu/license_download.cgi. ---------------------------------------------------------------------------------- C/ INSTALLING MetAMOS To download the software, go to https://github.com/treangen/MetAMOS and click on Downloads. Once downloaded, simply unpack the files and open the MetAMOS directory. Once inside the MetAMOS directory, run: python INSTALL.py This will download and install any external dependencies (or they can be refused by answering NO), which may take minutes or hours to download depending on your connection speed. If all dependencies are downloaded (including optional ones), this will take quite awhile to complete (plan on a few hours to 2 days). ---------------------------------------------------------------------------------- D/ QUICK START Before you get started using MetAMOS a brief review of its design will help clarify its intended use. MetAMOS gas two main components: 1) initPipeline 2) runPipeline The first component, initPipeline, is for creating new projects and also initiliazing sequence libraries. Currently interleaved & non-interleaved fasta, fastq, and SFF files are supported. The file-type flags (-f, -q, and -s) must be specified before the file. Once specified, they remain in effect until a different file type is specified. usage: initPipeline -f/-q -1 file.fastq.1 -2 file.fastq.2 -d projectDir -i 300:500 options: -s -c -q, -f, -1, -2, -d, -m, -i -1: either non-paired file of reads or first file in pair, can be list of multiple separated by a comma -2: second paired read file, can be list of multiple separated by a comma -c: fasta file containing contigs -d: output project directory (required) -f: boolean, reads are in fasta format (default is fastq) -h: display help message -i: insert size of library, can be list separated by commas for multiple libraries -l: SFF linker type -m: interleaved file of paired reads -o: reads are in outtie orientation (default innie) -q: boolean, reads are in fastq format (default is fastq) -s/--sff: boolean, reads are in SFF format (default is fastq) For example, to input a: (non-interleaved fastq, single library) initPipeline -q -1 file.fastq.1 -2 file.fastq.2 -d projectDir -i 300:500 (non-interleaved fasta, single library) initPipeline -f -1 file.fastq.1 -2 file.fastq.2 -d projectDir -i 300:500 (interleaved fastq, single library) initPipeline -q -m file.fastq.12 -d projectDir -i 300:500 (interleaved fastq, multiple libraries) initPipeline -q -m file.fastq.12,file2.fastq.12 -d projectDir -i 300:500,1000:2000 (interleaved fastq, multiple libraries, existing assembly) initPipeline -q -m file.fastq.12,file2.fastq.12 -c file.contig.fa -d projectDir -i 300:500,1000:2000 The second component, runPipeline, takes a project directory as input and runs the following steps by default: 1. Preprocess 2. Assemble 3. FindORFs 4. FindRepeats 5. Abundance 6. Annotate 7. FunctionalAnnotation 8. Scaffold 9. Propagate 10. FindScaffoldORFs 11. Classify 12. Postprocess usage info: usage: runPipeline [options] -d projectdir -h = <bool>: print help [this message] -j = <bool>: just output all of the programs and citations then exit (default = NO) -v = <bool>: verbose output? (default = NO) -d = <string>: directory created by initPipeline (REQUIRED) [options]: [pipeline_opts] [misc_opts] [pipeline_opts]: options that affect the pipeline execution Pipeline consists of the following steps: Preprocess, Assemble, FindORFS, MapReads, Abundance, Annotate, Scaffold, Propagate, Classify, Postprocess Each of these steps can be referred to by the following options: -f = <string>: force this step to be run (default = NONE) -s = <string>: start at this step in the pipeline (default = Preprocess) -e = <string>: end at this step in the pipeline (default = Postprocess) -n = <string>: step to skip in pipeline (default=NONE) For each step you can fine-tune the execution as follows [Preprocess] -t = <bool>: filter input reads? (default = NO) -q = <bool>: produce FastQC quality report for reads with quality information (fastq or sff)? (default = NO) [Assemble] -a = <string>: genome assembler to use (default = SOAPdenovo) -k = <kmer size>: k-mer size to be used for assembly (default = 31) -o = <int>: min overlap length [MapReads] -m = <string>: read mapper to use? (default = bowtie) -i = <bool>: save bowtie (i)ndex? (default = NO) -b = <bool>: create library specific per bp coverage of assembled contigs (default = NO) [FindORFS] -g = <string>: gene caller to use (default=FragGeneScan) -l = <int>: min contig length to use for ORF call (default = 300) -x = <int>: min contig coverage to use for ORF call (default = 3X) [Annotate] -c = <string>: classifier to use for annotation (default = FCP) -u = <bool>: annotate unassembled reads? (default = NO) [Classify] -z = <string>: taxonomic level to categorize at (default = class) [misc_opts]: Miscellaneous options -r = <bool>: retain the AMOS bank? (default = NO) -p = <int>: number of threads to use (be greedy!) (default=1) -4 = <bool>: 454 data? (default = NO) For example, to enable read filtering: -t and to enable meta-IDBA as the assembler: -a metaidba And to use PhyloSift to annotate: -c phylosift Any single step in the pipeline can be skipped by passing the following parameter to runPipeline: -n,--skipsteps=Step1,.. MetAMOS reruns steps based on timestamp information, so if the input files for a step in the pipeline hasn't changed since the last run, it will be skipped automatically. However, you can forefully run any step in the pipeline by passing the following parameter to runPipeline: -f,--force=Step1,.. MetAMOS stores a summary of the input libraries in pipeline.ini in the working directory. The pipeline.conf file stores the list of programs available to MetAMOS. Finally, pipeline.run stores the selected parameters and programs for the current run. MetAMOS also stores detailed logs of all commands executed by the pipeline in Logs/COMMANDS.log and a log for each step of the pipeline in Logs/<STEP NAME>.log Upon completion, all of the final results will be stored in the Postprocess/out directory. A component, create_summary.py, takes this directory as input and as output, generates an HTML page with with summary statistics and a few plots. An optional component, create_plots.py, takes one or multiple Posprocess/out directories as input and generates comparative plots. ---------------------------------------------------------------------------------- E/ Test suite We have developed a set of scripts for testing the various features of MetAMOS. All of these regression test scripts are available inside the /Test directory and include all necessary datasets to run them. Here is a brief listing of the test scripts we currently include: *Test initPipeline ./Test/test_create.sh *Vanilla test ./Test/run_test.sh *Test PhlyoSift ./Test/test_amphora.sh *Test Minimus ./Test/test_minimus.sh *Test Preprocess filtration of non-interleaved fastq files ./Test/test_filter_noninterleaved_fastq.s *Test Newbler (if available) ./Test/test_newbler.sh *Test CA (fasta) ./Test/test_ca_fasta.sh *Test CA (fastq) ./Test/test_ca.sh *Test SOAPdenovo ./Test/test_soap.sh *Test MetaVelvet ./Test/test_metavelvet.sh *Test SparseAssembler ./Test/test_sparse.sh *Test Velvet ./Test/test_velvet.sh *Test FCP ./Test/test_fcp.sh ..more on the way! ---------------------------------------------------------------------------------- F/ Example output MetAMOS generates an interactive webpage once a run successfully completes: http://www.cbcb.umd.edu/software/metamos/summary.html This includes summary statistics and taxonomic information based on Krona Krona publication: Ondov BD, Bergman NH, Phillippy AM.. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics. 2011 Sep 30;12:385. PMID: 21961884 Additionally, since MetAMOS stores all of its results in an AMOS bank, the assemblies can be visualized with Hawkeye. ---------------------------------------------------------------------------------- G/ BUGS If you encounter any problems/bugs, please check the wiki page: https://github.com/treangen/MetAMOS/wiki and the known issues pages: https://github.com/treangen/MetAMOS/issues?direction=desc&sort=created&state=open to see if it has already been documented. If not, please report the issue either using the contact information below or by submitting a new issue online. Please include information on your run, any output produced by runPipeline, as well as the pipeline.* files and the Log/<LAST_STEP> file (if not too large). H/ CONTACT Who to contact to report bugs, forward complaints, feature requests: Todd Treangen: treangen@gmail.com Sergey Koren: sergek@umd.edu ---------------------------------------------------------------------------------- I/ CITE Treangen TJ*, Koren S*, Sommer DD, Liu B, Astrovskaya I, Ondov B, Darling AE, Phillippy AM, Pop M. MetAMOS: a modular and open source metagenomic assembly and analysis pipeline. Genome Biol. 2013 Jan 15;14(1):R2. PMID: 23320958. url: http://genomebiology.com/content/pdf/gb-2013-14-1-r2.pdf *Indicates both authors contributed equally to this work