metagenomeFilter

Requirements

Docker method:

Build the Docker image on your system

Non-Docker method:

You will need CLARK

http://clark.cs.ucr.edu/Download/CLARKV1.2.3.tar.gz

If you plan on running installing the required modules using setup.py, following an sudo apt-get update the following packages need to be installed (these are the packages that need to be installed in an Ubuntu:16.04 docker container):

git
python-dev
python-pip
wget
zlib1g-dev (only if installing seqtk)

sudo apt-get install -y build-essential git python-dev python-pip wget zlib1g-dev

If you plan on filtering reads, then you will need seqtk

git clone https://github.com/lh3/seqtk.git;
cd seqtk; make

Installation

Clone the repository (--recursive will clone the necessary submodules):

git clone https://github.com/adamkoziol/metagenomeFilter.git --recursive

Install python dependencies:

cd metagenomeFilter/
python setup.py install

Usage

automateCLARK

Used to automate the taxonomic assignment performed by CLARK

Example command

python automateCLARK.py -s /path/sequences -d /database/folder -C /CLARK/scripts/folder /path

Required arguments:

path to the folder to be used to store results
-s: path to folder containing sequences. I normally set this to be /path/sequences
-d: path to folder containing CLARK database files
-C: path to folder containing CLARK scripts

See usage below:

usage: automateCLARK.py [-h] -s SEQUENCEPATH -d DATABASEPATH -C CLARKPATH
                        [-r RANK] [-D DATABASE] [-t THREADS] [-f] [-c CUTOFF]
                        path

Automates CLARK metagenome software

positional arguments:
  path                  Specify input directory

optional arguments:
  -h, --help            show this help message and exit
  -s SEQUENCEPATH, --sequencepath SEQUENCEPATH
                        Path of .fastq(.gz) files to process.
  -d DATABASEPATH, --databasepath DATABASEPATH
                        Path of CLARK database files to use.
  -C CLARKPATH, --clarkpath CLARKPATH
                        Path to the CLARK scripts
  -r RANK, --rank RANK  Choose the taxonomic rank to use in the analysis:
                        species (the default value), genus, family, order,
                        class or phylum
  -D DATABASE, --database DATABASE
                        Choose the database to use in the analysis (one or
                        more from: bacteria, viruses, human,and custom. To
                        select more than one, use commas to separate your
                        selection. Custom databases need to be set up before
                        they will work
  -t THREADS, --threads THREADS
                        Number of threads. Default is the number of cores in
                        the system
  -f, --filter          Optionally split the samples based on taxonomic
                        assignment
  -c CUTOFF, --cutoff CUTOFF
                        Cutoff value for setting taxIDs to use when filtering
                        fastq files. Defaults to 1 percent. Please note that
                        you must use a decimal format: enter 0.05 to get a 5
                        percent cutoff.