/metagenomeFilter

Primary LanguagePythonMIT LicenseMIT

metagenomeFilter

Requirements

Docker method:

Build the Docker image on your system

Non-Docker method:

You will need CLARK

http://clark.cs.ucr.edu/Download/CLARKV1.2.3.tar.gz

If you plan on running installing the required modules using setup.py, following an sudo apt-get update the following packages need to be installed (these are the packages that need to be installed in an Ubuntu:16.04 docker container):

  • git
  • python-dev
  • python-pip
  • wget
  • zlib1g-dev (only if installing seqtk)

sudo apt-get install -y build-essential git python-dev python-pip wget zlib1g-dev

If you plan on filtering reads, then you will need seqtk

git clone https://github.com/lh3/seqtk.git;
cd seqtk; make

Installation

Clone the repository (--recursive will clone the necessary submodules):

git clone https://github.com/adamkoziol/metagenomeFilter.git --recursive

Install python dependencies:

cd metagenomeFilter/
python setup.py install

Usage

automateCLARK

Used to automate the taxonomic assignment performed by CLARK

Example command

python automateCLARK.py -s /path/sequences -d /database/folder -C /CLARK/scripts/folder /path

Required arguments:

  • path to the folder to be used to store results
  • -s: path to folder containing sequences. I normally set this to be /path/sequences
  • -d: path to folder containing CLARK database files
  • -C: path to folder containing CLARK scripts

See usage below:

usage: automateCLARK.py [-h] -s SEQUENCEPATH -d DATABASEPATH -C CLARKPATH
                        [-r RANK] [-D DATABASE] [-t THREADS] [-f] [-c CUTOFF]
                        path

Automates CLARK metagenome software

positional arguments:
  path                  Specify input directory

optional arguments:
  -h, --help            show this help message and exit
  -s SEQUENCEPATH, --sequencepath SEQUENCEPATH
                        Path of .fastq(.gz) files to process.
  -d DATABASEPATH, --databasepath DATABASEPATH
                        Path of CLARK database files to use.
  -C CLARKPATH, --clarkpath CLARKPATH
                        Path to the CLARK scripts
  -r RANK, --rank RANK  Choose the taxonomic rank to use in the analysis:
                        species (the default value), genus, family, order,
                        class or phylum
  -D DATABASE, --database DATABASE
                        Choose the database to use in the analysis (one or
                        more from: bacteria, viruses, human,and custom. To
                        select more than one, use commas to separate your
                        selection. Custom databases need to be set up before
                        they will work
  -t THREADS, --threads THREADS
                        Number of threads. Default is the number of cores in
                        the system
  -f, --filter          Optionally split the samples based on taxonomic
                        assignment
  -c CUTOFF, --cutoff CUTOFF
                        Cutoff value for setting taxIDs to use when filtering
                        fastq files. Defaults to 1 percent. Please note that
                        you must use a decimal format: enter 0.05 to get a 5
                        percent cutoff.

metagenomeFilter

Used to filter reads based on taxonomic assignment by CLARK. Can either be called as a module: automateCLARK.py -f or run as a stand-alone script.

While not implemented yet, functionality to allow for the specific choice of which taxIDs to use rather than selecting all taxIDs that are greater than a cutoff value.

Example command

python filtermetagenome.py -s /path/sequences -d /CLARK/outputs/folder /path

Required arguments:

  • path to the folder to be used to store results
  • -s: path to folder containing sequences. I normally set this to be /path/sequences
  • -d: path to folder containing CLARK output files

See usage below:

usage: filtermetagenome.py [-h] [-v] [-t THREADS] -s SEQUENCEPATH -d DATAPATH
                           [-c CUTOFF] [-x TAXIDS]
                           path

Filter reads based on taxonomic assignment

positional arguments:
  path                  Specify path

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -t THREADS, --threads THREADS
                        Number of threads. Default is the number of cpus in
                        the system
  -s SEQUENCEPATH, --sequencepath SEQUENCEPATH
                        Path of .fastq(.gz) files to process.
  -d DATAPATH, --datapath DATAPATH
                        Path of .csv files created by CLARK with read ID,
                        length, and assignment.
  -c CUTOFF, --cutoff CUTOFF
                        Cutoff value for deciding which taxIDs to use when
                        sorting .fastq files. Defaults to 1 percent. Please
                        note that you must use a decimal format: enter 0.05 to
                        get a 5 percent cutoff value
  -x TAXIDS, --taxids TAXIDS
                        NOT IMPLEMENTED: CSV of desired taxIDs from each
                        sample.