Docker method:
Build the Docker image on your system
Non-Docker method:
You will need CLARK
http://clark.cs.ucr.edu/Download/CLARKV1.2.3.tar.gz
If you plan on running installing the required modules using setup.py, following an sudo apt-get update
the following packages need to be installed (these are the packages that need to be installed in an Ubuntu:16.04
docker container):
- git
- python-dev
- python-pip
- wget
- zlib1g-dev (only if installing seqtk)
sudo apt-get install -y build-essential git python-dev python-pip wget zlib1g-dev
If you plan on filtering reads, then you will need seqtk
git clone https://github.com/lh3/seqtk.git;
cd seqtk; make
Clone the repository (--recursive will clone the necessary submodules):
git clone https://github.com/adamkoziol/metagenomeFilter.git --recursive
Install python dependencies:
cd metagenomeFilter/
python setup.py install
Used to automate the taxonomic assignment performed by CLARK
python automateCLARK.py -s /path/sequences -d /database/folder -C /CLARK/scripts/folder /path
Required arguments:
- path to the folder to be used to store results
- -s: path to folder containing sequences. I normally set this to be /path/sequences
- -d: path to folder containing CLARK database files
- -C: path to folder containing CLARK scripts
See usage below:
usage: automateCLARK.py [-h] -s SEQUENCEPATH -d DATABASEPATH -C CLARKPATH
[-r RANK] [-D DATABASE] [-t THREADS] [-f] [-c CUTOFF]
path
Automates CLARK metagenome software
positional arguments:
path Specify input directory
optional arguments:
-h, --help show this help message and exit
-s SEQUENCEPATH, --sequencepath SEQUENCEPATH
Path of .fastq(.gz) files to process.
-d DATABASEPATH, --databasepath DATABASEPATH
Path of CLARK database files to use.
-C CLARKPATH, --clarkpath CLARKPATH
Path to the CLARK scripts
-r RANK, --rank RANK Choose the taxonomic rank to use in the analysis:
species (the default value), genus, family, order,
class or phylum
-D DATABASE, --database DATABASE
Choose the database to use in the analysis (one or
more from: bacteria, viruses, human,and custom. To
select more than one, use commas to separate your
selection. Custom databases need to be set up before
they will work
-t THREADS, --threads THREADS
Number of threads. Default is the number of cores in
the system
-f, --filter Optionally split the samples based on taxonomic
assignment
-c CUTOFF, --cutoff CUTOFF
Cutoff value for setting taxIDs to use when filtering
fastq files. Defaults to 1 percent. Please note that
you must use a decimal format: enter 0.05 to get a 5
percent cutoff.
Used to filter reads based on taxonomic assignment by CLARK. Can either
be called as a module: automateCLARK.py -f
or run as a stand-alone script.
While not implemented yet, functionality to allow for the specific choice of which taxIDs to use rather than selecting all taxIDs that are greater than a cutoff value.
python filtermetagenome.py -s /path/sequences -d /CLARK/outputs/folder /path
Required arguments:
- path to the folder to be used to store results
- -s: path to folder containing sequences. I normally set this to be /path/sequences
- -d: path to folder containing CLARK output files
See usage below:
usage: filtermetagenome.py [-h] [-v] [-t THREADS] -s SEQUENCEPATH -d DATAPATH
[-c CUTOFF] [-x TAXIDS]
path
Filter reads based on taxonomic assignment
positional arguments:
path Specify path
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-t THREADS, --threads THREADS
Number of threads. Default is the number of cpus in
the system
-s SEQUENCEPATH, --sequencepath SEQUENCEPATH
Path of .fastq(.gz) files to process.
-d DATAPATH, --datapath DATAPATH
Path of .csv files created by CLARK with read ID,
length, and assignment.
-c CUTOFF, --cutoff CUTOFF
Cutoff value for deciding which taxIDs to use when
sorting .fastq files. Defaults to 1 percent. Please
note that you must use a decimal format: enter 0.05 to
get a 5 percent cutoff value
-x TAXIDS, --taxids TAXIDS
NOT IMPLEMENTED: CSV of desired taxIDs from each
sample.