A simple tool to get biomes related data from MGnify and ENA
We recommend having conda installed to manage the virtual environments
First, we create a conda virtual environment with:
wget https://raw.githubusercontent.com/genomewalker/get-biomes/master/environment.yml
conda env create -f environment.yml
Then we proceed to install using pip:
pip install get-biomes
Using pip
pip install git+ssh://git@github.com/genomewalker/get-biomes.git
By cloning in a dedicated conda environment
git clone git@github.com:genomewalker/get-biomes.git
cd get-biomes
conda env create -f environment.yml
conda activate get-biomes
pip install -e .
getBiomes will take a file with a list of biomes and will retrieve all the samples belonging to those biomes from MGnify and ENA. It will then download the samples and create a directory structure to store the samples.
$ getBiomes --help
usage: getBiomes [-h] [--version] {search,download} ...
A simple tool to get biomes related data from MGnify and ENA
positional arguments:
{search,download} positional arguments
search Search for biomes in MGnify and ENA
download Download the biomes gathered by the subcommand search
optional arguments:
-h, --help show this help message and exit
--version Print program version
$ getBiomes search --help
usage: getBiomes search [-h] [--debug] -b BIOMES [--mgnify-filter MG_FILTER] [--ena-filter ENA_FILTER]
[--exclude-terms EXCLUDE_TERMS] [-p PREFIX] [-t THREADS] [--combine] [--clean]
optional arguments:
-h, --help show this help message and exit
--debug Print debug messages
search required arguments:
-b BIOMES, --biomes BIOMES
A txt file containing MGnify biomes. Ex: root:Environmental:Aquatic:Marine
search optional arguments:
--mgnify-filter MG_FILTER
Key-value pairs to filter the MGnify metadata. Valid values are:
experiment_type, biome_name, lineage, geo_loc_name, latitude_gte, latitude_lte,
longitude_gte, longitude_lte, species, instrument_model, instrument_platform,
metadata_key, metadata_value_gte, metadata_value_lte, metadata_value,
environment_material, environment_feature, study_accession or include
--ena-filter ENA_FILTER
Key-value pairs to filter the ENA metadata. Valid values are: read_count,
instrument_model, instrument_platform, library_layout, library_strategy,
library_selection or library_source
--exclude-terms EXCLUDE_TERMS
A comma-separated list of terms to exclude from the metadata
-p PREFIX, --prefix PREFIX
Prefix for the output file
-t THREADS, --threads THREADS
Number of threads to use
--combine Combine all output files into one
--clean Remove existing output files
One would run getBiomes search as:
getBiomes search -b test-biomes.txt -t 24
Where test-biomes.txt
is a file containing the biomes to retrieve. For example:
root:Environmental:Aquatic:Marine
root:Environmental:Aquatic:Freshwater
By default, getBiomes will retrieve all the samples from MGnify and ENA that belong to the biomes specified in the input file. However, it is possible to filter the samples retrieved from MGnify using the --filter
option. For example, to retrieve only the samples from the biomes specified in the input file that have been sequenced using Illumina, that are WGS and with more than 10M reads. In addition, we will remove any entry that contains the words human
and 16S
:
getBiomes -b test-biomes.txt -t 24 --mgnify-filter '{"instrument_platform":"illumina","metadata_key":"investigation type","metadata_value":"metagenome"}' --ena-filter '{"library_layout":"PAIRED","library_strategy":"WGS","library_source":"METAGENOMIC","library_selection":"RANDOM", "read_count":10000000}' --exclude-terms human,16S
The output file will contain the following columns:
accession
sample_accession
sample_name
longitude
latitude
geo_loc_name
studies
biome
sample_desc
environment_biome
environment_feature
environment_material
study_accession
experiment_accession
run_accession
read_count
instrument_model
instrument_platform
library_layout
library_strategy
library_selection
library_source
fastq_ftp
query_biome
$ getBiomes download --help
usage: getBiomes download [-h] [--debug] -i INPUT [-o OUTDIR] [--clean] [-t THREADS]
optional arguments:
-h, --help show this help message and exit
--debug Print debug messages (default: False)
download required arguments:
-i INPUT, --input INPUT
A txt file containing MGnify biomes. Ex: root:Environmental:Aquatic:Marine
(default: None)
download optional arguments:
-o OUTDIR, --outdir OUTDIR
The directory to save the fastq files (default: None)
--clean Remove existing output files (default: False)
-t THREADS, --threads THREADS
Number of threads to use (default: 1)
Once the samples have been retrieved, one can download them using the subcommand download
. For example:
getBiomes download -i test-biomes__combined.tsv -o test-biomes -t 24
Where test-biomes__combined.tsv
is the output file from the subcommand search
and test-biomes
is the directory where the samples will be downloaded. The output directory contains the file download_report.tsv
with the status of the downloaded files. One can continue downloading the samples that failed in a previous run. If --clean
is specified, the output directory will be removed before downloading the samples.