The Biosynthetic Gene cluster Meta’omics abundance Profiler (BiG-MAP)

This is the Github repository for the Biosynthetic Gene cluster Meta’omics abundance Profiler (BiG-MAP). Metabolic Gene Clusters (MGCs) are responsible of the synthesis of small molecules, that are known to have a major impact on the host. To evaluate their contribution to a given host phenotype, it is important to assess the MGC abundance and expression profile in a given dataset. Thus, we have built BiG-MAP, a bioinformatic tool to profile abundance and expression levels of gene clusters across metagenomic and metatranscriptomic data and evaluate their differential abundance and expression between different conditions. BiG-MAP is composed of 4 different modules:

BiG-MAP.download.py
BiG-MAP.family.py
BiG-MAP.map.py
BiG-MAP.analyse.py

For information on how to install and run BiG-MAP, check this tutorial. Please, take into account that this tool has been only tested on Ubuntu.

Installation

Install BiG-MAP dependencies using conda. Conda can be installed from miniconda_link. First pull the BiG-MAP repository from github:

~$ git clone https://github.com/medema-group/BiG-MAP.git

Then install all the dependencies from the BiG-MAP.yml file with:

# For BiG-MAP.download.py, BiG-MAP.family.py and BiG-MAP.map.py
~$ conda env create -f BiG-MAP_process.yml BiG-MAP_process
~$ conda activate BiG-MAP_process

# For BiG-MAP.analyse.py
~$ conda env create -f BiG-MAP_analyse.yml BiG-MAP_analyse
~$ conda activate BiG-MAP_analyse

This step is optional but to make use of the second redundancy filtering step in the BiG-MAP.family module, download BiG-SCAPE using:

~$ git clone https://git.wageningenur.nl/medema-group/BiG-SCAPE

Further below you will find more information on how to install BiG-SCAPE and its dependencies, but you can also check the wiki page for more information: https://git.wageningenur.nl/medema-group/BiG-SCAPE/-/wikis/installation.

Overview and example run

The typical workflow for BiG-MAP consists of the following steps:

Download WGS data using BiG-MAP.download.py (Optional)
Group gene clusters onto gene cluster families (GCFs) and housekeeping gene families (HGFs) using BiG-MAP.family.py
Assess gene cluster (and HGF representatives) abundance and expression profiles using BiG-MAP.map.py
Performs differential abundance/expression analysis and visualizes the output using BiG-MAP.analyse.py

The four modules are described more in detail below where you will also find the commands to run them.

1) BiG-MAP.download.py

This script is created to easily download the metagenomic and/or metatranscriptomic samples available in the SRA repository (https://www.ncbi.nlm.nih.gov/sra). First, the samples are downloaded in .SRA format, and then they are converted into .fastq using fastq-dump.

conda activate BiG-MAP_process
python3 BiG-MAP.download.py -h
python3 BiG-MAP.download.py [Options]* -A [accession_list_file] -O [path_to_outdir]

To download the samples, go to the SRA run selector and use the BioProject record of interest. Next, from the resulting page, download the Accession list. This is a file that contains a list of sample accessions and it is used as input for the BiG-MAP.download module. In this tutorial, the IBD-cohort of Schirmer et al. (2018) is used, thus the Accession list of the BioProject PRJNA389280 was downloaded. With the aim of simplifying the tutorial and speeding up the analysis, 8 metagenomic samples were chosen: 4 samples from patients suffering Crohn Disease (CD) and 4 from Ulcerative Colitis (UC). Create a file with the following 8 samples’ accessions and run BiG-MAP.download command to get the fastq files:

Acc_list.txt:
SRR5947837
SRR5947861
SRR5947824
SRR5947881
SRR5947836
SRR5947862
SRR5947855
SRR5947841

python3 BiG-MAP.download.py -A Acc_list.txt -O /usr001/fastq/schirmer/

2) BiG-MAP.family.py

The main purpose of this module is to group gene clusters into GCFs using sequence similarity. The first redundancy filtering step is performed by MASH, that by default uses 0.8 sequence similarity cut-off but can be changed as desired. Additionally, a second round of redundancy filtering can be performed by using BiG-SCAPE. We strongly recommend using BiG-SCAPE for a more accurate redundancy filtering. For that, look at the BiG-SCAPE wiki on how to install it: https://git.wageningenur.nl/medema-group/BiG-SCAPE/-/wikis/installation. To run BiG-SCAPE, you will also need to have the latest (processed) Pfam database Pfam-A.hmm.gz available from the Pfam FTP website (https://pfam.xfam.org/). Once the Pfam-A.hmm.gz file is downloaded, uncompress it and process it using the hmmpress command from the HMMER suit (http://hmmer.org/).

BiG-MAP.family takes as input the output directories of any anti- or gutSMASH run. Given a set of genomes, gutSMASH/antiSMASH can predict multiple gene clusters, thus the output folders containing the predicted gene clusters for each genome are the ones used as input for this module. Please, take into account that it needs to be run beforehand. To follow this tutorial without previously running anti- or gutSMASH, you can find 10 exemplary gutSMASH output folders in here: example data folder.

To make use of the tutorial gutSMASH folders, first extract the files using:

tar -xf BiG-MAP_tutorial_genomes.tar.gz

The general usage of BiG-MAP.family is:

conda activate BiG-MAP_process
python3 BiG-MAP.family.py -h
python3 BiG-MAP.family.py [Options]* -D [input dir(s)] -O [output dir]

Check the command below to see how to run this module with the tutorial samples and BiG-SCAPE:

python3 BiG-MAP.family.py -D /usr001/BiG-MAP_tutorial_genomes/ -b /usr001/BiG-SCAPE_location/ -pf /usr001/pfam_files_location/ -O /usr001/results_family/

This yields:
BiG-MAP.GCF.bed = Bedfile to extract core regions in BiG-MAP.map.py
BiG-MAP.GCF.fna = Reference file to map the WGS reads to
BiG-MAP.GCs.json = Dictionary that contains the GCFs
BiG-MAP.GCF.json = Dictionary that contains the BiG-SCAPE GCFs

In general, the anti- or gutSMASH-output folder should contain the results of at least several runs. Optional flags to run this module include:

-tg: Fraction between 0 and 1; the similarity threshold that determines when the protein sequences of the gene clusters can be considered similar. If the threshold is set to zero, all gene clusters will form their own gene cluster families, whereas a threshold of one will result in one large family containing all gene clusters. Default = 0.8.

-th: Fraction between 0 and 1; the similarity threshold that determines when the protein sequences of the housekeeping genes can be considered similar. Default = 0.1

-f: Specify here the number of genes that are flanking the core genes of the gene cluster. 0 –> only the core, n –> n genes included that flank the core. Default = 0

-g: Output whole genome fasta files for the MASH filtered gene clusters as well. This uses more disk space in the output directory. ‘True’ | ‘False’. Default = False

-p: Number of used parallel threads in the BiG-SCAPE filtering step. Default = 6

NOTE: the number of predicted MGCs may exceed the maximum number of sequences that MASH (“sketch” and “dist” functions) is able to compare, leading to an error. In this scenario, the family module can be run in batches or the code can be slightly modified to manually run the MASH analysis and MASH “paste” function (more information is available in their documentation at https://mash.readthedocs.io/en/latest/) and pick up the analysis again from that step onwards.

3) BiG-MAP.map.py

This module is designed to align the WGS (paired or unpaired) reads to the reference representatives of each GCF and HGF using bowtie2. The following will be computed: RPKM, coverage, core coverage. The coverage is calculated using Bedtools, and the read count values using Samtools. The general usage is:

conda activate BiG-MAP_process
python3 BiG-MAP.map.py -h
python3 BiG-MAP.map.py {-I1 [mate-1s] -I2 [mate-2s] | -U [samples]} {-R [reference] -F [family] | -P [pickled file]} -O [outdir] -b [metadata] [Options*]

To map the 8 samples from Schirmer et al. (2018) to the GCF reference representatives, and correct for the BiG-SCAPE GCF size, run:

NOTE: It is important for downstream analysis to also use the -b flag. Also, if it is prefered to use the averaged number of reads mapped per GCF (instead of summed), the flag -a or –average needs to be included in the command below

python3 BiG-MAP.map.py -b /usr001/results/schirmer_metadata.txt -I1 /usr001/fastq/schirmer/*pass_1* -I2 /usr001/fastq/schirmer/*pass_2* -O /usr001/results_mapping/ -F /usr001/results_family/


the schirmer_metadata.txt is set up as follows (tab-delimited):
#run.ID	host.ID	SampleType	DiseaseStatus
SRR5947837	M2026C2_MGX	METAGENOMIC	UC
SRR5947861	M2026C3_MGX	METAGENOMIC	UC
SRR5947824	M2026C4_MGX	METAGENOMIC	UC
SRR5947881	M2026C7_MGX	METAGENOMIC	UC
SRR5947836	M2027C1_MGX	METAGENOMIC	CD
SRR5947862	M2027C2_MGX	METAGENOMIC	CD
SRR5947855	M2027C3_MGX	METAGENOMIC	CD
SRR5947841	M2027C5_MGX	METAGENOMIC	CD

note the '#' to denote the header row!!!

4) BiG-MAP.analyse.py

This module performs a statistical analysis on the metagenomic/metatranscriptomic samples. First, the script normalizes and filters the data. Whereafter, the best covered gene clusters can be observed using the –explore flag. Next, the Kruskal Wallis and fitZIG model will be used to compute differentially abundant/expressed gene clusters and Benjamini-Hochberg FDR compensates for multiple hypothesis testing. The output of the script are several heatmaps in pdf format.

To run the script, the BiG-MAP_analyse conda environment should be activated. The general usage is:

conda activate BiG-MAP_analyse
python3 BiG-MAP.analyse.py -h
python3 BiG-MAP.analyse.py --explore --compare -B [biom_file] -T [metagenomic/metatranscriptomic] -M [metagroup] -O [outdir] [Options*]

Example command for the explore heatmap:
python3 BiG-MAP.analyse.py --explore -B /usr001/results_mapping/biom-results/BiG-MAP.map.metacore.dec.biom -T metagenomic -M DiseaseStatus -O /usr001/results_analysis

Example command for the compare heatmap:
python3 BiG-MAP.analyse.py --compare -B /usr001/results_mapping/biom-results/BiG-MAP.map.metacore.dec.biom -T metagenomic -M DiseaseStatus -g UC CD -O /usr001/results_analysis

Example command including both the explore and the compare heatmap:
python3 BiG-MAP.analyse.py --explore --compare -B /usr001/results_mapping/biom-results/BiG-MAP.map.metacore.dec.biom -T metagenomic -M DiseaseStatus -g UC CD -O /usr001/results_analysis

Note: You can either choose between the BiG-MAP.map.metacore.dec.biom or the BiG-MAP.mapcore.metacore.dec.biom as -B flag input file, depending if you are interested on plotting the results for the whole gene clusters or only the core genomic region of the gene clusters respectively.

Output: 
explore_heatmap.pdf & explore_heatmap.eps -> contains the top 20 best covered gene clusters
UCvsCD_fz.pdf & UCvsCD.eps -> comparison between UC and CD using the fitZIG model
UCvsCD_kw.pdf & UCvsCD_kw.eps -> comparison between UC and CD using the Kruskal Wallis model
tsv-results -> directory containing tsv files with the raw data

Snakemake workflow

This Snakemake workflow allows for a more automated and streamlined running of the separated BiG-MAP modules. For more information on how to install and run the Snakemake version of BiG-MAP, check the instructions below.

Installation and run overview

Install BiG-MAP dependencies using conda. Conda can be installed from miniconda_link. First pull the BiG-MAP repository from github:

~$ git clone https://github.com/medema-group/BiG-MAP.git

Install Snakemake with the following command:

~$ conda create -n snakemake -c conda-forge -c bioconda snakemake=6.0.2
~$ conda activate snakemake

Next, copy the BiG-MAP_snakemake folder to the preferred output location

~$ cp -r BiG-MAP/BiG-MAP_snakemake/ /path/to/output/location/

Navigate to the BiG-MAP_snakemake folder and adjust the config.yaml file. In this file, the locations of files and folders should be included based on the wanted BiG-MAP run settings. After adjusting the config file, use the following command to start the BiG-MAP run:

~$ snakemake --use-conda --cores 10

NOTE: It’s recommended to use a conda-prefix location to a folder in which the BiG-MAP conda environments can be installed (–conda-prefix path/to/snakemake/conda/envs). Additionally, the number of cores used by BiG-MAP can be adjusted with the –cores flag.

Requirements

Input data:

antiSMASH v5.0 or higher
gutSMASH

Software:

Python 3+
R statistics
fastq-dump
Mash
HMMer
Bowtie2
Samtools
Bedtools
biom
BiG-SCAPE=20191011

Packages:

Python

BioPython
pandas

R

metagenomeSeq
biomformat
ComplexHeatmap=2.0.0
viridisLite
RColorBrewer
tidyverse

raufs/BiG-MAP