A workflow to perform functional enrichment analysis for prokaryotes using deltaTE or DESeq2 input. This workflow was developed to be used with the processing pipeline of HRIBO, but will also work with any other correctly formatted input data.
💡 An example and a description on how to run it is available in the example section 💡
❗ We suggest installing dependencies for the tools using conda. We suggest installing miniconda.
To run pathsnake which uses the tool clusterprofiler, an annotation database is needed to perform Gene Ontology Term-based enrichment analysis.
When no curated database is available for an organism (e.g. for E. coli), it is possible to create a database based on an existing annotation. In our case, we use the UniProtKB to retrieve available GO Term annotation and create an annotation database.
The general idea of this method is to retrieve a table of Gene Identifiers and GO Terms, which can be turned into an SQ-lite Database. This method will work with any table of GO Terms / GeneIDs (and others... It only requires structuring said table correctly).
The scripts and files, we used are available in the database_creation
subfolder.
If you want to create your own database, the following dependencies are needed to run the provided scripts:
- python3
- pandas
- r-base
- bioconductor-annotationforge
- bioconductor-go.db
You can install these using conda with the database.yml
. All tool versions used for our analysis can be found in this file.
conda env create -f database.yml
conda activate database_creation
To retrieve the table visit UniProtKB and enter an identifier/name to access your organism.
Click on customize columns and make sure all desired columns are visible in the table.
For our example we required: Entry
, Gene Names
, Gene Names (ordered locus)
, Gene Ontology IDs
, KEGG
.
Depending on your use-case you might need more or less of these columns.
After retrieving the table, we need to create a table for each SQL-database table.
To this end we ran filter_uniprot_table.py
, which extracts the columns from the downloaded table and creates go.csv
, ko.csv
, refseq.csv
and uniprot.csv
.
These files contain mappings from different identifiers to either GO Terms or KEGG identifiers.
python3 filter_uniprot_table.py
We then use the AnnotationForge R library to create the SQLite database.
To this end adjust the settings in the make_annotation_db.R
script (i.e. name, email, identifiers for your organism) and run:
Rscript make_annotation_db.R
This will create an annotation database in the subfolder you specified, with the correct naming convention. This can then be used as an input for clusterprofiler
.
Pathsnake expects as an input an annotation database, as well as an deltaTE
output folder from HRIBO
.
It also works with any other correctly formatted differential expression output tables and does not necessarily require HRIBO
.
To run the workflow, install snakemake all other dependencies will be installed automatically when running pathsnake
.
You can also install it using conda:
conda env create -f snakemake.yml
conda activate snakemake
This file can either be downloaded if a curated version exists for your organism. Otherwise you can create a database yourself using the guide in the previous section.
The differential expression output is required by clusterprofiler to do the analysis.
In fact any differential expression results will work, the difference between both options is that deltaTE provides RNA-seq and RIBO-seq results, while DESeq only provides RNA-seq based results. As long as they are structured correctly, you can use any differential expression results.
The required format for deltaTE
is:
- An excel table with the relevant data in the first sheet.
- Naming convention is |treated_condition|-|control|_sorted.xlsx (e.g. B-A_sorted.xlsx)
- It needs to contain the columns:
RIBO_log2FoldChange
,RIBO_padj
,RNA_log2FoldChange
,RNA_padj
The required format for deseq2
is:
- An excel table with the relevant data in the first sheet.
- Naming convention is |treated_condition|-|control|_sorted.xlsx (e.g. B-A_sorted.xlsx)
- It needs to contain the columns:
log2FC
,pvalue_adjusted
As long as this format is met, any results can be used.
Multiple files excel
files can be present in the folder, they will be automatically processed in parallel. (e.g. for multiple conditions)
Set up the config file:
- Set the path to the differential expression folder.
- Set the idColumn to be used in your provided differential expression tables (e.g. Locus_tag, Old_locus_tag, etc...).
⚠️ This ID needs to be compatible with the Identifiers used in the annotation database. - Set the organism code for KEGG.
- Set the organism annotation database for clusterprofiler.
Run the workflow by calling:
snakemake -p -k --use-conda -s pathsnake/Snakefile --configfile pathsnake/config.yaml --directory ${PWD} -j 20 --latency-wait 60
.snakemake
folder with all the dependencies required to run the tool, make sure to delete it when you no longer need the tool.
Unfortunately, it is very hard to create universally well-looking plots for every kind input data using clusterprofiler
.
Known issues might be:
- manual filtering of redundant terms
- very long GO Terms that break the plotting
- custom coloring
- custom setup of conditions
- etc ...
This was also needed for the data used in our publication, therefore we used the script provided in the custom_scripts
subfolder to customize the appearance of our output results.
We added it for reproducability of our plots and to expose the functions that can be used to customize your plots using ggplot2
.
Before starting with the example run, download the repository onto your linux
machine. We tested this on a system using Ubuntu 20.04.3 LTS
.
git clone https://github.com/RickGelhausen/pathsnake.git
❗ For the remainder of this example, we will assume that you navigated into the downloaded repository.
conda
might take a long time to install the dependencies. If this is the case, we suggest to use mamba
instead.
First we will generate the database for methanosarcina mazei.
❗ The database generation step can be skipped. The finished database can be found in the example
folder.
- Install the dependencies:
conda env create -f database.yml
conda activate database_creation
- Run the filtering script on the UniProtKB table:
❗ Details on how to get this file for your organism can be found here: UniProtKB.
python3 database_creation/filter_uniprot_table.py -t=example/methanosarcina_uniprot_230403.tsv -o=example/
This will filter the file for missing IDs and remove incomplete entries. It will create 4 tables that will subsequently be used to create tables in the SQLite database.
- Building the database using
annotationForge
:
Rscript database_creation/make_annotation_db.R --input_folder example/ --output_dir example/test_database/ --version "0.1" --maintainer "Main Tainer <main.tainer@email.com>" --author "Au Thor <au.thor@email.com>" --tax_id 192952 --genus "Methanosarcina" --species "mazei"
This will create the database into a new folder example/test_database/
and tests whether it can be loaded via R
.
- Install the dependencies:
conda env create -f snakemake.yml
conda activate snakemake
❗ Make sure that you have a working c compiler installed to ensure that all R packages can be compiled.
On Ubuntu you can install the build-essential
s.
sudo apt-get install build-essential
- Prepare your differential expression results
We use results from the tool deltaTE
, generated using HRIBO
. If you want to use your own data please refer to this section.
We provide a mock example file in the example
folder.
❗ The values in this file are completely unrelated to the real results provided in the publication.
- Setup the config file
This is already prepared to be used for this example.
It contains information on the location of the input file, the column to use as an identifier and kegg ids to be used.
- Run the workflow
To run the workflow, simply call snakemake and it will automatically download the dependencies.
nohup snakemake -p -k --use-conda -s pathsnake/Snakefile --configfile pathsnake/config.yaml --directory ${PWD} -j 20 --latency-wait 60 &
❗ The first run can take a long time due to the installation of all the R dependencies. Subsequent runs will be a lot faster.
- Inspecting the results
The results are automatically put into a folder called functional_analysis
.
This folder contains all contrasts
for different input excel files. Inside of these folders the following files are available:
- ribo_diffex.tsv and rna_diffex.tsv: These files contain the raw input data for the tool deltaTE.
- RIBO / RNA folders: These contain the results for the RIBO-seq and RNA-seq data respectively.
- GSEA / KEGG / ORA folders: These contain the results for the Gene Set Enrichment Analysis, the KEGG analysis and the Over Representation Analysis respectively.
- plots / tables folders: These contain the plots or tables resulting from
clusterProfiler
.
- Custom scripts
Often the resulting tables and plots are not very desirable. Therefore, we usually manually customize the plots after inspecting the output tables of clusterprofiler
.
The plotting script can be adapted and called for specific files with adapted ggplot2 functions.
Rscript custom_scripts/plot_ORA_mmazei.R
Using the following command line arguments.
option_list = list(
make_option(c("-i", "--input_file"), type="character", default=NULL, help="File containing a column log2FC and an ID/gene symbol", metavar="character"),
make_option(c("-d", "--organism_database"), type="character", default=NULL, help="Annotation Database for the organism of interest", metavar="character"),
make_option(c("-o", "--output_path"), type="character", default=NULL, help="The output path containg all output tables and plots.", metavar="character"),
make_option(c("-p", "--padj_cutoff"), type="double", default=0.05, help="The p-value cutoff for the enrichment analysis", metavar="double"),
make_option(c("-l", "--log2FC_cutoff"), type="double", default=1, help="The log2FC cutoff for the enrichment analysis", metavar="double")
)
❗ This script allows to customize the ORA plots by changing the scripts. This is just an example of how we created the final plots for the publication.