/bigslice

A highly scalable, user-interactive tool for the large scale analysis of Biosynthetic Gene Clusters data

Primary LanguagePythonGNU Affero General Public License v3.0AGPL-3.0

BiG-SLiCE

Biosynthetic Gene clusters - Super Linear Clustering Engine

Version 2.0 is here!

  • Clustering now uses cosine-like (via l2-normalization) distances (as in https://www.nature.com/articles/s41564-022-01110-2)
  • pHMM databases have been updated to PFAM 35.0
  • BGC class definition has been updated to antiSMASH v7.0.0
  • Switching from HMMER to pyHMMER (speed-ups, can now be fully installed via pip)
  • General speed improvement
  • Ability to export pre-calculated BGCs and GCFs table into TSVs (use --export-csv parameter)

Quick start

  1. Make sure you have HMMer (version 3.2b1 or later) installed.
  2. Install BiG-SLiCE using pip:
  • from PyPI (stable)
user@local:~$ pip install bigslice
  • from source (bleeding edge -- only do this when you know what you are doing!)
user@local:~$ pip install git+https://github.com/medema-group/bigslice.git
  1. Fetch the latest HMM models (± 271MB gzipped):
user@local:~$ download_bigslice_hmmdb
  1. Check your installation:
user@local:~$ bigslice --version .

==============
BiG-SLiCE version 2.0.0
HMM databases version: bigslice-models-2022-11-30
Biosynthetic-pfam md5: 37495cac452bf1dd8aff2c4ad92065fe
Sub-pfam md5: 2e6b41d06f3c318c61dffb022798091e
==============
  1. Run BiG-SLiCE clustering analysis: (see wiki:Input folder on how to prepare the input folder)
user@local:~$ bigslice -i <input_folder> <output_folder>

For a "minimal" test run, you can use the example input folder that we provided.

Querying antiSMASH BGCs

Using the --query mode, you can perform a blazing-fast query of a putative BGC against the pre-processed set of Gene Cluster Family (GCF) models that BiG-SLiCE outputs (for example, you can use our pre-processed result on ~1.2M microbial BGCs from the NCBI database -- a 17GB zipped file download there is currently no pre-processed result for BiG-SLiCE v2, we will work to make it available soon.). You will get a ranked list of GCFs and BGCs similar to the BGC in question, which will help in determining the function and/or novelty of said BGC. To perform a GCF query, simply use:

user@local:~$ bigslice --query <antismash_output_folder> --n_ranks <int> <output_folder>

Which will perform a query analysis on the latest clustering result contained inside the output folder (see wiki: Program parameters for more advanced options). Top-(n_ranks) matching GCFs will be returned along with their similarity measurements. You can then view the query results using the user interactive output (see below).

Custom GenBank input

To perform GCF analyses on BGCs not covered by antiSMASH/MIBiG (i.e., from tools like ClusterFinder and DeepBGC, or BGCs with manually-refined cluster borders), you can use the converter script that we provided, which will take a (genome) GenBank file along with a comma-separated descriptor file for every BGCs to be generated (please see the example input files provided in the script's folder).

User Interactive output

BiG-SLiCE's output folder contains both the processed input data (in the form of an SQLite3 database file) and some scripts that power a mini web-app to visualize that data. To run this visualization engine, follow these steps:

  1. Fulfill the web-app's package requirements:
user@local:~$ pip install -r <output_folder>/requirements.txt
  1. Run the flask server:
user@local:~$ bash <output_folder>/start_server.sh <port(optional)>
  1. Open an internet browser, then go to the URL described by the previous step:
  • e.g. * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
  • then go to http://0.0.0.0:5000 in your browser

Programmatic Access and Postprocessing

To access BiG-SLiCE's preprocessed data, (advanced) users need to be able to run SQL(ite) queries. Although the learning curve might be steeper compared to the conventional tabular-formatted output files, once familiarized, the SQL database can provide an easy-to-use yet very powerful data wrangling experience. Please refer to our publication manuscript to get an idea of what kind of things are able to be done with the output data. Additionally, you can also download and reuse some jupyter notebook scripts that we wrote to perform all analyses and generate figures for the manuscript.

What kind of software is this, anyway?

bgc_gcf_illustration

Bacteria and fungi produce a vast array of bioactive compounds in nature, which can be useful for us as antibiotics (see this list), antivirals (see this list) and anticancer drugs (see Salinisporamide). To optimize and retain the production of those complex chemical agents, microbes organize the responsible genes into genomic 'clumps' colloquially termed as "Biosynthetic Gene Clusters (BGCs)" (above picture, left panel). Using bioinformatics tools such as antiSMASH, we can now take a genome sequence to identify BGCs and predict the secondary metabolites that the organism may produce (see this example analysis for the S. coelicolor genome). Furthermore, by doing a large scale comparative analysis of homologous BGCs sharing similar domain architectures (we call them "Gene Cluster Families (GCFs)"), we can practically chart an atlas of biosynthetic diversity among all sequenced microbes (above picture, right panel).

figure_1

To enable such a large scale analysis, BiG-SLiCE was specifically designed with scalability and speed as the #1 priority (Figure 1A), as opposed to our previous tool, BiG-SCAPE, which was able to sensitively capture the slightest difference of both domain architecture and sequence similarity between pairs of BGCs (see our paper for the details). As a result, BiG-SLiCE can reliably take an input data of more than 1.2 million BGCs and process it in less than a week runtime using 36-cores machine with 128GB RAM (Figure 1B) while keeping enough sensitivity to delineate the essential biosynthetic 'signals' among the input BGCs (Figure 1C). Moreover, to facilitate exploration and investigation of the analysis results, BiG-SLiCE also produce an interactive, easy-to-use output visualization that can be run with minimal software / hardware requirements.

This software was initially developed and is currently maintained by Satria Kautsar (twitter: @satriaphd) as part of a fully funded PhD project granted to Dr. Marnix Medema (website: marnixmedema.nl, twitter: @marnixmedema) by the Graduate School of Experimental Plant Sciences, NL. Contributions and feedbacks are very welcomed. Feel free to drop us an e-mail if you have any question regarding or related to BiG-SLiCE. In the future, we aim to make BiG-SLiCE a comprehensive platform to do all sorts of downstream large scale BGC analysis, taking advantage of its portable and powerful SQLite3-based data storage combined with the flexible flask-based web app architecture as the foundation.

Find our software useful? Please cite!

Satria A Kautsar, Justin J J van der Hooft, Dick de Ridder, Marnix H Medema, BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters, GigaScience, Volume 10, Issue 1, January 2021, giaa154. https://doi.org/10.1093/gigascience/giaa154