Note!

This repository provides data and examples that were used for development of DeepBGC and its evaluation with ClusterFinder and antiSMASH.

See https://github.com/Merck/deepbgc for the DeepBGC tool.

Note!

DeepBGC development & evaluation code

Reproducing data

Reproduction and storage of data files is managed using DVC (development version 0.22.0). Each data file has a .dvc history file that contains the command that was used to generate the output along with md5 hashes of its dependencies.

Installation

Install python 3, ideally using conda
Run pip install -r requirements.txt to download DVC and other requirements

Downloading a file

Run the AWS config script to generate temporary AWS credentials in ~/.aws/credentials:
- generate-aws-config --account lab --insecure
Run dvc pull data/path/to/file.dvc to download required file.

High-level overview

Main folders

bgc_detection/ all the code
data/ all the data
- bacteria/ 3k reference bacteria
  - candidates/ novel detected BGC candidates
- clusterfinder/ ClusterFinder (Cimermancic et al.) datasets
- evaluation/ Cross-validation, Leave-Class-Out and Bootstrap evaluation
- features/ Pfam2vec and other protein domain features
- figures/ Paper figures
- mibig/ MIBiG BGC database samples
- models/ Model configurations and trained models
- pfam/ Pfam repository files
- training/ Negative and positive training data and t-SNE visualizations
notebooks/ Jupyter (iPython) notebooks

Training a model

Define a JSON config file, see data/models/config for reference.
Run bgc_detection/run_training.py with given config and path to training data. See DVC files in data/models/trained for reference.
Trained model will be presented as Python pickle file.

Predicting using trained model

Prepare a protein FASTA file, e.g. using Prodigal (see data/bacteria/proteins.dvc for reference) or extract it from an annotated GenBank file using bgc_detection/preprocessing/proteins2fasta.py.
Detect protein domains using Hmmscan (see data/bacteria/domtbl.dvc for reference)
Convert the Hmmscan domtbl file into a Domain CSV file using bgc_detection/preprocessing/domtbl2csv.py (see data/bacteria/domains.dvc for reference)
Predict BGC domain-level probability using bgc_detection/run_prediction.py (see data/bacteria/prediction/128lstm-100pfamdim-8pfamiter-posweighted-neg-10k.dvc for reference)
Threshold and merge domain-level predictions into a BGC candidate CSV file using bgc_detection/candidates/threshold_candidates.py (see [data/bacteria/candidates/128lstm-100pfamdim-8pfamiter-posweighted-neg-10k-fpr2/candidates.csv.dvc] for reference)

prihoda/bgc-pipeline-1

Note!

Note!

DeepBGC development & evaluation code

Reproducing data

Installation

Downloading a file

High-level overview

Main folders

Training a model

Predicting using trained model

Bootstrap validation on 9 Fully-annotated genomes

Leave Class Out validation and Cross validation

Random Forest classification

Novel BGC candidates generation