/bgc-pipeline-1

Primary LanguageJupyter NotebookMIT LicenseMIT

Note!

This repository provides data and examples that were used for development of DeepBGC and its evaluation with ClusterFinder and antiSMASH.

See https://github.com/Merck/deepbgc for the DeepBGC tool.

Note!

DeepBGC development & evaluation code

Reproducing data

Reproduction and storage of data files is managed using DVC (development version 0.22.0). Each data file has a .dvc history file that contains the command that was used to generate the output along with md5 hashes of its dependencies.

Installation

  • Install python 3, ideally using conda
  • Run pip install -r requirements.txt to download DVC and other requirements

Downloading a file

  • Run the AWS config script to generate temporary AWS credentials in ~/.aws/credentials:
    • generate-aws-config --account lab --insecure
  • Run dvc pull data/path/to/file.dvc to download required file.

High-level overview

Main folders

Training a model

Predicting using trained model

Bootstrap validation on 9 Fully-annotated genomes

See notebooks/LabelledContigBootstrap.ipynb.

Leave Class Out validation and Cross validation

See data/evaluation/lco-neg-10k (TODO).

See data/evaluation/cv-10fold-neg-10k (TODO).

Random Forest classification

See notebooks/CandidateClassification.ipynb and notebooks/CandidateActivityClassification.ipynb

Novel BGC candidates generation

See notebooks/NovelCandidates.ipynb.