MEGA (Microbial hEterogeneous Graph Attention)

Warning: This repository is under heavy development and the content is not final yet.

MEGA is a deep learning-based python package for identifying cancer-associated intratumoral microbes.

If you have any questions or feedback, please contact Qin Ma qin.ma@osumc.edu.

The package is also available on PyPI: https://pypi.org/project/pyMEGA/

News

v0.0.5 - 4/14/2023

Updated:

Add tutorial for circos plot and network & upset plot

v0.0.5 - 3/12/2023

Updated:

Rename to MEGA

v0.0.4 - 2/18/2023

Updated:

Grammar and spelling errors
Updated MEGA installation steps

v0.0.3 - 2/16/2023

Added:

Complete workflow from raw abundance workflow and metadata labels to final prediction results
Improved tutorial for GPU and CPU version usage

v0.0.2 - 2/3/2023

Added:

Example data using a TCGA subset
Example databases, including NJS16 metabolic database, NCBI taxonomy database

v0.0.1 - 1/24/2023

Added:

GitHub published: https://github.com/OSU-BMBL/MEGA
PyPI published: https://pypi.org/project/pyMEGA/

Dev environment

MEGA is developed and tested in the following software and hardware environment:

python: 3.7.12
PyTorch: 1.4.0
NVIDIA Driver Version: 450.102.04
CUDA Version: 11.6
GPU: A100-PCIE-80GB
System: Red Hat Enterprise Linux release 8.3 (Ootpa)

Installation

The following packages and versions are required to run MEGA:

python: 3.7+
cuda: 10.2
torch==1.4.0 (must be 1.4.0)
torch-cluster==1.5.4
torch-geometric==1.4.3
torch-scatter==2.0.4
torch-sparse==0.6.1
R > 4.0
taxizedb (An R package for NCBI database)

Note: It is highly suggested to install the dependencies using micromamba (about 10 mins) rather than conda (could take more than 2 hours). If you don't want to use micromamba, just simply replace micromamba with conda in the code below.

if you have GPU available: check GPU version (CUDA 10.2)

if you only have CPU available: check CPU version

GPU version (CUDA 10.2)

Add channels using conda

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Create a virtual environment for MEGA

micromamba create -n MEGA_env python=3.7 -y

Activate MEGA_env

micromamba activate MEGA_env

install pytorch v1.4.0

micromamba install pytorch==1.4.0 cudatoolkit=10.1 -c pytorch -y

install other required packages from pip

pip install dill kneed imblearn matplotlib tqdm seaborn pipx

install torch-geometric for pytorch v1.4.0

pip install torch-scatter==2.0.4 torch-sparse==0.6.1 torch-cluster==1.5.4 torch-spline-conv==1.2.0 torch-geometric==1.4.3 -f https://data.pyg.org/whl/torch-1.4.0%2Bcu101.html

install MEGA

pip install MEGA

install R and taxizedb

micromamba install R -y

verify the installation

MEGA -h

CPU version

Add channels using conda

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Create a virtual environment for MEGA

micromamba create -n MEGA_cpu_env python=3.7 -y

Activate MEGA_cpu_env

micromamba activate MEGA_cpu_env

install pytorch v1.4.0

#micromamba install pytorch==1.4.0 cpuonly -c pytorch -y
pip install torch==1.4.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

install other required packages from pip

pip install dill kneed imblearn matplotlib tqdm seaborn pipx

install torch-geometric for pytorch v1.4.0

pip install torch-scatter==2.0.4 torch-sparse==0.6.1 torch-cluster==1.5.4 torch-spline-conv==1.2.0 torch-geometric==1.4.3 -f https://data.pyg.org/whl/torch-1.4.0%2Bcpu.html

install MEGA

pip install pyMEGA

install R and taxizedb

micromamba install R -y

verify the installation

MEGA -h

Input data

Data format

Abundance matrix: A CSV matrix. The first column represents the species IDs or official NCBI taxonomy names. The first row represents the sample names. MEGA will automatically try to convert the species name to IDs when needed.

Sample labels: A CSV matrix with a header row. The first column represents the species IDs or official NCBI taxonomy names. The first row represents the sample names. MEGA will automatically try to convert the species name to IDs when needed.

Example data

cre_abundance_data.csv: The abundance matrix has 995 species and 230 samples

cre_metadata.csv: The sample labels of the corresponding abundance matrix. It has 230 rows (samples) and 2 columns

NJS16_metabolic_relation.txt: Human gut metabolic relationship database (reference: https://www.nature.com/articles/ncomms15393). MEGA will load the built-in NJS16 metabolic database if users did not provide it. You can find the database content here

wget https://raw.githubusercontent.com/OSU-BMBL/MEGA/master/MEGA/data/cre_abundance_data.csv

wget https://raw.githubusercontent.com/OSU-BMBL/MEGA/master/MEGA/data/cre_metadata.csv

How to run MEGA

We will use the example data for the following tutorial.

Quick start

input1: the path to the abundance matrix
input2: the path to the sample metadata
cuda: which GPU device to use. Set to -1 if you only have CPU available

Running time:

GPU version: about 15 mins
CPU version: about 60 mins

GPU version

MEGA -cuda 0 -input1 cre_abundance_data.csv -input2 cre_metadata.csv -db NJS16_metabolic_relation.txt -o ./out

CPU version

MEGA -cuda -1 -input1 cre_abundance_data.csv -input2 cre_metadata.csv -db NJS16_metabolic_relation.txt -o ./out

Enabling other parameters

use MEGA -h to check more details about parameters


INPUT1=cre_abundance_data.csv
INPUT2=cre_metadata.csv
DB=NJS16_metabolic_relation.txt
CUDA=0
LR=0.003
N_HID=128
EPOCH=30
KL_COEF=0.00005
THRES=3
OUTPUT=./out
MEGA -input1 ${INPUT1} -input2 ${INPUT2} -db ${DB} -epoch ${EPOCH} -cuda ${CUDA} -n_hid ${N_HID} -lr ${LR} -kl_coef ${KL_COEF} -o ${OUTPUT}

Output files

*_final_taxa.txt : Cancer-associated microbal signatures. This is an unstructured txt file separated by tabs. This is the final output file.

*_taxa_num.csv : normalized attention score for each species under each cancel label
*_metabolic_matrix.csv: metabolic relationship network extracted from database
*_phy_matrix.csv: phylogenetic relationship network extracted from NCBI taxonomy database
*_attention.csv: raw attention matrix extracted from deep learning model

Visualization

UpSet plot and Cytoscape figures

Check the README file in ./figures folder:

./figures/README.md

Circos plot

Check the README file in ./figures/circos folder:

./figures/circos/README.md

Acknowledgements

Maintainer: Cankun Wang

Contributors:

Cankun Wang
Megan McNutt
Anjun Ma
Zhaoqian Liu
Yuhan Sun