/SL_benchmark

Benchmarking study of machine learning methods for prediction of synthetic lethality

Primary LanguageJupyter NotebookMIT LicenseMIT

Benchmarking of Machine Learning Methods for Predicting Synthetic Lethality Interactions

Welcome to the official code repository for the paper "Benchmarking of Machine Learning Methods for Predicting Synthetic Lethality Interactions." This repository hosts the implementations of various machine learning models evaluated in our study, along with the preprocessing methods and training data necessary for synthetic lethality (SL) prediction.

About the Study

Our research conducts a thorough benchmark of recent machine learning methods, including three matrix factorization and eight deep learning models. We rigorously test model performance under diverse data splitting scenarios, negative sample ratios, and sampling methods. The focus is on both classification and ranking tasks, aiming to ascertain the models' generalizability and robustness.

Workflow of the benchmarking study

Benchmarking Process Flowchart

Benchmarking results

The following graph depicts the performance of the machine learning models across various scenarios:

Results A, B and C represent the model performance under different negative sampling methods (NSM_Rand, NSM_Exp and NSM_Dep), where lighter colors indicate better performance. The figure is structured into five key sections:

  • a) A list of the 11 models.
  • b) The overall scores of the models along with combined scores for the classification task and the ranking task.
  • c) and d) Model performances under the classification and ranking tasks across six experimental scenarios, which include 2 Data Splitting Methods (DSMs) and 3 Positive to Negative Ratios (PNRs).
  • e) The average time required for the models to complete one round of cross-validation.

Key Highlights

  • Extensive benchmarking of ML models tailored for SL prediction.
  • Evaluation of model performance in varied experimental conditions.
  • Examination of the impact of data manipulation on model outcomes.

Benchmarked models

Method Paper Title Article Link Code Link
GRSMF Predicting synthetic lethal interactions in human cancers using graph regularized self-representative matrix factorization GRSMF GRSMF
SL2MF SL2MF: Predicting Synthetic Lethality in Human Cancers via Logistic Matrix Factorization SL2MF SL2MF
CMFW Predicting synthetic lethal interactions using heterogeneous data sources CMFW CMFW
SLMGAE Prediction of Synthetic Lethal Interactions in Human Cancers Using Multi-View Graph Auto-Encoder SLMGAE SLMGAE
NSF4SL NSF4SL: negative-sample-free contrastive learning for ranking synthetic lethal partner genes in human cancers NSF4SL NSF4SL
PTGNN Pre-training graph neural networks for link prediction in biomedical networks PTGNN PTGNN
PiLSL PiLSL: pairwise interaction learning-based graph neural network for synthetic lethality prediction in human cancers PiLSL PiLSL
KG4SL KG4SL: knowledge graph neural network for synthetic lethality prediction in human cancers KG4SL KG4SL
DDGCN Dual-dropout graph convolutional network for predicting synthetic lethality in human cancers DDGCN DDGCN
GCATSL Graph contextualized attention network for predicting synthetic lethality in human cancers GCATSL GCATSL
MGE4SL Predicting Synthetic Lethality in Human Cancers via Multi-Graph Ensemble Neural Network MGE4SL MGE4SL

Other SL prediction models

Machine Learning-Based Methods

Method Paper Title Article Link
Paladugu et al. Mining protein networks for synthetic genetic interactions Paladugu et al.
Pandey et al. An Integrative Multi-Network and Multi-Classifier Approach to Predict Genetic Interactions Pandey et al.
MetaSL In Silico Prediction of Synthetic Lethality by Meta-Analysis of Genetic Interactions, Functions, and Pathways in Yeast and Human Cancer MetaSL
EXP2SL EXP2SL: A Machine Learning Framework for Cell-Line-Specific Synthetic Lethality Prediction EXP2SL
DiscoverSL DiscoverSL: an R package for multi-omic data driven prediction of synthetic lethality in cancers DiscoverSL
Li et al. Identification of synthetic lethality based on a functional network by using machine learning algorithms Li et al.
SLant Predicting synthetic lethal interactions using conserved patterns in protein interaction networks SLant
Wu et al. Synthetic Lethal Interactions Prediction Based on Multiple Similarity Measures Fusion Wu et al.
De Kegel et al. Comprehensive prediction of robust synthetic lethality between paralog pairs in cancer cell lines De Kegel et al.
PARIS Uncovering cancer vulnerabilities by machine learning prediction of synthetic lethality PARIS
SBSL Overcoming selection bias in synthetic lethality prediction SBSL

Deep Learning-Based Methods

Method Paper Title Article Link
MVGCN MAGCN: A Multiple Attention Graph Convolution Networks for Predicting Synthetic Lethality MVGCN
MVGCN-iSL Multi-view graph convolutional network for cancer cell-specific synthetic lethality prediction MVGCN-iSL
SLGNN SLGNN: Synthetic lethality prediction in human cancers based on factor-aware knowledge graph neural network SLGNN

Repository Structure

This repository is organized as follows:

  • data/: This directory is meant to contain the dataset required for training the models. Given the large size of the data files, we have compressed and uploaded them to Google Drive for users to download.

  • results/: This directory will store the prediction results of the models. It is currently empty and will be populated with data as you run the models.

  • src/: Main source directory.

    • config.py: Configuration settings for the models.
    • main.py: Entry point of the SL prediction models.
    • models/: Contains the model implementations used in the study.
      • *.py: Each model has its own Python file (e.g., ddgcn.py, gcatsl.py, etc.).
    • preprocess.py: Script for data preprocessing.
    • summary_metrics.ipynb: Jupyter notebook for summarizing results.
    • train/: Training scripts for each model.
    • utils/: Utility scripts that support model operations and data manipulation.
    • wandb/: Weights & Biases tracking files for experiment tracking.
    • preprocess_exp_dep_scores.ipynb: Notebook detailing preprocessing of experimental dependency scores.

How to use

Data Preparation and Download Instructions

Follow these steps to download and prepare the training data:

Step 1: Download all the data parts from the Google Drive link provided in the repository. (The actual command will depend on how you're downloading files from Google Drive)

# Step 2: Verify the integrity of the downloaded files.
md5sum -c md5sum.txt

# Step 3: Combine the parts into a single archive.
cat data_split* > data.tar.gz

# Step 4: Extract the dataset (the extracted folder will be approximately 90GB in size).
tar -xzvf data.tar.gz

Run models

# Navigate to the src directory
cd path/to/results
mkdir Rand_score_mats Exp_score_mats Dep_score_mats score_dist
cd path/to/src
python main.py -m SLMGAE \ # Choose the SL prediction method among 'GRSMF', 'SL2MF', 'CMFW', 'SLMGAE', 'NSF4SL', 'PTGNN', 'PiLSL', 'KG4SL', 'DDGCN', 'GCATSL' and 'MGE4SL'.
               -ns Rand \ # Choose the negative sampling method with 'Rand', 'Exp', or 'Dep'.
               -ds CV1 \ # Select the data splitting method with 'CV1', 'CV2', or 'CV3'.
               -pn 1  # Set the positive to negative ratio with '1', '5', '20', or '50'.

⚠️ Important: Ensure you have at least 500GB of free space to store training data and model prediction results.

Future Directions

Beyond providing the necessary tools for SL prediction, this repository serves as a foundation for future improvements in the predictive accuracy and interpretability of ML methods in SL discovery.

We encourage the scientific community to leverage this repository for advancing the research in synthetic lethality and the pursuit of precision medicine in oncology.