Welcome to the official code repository for the paper "Benchmarking of Machine Learning Methods for Predicting Synthetic Lethality Interactions." This repository hosts the implementations of various machine learning models evaluated in our study, along with the preprocessing methods and training data necessary for synthetic lethality (SL) prediction.
Our research conducts a thorough benchmark of recent machine learning methods, including three matrix factorization and eight deep learning models. We rigorously test model performance under diverse data splitting scenarios, negative sample ratios, and sampling methods. The focus is on both classification and ranking tasks, aiming to ascertain the models' generalizability and robustness.
The following graph depicts the performance of the machine learning models across various scenarios:
A, B and C represent the model performance under different negative sampling methods (NSM_Rand
, NSM_Exp
and NSM_Dep
), where lighter colors indicate better performance. The figure is structured into five key sections:
- a) A list of the 11 models.
- b) The overall scores of the models along with combined scores for the classification task and the ranking task.
- c) and d) Model performances under the classification and ranking tasks across six experimental scenarios, which include 2 Data Splitting Methods (DSMs) and 3 Positive to Negative Ratios (PNRs).
- e) The average time required for the models to complete one round of cross-validation.
- Extensive benchmarking of ML models tailored for SL prediction.
- Evaluation of model performance in varied experimental conditions.
- Examination of the impact of data manipulation on model outcomes.
Method | Paper Title | Article Link | Code Link |
---|---|---|---|
GRSMF | Predicting synthetic lethal interactions in human cancers using graph regularized self-representative matrix factorization | GRSMF | GRSMF |
SL2MF | SL2MF: Predicting Synthetic Lethality in Human Cancers via Logistic Matrix Factorization | SL2MF | SL2MF |
CMFW | Predicting synthetic lethal interactions using heterogeneous data sources | CMFW | CMFW |
SLMGAE | Prediction of Synthetic Lethal Interactions in Human Cancers Using Multi-View Graph Auto-Encoder | SLMGAE | SLMGAE |
NSF4SL | NSF4SL: negative-sample-free contrastive learning for ranking synthetic lethal partner genes in human cancers | NSF4SL | NSF4SL |
PTGNN | Pre-training graph neural networks for link prediction in biomedical networks | PTGNN | PTGNN |
PiLSL | PiLSL: pairwise interaction learning-based graph neural network for synthetic lethality prediction in human cancers | PiLSL | PiLSL |
KG4SL | KG4SL: knowledge graph neural network for synthetic lethality prediction in human cancers | KG4SL | KG4SL |
DDGCN | Dual-dropout graph convolutional network for predicting synthetic lethality in human cancers | DDGCN | DDGCN |
GCATSL | Graph contextualized attention network for predicting synthetic lethality in human cancers | GCATSL | GCATSL |
MGE4SL | Predicting Synthetic Lethality in Human Cancers via Multi-Graph Ensemble Neural Network | MGE4SL | MGE4SL |
Method | Paper Title | Article Link |
---|---|---|
Paladugu et al. | Mining protein networks for synthetic genetic interactions | Paladugu et al. |
Pandey et al. | An Integrative Multi-Network and Multi-Classifier Approach to Predict Genetic Interactions | Pandey et al. |
MetaSL | In Silico Prediction of Synthetic Lethality by Meta-Analysis of Genetic Interactions, Functions, and Pathways in Yeast and Human Cancer | MetaSL |
EXP2SL | EXP2SL: A Machine Learning Framework for Cell-Line-Specific Synthetic Lethality Prediction | EXP2SL |
DiscoverSL | DiscoverSL: an R package for multi-omic data driven prediction of synthetic lethality in cancers | DiscoverSL |
Li et al. | Identification of synthetic lethality based on a functional network by using machine learning algorithms | Li et al. |
SLant | Predicting synthetic lethal interactions using conserved patterns in protein interaction networks | SLant |
Wu et al. | Synthetic Lethal Interactions Prediction Based on Multiple Similarity Measures Fusion | Wu et al. |
De Kegel et al. | Comprehensive prediction of robust synthetic lethality between paralog pairs in cancer cell lines | De Kegel et al. |
PARIS | Uncovering cancer vulnerabilities by machine learning prediction of synthetic lethality | PARIS |
SBSL | Overcoming selection bias in synthetic lethality prediction | SBSL |
Method | Paper Title | Article Link |
---|---|---|
MVGCN | MAGCN: A Multiple Attention Graph Convolution Networks for Predicting Synthetic Lethality | MVGCN |
MVGCN-iSL | Multi-view graph convolutional network for cancer cell-specific synthetic lethality prediction | MVGCN-iSL |
SLGNN | SLGNN: Synthetic lethality prediction in human cancers based on factor-aware knowledge graph neural network | SLGNN |
This repository is organized as follows:
-
data/
: This directory is meant to contain the dataset required for training the models. Given the large size of the data files, we have compressed and uploaded them to Google Drive for users to download. -
results/
: This directory will store the prediction results of the models. It is currently empty and will be populated with data as you run the models. -
src/
: Main source directory.config.py
: Configuration settings for the models.main.py
: Entry point of the SL prediction models.models/
: Contains the model implementations used in the study.*.py
: Each model has its own Python file (e.g.,ddgcn.py
,gcatsl.py
, etc.).
preprocess.py
: Script for data preprocessing.summary_metrics.ipynb
: Jupyter notebook for summarizing results.train/
: Training scripts for each model.utils/
: Utility scripts that support model operations and data manipulation.wandb/
: Weights & Biases tracking files for experiment tracking.preprocess_exp_dep_scores.ipynb
: Notebook detailing preprocessing of experimental dependency scores.
Follow these steps to download and prepare the training data:
Step 1: Download all the data parts from the Google Drive link provided in the repository. (The actual command will depend on how you're downloading files from Google Drive)
# Step 2: Verify the integrity of the downloaded files.
md5sum -c md5sum.txt
# Step 3: Combine the parts into a single archive.
cat data_split* > data.tar.gz
# Step 4: Extract the dataset (the extracted folder will be approximately 90GB in size).
tar -xzvf data.tar.gz
# Navigate to the src directory
cd path/to/results
mkdir Rand_score_mats Exp_score_mats Dep_score_mats score_dist
cd path/to/src
python main.py -m SLMGAE \ # Choose the SL prediction method among 'GRSMF', 'SL2MF', 'CMFW', 'SLMGAE', 'NSF4SL', 'PTGNN', 'PiLSL', 'KG4SL', 'DDGCN', 'GCATSL' and 'MGE4SL'.
-ns Rand \ # Choose the negative sampling method with 'Rand', 'Exp', or 'Dep'.
-ds CV1 \ # Select the data splitting method with 'CV1', 'CV2', or 'CV3'.
-pn 1 # Set the positive to negative ratio with '1', '5', '20', or '50'.
⚠️ Important: Ensure you have at least 500GB of free space to store training data and model prediction results.
Beyond providing the necessary tools for SL prediction, this repository serves as a foundation for future improvements in the predictive accuracy and interpretability of ML methods in SL discovery.
We encourage the scientific community to leverage this repository for advancing the research in synthetic lethality and the pursuit of precision medicine in oncology.