/Cell-Vision-Fusion

A SwinV2 transformer-based fusion approach combining Cell Painting images, image-based profiles and compound structures to predict kinase inhibitor mechanism of action.

Primary LanguageJupyter NotebookCreative Commons Attribution 4.0 InternationalCC-BY-4.0

PyTorch Paper

Description

Abstract: Image-based profiling of the cellular response to drug compounds has proven to be an effective method to characterize the morphological changes resulting from chemical perturbation experiments. This approach has been useful in the field of drug discovery, ranging from phenotype-based screening to identifying a compound's mechanism of action or toxicity. As a greater amount of data becomes available however, there are growing demands for deep learning methods to be applied to perturbation data. In this paper we applied the transformer-based SwinV2 computer vision architecture to predict the mechanism of action of ten kinase inhibitor compounds directly from raw images of the cellular response. This method outperforms the standard approach of using image-based profiles, multidimensional feature set representations generated by bioimaging software. Furthermore, we combined the best-performing models for three different data modalities, raw images, image-based profiles and compound chemical structures, to form a fusion model, Cell-Vision Fusion (CVF). This approach classified the kinase inhibitors with 69.79% accuracy and 70.56% F1 score, 4.2% and 5.49% greater, respectively, than the best performing image-based profile method. Our work provides three techniques, specific to Cell Painting images, which enable the SwinV2 architecture to train effectively, as well as exploring approaches to combat the significant batch effects present in large Cell Painting perturbation datasets.

Approach Overview

Primary Reference Material and Data Sources

Path Description
JUMP Cell Painting Repository JUMP Consortium GitHub Repository
Broad Cell Painting Galley s3 Bucket AWS S3 Storage bucket for Cell Painting data
JUMP Cpg0016 Paper Morphological impact of 136,000 chemical and genetic perturbations paper - Chandrasekaran et al. (2023)

Requirements

  • 1–2 GPUs with at least 12 GB of memory.
  • 64-bit Python 3.10 and PyTorch 1.8.1. See https://pytorch.org/ for PyTorch install instructions.
  • CUDA toolkit 11.0 or later.
  • Python libraries: see reqs.txt for necessary libraries.

Note: To use later versions of Pytorch with Python 3.10.x, the following code changes will have to be made to the Cuda package to avoid dependency conflicts.

Getting Started

# Download the codebase:
git clone https://github.com/williamdee1/Cell-Vision-Fusion
cd Cell-Vision-Fusion

# Create environment and install requirements (instructions for Anaconda):
conda create -n [env_name] python=3.10.9
conda activate [env_name] 
pip install -r reqs.txt

Set up AWS CLI - https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html. Ensure your AWS credentials and config files are in your user/.aws directory.

Load the Kinase Inhibitor Dataset

See below for the steps required to recreate the kinase inhibitor dataset used in this work:

  1. The JUMP cpg0016 metadata files were downloaded from the JUMP Cell Painting Repository Metadata folder. These were combined in the Combining JUMP Metadata notebook.
  2. Compound mechanism of action (MOA) information was downloaded from the ChEMBL Drug mechanisms data portal and the Drug Repurposing Hub (v. 3/24/2020). This data was processed and aligned with the JUMP data in the Preparing MOA Label Data and Aligning MOA to JUMP Data notebooks, following the Compound Annotator GitHub repository's example.
  3. Inhibitor classes were included/excluded on the basis of their suitability (i.e., liklihood of producing a significant phenotypic response) for the U2OS cell line. The alignment of these classes with the DepMap TPM gene expression values was performed in the U2OS Kinase Expression notebook.

4. Download CellProfiler Image-based Profile data:

Image-based profile data was downloaded for each datapoint in cpg0016 matched with a kinase inhibitor label - see the code below:

python ibp_dl_main.py --data=data/all_cpnd_pert.csv --out_path=data/ibp/all_cpnd_ibp.csv
  1. Following the download, quality control was performed to exclude images with high/low saturation, low focus and high levels of blur. This highlighted a number of images with artefacts that were present prior to quality control (see image below). This QC process is documented in Quality Control and Dataset Selection. Post-QC the remaining datapoints were selected to form the kinase inhibitor IBP dataset, choosing a maximum of ten, minimum of four replicates per compound.

Example low quality image excluded by quality control procedures (source_9, 20211102-Run15, GR00004394, U24, field 2)

6. Download Raw Image Cell Painting data:

Using the metadata from the IBP dataset, the associated images were downloaded from the JUMP CP S3 Bucket using the following code:

python img_dl_main.py --data=data/images/ki_img_dl.csv --output_dir=dl_imgs

This function downloads all field images for each datapoint and applies the illumination function for the plate where the experiment was situated.

Modelling the Data

Image-based Profiles

Applying MAD normalization, Harmony and Spherization to the IBP data was performed by following the code in the 2023_Arevalo_BatchCorrection GitHub repository, which is linked to the following research paper - Evaluating batch correction methods for image-based cell profiling. The config associated with Scenario 4 of that paper was used as it most closely mirrors the scenario found in this study (i.e. multiple microscope types, multiple laboratories, few compounds, multiple replicates).

The impact of using MAD normalization, spherization and either Pycytominer or Shapley feature selection for reducing batch effects caused by the microscope used can be observed below:

Modelling was then performed in the IBP RF SVM XGBoost and the IBP MLP Kaggle notebooks.

Image Model Training Time

Labels Resolution GPUs sec/kimg Max RSS Max PSS
ENet 240x240 1 ~8.0 8.0 GB 5.6 GB
SwinV2 896x896 2 ~63.2 59.1 GB 9.6 GB
CVF 896x896 2 ~65.8 61.4 GB 9.7 GB

Compound Structures

The CP Chem MOA GitHub repository was used as the basis for converting compound smiles into Morgan Fingerprints, before using these representations of chemical structure as input into an MLP model. The process can be found in the MLP_Structural_Model notebook.

Cell-Vision Fusion Model

To train and evaluate the Cell-Vision fusion model the following configuration can be run after this repository and the associated image, compound structure and image-based profile data (see above) is downloaded/ preprocessed.

python train_model.py --config=config/fusion.yml

Evaluation

The results from each cross validation fold for the best performing models for each data modality - images (SwinV2), image-based profiles (MLP), compound structure (MLP), as well as the CVF results are shared in the Results folder.

These results are combined and performance metrics for each model are calculated in the Model Results notebook. These are displayed in the table below:

Results