Contrastive learning for feature aggregation in image-based cell profiling (on an AWS instance)

We propose a Deep Sets based method that learns the best way of aggregating single-cell feature data into a profile that better predicts a compound’s mechanism of action compared to average profiling. This is achieved by applying weakly supervised contrastive learning in a multiple instance learning setting. The proposed model provides a more accessible method for aggregating single-cell feature data than previous studies while significantly improving upon the average profiling baseline.

All of the scripts that were used to develop, train, and evaluate the model are shown here. The scripts for computing the figures can be found in the jupyter notebooks folder.

You can find more details in the FeatureAggregationManuscriptV1.pdf

To apply this method yourself on the LINCS dataset

First install required packages:

# If starting on an empty AWS EBS volume, otherwise ...
sudo su
mkdir ~/ebs_tmp/
cd ~/ebs_tmp
# ... start here 
sudo yum install git -y
sudo amazon-linux-extras install epel

Then clone this GitHub repo:

git clone

Configure AWS credentials

aws configure

enter credentials

Download the LINCS metadata

cd aws_scripts
git init
git remote add -f origin
git config core.sparseCheckout true
echo "metadata/platemaps/2016_04_01_a549_48hr_batch1/" >> .git/info/sparse-checkout
git pull origin master

# Download the repurposing info table to access perturbation names and MoA's (repurposing_info_long.tsv)
cd metadata/platemaps/2016_04_01_a549_48hr_batch1
curl -o repurposing_info_long.tsv

Setup conda environment

cd ..
conda update --all
conda create -n FAenv python=3.9 scipy=1.8 pytorch umap-learn pandas matplotlib seaborn pycytominer
conda activate FAenv
conda install datashader bokeh holoviews scikit-image colorcet 
pip install kneed sklearn pytorch-metric-learning wandb tabulate

conda env create -f environment.yml

Download and preprocess all LINCS plates from batch 1

edit the lincs_preprocessing_input.txt file, note that empty lines correspond to a False boolean:

  • p1: dataset name
  • p2: sqlite path
  • p3: metadata path
  • p4: barcode platemap filename
  • p5: boolean for subsample (used for developing code)
  • p6: boolean for only download dose point 10 and 3.33 uM
  • p7: path to aws commands text file (get_data_LINCS.txt)

possibly edit the get_data_LINCS.txt file to download a subset of the data with "nano /aws_scripts/get_data_LINCS.txt"

python @script_input_files/lincs_preprocessing_input.txt

Train the feature aggregation model on the preprocessed plates

modify the "script_input_files/main_LINCS_input.txt" file to the hyperparameters that you want to use:

  • p1: metadata path

  • p2: wandb mode parameter

  • p3: number of input features model

  • p4: learning rate

  • p5: epochs

  • p6: number of sets per compound type

  • p7: batch size (note: true batch size = p6*p7)

  • p8: mean of the gaussian distribution used to sample cells (sd=800)

  • p9: kFilters

  • p10: minimum number of replicates for compounds to be included (default=0)

      python @script_input_files/main_LINCS_input.txt

Evaluate the trained model

modify the "script_input_files/fulleval_input.txt" file to the type of evaluation you want to do and on which dataset:

  • p1: number of input features model

  • p2: kFilters

  • p3: save newly inferred profiles as csv - boolean

  • p4: evaluation mode: replicate prediction (empty) or MoA prediction (1)

  • p5: dataset name

  • p6: model directory

  • p7: metadata path

  • p8: dose point to evaluate on (10 or 3)

  • p9: output directory

      python @script_input_files/fulleval_input.txt

Create paper figures

mkdir notebooks
cd notebooks
git init 
git remote add -f origin
git config core.sparseCheckout true
echo "6.paper_figures/figure4_percentmatching.ipynb" >> .git/info/sparse-checkout
git pull origin master