/compath

deep learning pipeline repository for the paper "Geospatial immune variability illuminates differential evolution of lung adenocarcinoma"

Primary LanguageMATLAB

Histology single-cell identification pipeline

Deep learning pipeline repository for our paper "Geospatial immune variability illuminates differential evolution of lung adenocarcinoma" published in Nature Medicine.

In addition to a combination of Python, MATLAB and R scripts, this repository also includes, example H&E images and their final outputs and single-cell annotations data for external cohort testing.

The pipeline accepts a standard H&E (e.g. ndpi format) and outputs a spatial map, where all cancer, lymphocyte and stromal cells can be recognized. The SCCNN method was first published in doi.org/10.1109/TMI.2016.2525803 but re-implemented with different parameters in Python-TensorFlow here. Tissue segmentation is based on MicroNet: doi.org/10.1016/j.media.2018.12.003.

Citation

If you use this pipeline or some of its steps, or if you use the attached annotation data, please cite:

  • AbdulJabbar, K. et al. Geospatial immune variability illuminates differential evolution of lung adenocarcinoma. Nature Medicine (2020). doi: 10.1038/s41591-020-0900-x

Highlights

The steps can be further explained as follows:

  • Tiling: to convert a raw microscopy image into 2000x2000 JPEG tiles.
  • Tissue segmentation: to segment viable tissue area from a H&E slide.

The above two steps can be skipped, e.g. if you already have small sections of a H&E as JPEG tiles, or if you don't think there is any need to segment tissue areas. However, please note, tissue segmentation is a fast step that rids large unwanted tiles from a standard H&E to save time for the next two steps.

  • Cell detection: identifying cell nucleus,
  • Cell classification: predicting the class of an identified cell (cancer, stromal, lymphocyte, other)

Both cell detection and classification algorithms contain pre processing routines. You can turn this off/on or modify it from the main run script or sub matlab dir.

To execute, you need the below Conda virtual environments.

Python-TensorFlow virtual envs (Linux)

  • For cell detection and classification:
module load anaconda/3/4.4.0
conda create -n tfdavrosCPU1p3 python=3.5.4
conda activate tfdavrosCPU1p3
conda install scipy=0.19 pandas=0.20 numpy=1.13.1
pip install /apps/tensorflow/tensorflow-1.3.0-cp35-cp35m-linux_x86_64.whl

cd /apps/MATLAB/R2018b/extern/engines/python
#replace your dir:
python setup.py build --build-base="/home/dir/tmp" install
pip install pillow==4.2.1 h5py==2.7.1
conda deactivate

#check by running python then 'import tensorflow as tf'
  • For tiling raw ndpi files:
module load anaconda/3/4.4.0
conda create –n CWS python=3.5
source activate CWS
conda install numpy
module load java/sun8/1.8.0u66
pip install 'python-bioformats<=1.3.0'
module load openjpeg/2.1.2
module load openslide/3.4.1
pip install openslide-python
source deactivate CWS

Example data

Under data/example we provide sample tiles. The aim should be to run both cell detection and classification and replicate the results as seen under example/results.

  • example/data: raw tiled JPEGs, ready for cell detection and cell classification.
  • example/results: the output of this pipeline in the form of annotated images and cell coordinates.

Post processsing

A likely scenario is to see a lot of rubbish being detected outside the tissue regions. This happens simply because our algorithm hasn't seen enough 'negative non-cell' events from a chohort other than Lung TRACERx. Though much of this rubbish should be avoided with tissue segmentation, however, we provide a simple MATLAB script for post processing (cleaning) under: post_proc. This script should also create a summary for all slides in one table: number and relative percentage of cells identified for each class.

Test data (LATTICe-A annotations)

Single-cell expert pathology annotations from the LATTICe-A cohort are provided under: test_data. This test dataset represents one of several external validations performed in the paper.

The R scripts is provided to re-generate single-cell accuracy results - you should be able to replicate Table S3 from the paper using:

  • latticea_test_data/imgs: the original raw H&E tiles used for single-cell pathology annoations.

  • latticea_test_data/gt_celllabels: expert pathology annotations in the form of class, and x,y coordinates.

  • latticea_test_data/dl_celllabels: our final cell predictions from this pipeline.

Multiplex IHC

By large, this pipeline is designed for H&E images as they make the bulk of our paper. For multiplex IHC images (CD8-CD4-FOXP3); refer to Methods in the paper. Depending on your IHC images (combination of colors, cytoplasmic/nuclear staining), the pipeline may need some modification.

Training

Training codes are available for each step of this pipeline. We aim to update this repo with a more recent version (updated codes, tf version 1.13).