Deep Cancer

DS-GA 1006 Capstone Project for Joyce Wu, Raúl Delgado Sánchez and Eduardo Fierro Farah.

Steps to replicate work:

1. Data:

Download the GDC data transfer API
Create a manifest by selecting Cases > CANCER_TYPE and Files > Data Type > Tissue Slide Image.
Download the manifest into manifest_file
Run the command gdc-client download -m manifest_file in Terminal

2. Data processing:

Note that data tiling and sorting scripts come from Nicolas Coudray. Please refer to the README within DeepPATH_code for the full range of options. Additionally, note that these scripts may take a significant amount of computing power. We recommend submitting sections 2.1 and 2.2 to a high performance computing cluster with multiple CPUs.

2.1. Data tiling

Run Tiling/0b_tileLoop_deepzoom2.py to tile the .svs images into .jpeg images. To replicate this particular project, select the following specifications:

python -u Tiling/0b_tileLoop_deepzoom2.py -s 512 -e 0 -j 28 -f jpeg -B 25 -o <OUT_PATH> "<INPUT_PATH>/*/*svs"

<INPUT_PATH>: Path to the outer directory of the original svs files
<OUT_PATH>: Path to which the tile files will be saved
-s 512: Tile size of 512x512 pixels
-e 0: Zero overlap in pixels for tiles
-j 28: 28 CPU threads
-f jpeg: jpeg files
-B 25: 25% allowed background within a tile.

2.2. Data sorting

To ensure that the later sections work properly, we recommend running these commands within <ROOT_PATH>, the directory in which your images will be stored:

mkdir <CANCER_TYPE>TilesSorted
cd <CANCER_TYPE>TilesSorted

<CANCER_TYPE>: The dataset such as 'Lung', 'Breast', or 'Kidney'

Next, run Tiling/0d_SortTiles.py to sort the tiles into train, valid and test datasets with the following specifications.

python -u <FULL_PATH>/Tiling/0d_SortTiles.py --SourceFolder="<INPUT_PATH>" --JsonFile="<JSON_FILE_PATH>" --Magnification=20 --MagDiffAllowed=0 --SortingOption=3 --PercentTest=15 --PercentValid=15 --PatientID=12 --nSplit 0

<FULL_PATH>: The full path to the cloned repository
<INPUT_PATH>: Path in which the tile files were saved, should be the same as <OUT_PATH> of step 2.1.
<JSON_FILE_PATH>: Path to the JSON file that was downloaded with the .svs tiles
--Magnification=20: Magnification at which the tiles should be considered (20x)
--MagDiffAllowed=0: If the requested magnification does not exist for a given slide, take the nearest existing magnification but only if it is at +/- the amount allowed here (0)
--SortingOption=3: Sort according to type of cancer (types of cancer + Solid Tissue Normal)
--PercentValid=15 --PercentTest=15 The percentage of data to be assigned to the validation and test set. In this case, it will result in a 70 / 15 / 15 % train-valid-test split.
--PatientID=12 This option makes sure that the tiles corresponding to one patient are either on the test set, valid set or train set, but not divided among these categories.
--nSplit=0 If nSplit > 0, it overrides the existing PercentTest and PercentTest options, splitting the data into n even categories.

2.3. Build tile dictionary

Run Tiling/BuildTileDictionary.py to build a dictionary of slides that is used to map each slide to a 2D array of tile paths and the true label. This is used in the aggregate function during training and evaluation.

python -u Tiling/BuildTileDictionary.py --data <CANCER_TYPE> --path <ROOT_PATH>

<ROOT_PATH> points to the directory path for which the sorted tiles folder is stored in, same as in 2.2.

Note that this code assumes that the sorted tiles are stored in <ROOT_PATH><CANCER_TYPE>TilesSorted. If you do not follow this convention, you may need to modify this code.

4. Train model:

In the Load data section of train.py (lines ~85-96) please modify these variables:

root_dir = "<ROOT_PATH><CANCER_TYPE>TilesSorted/", change the path to your file path
tile_dict_path = "<ROOT_PATH><CANCER_TYPE>_FileMappingDict.p, change the path to your tile dict path

NOTE: We are using a very useful tool called CometML to keep track of our experiments. If you would like to use it as well, please modify the API key in the training code to your own API key (line ~57). Otherwise, you can remove the lines of code related to cometml.

4.1. Train our model

Run train.py to train with our CNN architecture. sbatch file run_job.sh is provided as an example script for submitting a GPU job for this script.

--cuda: enables cuda
--ngpu: number of GPUs to use (default=1)
--data: data to train on (lung/breast/kidney)
--augment: use data augmentation or not
--batchSize: batch size for data loaders (default=32)
--imgSize: the height / width that the image will be shrunk to (default=299)
--metadata: use metadata or not

IMPORTANT NOTE: this option is not fully implemented! Please see section 6 for additional information about using the metadata.

--nc: input image channels + concatenated info channels if metadata = True (default = 3 for RGB).
--niter: number of epochs to train for (default=25)
--lr: learning rate for the optimizer (default=0.001)
--decay_lr: activate decay learning rate function
--optimizer: Adam, SGD or RMSprop (default=Adam)
--beta1: beta1 for Adam (default=0.5)
--earlystop: use early stopping
--init: initialization method (default=normal, xavier, kaiming)
--model: path to model to continue training from a checkpoint (default='')
--experiment: where to store samples and models (default=None)
--nonlinearity: nonlinearity to use (selu, prelu, leaky, default=relu)
--dropout: probability of dropout in each block (default=0.5)
--method: aggregation prediction method (max, default=average)

4.2. Train Google's Inception V3

Run train_inception.py to train Google's Inception V3. sbatch file run_job_inception.sh is provided as an example script for submitting a GPU job for this script. Note that this version has only been coded to run with the lung cancer task with 3 classes.

5. Test model:

Run test.py to evaluate a specific model on the test data, run_test.sh is the associated sbatch file.

--data: Data to train on (lung/breast/kidney)
--experiment Name of experiment to test, same as in section 4.1
--model: Name of model to test, e.g. epoch_10.pth

6. Metadata

We explored concatenating the metadata included in the patient JSON file as additional information for the neural network to process. We did not fully implement this option because of a few reasons:

Training was greatly slowed
Different cancers may not be compared on a fair basis (e.g. cigarettes smoked per day will be more informative for lung cancer than kidney cancer)
The work we were replicating did not use any metadata, and we wanted a fair comparison

However, if you would like to try your hand at using the metadata you must:

Look at iPython Notebooks/LungJsonDescription.ipynb to explore the metadata you'd like to add
Create a dictionary of desired JSON inputs as per JsonParser/LungJsonCleaner.py
Modify parse_json in utils/dataloader.py to add the desired metadata (not implemented for kidney and breast)
Modify the aggregate function in train.py so that it concatenates the metadata to the image as per utils/dataloader.py. This is the main part that is not implemented for any of the cancers.

Additional resources:

iPython Notebooks

100RandomExamples.ipynb visualizes of 100 random examples of tiles in the datasets
Final evaluation and viz.ipynb provides code for visualizing the output prediction of a model, and also for evaluating a model on the test set on CPU
LungJsonDescription.ipynb explores the potential of metadata that can be used as extra information for training
new_transforms_examples.ipynb visualizes a few examples of the data augmentation used for training. One can tune the data augmentation here.

Assignments

Documents in this folder are all of the required submissions for our Capstone course. The final report contains explanation of our methodology and some results.

sedab/deep-cancer