Starter code for the CellSignal NeurIPS 2019 competition hosted on Kaggle.
To learn more about the dataset please visit RxRx.ai.
Here are some notebooks to illustrate how this code can be used.
This starter code works with python 2.7 and above. To install the deps needed for training and visualization run:
pip install -r requirements.txt
If you plan on using the preprocessing functionality you also need to install other deps:
pip install -r preprocessing_requirements.txt
Reading individual image files can become an IO bottleneck during training. This is will be a common problem faced by people who use this dataset so we are also releasing an example script to pack the images into TFRecords and zarr
files. We are also making available some pre-created TFRecords available in Google Cloud Storage. Read more about the provided TFRecords below.
Script that packs raw images from the rxrx1
dataset into TFRecord
s. This scripts runs locally or using Google DataFlow.
Run python -m rxrx.preprocess.images2tfrecords --help
for usage instructions.
Script that packs raw images from the rxrx1
dataset into zarr
s. This script only runs locally but could easily be extended to run using Google DataFlow.
This script packs each site image into a single zarr
. So, instead of having to load 6 separate channel png
s for a singe image all of those channels will be saved together in a single zarr
file.
You could extend the script to pack more images into a single zarr
file similar to what is done for TFRecord
s. This is left as an exercise to the IO bound reader. :) Read more about the Zarr format and library here.
Run python -m rxrx.preprocess.images2zarr --help
for usage instructions.
This repo has barebones starter code on how to train a model on the RxRx1 dataset using Google Cloud TPUs.
The easiest way to see this in action is to look at this notebook.
You can also spin up a VM to launch jobs from. To understand TPUs the best place to start is the TPU quickstart guide. The ctpu
command is helpful and you can find its documentation here. Note, that you can easily download and install ctpu to you local machine.
First spin up a VM:
ctpu up -vm-only -forward-agent -forward-ports -name my-tpu-vm
This command will create the VM and ssh
you into it. Note how the -vm-only
flag is used. This allows you to spin up the VM separate from the TPU which helps prevent spending money on idle TPUs.
Next, setup the repo and install the dependencies:
git clone git@github.com:recursionpharma/rxrx1-utils.git
cd rxrx1-utils
pip install -r requirements.txt # optional if just training!
Note that for just training you can skip the pip install
since the VM will have all the needed deps already.
Next you need to spin up a TPU for training:
export TPU_NAME=my-tpu-v3-8
ctpu up -name "$TPU_NAME" -preemptible -tpu-only -tpu-size v3-8
Once that is complete you can start a training job:
python -m rxrx.main --model-dir "gs://path-to-bucket/trial-id/"
You'll also want to launch a tensorboard
to watch to check the results:
tensorboard --logdir=gs://path-to-bucket/
Since we used the -forward-ports
in the ctpu
command when starting the VM you will be able to view tensorboard
on your localhost.
Once you are done with the TPU be sure to delete it!
ctpu delete -name "$TPU_NAME" -tpu-only`
You can then iterate on the code and spin up a TPU again when ready to try again.
When you are done with your VM you can either stop it or delete it with the ctpu
command, for example:
ctpu delete -name my-tpu-vm
As noted above we are providing TFRecords. They live in the following buckets:
gs://rxrx1-us-central1/tfrecords
gs://rxrx1-europe-west4/tfrecords
The data lives in these two regional buckets because when you train with TPUs you want to train from buckets in the same region as your TPU. Remember to use the appropriate bucket that is in the same region as your TPU!
The directory structure of the TFRecords is as follows:
└── tfrecords
├── by_exp_plate_site-42
│ ├── HEPG2-10_p1_s1.tfrecord
│ ├── HEPG2-10_p1_s2.tfrecord
│ ├── ….
│ ├── U2OS-03_p3_s2.tfrecord
│ ├── U2OS-03_p4_s2.tfrecord
│ └── U2OS-03_p4_s2.tfrecord
└── random-42
├── train
│ ├── 001.tfrecord
│ ├── 002.tfrecord
….
The random-42
denotes that the data has been split up randomly across different tfrecords, each record holding ~1000 examples. The 42
is the random seed used to generate this partition. The example code in this repository uses this version of the data.
The by_exp_plate_site-42
is where each TFRecord contains an all of the images for a particular experiment, plate, and site grouping. Internally the well addresses are random in the TFRecord. The advantage of this grouping is that you can be selective on the experiments you train on. Due to the grouping each TFRecord here has only about ~277 examples per file.
For good training batch diversity it is recommended that you use the TF Dataset API to interleave examples from these various files. The provided input_fn
in this repository already does this.