tensor_gp: A Stan repository from davidaknowles

Code for Tensor Gaussian Process Regression for predicting drug combination synergy, developed for the AstraZeneca-Sanger Drug Combination Prediction DREAM Challenge 2015:
https://www.synapse.org/#!Synapse:syn4231880/wiki/235645

More details of the method are available here:
https://www.synapse.org/#!Synapse:syn5586109/wiki/394920

Required R packages: rstan, data.table, dplyr, doMC (or at least foreach), irlba (if you want to run the convex PCA component).

Data required:
- Both the Drug_synergy_data and Sanger_molecular_data directories must be in the main directory. ch1_LB.csv and ch2_LB.csv should be in the synergy data folder also.
- CCLE expression data file: CCLE_Expression_Entrez_2012-09-29.gct available from http://www.broadinstitute.org/ccle/data/browseData?conversationPropagation=begin

Key files:
- molecular_data.R processes the raw cell line data and produces distance matrices. It addtionally performs imputation for GE for two cell lines from CCLE and for methylation.
- mono_pca_all.R performs a nuclear-norm regularized PCA on the monotherapy data
- train_gp.R This is run to produce results for all challenge tasks.
Rscript train_gp.R <run> <setup> <sub> <cores> <usetissue> <max_its>

With the full training data (we use all of Ch 1 and Ch 2 for both) the memory consumption is pretty high (like 40Gb per thread) so I only run two cores on the 120Gb machines we have. So Ch 1A was trained using:
train_gp.R 1 final2 A 2 1 30
and predictions are made using
train_gp.R 0 final2 A 1 1 30
Similary for Ch 1B (just change A -> B).

For Ch 2, the training from Ch1 A can be used (pick the top likelihood seed), and then run sub2_predict.R. In principle this could be done using a Stan run, but in practice it's faster to do it manually.

davidaknowles/tensor_gp