/Nordlund-ALL-subtyping

DNA methylation-based subtype prediction for pediatric acute lymphoblastic leukemia

Primary LanguageR

ALL subtyping based on 450k methylation assay

Original pubilcation: Nordlund et al. 2014, DNA methylation-based subtype prediction for pediatric acute lymphoblastic leukemia

This code was produced by the groups of Cancer Pharmacology and Computational Medicine and Molecular Medicine at the Department of Medical Sciences at Uppsala University.

System requirements

Software

The code is written for a Unix or Linux operating system with R version 3.0.1 or later, but can be modified to run under Windows fairly easily. Packages doMC (1.3-0), GEOquery (2.26.2), pamr, predict (2.1-8) and roxygen2 (2.2.2), are required, but are installed automatically. The predict package was developed inhouse and will soon be released on CRAN (manuscript in preparation).

The analyse450k package appearing in the end of analyze_final.R is an inhouse package for data management, that will not be distributed. Instead, when the manuscript is accepted for publication and the validation dataset is made publicly available on GEO, it will be incorporated into setup.R.

Hardware

At peak memory, 20 GB of RAM is required (in process_methylation.R). If you wish to run the analysis in multicore mode 13 GB/core is required.

Instructions

The most convenient way to run the analysis is to clone the repo to your computer and run the files as shown below (commands for unix/linux).

git clone git@github.com:Molmed/Nordlund-ALL-subtyping.git

Replication of the study

The replicate the entire training procedure to create the classifier used for final prediction run the following commands.

cd Nordlund-ALL-subtyping
R -f setup.R
R -f analyze_tune.R
R -f analyze_final.R

Notice that analyze_tune.R and analyze_final.R are designed to run on multiple CPU cores, and that you manually need to specify how many to use by editing the files, setting the variable number.of.cores. analyze_tune.R can also be run on multiple machines to reduce computation time further, see the comments in beginnig of the file for instructions.

  • setup.R will download all data from GEO and prepare it for use in R.
  • analyze_tune.R will perform the doubly cross validated feature selection routine necessary for model parameter tuning and performance estimation.
  • analyze_final.R will perform the model tuning, estimate performance and build the the final model that was presented in the paper. It will also produce some tables of the results.

Use the final classifier directly

To use the trained classifier on your own data follow the instructions in the script classifier/classifier.R.