Bioinformatics 2018 - Distinguishing prognostic and predictive biomarkers: An information theoretic approach
Information theoretic predictive biomarker ranking
Date: 02/02/2018
Paper: Distinguishing prognostic and predictive biomarkers: An information theoretic approach Authors: Konstantinos Sechidis, Konstantinos Papangelou, Paul D. Metcalfe, David Svensson, James Weatherall and Gavin Brown
Platform: R Version 3.3.1
Required packages: MASS, infotheo
Maintainer: Konstantinos Sechidis konstantinos.sechidis@manchester.ac.uk
Description: Deriving rankings that capture the predictive biomarker strength through univariate (INFO) or higher-order (INFO+) methods
Functions:
INFOplus.Output_Categorical.Covariates_Categorical(data,labels,treatment,top_k)$ranking
This function returns the predictive ranking, the input arguments are
data: A matrix containing the covariates (biomarkers). The columns capture the different covariates, while the rows the different examples (patients). For this function the covariates are categorical (nominal).
labels: A vector that contains the output (target) label for each patient, in this case it takes categorical (nominal) values.
treatment: A vector that describes the treatment allocation (i.e. T=0 control group, T=1 experimental treatment).
top_k: The number of top-k predictive biomarkers to be returned.
Furthermore we provide functions that can be used for various data types:
INFOplus.Output_Categorical.Covariates_Continuous
: The covariates can be either all continuous or mixed (continuous and categorical). To discretise continuous covariates we follow by default Scott's rule.
INFOplus.Output_Survival.Covariates_Categorical
: For survival (time-to-event) output targets and categorical covariates.
INFOplus.Output_Survival.Covariates_Categorical
: For survival (time-to-event) output targets and continuous or mixed (continuous and categorical) covariates.
Finally, we provide the same functions for deriving the uni-variate INFO ranking.
Example
We provide a source code (Functions-GenerateData.R
) to generate the synthetic scenarios presented in the paper. The following example shows how to derive the predictive rankings using our code.
## Load libraries
library(MASS) # To generate synthetic data by sampling a Multivariate Normal
library(infotheo) # Information theoretic library
## Load sources
source("Functions-GenerateData.R") # Function to generate synthetic data
source("InformationTheory-PredictiveRankings.R") # Functions to derive predictive rankings
###################################
##### Generate synthetic data #####
###################################
model <- 3 ; # Which model to use (1, 2, 3, 4, 5, 6, 7) - details on the paper
theta_pred <- 1 # Strength of predictive part
num_features <- 20 # Number of covariates
sample_size <- 2000 # Number of examples
dataset <- Generate.Data(sample_size,num_features,theta_pred,model)
# The methods will return the top-k biomarkers
top_k <-5
#######################################################
# Ranking the biomarkers on their predictive strength #
#######################################################
# INFO, which captures first order interactions (returns the top_k = 5 biomarkers)
INFO.Output_Categorical.Covariates_Categorical(dataset$data,dataset$labels,dataset$treatment)$ranking[1:top_k] # this function returns the ranking
# INFO+, which captures second order interactions (returns the top_k = 5 biomarkers)
INFOplus.Output_Categorical.Covariates_Categorical(dataset$data,dataset$labels,dataset$treatment,top_k)$ranking # this function returns the ranking