WISP (Weighted In Silico Pathology) is a novel approach to assess intra-tumoral heterogeneity from bulk molecular profiles. Based on predefined pure molecular or histological populations for a particular cancer type, our approach gives a fine description of each tumor in a standardized way. The methodology is based on non-negative least squares regression and quadratic programming optimization for estimating the mixing proportions of distinct populations for a tumor sample. It can be applied on trancriptomic data or methylation data.
WISP consists in 3 main steps:
1- Calculation of pure population centroid profiles.
2- Estimation of mixing proportions of pure populations for each tumoral sample.
3- Graphical representations of the mixing proportion estimations for all samples.
For now, please provide a link to this github repository:
https://cit-bioinfo.github.io/WISP/
You may install this package with devtools:
devtools::install_github("cit-bioinfo/WISP")
library(WISP)
Illustrative gene expression datasets of presupposed pure histological samples and mixed samples from pleural mesothelioma restricted to specific markers (for computing time reason).
data(dataWISP)
dataWISP
is a list containing the following objects:
datapure
: data frame of gene expression profiles of the presupposed pure histological samples.
dataWISP$datapure[1:5,1:5]
## T036 T025 T022 T004 T016
## gene.1 11.456054 11.80786 10.205246 12.201666 6.373776
## gene.2 10.643914 13.40374 13.182918 13.249751 11.930280
## gene.3 10.722577 12.25111 7.501498 11.650849 5.084441
## gene.4 9.395325 10.71960 10.827885 9.734774 9.992148
## gene.5 6.529778 11.83156 9.828759 11.486970 6.404284
clpure
: vector of histology annotation for the presupposed pure histological samples.
head(dataWISP$clpure)
## T036 T025 T022 T004 T016 T014
## Epure Epure Epure Epure Epure Epure
## Levels: Epure normal Spure
data
: data frame of gene expression profiles of mixed histological samples.
dataWISP$data[1:5,1:5]
## T013 T058 T066 T070 T078
## gene.1 8.178341 11.550232 11.840814 4.877618 11.587651
## gene.2 12.967090 12.756574 13.391280 6.559053 13.054004
## gene.3 11.042867 11.442587 11.789554 4.659080 11.373127
## gene.4 8.818490 8.240092 9.041123 4.707080 8.435392
## gene.5 10.638619 10.920845 11.460884 5.814253 9.994315
(optional) histo
: histological annotation for all the samples present in data
object.
head(dataWISP$histo)
## T013 T058
## "Epithelioid mesothelioma" "Epithelioid mesothelioma"
## T066 T070
## "Epithelioid mesothelioma" "Epithelioid mesothelioma"
## T078 T080
## "Epithelioid mesothelioma" "Epithelioid mesothelioma"
WISP.getPureCentro
function takes as input a data frame of gene expression restricted to samples considered as pure by the user (in our example datapure
) and a vector of population names for each sample (in our example clpure
). If pureSamples_filtering
is set to TRUE, WISP will perform its sample filtering procedure. More precisely, for each presupposed pure sample, it estimates the proportions of all populations: if the difference between the supposed pure class proportion and the one of the second existing contingent is more that the threshold specified in pureSamples_deltaTopWeights
argument, the sample is considered as a mixture of populations and is removed. For example, a threshold of 0.6 means that one expects to have at least 60% of the main existing population in a sample to consider it as pure.
If pureSamples_filtering
is set to FALSE the function will take all the presupposed pure samples to calculate the pure centroid profiles.
For the centroid calculation, the function retrieves the markers that are specific to each population: if you want a particular number of markers per class you can set nb_markers_selection
to "custom" and nb_markers_max_perClass
with the desired number of markers. The final selection will be based on ANOVA p-value cutoff (markers_cutoff_pval_anovatest
) and AUC cutoff (markers_cutoff_auc
); genes are ranked by their logFC. If nb_markers_selection
is set to "optim_kappa", the number of markers is chosen so that it optimizes the condition number.
If you wish particular genes to be present in the final centroid signature you can add a vector of gene names in add_markers
. If sum_LessThanOne
is set to TRUE (default), quadratic optimization will use the constraint: sum of coefficients less than or equal to 1, if FALSE the sum of the coefficient will be equal to 1. This can be important if not all pure populations present in the mixed samples have been considered by the user.
resPure = WISP.getPureCentro(data= dataWISP$datapure,cl= dataWISP$clpure, pureSamples_filtering = TRUE, nb_markers_selection = "custom", nb_markers_max_perClass = 50, markers_cutoff_pval_anovatest = 0.05, markers_pval_anovatest_fdr = TRUE, markers_cutoff_auc = 0.8, pureSamples_deltaTopWeights = 0.6, plot_heatmap = TRUE, add_markers = NULL, col_purePop = c("Epure"="#f46d43","normal"="grey","Spure"="#66c2a5"), sum_LessThanOne = FALSE)
## [1] "Retrieving markers"
## [1] "Centroid calculation"
## [1] "Weight estimation"
## [1] "5 samples were removed T036" "5 samples were removed T016"
## [3] "5 samples were removed T080" "5 samples were removed N018"
## [5] "5 samples were removed N017"
## [1] "Retrieving markers"
## [1] "Centroid calculation"
## [1] "Weight estimation"
## [1] "No sample was removed "
## [1] "Final Centroid calculation (on 17 samples)"
The returned object of addition of a heatmap (if plot_heatmap
=TRUE) is a list with the following objects:
genescentro
: centroids of pure populations based on the pure samples.
indkept
: pure samples kept by WISP.
indremoved
: samples that were not considered as pure and removed by WISP.
genesclasses
: list of markers that were used in centroid calculation for each pure population.
lapply(resPure, head)
## $genescentro
## Epure Spure normal
## gene.1 11.40341 4.847256 7.698763
## gene.2 12.86137 6.867839 10.366323
## gene.3 10.32061 4.643993 7.937093
## gene.4 9.68924 4.802799 5.343581
## gene.5 10.69767 5.933601 6.433749
## gene.6 12.78011 8.357490 8.071400
##
## $indkept
## Epure Epure Epure Epure Epure Epure
## "T025" "T022" "T004" "T014" "T131" "T115"
##
## $indremoved
## Epure Epure Epure normal normal
## "T036" "T016" "T080" "N018" "N017"
##
## $genesclasses
## $genesclasses$Epure
## [1] "gene.1" "gene.2" "gene.3" "gene.4" "gene.5" "gene.6" "gene.7" "gene.8" "gene.9"
## [10] "gene.10" "gene.11" "gene.12" "gene.13" "gene.14" "gene.15" "gene.16" "gene.17" "gene.18"
## [19] "gene.19" "gene.20" "gene.21" "gene.22" "gene.23" "gene.24" "gene.25" "gene.26" "gene.27"
## [28] "gene.28" "gene.29" "gene.30" "gene.31" "gene.32" "gene.33" "gene.34" "gene.35" "gene.36"
## [37] "gene.37" "gene.38" "gene.39" "gene.40" "gene.41" "gene.42" "gene.43" "gene.44" "gene.45"
## [46] "gene.46" "gene.47" "gene.48" "gene.49" "gene.50"
##
## $genesclasses$normal
## [1] "gene.51" "gene.52" "gene.53" "gene.54" "gene.55" "gene.56" "gene.57" "gene.58"
## [9] "gene.59" "gene.60" "gene.61" "gene.62" "gene.63" "gene.64" "gene.65" "gene.66"
## [17] "gene.67" "gene.68" "gene.69" "gene.70" "gene.71" "gene.72" "gene.73" "gene.74"
## [25] "gene.75" "gene.76" "gene.77" "gene.78" "gene.79" "gene.80" "gene.81" "gene.82"
## [33] "gene.83" "gene.84" "gene.85" "gene.86" "gene.87" "gene.88" "gene.89" "gene.90"
## [41] "gene.91" "gene.92" "gene.93" "gene.94" "gene.95" "gene.96" "gene.97" "gene.98"
## [49] "gene.99" "gene.100"
##
## $genesclasses$Spure
## [1] "gene.101" "gene.102" "gene.103" "gene.104" "gene.105" "gene.106" "gene.107" "gene.108"
## [9] "gene.109" "gene.110" "gene.111" "gene.112" "gene.113" "gene.114" "gene.115" "gene.116"
## [17] "gene.117" "gene.118" "gene.119" "gene.120" "gene.121" "gene.122" "gene.123" "gene.124"
## [25] "gene.125" "gene.126" "gene.127" "gene.128" "gene.129" "gene.130" "gene.131" "gene.132"
## [33] "gene.133" "gene.134" "gene.135" "gene.136" "gene.137" "gene.138" "gene.139" "gene.140"
## [41] "gene.141" "gene.142" "gene.143" "gene.144" "gene.145" "gene.146" "gene.147" "gene.148"
## [49] "gene.149" "gene.150"
WISP.getWeight
function takes as input a data frame of samples without any prior knowledge and the centroid profiles given by WISP.getPureClass
. We let the user perform its own normalization method for the expression dataset as well as using unlogged data and centroids which can provide better results. If the centroid result is applied to another technology (example: pure centroid profiles estimated using expression array dataset and then used to estimate weights from RNAseq profiles), we suggest the user to center or scale the data using the scaling
argument. In order to evaluate the accuracy of the models, the function performs a global F-test for each model and calculates the adjusted R-squared. Based on these two criteria, the function will annotate the sample as "LIMIT" or "OK" in the last column of the output object. Cutoff for the F-test p-value and adjusted R-squared can be set in the arguments cutoff_gobalFtest
and Rsquared_cutoff
respectively. If sum_LessThanOne
is set to TRUE (default), quadratic optimization will use the constraint: sum of coefficients less than or equal to 1, if FALSE the sum of the coefficient will be equal to 1. This can be important if not all pure populations present in the mixed samples have been considered by the user.
resW = WISP.getWeight(dataWISP$data, resPure$genescentro, scaling = c("none", "scale", "center")[1], cutoff_gobalFtest = 0.05, Rsquared_cutoff = 0.5, cutoff_ttest_weights = 0.05, sum_LessThanOne = FALSE)
head(resW)
## weight.Epure weight.normal weight.Spure dist.Obs.Model Ftest.pvalue
## T013 0.8087 0.0766 0.1147 12.18 5.943785e-127
## T058 0.8475 0.0459 0.1066 8.19 6.107129e-152
## T066 0.9998 0.0002 0.0000 7.33 1.049369e-159
## T070 0.0000 0.0000 1.0000 10.93 2.392456e-132
## T078 0.8389 0.0000 0.1611 11.34 4.226689e-131
## T080 0.6398 0.0532 0.3070 15.95 3.107440e-109
## Adjusted.R.squared Pvalue.Epure Pvalue.normal Pvalue.Spure
## T013 0.982 2.194218e-55 0.003894200 5.203173e-04
## T058 0.991 5.567254e-80 0.009932865 2.485520e-06
## T066 0.993 1.791295e-96 0.989862939 1.000000e+00
## T070 0.984 1.000000e+00 1.000000000 2.429116e-72
## T078 0.984 4.331416e-61 1.000000000 3.287125e-07
## T080 0.960 3.788155e-32 0.121916513 2.150684e-11
## weight.Epure.filtered weight.normal.filtered weight.Spure.filtered
## T013 0.8087 0.0766 0.1147
## T058 0.8475 0.0459 0.1066
## T066 0.9998 0.0000 0.0000
## T070 0.0000 0.0000 1.0000
## T078 0.8389 0.0000 0.1611
## T080 0.6398 0.0000 0.3070
## topWeightedClass deltaTopWeights CI.Epure CI.normal CI.Spure
## T013 Epure 0.6940 [0.75, 0.87] [0.02, 0.13] [0.05, 0.18]
## T058 Epure 0.7409 [0.8, 0.89] [0.01, 0.08] [0.06, 0.15]
## T066 Epure 0.9996 [0.96, 1] [0, 0.03] [0, 0.04]
## T070 Spure 1.0000 [0, 0.06] [0, 0.05] [0.94, 1]
## T078 Epure 0.6778 [0.78, 0.9] [0, 0.05] [0.1, 0.22]
## T080 Epure 0.3328 [0.56, 0.72] [0, 0.12] [0.22, 0.39]
## WARNING
## T013 OK
## T058 OK
## T066 OK
## T070 OK
## T078 OK
## T080 OK
The output is a data frame containing the weight (=proportion) of each pure population, the distance to the model, the F-test p-value, the adjusted R-squared and for each weight the Confidence Interval and the t-test p-value. The last column gives a warning if the model does not meet the criteria described above.
We propose two different representations. Either a heatmap or a barplot as specified in the argument plot_type
.
WISP.getPlot(resW, annotsup = dataWISP$histo, plot_type = "heatmap", col_purePop = c("Epure"="#f46d43","Spure"="#66c2a5", "normal"="grey"))
The barplot can be divided by the top weight population assignment for each sample or by an optional supplementary annotation.
WISP.getPlot(resW, annotsup = dataWISP$histo, plot_type = "barplot", barplot_split = "by.topweight", col_purePop = c("Epure"="#f46d43","Spure"="#66c2a5", "normal"="grey"))
WISP.getPlot(resW, annotsup = dataWISP$histo, plot_type = "barplot", barplot_split = "by.annotsup", col_purePop = c("Epure"="#f46d43","Spure"="#66c2a5", "normal"="grey"))