The ClusterSignificance package is written in R and can be found hosted at the Bioconductor repository via the links below.
The ClusterSignificance package provides tools to assess if clusters, in e.g. principal component analysis (PCA), have a separation different from random or permuted data. This is accomplished in a 3 step process projection, classification, and permutation. To be able to compare cluster separations, we have to give them a score based on this separation. First, all data points in each cluster are projected onto a line (projection), after which the seperation for two groups at a time is scored (classification). Furthermore, to get a p-value for the separation we have to compare the separation score for our real data to the separation score for permuted data (permutation).
The release version of ClusterSignificance can be installed in R from Bioconductor as follows:
install.packages("BiocManager")
BiocManager::install("ClusterSignificance")
To install the development version use:
install.packages("devtools")
devtools::install_github("jasonserviss/ClusterSignificance")
While we recommend reading the vignette, the instructions that follow will allow you to quickly get a feel for how ClusterSignificance works and what it is capable of.
Here we utilize the example data included in the ClusterSignificance package for the Pcp method.
We start by projecting the points into one dimension using the Pcp method. We are able to visualize each step in the projection by plotting the results as shown below.
library(ClusterSignificance)
classes <- rownames(pcpMatrix)
prj <- pcp(pcpMatrix, classes)
plot(prj)
Now that the points are in one dimension, we can score each possible seperation and deduce the max seperation score. This is accomplished by the classify command (again we can plot the results afterwards). The vertical lines in the plot represent the seperation score for each possible seperation.
## Classify and plot.
cl <- classify(prj)
plot(cl)
Finally, as we have now determined the max seperation score, we can permute the data to examine how many permuted max scores exceed that of our real max score and, thus, calculate a p-value for our seperation. Plotting the permutaion results show a histogram of the permuted max scores with the red line representing the real score.
## Set the seed and number of iterations.
set.seed(3)
iterations <- 100
## Permute and plot.
pe <- permute(
mat = pcpMatrix,
iter = iterations,
classes = classes,
projmethod = "pcp"
)
## initializing permutation analysis
## 100 iterations were sucessfully completed for comparison class1 vs class2
## 100 iterations were sucessfully completed for comparison class1 vs class3
## 100 iterations were sucessfully completed for comparison class2 vs class3
plot(pe)
To calculate the p-value we use the following command.
## class1 vs class2 class1 vs class3 class2 vs class3
## 0.01 0.15 0.01
The Bioconductor support site for the ClusterSignificance package is located here. Issues and bugs can be reported via Github at: ClusterSignificance
Jason T. Serviss, Jesper R. Gådin, Per Eriksson, Lasse Folkersen, Dan Grandér; ClusterSignificance: a bioconductor package facilitating statistical analysis of class cluster separations in dimensionality reduced data, Bioinformatics, Volume 33, Issue 19, 1 October 2017, Pages 3126–3128, https://doi.org/10.1093/bioinformatics/btx393
Citation information can be found in R using:
library(ClusterSignificance)
citation("ClusterSignificance")