YuLab-SMU/ProjectYulab

Cluster name of emapplot_cluster() in enrichplot package

huerqiang opened this issue · 2 comments

We now use wordcloud as the cluster name of emapplot_cluster().

    library(DOSE)
    data(geneList)
    de <- names(geneList)[1:100]
    x <- enrichDO(de)
    x2 <- pairwise_termsim(x)
    emapplot_cluster(x2)

image
But it is not good enough:
YuLab-SMU/enrichplot#241 (comment)

Please give a better way to display cluster information.
You can get the code of wordcloud here:
https://github.com/YuLab-SMU/enrichplot/blob/master/R/wordcloud.R

Question: is there any way to extract the clustering information from the emapplot() easily? I'm struggling on this for days... 😢😢😢

Theoretically, the understandability of the cluster information is determined by the number of keyword that being displayed, that is, the more keywords are shown, the more understandable the cluster would be. So I think we can leave the choice to users and let them determine how many keywords could be shown. Here's my example:

rm(list = ls())
library(DOSE)
library(enrichplot)
library(reshape2)
library(igraph)
library(magrittr)
data(geneList)
de <- names(geneList)[1:100]
x <- enrichDO(de)
x2 <- pairwise_termsim(x)
#############################################

x3 <- as.data.frame(x2)
x4 <- x2@termsim[as.character(x3$Description),as.character(x3$Description)]
w <- melt(x4)
wd <- w[w[,1] != w[,2],] %>% na.omit()
wd <- wd[wd$value != 0,]
##
g <- graph.data.frame(wd[, -3], directed=FALSE)
E(g)$value <- wd[, 3]
## calculate the number of clusters
centers_g <- ceiling(sqrt(nrow(x4)))
k_means <- kmeans(get.adjacency(g), centers = centers_g)
#### get the information of a certain cluster
info_n <- k_means$cluster[k_means$cluster==3] %>% names() # the 3rd cluster, for instance

## borrowing the word frequency function from @huerqiang 
get_word_freq <- function(wordd){     
  dada <- strsplit(wordd, " ")
  didi <- table(unlist(dada))
  didi <- didi[order(didi, decreasing = TRUE)]
  # Get the number of each word
  word_name <- names(didi)
  fun_num_w <- function(ww){
    sum(vapply(dada, function(w){ww %in% w}, FUN.VALUE = 1))
  }
  word_num <- vapply(word_name, fun_num_w, FUN.VALUE = 1)
  word_w <- word_num[order(word_num, decreasing = TRUE)]
}
##

#### how many keywords you wanna show? take 80% as an example~
info_cluster <- get_word_freq(info_n)[1:(0.8*length(get_word_freq(info_n)))] %>% names()

It's still not so perfect, but now we can have a clearer clue for understanding cluster information.