Discrepancy in topic content when summarizing and visualizing with LDAvis
leungi opened this issue · 1 comments
Apologies for the non reprex
(due to size), but below is code using example from the textmineR
package, so it should be reproducible.
Issue: reviewing model$summary
to for, say, topic 1 t_1
, it seems that it doesn't match with the t_1
marked in LDAvis
plot.
I believe the definitions of phi
P(token|topic) and theta
P(topic|document) are the same across textmineR
and LDAvis
, so I'd expect similar topic/word clusters.
Note that the issue was originally posted with textmineR
(TommyJones/textmineR#72), and the author suggested that the reason may be with LDAvis
.
library(textmineR)
# load nih_sample data set from textmineR
data(nih_sample)
# create a document term matrix
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT, # character vector of documents
doc_names = nih_sample$APPLICATION_ID, # document names
ngram_window = c(1, 2), # minimum and maximum n-gram length
stopword_vec = c(stopwords::stopwords("en"), # stopwords from tm
stopwords::stopwords(source = "smart")), # this is the default value
lower = TRUE, # lowercase - this is the default value
remove_punctuation = TRUE, # punctuation - this is the default
remove_numbers = TRUE, # numbers - this is the default
verbose = FALSE, # Turn off status bar for this demo
cpus = 2) # default is all available cpus on the system
dtm <- dtm[,colSums(dtm) > 2]
set.seed(12345)
model <- FitLdaModel(dtm = dtm,
k = 20,
iterations = 200, # I usually recommend at least 500 iterations or more
burnin = 180,
alpha = 0.1,
beta = 0.05,
optimize_alpha = TRUE,
calc_likelihood = TRUE,
calc_coherence = TRUE,
calc_r2 = TRUE,
cpus = 2)
model$top_terms <- GetTopTerms(phi = model$phi, M = 10)
# Get the prevalence of each topic
# You can make this discrete by applying a threshold, say 0.05, for
# topics in/out of docuemnts.
model$prevalence <- colSums(model$theta) / sum(model$theta) * 100
# textmineR has a naive topic labeling tool based on probable bigrams
model$labels <- LabelTopics(assignments = model$theta > 0.05,
dtm = dtm,
M = 1)
model$summary <- data.frame(topic = rownames(model$phi),
label = model$labels,
coherence = round(model$coherence, 3),
prevalence = round(model$prevalence,3),
top_terms = apply(model$top_terms, 2, function(x){
paste(x, collapse = ", ")
}),
stringsAsFactors = FALSE)
model$summary[ order(model$summary$prevalence, decreasing = TRUE) , ][ 1:10 , ]
# summary of document lengths
doc_lengths <- rowSums(dtm)
# get counts of tokens across the corpus
tf_mat <- TermDocFreq(dtm = dtm)
tf_mat
library(LDAvis)
# create the JSON object to feed the visualization:
json <- createJSON(
phi = model$phi,
theta = model$theta,
doc.length = doc_lengths,
vocab = tf_mat$term,
term.frequency = tf_mat$term_freq
)
serVis(json, open.browser = TRUE)
Having played with @leungi's example, it looks like the row index on the phi
matrix is shuffled in LDAvis compared to the row order of model$phi
which is being fed into the JSON.