cpsievert/LDAvis

Discrepancy in topic content when summarizing and visualizing with LDAvis

leungi opened this issue · 1 comments

Apologies for the non reprex (due to size), but below is code using example from the textmineR package, so it should be reproducible.

Issue: reviewing model$summary to for, say, topic 1 t_1, it seems that it doesn't match with the t_1 marked in LDAvis plot.

I believe the definitions of phi P(token|topic) and theta P(topic|document) are the same across textmineR and LDAvis, so I'd expect similar topic/word clusters.

Note that the issue was originally posted with textmineR (TommyJones/textmineR#72), and the author suggested that the reason may be with LDAvis.

library(textmineR)

# load nih_sample data set from textmineR
data(nih_sample)

# create a document term matrix 
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT, # character vector of documents
                 doc_names = nih_sample$APPLICATION_ID, # document names
                 ngram_window = c(1, 2), # minimum and maximum n-gram length
                 stopword_vec = c(stopwords::stopwords("en"), # stopwords from tm
                                  stopwords::stopwords(source = "smart")), # this is the default value
                 lower = TRUE, # lowercase - this is the default value
                 remove_punctuation = TRUE, # punctuation - this is the default
                 remove_numbers = TRUE, # numbers - this is the default
                 verbose = FALSE, # Turn off status bar for this demo
                 cpus = 2) # default is all available cpus on the system

dtm <- dtm[,colSums(dtm) > 2]

set.seed(12345)

model <- FitLdaModel(dtm = dtm, 
                     k = 20,
                     iterations = 200, # I usually recommend at least 500 iterations or more
                     burnin = 180,
                     alpha = 0.1,
                     beta = 0.05,
                     optimize_alpha = TRUE,
                     calc_likelihood = TRUE,
                     calc_coherence = TRUE,
                     calc_r2 = TRUE,
                     cpus = 2) 

model$top_terms <- GetTopTerms(phi = model$phi, M = 10)

# Get the prevalence of each topic
# You can make this discrete by applying a threshold, say 0.05, for
# topics in/out of docuemnts. 
model$prevalence <- colSums(model$theta) / sum(model$theta) * 100

# textmineR has a naive topic labeling tool based on probable bigrams
model$labels <- LabelTopics(assignments = model$theta > 0.05, 
                            dtm = dtm,
                            M = 1)


model$summary <- data.frame(topic = rownames(model$phi),
                            label = model$labels,
                            coherence = round(model$coherence, 3),
                            prevalence = round(model$prevalence,3),
                            top_terms = apply(model$top_terms, 2, function(x){
                              paste(x, collapse = ", ")
                            }),
                            stringsAsFactors = FALSE)
model$summary[ order(model$summary$prevalence, decreasing = TRUE) , ][ 1:10 , ]



# summary of document lengths
doc_lengths <- rowSums(dtm)
# get counts of tokens across the corpus
tf_mat <- TermDocFreq(dtm = dtm)
tf_mat


library(LDAvis)
# create the JSON object to feed the visualization:
json <- createJSON(
  phi = model$phi,
  theta = model$theta,
  doc.length = doc_lengths,
  vocab = tf_mat$term,
  term.frequency = tf_mat$term_freq
)

serVis(json, open.browser = TRUE)

Having played with @leungi's example, it looks like the row index on the phi matrix is shuffled in LDAvis compared to the row order of model$phi which is being fed into the JSON.