slycoder/R-lda

lda.collapsed.gibbs.sampler on new data

Closed this issue · 3 comments

Hi Jonathan (and any other githuber passing by),
I am trying to use lda.collapsed.gibbs.sampler to find which pre-defined topics are associated with new documents.

library(lda)
library(topicmodels)
# data : character of 500 elements
docs <- Corpus(VectorSource(data))
# preprocess docs with tm_map (toSpace -’‘•”“, removePunctuation, removeNumbers, stripWhitespace, tolower, stemDocument, removeWords [stopwords])
dtm <-DocumentTermMatrix(docs, control=list(wordLengths=c(4, 50), bounds = list(global = c(3,length(docs)-3))))
# found the optimal k topics using FindTopicsNumber from ldatuning package
k<-150
text <- dtm2ldaformat(dtm, omit_empty = FALSE)
ldaDocs <- lda.collapsed.gibbs.sampler(text$documents,
                                                  k,
                                                  text$vocab,
                                                  1000,
                                                  alpha = 50/k,
                                                  eta = 200/ncol(dtm)
)

# data2 : character of 50 elements
docs2 <- Corpus(VectorSource(data2))
# same preprocess 
dtm2 <-DocumentTermMatrix(docs2, control=list(wordLengths=c(4, 50), bounds = list(global = c(3,length(docs2)-3))))
text2 <- dtm2ldaformat(dtm2, omit_empty = FALSE)
ldaDocs2 <- lda.collapsed.gibbs.sampler(text2$documents,
                                                        k,
                                                        text2$vocab,
                                                        1000,
                                                        alpha = 50/k,
                                                        eta = 200/ncol(dtm2),
                                                        freeze.topics = TRUE,
                                                        initial = list(topics = ldaDocs$topics, topic_sums = ldaDocs$topic_sums)
)

However, because my two corpuses do not have the same number of documents and vocabulary, I get the following error:

Error in structure(.Call("collapsedGibbsSampler", documents, as.integer(K), : Initial topics (150 x 8937) must be a 150 x 388 integer matrix.

How am I supposed to do this?
Thank you

So you have a few options.

  1. When doing the initial training, include both training and test words into your vocabulary. I.e, create a unioned vocabulary and then change all your document indices to be consistent across both training and test corpora.

  2. Remove words that were not present during training from your test corpora. Again, make sure your indices are consistent.

  3. Keep the test words, but pad your trained topics with zeroes (or some other smoothing amount). In this case if you unfreeze the topics you could arguably get improved performance by learning something about the unseen words.

I think in practice most people do 1 (but it depends on your application). For your code, the easiest way to do that (that would also keep the indices consistent) would be to concatenate the corpora (e.g. docs1 and docs2). Do all the preprocessing once on that combined set. Then in the call to lda.collapsed.gibbs.sampler, pass in the training rows (something like head(text, NUMBER_OF_TRAINING_DOCUMENTS)) and the test rows (something like tail(text, NUMBER_OF_TESTING_DOCUMENTS)).

Hope that helps.

I follow these instructions and able to create the phi and theta matrix. However, theta matrix is showing only 3 documents (whereas I need document to topic probability distribution as a part of inference step to new documents). Additionally, I found no difference in the phi and theta matrices gained before and after testing(inference) phase. Here is my code:
library(lda)
library(tm)
#set working directory (modify path as needed)
setwd("D:\Implementations\source")

#load files into corpus
#get listing of .txt files in directory
filenames <- list.files(getwd(),pattern="*.txt")

#read files into a character vector
files <- lapply(filenames,readLines)

#create corpus from vector
CorpusObj <- Corpus(VectorSource(files))

CorpusObj <- tm_map(CorpusObj, tolower) # convert all text to lower case
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern,' ' , x))})
CorpusObj <- tm_map(CorpusObj, toSpace, "[^[:alnum:] ]")
CorpusObj <- tm_map(CorpusObj, removePunctuation)
CorpusObj <- tm_map(CorpusObj, removeNumbers)
CorpusObj <- tm_map(CorpusObj, removeWords, stopwords("english"))
CorpusObj <- tm_map(CorpusObj, stemDocument, language = "english") ## Stemming the words
CorpusObj<-tm_map(CorpusObj,stripWhitespace)
myStopwords <- c("a", "about", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "computer", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "shall", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thick", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves", "abstract", "continue", "for", "new", "switch", "assert", "default", "goto", "package", "synchronized", "boolean", "do", "if", "private", "this", "break", "double", "implements", "protected", "throw", "byte", "else", "import", "public", "throws", "case", "enum", "instanceof", "return", "transient", "catch", "extends", "int", "short", "try", "char", "final", "interface", "static", "void", "class", "finally", "long", "strictfp", "volatile", "const", "float", "native", "super", "while", "org", "eclipse", "swt", "string", "main", "args", "null", "this", "extends", "true", "false")
CorpusObj <- tm_map(CorpusObj, removeWords, myStopwords)
##create a term document matrix
#CorpusObj.tdm <- TermDocumentMatrix(CorpusObj, control = list(minWordLength = 3))

corpusLDA <- lexicalize(CorpusObj )
require(lda)

ldaModel=lda.collapsed.gibbs.sampler(corpusLDA$documents,K=10,vocab=corpusLDA$vocab,burnin=9999,num.iterations=1000,alpha=1,eta=0.1)
ldaRes=lda.collapsed.gibbs.sampler(tail(corpusLDA$documents,150),K=10, initial=head(list(topics=ldaModel$topics, topic_sums=ldaModel$topic_sums),6413),vocab=corpusLDA$vocab,burnin=9999,num.iterations=1000,alpha=1,eta=0.1, freeze.topics=TRUE)
#save(ldaModel,file="D:\Implementations\source\ldaModel.saved")
#top.words <- top.topic.words(ldaModel$topics, 5, by.score=TRUE)
#print(top.words)
theta <- t(apply(ldaModel$document_sums +1, 2, function(x) x/sum(x)))# 1 is alpha
phi <- t(apply(t(ldaModel$topics) + 0.1, 2, function(x) x/sum(x)))#0.1 is eta
thetaRes <- t(apply(ldaRes$document_sums +1, 2, function(x) x/sum(x)))# 1 is alpha
phiRes <- t(apply(t(ldaRes$topics) + 0.1, 2, function(x) x/sum(x)))#0.1 is eta

Am i making some mistake in passing training rows and testing rows? (my training rows are first 6413 and testing rows are 150 later)..

@razuiit

  • I'm not sure why you are taking the head of a list as that's a no-op in this cas.e
  • In your training set, you probably should be excluding the test documents, that is take head(corpusLDA$documents). Because you included the test documents in your training run (ldaModel), theta contains rows which overlap with the test run (ldaRes).
  • Phi should be the same in both models (since the test run should not modify phi).

I didn't quite understand what you meant by theta being 3 documents. Do you mean that ldaRes$document_sums is a matrix of size 10x3?