gregversteeg/corex_topic

How we can testing the model on new data ?

Suhaib441 opened this issue · 11 comments

Hello, thank you for this tutoriel, i want to build a anchored model for text classification (i have 5 classes) sentences, so i trained an anchored model with 5 topic, but how can i test the model on new sentences ? there is a "predict" attribute but i have an error

Hello,

Would you be able to give an example of the error that you're getting?

If you have a topic model tm trained on a document-term matrix X, and you have a new document-term matrix X2, then you should be able to do either tm.predict(X2) or tm.transform(X2). X2 should be in the format of a document-term matrix like the original input X, where there are exactly the same number of columns as X that correspond to the same terms.

If you do tm.predict(X2, details=False), then it will return one output labels which is a document x topic matrix of 0s and 1s which topics are present for each document. If you do tm.predict(X2, details=True) then you'll get two outputs, p_y_given_x and log_z. p_y_given_x is a document x topic matrix that gives the probability a topic is present in a document given its terms, and log_z is a document x topic matrix that measures how "surprising" each topic is given the terms of a document.

If those explanations don't fix things, then we should be able to more if we have an example of what error you're getting exactly.

@ryanjgallagher thanks for the informative answer here. I'm wondering if you could explain a bit more what the term 'surprising' means here. Is it a 'surprise' because given the terms, the correlation is higher/lower than what we would have expected given the terms that are in the document after training?

People use "surprise" in information theory to refer to log probability ratios. If p(y|x) = 2 p(y), we'd say "y is twice as likely give x as the baseline probability for y" and then log_2 p(y|x)/p(y) = 1 (bit) of surprise. The average of surprise is mutual information (between y and x).
The case here is a little bit more complicated... I really had to think about it to remember why we refer to it that way. It's also true in this case that the average is a (conditional) (multivariate) mutual information. It's the multivariate mutual information / total correlation of the words, conditioned on a specific latent factor. Each latent factor predicts certain correlations (e.g., factor 1 says that "cat" and "dog" are either present together or absent together). So if cat and dog appear together in a document and latent factor 1 strongly predicted this, then we have a positive "surprise" that these occur with higher than the background probability. However, the converse could occur. Latent factor 1 predicts cat and dog appear together but in this document only cat, but not dog, appears. In this case we can get negative surprise. The latent factor predicted a correlation which didn't actually appear in the document.

Sorry, it's not a super-straightforward measure to interpret. I can discuss a longer example if you'd like.

@gregversteeg thanks so much for the clarification. I think most of this makes sense to me. One follow up question that I'm hoping you could shed some light on: what is the semantic difference in results when sorting by p_y_given_x vs. getting the highest log_z? If I'm trying to get the documents which are most representative of a latent factor/topic, will I see huge differences between these two methods or is there a fundamental misunderstanding on my end here?

I think sorting by p_y_given_x is what you want. There is a small failure mode there... the latent factor Y_j can be zero or one. Generally "1" is labeled as the less likely class, with the intuition that most topics occur with probability less than one half. But you never know! For instance, corex_topic likes to group function words "the, and, but..." into their own factor, and this one might occur frequently.

Ryan might be able to comment on what sorting by log_z does, as he has experimented with that a bit more. I would generally expect that highest log_z documents to be similar to the highest p_y_given_x ones: they would contain many of the correlated words that appear in that topic. However, the meaning as you go down the list is less clear. Imagine the cat/dog example from above. If the document contains cat but not dog, you might get p(y=1|x) = 1/2, but log_z could actually be negative (if cat and dog always appear together in the training). Then for random words (not cat or dog) p(y=1|x) = 0, but log_z would be near zero, since no words related to this factor appear.

Sorting on log_z and understanding more precisely how it changes the rankings of the documents was something that we wanted to get around to but haven't had the resources to do it yet.

I would recommend sorting on log_p_y_given_x and not log_z. I think log_z has the potential to be useful, but it's difficult to interpret properly if you aren't really on top of your information theory. I've used it to rank documents on some small examples and it can give counterintuitive results that are confusing if you don't understand it really well. I think log_p_y_given_x is easier to interpret and easier to explain to others.

Hello,

Would you be able to give an example of the error that you're getting?

If you have a topic model tm trained on a document-term matrix X, and you have a new document-term matrix X2, then you should be able to do either tm.predict(X2) or tm.transform(X2). X2 should be in the format of a document-term matrix like the original input X, where there are exactly the same number of columns as X that correspond to the same terms.

If you do tm.predict(X2, details=False), then it will return one output labels which is a document x topic matrix of 0s and 1s which topics are present for each document. If you do tm.predict(X2, details=True) then you'll get two outputs, p_y_given_x and log_z. p_y_given_x is a document x topic matrix that gives the probability a topic is present in a document given its terms, and log_z is a document x topic matrix that measures how "surprising" each topic is given the terms of a document.

If those explanations don't fix things, then we should be able to more if we have an example of what error you're getting exactly.

Hi Ryan, The error that I get while trying to use predict is "Dimension mismatch error". I have used the same vectorizer on the train dataset and the test data set. And I have followed your example to get the matrix

vectorizer = CountVectorizer(stop_words='english', binary=True)
doc_word = vectorizer.fit_transform(docs)
doc_word = ss.csr_matrix(doc_word)

and for test data, I did

with open('textdata.txt', 'r') as file:
    text_data = file.read().replace('\n', ' ')
doc_word1 = vectorizer.transform([text_data])
doc_word1 = ss.csr_matrix(doc_word1)

anchored_topic_model.predict(doc_word1)

The error I get is

    518 
    519             result = self._mul_multivector(np.asarray(other))

ValueError: dimension mismatch

Hi @sidgitind, sorry for the slow response. I'm not quite sure where the error might be coming from because I don't think the line result = self._mul_multivector(np.asarray(other)) is in the CorEx code. Please let me know if I'm mistaken and I'll open this back up.

If it's an issue with the dimensions using the CountVectorizer, then you might be better off asking on stackexchange. I don't actually use it that much for my own work, but it was helpful for the quick notebook example.

did you ever figure this out @sidgitind? having the same error.

Nevermind, got it - have to re-use the same vectorizer from the original data (with same dictionary)