datquocnguyen/LFTM

Get probability a document to belong in topic

afergadis opened this issue · 2 comments

Suppose I have a training set of tweets as the test/corpus.txt. It's straight forward how to create the topic clusters.
Now, I have a test set (in one file) and I want to get the probability each tweet (line of the file) to belong in one of the topics clusters found in the first step.

Example: From the testLFLDA.topWords you have:

Topic0: iphone great siri ios time awesome amazing day loving yeah shows pretty store year love job million macbook phone mango

Topic1: android nexus cream ice sandwich ics samsung phone search good galaxy nice works iphone smart mango screen windows awesome beautiful

Topic2: facebook love free retweets users world application ios work blackberry technology today feel power mac show fucking impressive email working

Topic3: windows good people lol facebook bookcase haven back sleep agree social great man shit ipad text wow happy store cloud

If I have a tweet I enjoy using siri in my iphone, I would expect a result such as: [0.5, 0.1, 0,3, 0.1] where each value is for topic0, topic1, etc.

I don't have any gold labels and I don't need any labels. Is that possible? If yes, how?

Hi,
I just updated the code for inference of topics on unseen corpus.
Now you can train a topic model using training data and then infer topic distribution on unseen/new test data.
If there is any issue with the code, please inform me.
Thanks.

In the code, I can see that the probability of topic p(z) is re calculated for a new topic. I just wanted to know if it is necessary to do that... Can't p(z) be initialised using the training corpus as it is done for the topic-word distributions, p(w|z)