Get probability a document to belong in topic
afergadis opened this issue · 2 comments
Suppose I have a training set of tweets as the test/corpus.txt
. It's straight forward how to create the topic clusters.
Now, I have a test set (in one file) and I want to get the probability each tweet (line of the file) to belong in one of the topics clusters found in the first step.
Example: From the testLFLDA.topWords you have:
Topic0: iphone great siri ios time awesome amazing day loving yeah shows pretty store year love job million macbook phone mango
Topic1: android nexus cream ice sandwich ics samsung phone search good galaxy nice works iphone smart mango screen windows awesome beautiful
Topic2: facebook love free retweets users world application ios work blackberry technology today feel power mac show fucking impressive email working
Topic3: windows good people lol facebook bookcase haven back sleep agree social great man shit ipad text wow happy store cloud
If I have a tweet I enjoy using siri in my iphone
, I would expect a result such as: [0.5, 0.1, 0,3, 0.1]
where each value is for topic0, topic1, etc.
I don't have any gold labels and I don't need any labels. Is that possible? If yes, how?
Hi,
I just updated the code for inference of topics on unseen corpus.
Now you can train a topic model using training data and then infer topic distribution on unseen/new test data.
If there is any issue with the code, please inform me.
Thanks.
In the code, I can see that the probability of topic p(z) is re calculated for a new topic. I just wanted to know if it is necessary to do that... Can't p(z) be initialised using the training corpus as it is done for the topic-word distributions, p(w|z)