Mapping between vocabulary and columns in topic-word-matrix
rjsu26 opened this issue · 1 comments
- OCTIS version: 1.10.3
- Python version: 3.8.10
- Operating System: Ubuntu 20.04.3 LTS
Description
I want to take search query from a user and based on this query, return a list of top 5 topics(out of 50 generated after running the LDA model) which match this query.
What I Did
For this task, I made an all zero list of size len(vocabulary.txt) and made the indices corresponding to the search query as 1, i.e
search_vec = [0]*len(vocabulary)
for word in query:
if word in vocabulary:
idx = vocabulary.index(word)
search_vec[idx] = 1
# N-hot encoding complete
I later ran some Nearest Neighbor functions using topic-words-matrix
as original data while search_vec
as my query vector. The problem here is, as I figured out, the ordering of words in vocabulary list and that used to create the topic-word-matrix
are not the same.
How do I get that ordering? Is there any method to give me the index of word in vocabulary which was used as a column in the topic-word-matrix
?
Hello,
when you train a topic model, you initialize the dataset first. This dataset has a vocabulary (the indices correspond to the vocabulary of topic-words-matrix
). You can get it in the following way:
dataset = Dataset()
dataset.load_custom_dataset_from_folder("dataset_folder") # or your preferred way to initialize the dataset
vocabulary = dataset.get_vocabulary()
Hope this helped. Thanks for your patience,
Silvia