Mapping between vocabulary and columns in topic-word-matrix

Question

Mapping between vocabulary and columns in topic-word-matrix

rjsu26 opened this issue 2 years ago · 1 comments

OCTIS version: 1.10.3
Python version: 3.8.10
Operating System: Ubuntu 20.04.3 LTS

Description

I want to take search query from a user and based on this query, return a list of top 5 topics(out of 50 generated after running the LDA model) which match this query.

What I Did

For this task, I made an all zero list of size len(vocabulary.txt) and made the indices corresponding to the search query as 1, i.e

search_vec = [0]*len(vocabulary)
for word in query:
       if word in vocabulary:
           idx = vocabulary.index(word)
           search_vec[idx] = 1
# N-hot encoding complete

I later ran some Nearest Neighbor functions using topic-words-matrix as original data while search_vec as my query vector. The problem here is, as I figured out, the ordering of words in vocabulary list and that used to create the topic-word-matrix are not the same.

How do I get that ordering? Is there any method to give me the index of word in vocabulary which was used as a column in the topic-word-matrix?

Answer 1 · 2022-11-01T09:45:40.000Z

Hello,
when you train a topic model, you initialize the dataset first. This dataset has a vocabulary (the indices correspond to the vocabulary of topic-words-matrix). You can get it in the following way:

dataset = Dataset()
dataset.load_custom_dataset_from_folder("dataset_folder") # or your preferred way to initialize the dataset
vocabulary = dataset.get_vocabulary()

Hope this helped. Thanks for your patience,

Silvia