datquocnguyen/LFTM

word2id Vocabulary

yxtay opened this issue · 3 comments

yxtay commented

Thank you for sharing your work.

Is there any way to get the word2id or id2word for the vocabulary? In jLDADMM, you had the script write out a .vocabulary file. However, there is no corresponding output for this project.

I need the word2id as I am using the topic-word distribution to do some custom keyword scoring. Without the right vocabulary order, I have no idea which word does each column in the matrix refer to.

How would I be able to get the word2id in this case? After a quick exploration, I know it is definitely not following the word2id for the word embeddings file.

EDIT:
I realised that you had a function writeDictionary() for writing the word2id to a file but it is not used in the write() function. I think it will be a good idea to include it.

The same goes for the writeTopicVectors() function. I believe users will benefit from having access to those. I have recompiled the jar file after making changes to include those and they are giving me expected outputs.

Hi, thanks for the comments. I will close this issue.

Hi @yxtay, years after I'm in the same position as I were. Thanks for showing the way out, however I cannot easily recompile the project due to lack of Java knowledge. Can you please share your source/jar with me? Best

Hi again, I think I managed to compile with a version that writes vocabulary as well. It's here if you'd like to check out. I do not commit here on the original repository as my java coding experience is only for 1 day long yet :)