Implement/Find out how to use libraries for semantic word embeddings

Question

Implement/Find out how to use libraries for semantic word embeddings

Closed this issue 6 years ago · 3 comments

ThaiJamesLee commented 6 years ago

glove
word2vec
transform query and document to embedded word vector
use average embedding vector
use pickle to cache those

Answer 1 · 2019-04-12T20:56:14.000Z

created folder cache/
contains cached word embedding vectors from glove
read it with pickle and get a dict with key = term value = embedding vector (array)
processed paragraph corpus was used for vocabulary
we might need to run it again for the original corpus
see commit: a6b9c6d

Answer 2 · 2019-04-14T16:58:33.000Z

Issue: How to handle unknown words in pretrained glove model

https://stackoverflow.com/questions/49239941/what-is-unk-in-the-pretrained-glove-vector-files-e-g-glove-6b-50d-txt
word2vec would have a 'unk' vector for unknown words

Answer 3 · 2019-04-18T11:08:15.000Z

too many unknown terms in corpus => update lemmatize train/test dataframe for testing
read /processed_test.pkl or lemma_processed_query.pkl, or lemma_processed_paragraph.pkl