Implement/Find out how to use libraries for semantic word embeddings
Closed this issue · 3 comments
ThaiJamesLee commented
- glove
- word2vec
- transform query and document to embedded word vector
- use average embedding vector
- use pickle to cache those
ThaiJamesLee commented
- created folder cache/
- contains cached word embedding vectors from glove
- read it with pickle and get a dict with key = term value = embedding vector (array)
- processed paragraph corpus was used for vocabulary
- we might need to run it again for the original corpus
see commit: a6b9c6d
ThaiJamesLee commented
Issue: How to handle unknown words in pretrained glove model
- https://stackoverflow.com/questions/49239941/what-is-unk-in-the-pretrained-glove-vector-files-e-g-glove-6b-50d-txt
- word2vec would have a 'unk' vector for unknown words
heatherwan commented
too many unknown terms in corpus => update lemmatize train/test dataframe for testing
read /processed_test.pkl or lemma_processed_query.pkl, or lemma_processed_paragraph.pkl