ThaiJamesLee/IR_Complex_Question_Retrieval

Implement/Find out how to use libraries for semantic word embeddings

Closed this issue · 3 comments

  • glove
  • word2vec
  • transform query and document to embedded word vector
  • use average embedding vector
  • use pickle to cache those
  • created folder cache/
  • contains cached word embedding vectors from glove
  • read it with pickle and get a dict with key = term value = embedding vector (array)
  • processed paragraph corpus was used for vocabulary
  • we might need to run it again for the original corpus
    see commit: a6b9c6d

Issue: How to handle unknown words in pretrained glove model

too many unknown terms in corpus => update lemmatize train/test dataframe for testing
read /processed_test.pkl or lemma_processed_query.pkl, or lemma_processed_paragraph.pkl