stanfordnlp/GloVe

Idea - rebuild glove vectors

AngledLuffa opened this issue · 1 comments

Goal: build new glove vectors with current vocab

  • new people you haven't heard of should now have word vectors
  • match tokenization of the glove vectors with the upcoming corenlp tokenization
  • also, we don't have good records of what data we used to build the vectors

wikipedia + gigaword
maybe common crawl and/or twitter as well

could look at attardi's wikipedia cleaner

note for internal use: /u/downloads/data

This would be very useful. I want something up to date and I can't seem to find any alternatives.