Idea - rebuild glove vectors
AngledLuffa opened this issue · 1 comments
AngledLuffa commented
Goal: build new glove vectors with current vocab
- new people you haven't heard of should now have word vectors
- match tokenization of the glove vectors with the upcoming corenlp tokenization
- also, we don't have good records of what data we used to build the vectors
wikipedia + gigaword
maybe common crawl and/or twitter as well
could look at attardi's wikipedia cleaner
note for internal use: /u/downloads/data
Big-Tree commented
This would be very useful. I want something up to date and I can't seem to find any alternatives.