A PyTorch implementation of GloVe: Global Vectors for Word Representation.
We use text8. To get the data, run:
cd data
./get-data.sh
To obtain the cooccurrence-counts and construct the sparse matrices, run:
cd data
mkdir vocab pairs cooccur
./get-cooccurrences.sh
By default this script construct a vocabulary size of the 10,000 most common words.
To train 100 dimensional vectors on the cooccurrence matrices constructed above, run:
mkdir vec
./main.py train --name text8.10k --emb-dim 100 --out-dir vec
The vectors are saved in vec/text8.10k.100d.txt
.
To plot (a number of) these vectors, use:
./main.py plot --vec-dir vec/text8.10k.100d.txt
The plots are saved as html in plots
. An example can be seen here. (Github does not render html files. To render, download and open, or use this link.)
torch==0.4.1
numpy
tqdm
bokeh # for t-sne plot
sklearn # for t-sne plot
- Add vector evaluation tests.
- Why so slow on GPU?
- Hogwild training, for fun.