GloVe embeddings in PyTorch

A PyTorch implementation of GloVe: Global Vectors for Word Representation.

Data

We use text8. To get the data, run:

cd data
./get-data.sh

To obtain the cooccurrence-counts and construct the sparse matrices, run:

cd data
mkdir vocab pairs cooccur
./get-cooccurrences.sh

By default this script construct a vocabulary size of the 10,000 most common words.

Usage

To train 100 dimensional vectors on the cooccurrence matrices constructed above, run:

mkdir vec
./main.py train --name text8.10k --emb-dim 100 --out-dir vec

The vectors are saved in vec/text8.10k.100d.txt.

To plot (a number of) these vectors, use:

./main.py plot --vec-dir vec/text8.10k.100d.txt

The plots are saved as html in plots. An example can be seen here. (Github does not render html files. To render, download and open, or use this link.)

Requirements

torch==0.4.1
numpy
tqdm
bokeh     # for t-sne plot
sklearn   # for t-sne plot

TODO

Add vector evaluation tests.
Why so slow on GPU?
Hogwild training, for fun.

daandouwe/glove

GloVe embeddings in PyTorch

Data

Usage

Requirements

TODO