stanfordnlp/GloVe

Attemping to train on own corpus

garrett-yoon opened this issue · 4 comments

Hi, I'm unsure why the loss is trending towards infinity training on my small corpus? the vector.txt is filled with 'nan'. I adjusted the 'eta'/learning rate and still having problems.

gcc src/glove.c -o build/glove -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/shuffle.c -o build/shuffle -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/cooccur.c -o build/cooccur -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/vocab_count.c -o build/vocab_count -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
BUILDING VOCABULARY
Processed 146823 tokens.
Counted 2957 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 1284.

COUNTING COOCCURRENCES
window size: 15
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 1284 words.
Building lookup table...table contains 1648657 elements.
Processed 146823 tokens.
Writing cooccurrences to disk......2 files in total.
Merging cooccurrence files: processed 325221 lines.

SHUFFLING COOCCURRENCES
array size: 255013683
Shuffling by chunks: processed 325221 lines.
Wrote 1 temporary file(s).
Merging temp files: processed 325221 lines.

TRAINING MODEL
Read 325221 lines.
Initializing parameters...done.
vector size: 500
vocab size: 1284
x_max: 10.000000
alpha: 0.750000
iter: 001, cost: nan
iter: 002, cost: nan
iter: 003, cost: nan
iter: 004, cost: nan
iter: 005, cost: nan
iter: 006, cost: nan
iter: 007, cost: nan
iter: 008, cost: nan
iter: 009, cost: nan
iter: 010, cost: nan
iter: 011, cost: nan
iter: 012, cost: nan
iter: 013, cost: nan
iter: 014, cost: nan
iter: 015, cost: nan

Hi there,

No worries about it. I think I figured out the issue.