gloVe performances
lfoppiano opened this issue · 4 comments
Hi all,
it's been a while now that I've been trying to train gloVe on a large dataset (1.2Tb).
The script successfully created the vocabulary but it's been like two month that it's running on the cooccurrence extraction:
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
I can see the file cooccurrence.bin
growing slowly, but I was wondering if it is normal that it's running for such long time?
-rw-r--r-- 1 lfoppian0 tdm 631G Sep 8 08:08 cooccurrence.bin
Thank you in advance
For information, I'm attaching the modified the script demo.sh
. Mainly I changed memory=80Gb, CPUs=72 and modified various paths and file names:
#!/bin/bash
set -e
# Makes programs, downloads sample data, trains a GloVe model, and then evaluates it.
# One optional argument can specify the language used for eval script: matlab, octave or [default] python
make
CORPUS=myCorpus.txt
VOCAB_FILE=vocab.txt
COOCCURRENCE_FILE=cooccurrence.bin
COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin
BUILDDIR=build
SAVE_FILE=vectors
VERBOSE=2
MEMORY=80.0
VOCAB_MIN_COUNT=5
VECTOR_SIZE=50
MAX_ITER=15
WINDOW_SIZE=15
BINARY=2
NUM_THREADS=72
X_MAX=10
if hash python 2>/dev/null; then
PYTHON=python
else
PYTHON=python3
fi
echo
echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE"
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE"
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
echo "$ $PYTHON eval/python/evaluate.py"
$PYTHON eval/python/evaluate.py
~
I understand the situation. Thanks anyway for answering. :-)
I checked and things seem fine (no swapping, some memory is still free).
However, after checking this I stopped and restarted with verbose = 0 and with more memory, although I don't think the memory was the issue there.
I can confirm that restarting the process with verbose=0
and more memory finished in much less time