The utils.py
file stores all the different functions that will be used
for I/O and other procedures on the corpus.
The SkipW2V.py
file implements the W2V-skipgram architecture with negative sampling.
The main.py
file is used for training the algorithm.
Example commands:
Training:
python ./main.py -c ../data/1bwc50000.txt -w 2 -min 0 -ll 50000 -tsize 10000 -nex 5 -opt sgd -e 15
Testing with the word "man":
python ./main.py --train_test test -words man
General papers & notes:
Misc.:
Optimize passes over the data.
- Implement subsampling when reading the corpora
- Discard words that do not meet the
min_count
. - Implement random batching of data
- Implement an independent testing suite?