Pytorch implementation for word2vec.
A pipeline for training word embeddings using word2vec on web-scrapped corpus.
- Set training configuration in
configs/config.yaml
- For training run
python3 -m word2vec.trainer
- The final embeddings will be saved in
embeddings
folder
- Just run
python3 embedding_projector.py
(it will automatically generatelogs
folder) - In terminal type
tensorboard --logdir logs/
- Skip-gram
- Batch update
- Negative Sampling
- Sub-sampling of frequent word
- Nearest Neigbors search and tensorboard visalization
- Just run
python3 nearest_neighbors.py -word car -topk 5
Here are some awesome examples
python3 nearest_neighbors.py -word car -topk 5
- Top 1 nearest: cars, score 0.67
- Top 2 nearest: automobiles, score 0.58
- Top 3 nearest: vehicle, score 0.58
- Top 4 nearest: mileage, score 0.58
- Top 5 nearest: vehicles, score 0.52
python3 nearest_neighbors.py -word covid
- Top 1 nearest: coronavirus, score 0.62
- Top 2 nearest: outbreak, score 0.61
- Top 3 nearest: pandemic, score 0.56
- Top 4 nearest: cmeminsave, score 0.54
- Top 5 nearest: crisis, score 0.53