This repository hosts the source code for an efficient implementation of "Word Mover's Distance" (WMD) using the Sinkhorn-Knopp algorithm. Paper reference will be added upon publication.
- REQUIREMENT: gcc version gcc-7.1.0 or higher
- source your_icc_compiler
- source compile
-
Download the embedding file from https://www.kaggle.com/datasets/yekenot/fasttext-crawl-300d-2m. We do not provide the file, since it is large.
-
Then perform the following steps to prepare the input file.
-
- take first 100001 lines: head -n100001 crawl-300d-2M.vec >test.out
-
- remove first line: sed '1d' test.out > test2.out
-
- remove first column of each line: cut -d" " -f2- test2.out > data/vecs.out
-
- discard temporary files: rm test.out test2.out
-
set KMP AFFINITY. For example: export KMP_AFFINITY=compact,1,0,granularity=fine
-
./name_of_executable
-
There is also a small input in data (v2, r2, sample.mat, set the input in the program to run, set word2vec size to 3).
@article{tithi2020efficient, title={An Efficient Shared-memory Parallel Sinkhorn-Knopp Algorithm to Compute the Word Mover's Distance}, author={Tithi, Jesmin Jahan and Petrini, Fabrizio}, journal={arXiv preprint arXiv:2005.06727}, year={2020} }