Speeding up Word Mover's Distance and its variants via properties of distances between embeddings

This repository contains the source code used for all experiments.

Paper: https://arxiv.org/pdf/1912.00509.pdf

(under development) A Python-wrapper package of the main algorithms is also available.

Environment tested

Ubuntu 16.04 and 18.04
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz,with 8 GB of RAM

Installing dependencies

sh install_dependencies.sh

It will install Eigen3 and OR-Tools dependencies.

Building

sh build-project.sh

The application will be under build folder.

Getting started

Datasets

The datasets used durings this study can be obtained in 2 ways:

Dropbox: https://www.dropbox.com/sh/9rnx8vwjvwjirsp/AADGwbT-aCqzZG7C_L1uNsY1a?dl=0
- Download each dataset and uncompress it inside the dataset folder.
Parse the datasets made available by Kusner in his repository
- Use the python scripts inside dataset folder
- For AMAZON, BBCSPORT, CLASSIC, REUTERS and TWITTER use kusner_dataset_parser.py. E.g:
```
python3 kusner_dataset_parser.py amazon 0
```
- For 20NEWS, OHSUMED and REUTERS use kusner_dataset_parser-2.py. E.g:
```
python3 kusner_dataset_parser.py 20news
```

Application

Inside the build folder you will find fast-wmd executable.

Kusner datasets

./fast-wmd kusner --tr <path-to-train-data> --te <path-to-test-data> --emb <path-to-embeddings> --k <k> --func <distance-function> --r <related-words> --verbose <true|false>

Options:

k: number of neighbours from k-NN
func: distance function to be used with k-NN. It can be:
- cosine: Cosine distance
- wmd: Word Mover's Distance (WMD)
- rwmd: Relaxed Word Mover's Distance (RWMD)
- lc-rwmd: Linear-Complexity Relaxed Word Mover's Distance (Following Atusu et al, 2017)
- wcd: Word Centroid Distance (WCD)
- rel-wmd: WMD + Edges reduction (Following Pele et al, 2009)
- rel-rwmd: RWMD + Edges reduction
- lc-rel-rwmd: Linear-Complexity RWMD + Edges reduction
r: number of related words per word. Used only by mf-wmd.
verbose: dumps info about the distance computation between each document pair (Used for measuring distance function time)

Triplets datasets

./fast-wmd triplets --triplets <path-to-test-set> --docs <path-to-test-data> --emb <path-to-embeddings> --num_clusters <number-of-clusters> --max_iter <number-of-iterations> --func <distance-function> --r <related-words> --verbose <true|false>

Options:

num_clusters: number of clusters
max_iter: Maximum number of iterations during clustering

After run it, it will dump in the console the configuration used, preprocessing and classification times and error rate.

Scripts

As the number of experiments to be run is large, we implemented scripts for easily run most of them:

For AMAZON, BBCSPORT, CLASSIC, RECIPE, TWITTER:

sh run-kusner-experiments.sh "<datasets>" "<partitions>" "<function>" "<r>" "<verbose>"

E.g:

sh run-kusner-experiments.sh "amazon bbcsport" "0 1 2 3 4" "rel-wmd" "1 2 4 8"

It will run the REL-WMD function with 1, 2, 4 and 8 related words for all partition of the AMAZON and BBCSPORTS datasets.

For 20NEWS, OHSUMED, REUTERS:

sh run-kusner-experiments.sh "<datasets>" "<partitions>" "<function>" "<r>" "<verbose>"

E.g:

sh run-kusner-experiments.sh "20news" "wmd rwmd"

It will run the WMD and RWMD functions for the 20NEWS dataset.

For ARXIV, WIKIPEDIA:

sh run-triplets-experiments.sh "<datasets>" "<function>" "<r>" "<num-clusters>" "<max-iterations>"

E.g:

sh run-triplets-experiments.sh "arxiv" "lc-rwmd lc-rel-rwmd" "16" "229" "5"

It will run the LC-RWMD and LC-REL-RWMD functions with r=16 for the ARXIV dataset, and applying a pre-processing step for clustering the embeddings with 229 clusters.

Buratinator/fast-wmd

Speeding up Word Mover's Distance and its variants via properties of distances between embeddings

Environment tested

Installing dependencies

Building

Getting started

Datasets

Application

Kusner datasets

Triplets datasets

Scripts

For AMAZON, BBCSPORT, CLASSIC, RECIPE, TWITTER:

For 20NEWS, OHSUMED, REUTERS:

For ARXIV, WIKIPEDIA: