Learning ranking distributions using hyperplane arrangement
Written by Shizuo KAJI
Look at the following paper for details:
- "A hyper-surface arrangement model of ranking distributions" by S. Kaji, A. Horiguchi, T. Abe (Kyushu University), and Y. Watanabe (ZOZO Technologies), ACM SIGKDD 2021.
Requirements
- Python 3: Anaconda is recommended
- Python libraries: Chainer, chainerui:
pip install chainer chainerui
- (optional) for parallel learning: mpi4py:
conda install mpi4py
Terminology
- label = alternative = item are those that are ranked.
- instance = agent = judge are those who rank labels.
- ranking is a partial ordering of labels by an instance.
- ranking distribution is a probability distribution on the ordering of labels which occurs from a collection of rankings.
- In our model, labels and instance are represented by points in the m-dimensional ball.
- ranking distribution can be represented by a n!-dimensional vector, where n is the number of labels.
- Our model approximates the ranking distribution by the distribution of the volumes of the cells formed by the hyper-plane arrangements in the m-dimensional ball. The dimension of our model is nm.
How to use
- Partial ranking data should be prepared in a csv file in which each line has the following format:
id, x, y, z, ...
where x > y > z > ... for instance id. There can be multiple lines having the same id so that any partial ordering can be specified.
- Learning the model: The following learns the ranking distribution of ranking.csv and
outputs
labels.csv
(learned label coords) andinstances.csv
(learned instance coords) under the directory namedresult
.
python rankLearn.py ranking.csv -e 50 -lr 0.01 -lri 1e-3 -o result
The batch size (-b 20
), number of epochs (-e 50
), learning rate (-lr 0.01
) have a large impact on the speed and the accuracy of learning.
Note that only labels.csv is needed for sampling from the leaned model.
- Learning can be resumed by giving initial configurations of label and/or instance coordinates.
python rankLearn.py ranking.csv -e 50 -lr 0.01 -lri 1e-3 -o result -i instances.csv -l labels.csv
- Parallel learning: if mpi4py is installed, learning can be parallelised using MPI.
mpiexec -n 4 python rankLearn.py ranking.csv -e 50 -lr 0.01 -o result --mpi
- To see the list of command-line options,
python rankLearn.py -h
Example
-
For the SUSHI dataset, a sample embedding setting is (--batchsize 50 --epoch 100 --dim 9 --margin 0.01 --learning_rate 0.01 --lambda_ord 1 --lambda_repel_instance 0.0001 --lambda_ball 1)
-
For the toy dataset with two items (2label90-10.csv), a sample setting is (-b 100 -e 200 -d 2 -m 0.03 -lr 0.1 -lo 10 -lri 1 -lrl 1e-9 -lb 10 -se)
Visualisation of the first two principal components:
Evaluating the model
- To compare the learned model with the data in terms of various metrics,
python arrangement.py --label labels.csv -r2 ranking.csv --compute_wasserstein --top_n 3
- To plot the coordinates of labels and instances,
python arrangement.py --label labels.csv --instance instances.csv --plot
Sample ranking data creation
- To sample 1000 rankings from the learned model
labels.csv
,
python arrangement.py --label labels.csv -ni 1000 -g --dim 2 -o result
The sampled rankings (ranking.csv
) and instance coordinates (instances.csv
) are found under the directory named result
.
- To create a 2D sample arrangement with 5 labels and 1000 instances,
python arrangement.py -nl 5 -ni 1000 -g --dim 2
- To see the list of command-line options,
python arrangement.py -h
TODO
- Use the Lehmer code for indexing permutations.