MACH (Merged-Average Classifiers via Hashing) is a method to reduce time and space cost with extreme classification. Paper. Dataset.
In short, MACH use hash functions to map L
labels into B
buckets. In total there are R
different hash functions, which indicates there are R
groups of buckets. Then we train on classification model for each group. When predicting, it computes the output of R
models, each predicting B
labels. Then it maps B
labels back to L
labels. At last the scores in R
groups are averaged to create the final score for each labels, generating a vector of length L
.
Terminologies:
- Bucket: A hash function f: N->[0,B) hashes any integer into bucket 0~B-1.
- Repetition: each of
R
label groups is called a repetition.
- Test on CPU.
- Debug Evaluation in xclib.
- Test on GPU.
- Add mAP to final evaluation.
- Add do not use feature hash.
- Test on large datasets.
- Decide A and B for each dataset.[cannot]
- Tune hyper-parameters.
- bibtex
- delicious
- mediamill
- Eurlex
- wiki10
- Amazon670k
- Trim label.
- Train and Evaluate trimmed datasets code.
- Document for trimmed datasets.
- Does trimming labels break MACH's theoretical guarantee?
- bibtex
- delicious
- mediamill
- Eurlex
- wiki10
- Amazon670k
- Train a fully connected neural network for multilabel classification, using BCE loss.
- Evaluate sequentially.
For evaluation, pyxclib is need. Currently it has a bug which cause precision to be incorrectly calculated. See this issue and change the code.
Add
indices = indices[:, ::-1]
after these two lines in_get_top_k
: here and here.
For calculate mAP (mean average precision), torchnet is used. It has a bug (issue). Please change return 0
here to return torch.tensor([0.])
.
Other prerequisites are specified in requirements.txt
.
Please run code from the project root directory, not from src
.
The structure of data
directory is as follows:
data/
├── Delicious -> ../../dataset_uncompressed/Delicious
│ ├── Delicious_test.txt
│ └── Delicious_train.txt
└── bibtex -> ../../dataset_uncompressed/Bibtex/
├── bibtex_train.txt
└── bibtex_test.txt
Note: The data directory's name should be the same as the name
field in the config file, and should be case-sensitive. Eg. for Delicious dataset, the name
field in config/data/delicious.yaml
, the name of the subdirectory under data
both should be Delicious. Additionally, the prefix of txt
files in the subdirectory can be anything, and should be specified in dataset config files. The training file under bibtex
directory can be Delicious_train.txt
, as long as the prefix
field in the config file is Delicious
.
*_test.txt
contains the test dataset and *_train.txt
contains both training and validation datasets. The number of training instances are specified by train_size
in data config file. Assume it is m
. The first m
training instances in *_train.txt
are used in training, with the rest being validation set.
The format of the text files can be found here (search README
). The first line in the original dataset contains 3 numbers indicating the number of instances, features and labels. These should also be included in the dataset config manually.
Configuration files are placed in config/data
(dataset meta-information) and config/models
(model hyperparameters). There is one file called test.yaml
in each directory with comments, serving as an example.
The initial of config files should be lower-case.
After setting up the config files, preprocess to generate auxiliary files in record
directory.
python src/preprocess.py -h
usage: preprocess.py [-h] --model MODEL --dataset DATASET
optional arguments:
-h, --help show this help message and exit
--model MODEL, -m MODEL
Path to the model config yaml file.
--dataset DATASET, -d DATASET
Path to the data config yaml file.
python src/train.py -h
usage: train.py [-h] [--rep REP] --model MODEL --dataset DATASET [--gpus GPUS]
optional arguments:
-h, --help show this help message and exit
--rep REP, -r REP Which reptition to train. Default 0.
--model MODEL, -m MODEL
Path to the model config yaml file.
--dataset DATASET, -d DATASET
Path to the data config yaml file.
--gpus GPUS, -g GPUS A string that specifies which GPU you want to use,
split by comma. Eg 0,1. Default 0.
Example: python src/train.py --rep 0 --model $MODEL_CONFIG $DATASET_CONFIG --gpus 0
Training script trains model for only one repetition. It creates a models
directory in which trained models are saved. The structure of the directory is
models/
└── Bibtex
├── B_100_R_32_feat_1000_hidden_[32, 32]_rep_00
│ ├── best_ckpt.pkl
│ ├── final_ckpt.pkl
│ └── train.log
├── B_100_R_32_feat_1000_hidden_[32, 32]_rep_01
│ ├── best_ckpt.pkl
│ ├── final_ckpt.pkl
│ └── train.log
...
└── Bibtex_eval.log
python src/evaluate.py -h
usage: evaluate.py [-h] --model MODEL --dataset DATASET [--gpus GPUS]
[--type TYPE] [--rate RATE]
optional arguments:
-h, --help show this help message and exit
--model MODEL, -m MODEL
Path to the model config yaml file.
--dataset DATASET, -d DATASET
Path to the data config yaml file.
--gpus GPUS, -g GPUS A string that specifies which GPU you want to use,
split by comma. Eg 0,1
--type TYPE, -t TYPE Evaluation type. Should be 'all'(default) and/or
'trim_eval', split by comma. Eg. 'all,trim_eval'. If
it is 'trim_eval', the rate parameter should be
specified.
'all': Evaluate normally. If the 'trimmed'
field in data config file is true, the code will
automatically map the rest of the labels back to the
orginal ones. 'trim_eval': Trim labels when
evaluating. The scores with tail labels will be set to
0 in order to not predict these ones. This checks how
much tail labels affect final evaluation metrics. Plus
it will evaluate average precision on tail and head
labels only.
--rate RATE, -r RATE If evaluation needs trimming, this parameter specifies
how many labels will be trimmed, decided by cumsum.
Should be a string containing trimming rates split by
comma. Eg '0.1,0.2'. Default '0.1'.
After training all R
repetitions, running evaluate.py
provides the following metrics: Precision, nDCG, PSPrecision, PSnDCG and mAP, which are described in XMLrepo. It also logs them into models/[dataset]/[dataset]_eval.log
(see above).
Example usage:
python src/evaluate.py --model config/model/eurlex.yaml --dataset config/dataset/eurlex.yaml -t all,trim_eval -r 0.1,.2,.3,.4,.5,.6,.7,.8,.9
python src/trim_labels.py -h
usage: trim_labels.py [-h] --dataset DATASET --type TYPE
optional arguments:
-h, --help show this help message and exit
--dataset DATASET, -d DATASET
Dataset name. Initial should be CAPITAL.
--type TYPE, -t TYPE Should be 'cumsum' or 'rank'.
This script reads in config files in config/dataset/
, creates 9 more config files under config/data_trim/$DATASET/
, and creates 9 more dataset text files in data/$DATASET/
, as follows. They are datasets which only retain the major instances and labels.
data_trim/
├── Bibtex
│ ├── Bibtex_trimcumsum0.1.yaml
│ ├── Bibtex_trimcumsum0.2.yaml
│ ├── ...
│ ├── Bibtex_trimrank0.1.yaml
│ ├── Bibtex_trimrank0.2.yaml
│ ├── ...
data/Bibtex/
├── Bibtex_trimcumsum0.1_meta.json
├── Bibtex_trimcumsum0.1_train.txt
├── Bibtex_trimcumsum0.2_meta.json
├── Bibtex_trimcumsum0.2_train.txt
├── ...
There are two modes for trimming off tail labels: cumsum
and rank
. Let the ratio of labels to be cut off r
, and the number of instances N
the number of total labels L
. In rank
mode, r*L
labels with the fewest instances will be cut. In cumsum
mode, the fewest labels whose numbers add up to r*N
will be cut.
After running trimming script, train and evaluate models as usual using these config files.