Code for NGAME: Negative mining-aware mini-batching for extreme classification [1]
+-- <work_dir>
| +-- programs
| | +-- ngame
| | +-- ngame
| +-- data
| +-- <dataset>
| +-- models
| +-- results
* Download the (zipped file) raw data from The XML repository [5].
* Extract the zipped file into data directory.
* The following files should be available in <work_dir>/data/<dataset> (create empty filter files if unavailable):
- trn.json.gz
- tst.json.gz
- lbl.json.gz
- filter_labels_text.txt
- filter_labels_train.txt
Extract and tokenize data as follows.
./prepare_data.sh LF-AmazonTitles-131K 32
The algorithm can be run as follows. A json file (e.g., config/NGAME/LF-AmazonTitles-131K.json) is used to specify architecture and other arguments. Please refer to the full documentation below for more details.
./run_main.sh 0 NGAME LF-AmazonTitles-131K 0 108
./prepare_data.sh <dataset> <seq-len>
* dataset
- Name of the dataset.
- Tokenizer expects the following files in <work_dir>/data/<dataset>
- trn.json.gz
- tst.json.gz
- lbl.json.gz
- it'll dump the following six tokenized files
- trn_doc_input_ids.npy
- trn_doc_attention_mask.npy
- tst_doc_input_ids.npy
- tst_doc_attention_mask.npy
- lbl_input_ids.npy
- lbl_attention_mask.npy
* seq-len
- sequence length of text to consider while tokenizing
- 32 for titles dataset
- 256 for Wikipedia
- 128 for other full-text datasets
./run_main.sh <gpu_id> <type> <dataset> <version> <seed>
* gpu_id: Run the program on this GPU.
* type
NGAME builds upon SiameseXML [2] and DeepXML[3] for training. An encoder is trained in M1 and the classifier is trained in M-IV.
- NGAME: The intermediate representation is not fine-tuned while training the classifier (more scalable; suitable for large datasets).
- NGAME++: The intermediate representation is fine-tuned while training the classifier (leads to better accuracy on some datasets). #TODO
* dataset
- Name of the dataset.
- NGAME expects the following files in <work_dir>/data/<dataset>
- trn_doc_input_ids.npy
- trn_doc_attention_mask.npy
- trn_X_Y.txt
- tst_doc_input_ids.npy
- tst_doc_attention_mask.npy
- tst_X_Y.txt
- lbl_input_ids.npy
- lbl_attention_mask.npy
- filter_labels_test.txt (put empty file or set as null in config when unavailable)
* version
- different runs could be managed by version and seed.
- models and results are stored with this argument.
* seed
- seed value as used by numpy and PyTorch.
- Training encoders
- Training classifiers
- Getting embeddings
- Prediction
- Score-fusion
- Other feature encoders
- Non-shared shortlist
- Other negative-samplers
- tokenizer
- Multi GPU Training
@InProceedings{Dahiya23,
author = "Dahiya, K. and Gupta, N. and Saini, D. and Soni, A. and Wang, Y. and Dave, K. and Jiao, J. and Gururaj, K. and Dey, P. and Singh, A. and Hada, D. and Jain, V. and Paliwal, B. and Mittal, A. and Mehta, S. and Ramjee, R. and Agarwal, S. and Kar, P. and Varma, M.",
title = "NGAME: Negative mining-aware mini-batching for extreme classification",
booktitle = "WSDM",
month = "March",
year = "2023"
}
- SiameseXML: Siamese networks meet extreme classifiers with 100M labels
- DeepXML: A Deep Extreme Multi-Label Learning Framework Applied to Short Text Documents
- DECAF: Deep Extreme Classification with Label Features
- ECLARE: Extreme Classification with Label Graph Correlations
- GalaXC: Graph Neural Networks with Labelwise Attention for Extreme Classification
[1] K. Dahiya, N. Gupta, D. Saini, A. Soni, Y. Wang, K. Dave, J. Jiao, K. Gururaj, P. Dey, A. Singh, D. Hada, V. Jain, B. Paliwal, A. Mittal, S. Mehta, R. Ramjee, S. Agarwal, P. Kar and M. Varma. NGAME: Negative mining-aware mini-batching for extreme classification. In WSDM, Singapore, March 2023.
[2] K. Dahiya, A. Agarwal, D. Saini, K. Gururaj, J. Jiao, A. Singh, S. Agarwal, P. Kar and M. Varma. SiameseXML: Siamese networks meet extreme classifiers with 100M labels. In ICML, July 2021
[3] K. Dahiya, D. Saini, A. Mittal, A. Shaw, K. Dave, A. Soni, H. Jain, S. Agarwal, and M. Varma. Deepxml: A deep extreme multi-label learning framework applied to short text documents. In WSDM, 2021.
[4] pyxclib: https://github.com/kunaldahiya/pyxclib
[5] The Extreme Classification Repository: http://manikvarma.org/downloads/XC/XMLRepository.html