In this project, we experimented with various extreme multi-label classification algorithms on a large-scale data set.
Extreme classification is a multi-label classification problem that annotates a data point with the most relevant subset of labels from an extremely large label set. It has wide applications in diverse areas such as dynamic search advertising, text classification, and recommender systems. The main technical challenges include improving the prediction accuracy and reducing the training time, prediction time and model size.
In this project, we performed extreme multi-label classification on EURLex-4K dataset, a collection of documents about European Union Law with 3993 categories.
We first applied traditional multi-label algorithms as baseline. There are two traditional methods that we tried:
- Problem Transformation (Binary Relevance/Classifier Chain plus Traditional ML algorithms like RF/KNN)
- Algorithm Adaption (Adapted KNN, SVM etc.)
We further implemented embedding-based models Principal Label Space Transformation (PLST) and Sparse Local Embeddings for Extreme Multi-label Classification (SLEEC), and we modified existed algorithms for improvements.
Finally, we focused on one of the leading one-vs-all based extreme classifiers Partitioned Label Trees (Parabel).
We used label ranking average precision (LRAP) as our evaluation metric to assess label ranking performance.
We also record training times to evaluate model efficiency.
The result shows that the Parabel achieves the highest LRAP score as well as the best training time among all the algorithms we experimented with.
- Man Jin (mj1637@nyu.edu)
- Florence Denglin Jiang (florence.jiang@nyu.edu)
- Hong Gong (hg1153@nyu.edu)
- Jacqueline Yuwei Wang (yw1854@nyu.edu)
- Yi Xu (yx2090@nyu.edu)