eXtreme Multi-label Text Classification with BERT
This is a README for the experimental code in our paper
Wei-Cheng Chang, Hsiang-Fu Yu, Kai Zhong, Yiming Yang, Inderjit Dhillon
Preprint 2019
Installation
Requirements
- conda
- python=3.6
- cuda=9.0
- Pytorch=0.4.1
- pytorch-pretrained-BERT=0.6.2
- allenlp=0.8.4
Depedencies via Conda Environment
> conda create -n xbert-env python=3.6
> source activate xbert-env
> (xbert-env) conda install scikit-learn
> (xbert-env) conda install pytorch=0.4.1 cuda90 -c pytorch
> (xbert-env) pip install pytorch-pretrained-bert==0.6.2
> (xbert-env) pip install allennlp==0.8.4
> (xbert-env) pip install -e .
**Warning: you need to install pytorch=0.4.1 based on the cuda version on your machine.
**Notice: the following examples are executed under the > (xbert-env)
conda virtual environment
Reproduce Evaulation Results in the Paper
We demonstrate how to reproduce the evaluation results in our paper by downloading the raw dataset and pretrained models.
Download Dataset (Eurlex-4K, Wiki10-31K, AmazonCat-13K, Wiki-500K)
Change directory into ./datasets folder, download and unzip each dataset
cd ./datasets
bash download-data.sh Eurlex-4K
bash download-data.sh Wiki10-31K
bash download-data.sh AmazonCat-13K
bash download-data.sh Wiki-500K
cd ../
Each dataset contains the following files
X.trn.npz, X.val.npz, X.tst.npz
: data tf-idf sparse matrixY.trn.npz, Y.val.npz, Y.tst.npz
: label sparse matrixL.elmo.npz, L.pifa.npz
: label embedding matrixmlc2seq/{train,valid.test}.txt
: each line is label_ids \tab raw_textmlc2seq/label_vocab.txt
: each line is label_count \tab label_text
Download Pretrained Models (Indexing codes, matcher models and ranker models)
Change directory into ./pretrained_models folder, download and unzip models for each dataset
cd ./pretrained_models
bash download-model.sh Eurlex-4K
bash download-model.sh Wiki10-31K
bash download-model.sh AmazonCat-13K
bash download-model.sh Wiki-500K
cd ../
Prediction and Evaluation Pipeline
load indexing codes, generate predicted codes from pretrained matchers, and predict labels from pretrained rankers.
export DATASETS=Eurlex-4K
bash scripts/run_linear_eval.sh ${DATASETS}
bash scripts/run_xbert_eval.sh ${DATASETS}
bash scripts/run_xttention_eval.sh ${DATASETS}
DATASETS
: the dataset name such as Eurlex-4K, Wiki10-31K, AmazonCat-13K, or Wiki-500K.
Ensemble prediction and Evaluation
python -m xbert.evaluator \
-y [path to Y.tst.npz] \
-e prediction-path [prediction-path ... ]
For example, given the ranker prediction files (tst.pred.xbert.npz),
python -m xbert.evaluator \
-y datasets/Eurlex-4K/Y.tst.npz \
-e pretrained_models/Eurlex-4K/*/ranker/tst.pred.xbert.npz
which computes the metric for the X-BERT ensemble of the label_emb={elmo,pifa} and seed={0,1,2} combinations.
Pipeline for running X-BERT on a new dataset
Generate label embedding
We support ELMo and PIFA label embedding given the file label_vocab.txt.
cd ./datasets/
python label_embedding.py --dataset ${DATASET} --embed-type ${LABEL_EMB}
cd ../
DATASETS
: the customized dataset name which contains the necessary files as described in [download dataset section]LABEL_EMB
: currently support either elmo or pifa
Semantic Indexing and Linear Ranker
Before training deep neural matcher, we first obtain indexed label codes and linear ranker.
The following example assume to have a similar structure as the pretrained_models
folder.
Semantic Label Indexing
An example usage would be:
OUTPUT_DIR=save_models/${DATASET}/${LABEL_EMB}-a${ALGO}-s${SEED}
mkdir -p ${OUTPUT_DIR}/indexer
python -m xbert.indexer \
-i datasets/${DATASET}/L.${LABEL_EMB}.npz \
-o ${OUTPUT_DIR}/indexer \
-d ${DEPTH} --algo ${ALGO} --seed ${SEED} \
--max-iter 20
ALGO
: clustering algorithm. 0 for KMEANS, 5 for SKMEANSDEPTH
: The depth of hierarchical 2-meansSEED
: random seed
Linear Ranker training
An example usage would be:
OUTPUT_DIR=save_models/${DATASET}/${LABEL_EMB}-a${ALGO}-s${SEED}
mkdir -p $OUTPUT_DIR/ranker
python -m xbert.ranker train \
-x datasets/${DATASET}/X.trn.npz \
-y datasets/${DATASET}/Y.trn.npz \
-c ${OUTPUT_DIR}/indexer/code.npz \
-o ${OUTPUT_DIR}/ranker
Linear Ranker Prediction
An example usage would be:
OUTPUT_DIR=save_models/${DATASET}/${LABEL_EMB}-a${ALGO}-s${SEED}
mkdir -p $OUTPUT_DIR/ranker
python -m xbert.ranker predict \
-m ${OUTPUT_DIR}/ranker \
-x datasets/${DATASET}/X.tst.npz \
-y datasets/${DATASET}/Y.tst.npz \
-c ${OUTPUT_DIR}/matcher/${MATCHER}/C_eval_pred.npz \
-o ${OUTPUT_DIR}/ranker/tst.prediction.npz
Neural Matching via XBERT or Xttention
Create Data Binary as Preprocessing
Before training, we need to generate preprocessed data as binary pickle files.
OUTPUT_DIR=save_models/${DATASET}/${LABEL_EMB}-a${ALGO}-s${SEED}
mkdir -p $OUTPUT_DIR/data-bin-${MATCHER}
CUDA_VISIBLE_DEVICES=GPUS python -m xbert.preprocess \
-m ${MATCHER} \
-i datasets/${DATASET} \
-c ${OUTPUT_DIR}/indexer/code.npz \
-o ${OUTPUT_DIR}/data-bin-${MATCHER}
-GPUS
: the available gpu_id
-MATCHER
: currently support xttention
or xbert
Training XBERT
Set hyper-parameters properly, an example would be
GPUS=0,1,2,3,4,5
MATCHER=xbert
TRAIN_BATCH_SIZE=36
EVAL_BATCH_SIZE=64
LOG_INTERVAL=1000
EVAL_INTERVAL=10000
NUM_TRAIN_EPOCHS=12
LEARNING_RATE=5e-5
WARMUP_RATE=0.1
Users can also check scripts/run_xbert.sh
to see the detailed setting for each datasets used in the paper.
We are now ready to run the xbert models:
OUTPUT_DIR=save_models/${DATASET}/${LABEL_EMB}-a${ALGO}-s${SEED}
mkdir -p ${OUTPUT_DIR}/matcher/${MATCHER}
CUDA_VISIBLE_DEVICES=${GPUS} python -u -m xbert.matcher.bert \
-i ${OUTPUT_DIR}/data-bin-${MATCHER}/data_dict.pt \
-o ${OUTPUT_DIR}/matcher/${MATCHER} \
--bert_model bert-base-uncased \
--do_train --do_eval --stop_by_dev \
--learning_rate ${LEARNING_RATE} \
--warmup_proportion ${WARMUP_RATE} \
--train_batch_size ${TRAIN_BATCH_SIZE} \
--eval_batch_size ${EVAL_BATCH_SIZE} \
--num_train_epochs ${NUM_TRAIN_EPOCHS} \
--log_interval ${LOG_INTERVAL} \
--eval_interval ${EVAL_INTERVAL} \
> |& tee ${OUTPUT_DIR}/matcher/${MATCHER}.log
Training Xttention
Set hyper-parameters properly, an example would be
GPUS=0
MATCHER=xttention
TRAIN_BATCH_SIZE=128
LOG_INTERVAL=100
EVAL_INTERVAL=1000
NUM_TRAIN_EPOCHS=10
Users can also check scripts/run_xttention.sh
to see the detailed setting for each datasets used in the paper.
We are now ready to run the xttention models:
OUTPUT_DIR=save_models/${DATASET}/${LABEL_EMB}-a${ALGO}-s${SEED}
mkdir -p ${OUTPUT_DIR}/matcher/${MATCHER}
CUDA_VISIBLE_DEVICES=${GPUS} python -u -m xbert.matcher.attention \
-i ${OUTPUT_DIR}/data-bin-${MATCHER}/data_dict.pt \
-o ${OUTPUT_DIR}/matcher/${MATCHER} \
--do_train --do_eval --cuda --stop_by_dev \
--train_batch_size ${TRAIN_BATCH_SIZE} \
--num_train_epochs ${NUM_TRAIN_EPOCHS} \
--log_interval ${LOG_INTERVAL} \
--eval_interval ${EVAL_INTERVAL} \
> |& tee ${OUTPUT_DIR}/matcher/${MATCHER}.log
Acknowledge
Some portions of this repo is borrowed from the following repos: