LibMultiLabel is a simple tool with the following functionalities.
- end-to-end services from raw texts to final evaluation/analysis
- support of common network architectures for multi-label text classification
This is an on-going development so many improvements are still being made. Comments are very welcome.
- Python: 3.6+
- CUDA: 10.2 (if GPU used)
- Pytorch 1.8+
If you have a different version of CUDA, go to the website for the detail of PyTorch installation.
To install the latest development version, run:
pip3 install -r requirements.txt
- Create a data sub-directory within
LibMultiLabel
and go to this sub-directory.
mkdir -p data/rcv1
cd data/rcv1
- Download the
rcv1
dataset from LIBSVM data sets by the following commands.
wget -O train.txt.bz2 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel/rcv1_topics_train.txt.bz2
wget -O test.txt.bz2 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel/rcv1_topics_test.txt.bz2
- Uncompress data files and change the directory back to
LibMultiLabel
.
bzip2 -d *.bz2
cd ../..
Train a cnn model and predict the test set by an example config. Use --cpu
to run the program on the cpu.
python3 main.py --config example_config/rcv1/kim_cnn.yml
The LibMultiLabel toolkit uses a yaml file to configure the dataset and the training process. At a high level, the config file is split into four parts:
-
data: the data config consists of data paths that tell where to place the datasets, pre-trained word embeddings, and vocabularies.
-
model: the model config defines the parameters that are related to the network definition (i.e., model name).
-
train: the train config specifies the hyperparameters (i.e., batch size, learning rate, etc.) used when training a model.
-
eval: the eval config decides metrics monitored or reported during the evaluation process.
For more information, run --help
or check the example configs.
python3 main.py --help
Each parameter can also be specified through the command line.
Put labels and texts in the training, validation, and test set separately in train.txt
, valid.txt
, and test.txt
, or specifying the path by the arguments train_path
, valid_path
, and test_path
. If validation set is not provided, then the program internally splits the training set to two parts for training and validation. To specify the size of the validation set, use val_size
in one of the following two ways:
- a ratio in [0,1] to indicate the percentage of training data allocated for validation
- an integer to indicate the number of training data allocated for validation
- one sample per line
- seperate ID, labels and texts by
<TAB>
(the ID column is optional) - labels are split by spaces
With ID column:
2286<TAB>E11 ECAT M11 M12 MCAT<TAB>recov recov recov recov excit ...
2287<TAB>C24 CCAT<TAB>uruguay uruguay compan compan compan ...
Without ID column:
E11 ECAT M11 M12 MCAT<TAB>recov recov recov recov excit ...
C24 CCAT<TAB>uruguay uruguay compan compan compan ...
In the test set, the labels are used to calculate accuracy or errors. If it's unknown, any string (even an empty one) is fine. For example,
With ID column:
2286<TAB><TAB>recov recov recov recov excit ...
2287<TAB><TAB>uruguay uruguay compan compan compan ...
Without ID column:
<TAB>recov recov recov recov excit ...
<TAB>uruguay uruguay compan compan compan ...
In the training procedure, you can build a model from scratch or start from some pre-obtained information.
python3 main.py --config CONFIG_PATH [--load_checkpoint CHECKPOINT_PATH] \
[--embed_file EMBED_NAME_OR_EMBED_PATH] [--vocab_file VOCAB_CSV_PATH]
- config: configure parameters in a yaml file. See the section Usage.
If a model was trained before by this package, the training procedure can start with it.
- load_checkpoint: specify the path to a pre-trained model.
To use your own word embeddings or vocabulary set, specify the following parameters:
- embed_file: choose one of the pretrained embeddings defined in torchtext or specify the path to your word embeddings with each line containing a word followed by its vectors. Example:
team -0.17901678383350372 1.2405720949172974 ... language 0.8667483925819397 5.001194953918457 ...
- vocab_file: set the file path to a predefined vocabulary set that contains lines of words.
For the validation process in the training procedure, you can evaluate the model with a set of evaluation metrics. Set monitor_metrics
to define what you want to print on the screen. The argument val_metric
is the metric for picking the best model. Example:
monitor_metrics: [P@1, P@3, P@5]
val_metric: P@1
If test_path
is specified or DATA_DIR/test.txt
exists, the model with the highest val_metric
will be evaluated after training.
To deploy/evaluate a model (i.e., a pre-obtained checkpoint), you can predict a test set by the following command.
python3 main.py --eval --config CONFIG_PATH --load_checkpoint CHECKPOINT_PATH --test_path TEST_DATA_PATH --save_k_predictions K --predict_out_path PREDICT_OUT_PATH
- Use
--save_k_predictions
to save the top K predictions for each instance in the test set. K=100 if not specified. - Use
--predict_out_path
to specify the file for storing the predicted top-K labels/scores.