Deep Learning for Multi-Label Text Classification

This repository is my research project, and it is also a study of TensorFlow, Deep Learning (Fasttext, CNN, LSTM, etc.).

The main objective of the project is to solve the multi-label text classification problem based on Deep Neural Networks. Thus, the format of the data label is like [0, 1, 0, ..., 1, 1] according to the characteristics of such a problem.

Requirements

Python 3.6
Tensorflow 1.15.0
Tensorboard 1.15.0
Sklearn 0.19.1
Numpy 1.16.2
Gensim 3.8.3
Tqdm 4.49.0

Project

The project structure is below:

.
├── Model
│   ├── test_model.py
│   ├── text_model.py
│   └── train_model.py
├── data
│   ├── word2vec_100.model.* [Need Download]
│   ├── Test_sample.json
│   ├── Train_sample.json
│   └── Validation_sample.json
└── utils
│   ├── checkmate.py
│   ├── data_helpers.py
│   └── param_parser.py
├── LICENSE
├── README.md
└── requirements.txt

Innovation

Data part

Make the data support Chinese and English (Can use jieba or nltk ).
Can use your pre-trained word vectors (Can use gensim).
Add embedding visualization based on the tensorboard (Need to create metadata.tsv first).

Model part

Add the correct L2 loss calculation operation.
Add gradients clip operation to prevent gradient explosion.
Add learning rate decay with exponential decay.
Add a new Highway Layer (Which is useful according to the model performance).
Add Batch Normalization Layer.

Code part

Can choose to train the model directly or restore the model from the checkpoint in train.py.
Can predict the labels via threshold and top-K in train.py and test.py.
Can calculate the evaluation metrics --- AUC & AUPRC.
Can create the prediction file which including the predicted values and predicted labels of the Testset data in test.py.
Add other useful data preprocess functions in data_helpers.py.
Use logging for helping to record the whole info (including parameters display, model training info, etc.).
Provide the ability to save the best n checkpoints in checkmate.py, whereas the tf.train.Saver can only save the last n checkpoints.

Data

See data format in /data folder which including the data sample files. For example:

{"testid": "3935745", "features_content": ["pore", "water", "pressure", "metering", "device", "incorporating", "pressure", "meter", "force", "meter", "influenced", "pressure", "meter", "device", "includes", "power", "member", "arranged", "control", "pressure", "exerted", "pressure", "meter", "force", "meter", "applying", "overriding", "force", "pressure", "meter", "stop", "influence", "force", "meter", "removing", "overriding", "force", "pressure", "meter", "influence", "force", "meter", "resumed"], "labels_index": [526, 534, 411], "labels_num": 3}

"testid": just the id.
"features_content": the word segment (after removing the stopwords)
"labels_index": The label index of the data records.
"labels_num": The number of labels.

Text Segment

You can use nltk package if you are going to deal with the English text data.
You can use jieba package if you are going to deal with the Chinese text data.

Data Format

This repository can be used in other datasets (text classification) in two ways:

Modify your datasets into the same format of the sample.
Modify the data preprocessing code in data_helpers.py.

Anyway, it should depend on what your data and task are.

🤔Before you open the new issue about the data format, please check the data_sample.json and read the other open issues first, because someone maybe ask me the same question already. For example: