This project focuses on multi-label text classification using BERT (Bidirectional Encoder Representations from Transformers). We combine titles and abstracts of articles to classify them into multiple categories simultaneously.
The dataset used in this project includes titles and abstracts of articles along with their associated categories. Categories with low frequencies were dropped to focus on the significant ones.
Ensure you have the following libraries installed:
- pandas
- matplotlib
- numpy
- torch
- transformers
- scikit-learn
First, the dataset is loaded and a preview of the first few rows is displayed to understand its structure.
A bar chart is plotted to show the frequency of each category. Categories with low frequencies are dropped to ensure the model focuses on significant ones. Articles that do not belong to any of the remaining categories are also removed.
The title and abstract of each article are combined into a single string to increase the amount of information available for classification. The original title and abstract columns are then dropped.
The dataset is split into training, validation, and test sets to allow for proper training and evaluation of the model.
Several hyperparameters are defined, including the maximum length of token sequences, batch sizes for training, validation, and testing, the number of epochs, learning rate, and the threshold for classification.
The BERT tokenizer is initialized to convert text into tokens that can be fed into the BERT model.
A custom dataset class is created to handle the encoding of text and the preparation of input data for the BERT model. This class ensures that each text is properly tokenized and padded/truncated to the specified maximum length.
A custom BERT-based model class is defined. This class includes a BERT model with a dropout layer and a linear layer for classification. The forward method specifies how the input data passes through the model to produce output predictions.
Functions are defined for training and evaluating the model. The training function handles the forward and backward passes, gradient clipping, and updating of model parameters. The evaluation function assesses the model's performance on the validation set without updating the model parameters.
A training loop iterates over a specified number of epochs. In each epoch, the model is trained on the training set and evaluated on the validation set. The accuracy and loss for both training and validation are tracked. The best model (based on validation accuracy) is saved.
The best model, saved during training, is loaded for evaluation on the test set.
The model's accuracy and loss are calculated on the test set to assess its performance.
A classification report is generated using the test set to provide detailed metrics, such as precision, recall, and F1-score, for each category.
The model is tested on new, unseen data to demonstrate its ability to classify new articles. The text is tokenized, fed into the model, and the predicted categories are displayed.
This project demonstrates how to perform multi-label text classification using BERT. The model was trained and evaluated using a dataset of article titles and abstracts, and it achieved significant accuracy in classifying the articles into multiple categories.