All the library versions are up-to-date with Google Colab (by April 2022).
Additionally, do pip install pytorch-pretrained-bert
for getting BERT model classes
Alternatively, you could run pip install -r requirements.txt
Models are stored in Google Drive. Recommend using finetuned_pytorch_model_32_ep5.bin
or finetuned_pytorch_model_32_ep7.bin
. Epoch-3 version could be used as a checkpoint for fine-tuning the model given num_classes fixed.
The model performs considerably well on the dataset used for training (Kaggle dataset)
- 89% accuracy on validation set
- 87.9% accuracy on test set
bert_classification_model_inference.ipynb
presents the inference code. By default, evals.predict()
function returns
- Class probabilities
- Logits (softmax of logits are probabilities)
- 768-dimensional text embeddings
Example dataset is attached at data/new_test.csv
Predictions are performed on a sequence of 32 first words in the text
column of a given dataset.
The notebook contains comments on further details.
infer.py
provides an interface for a json-to-json prediction
bert_classification_model_training.ipynb
presents the training code.
Training dataset is attached at data/cleaned.csv
. Training for 1 epoch takes 20-30 minutes on Google Colab GPU depending on training set size. 5-7 epochs is enough for achieving decent quality (85+% accuracy on validation).
Training is performed using a sequence of 32 first words in the text
column of a given dataset.
The notebook contains comments on further details.
If you want to train on another set of data, using another set of classes, all you need to do is:
- Save your data to
data/cleaned.csv
. It must havecategory
andtext
columns (redundant columns are fine) - Change
args['label_list']
so that it contains all the unique classes in the new dataset