Dialogue Classification using Bidirectional Encoder Representations from Transformers (BERT)

A dialogue classifier from the show The Office. This code utilises the fine-tuned BERT-based model to classify the dialogues as Jim or Dwight's Dialogues.

Jim vs. Dwight Dialogue Speaker Classification

This project focuses on classifying speakers in TV series dialogues using a fine-tuned BERT-based model. The model is trained on a dataset containing dialogues from the TV series "The Office" and can predict the speaker of the line with a 84% accuracy.

Dataset

The dataset used in this project consists of dialogues from The Office. Note: The model was only trained on the dialogues given in the file train.csv for a Kaggle competition by IEEE NITK's Computer Intelligence Society.

Requirements

TensorFlow
Transformers
Pandas
NumPy

Application

About the dataset:
- The training dataset was in CSV format with columns: "id", "line", "speaker".
- The validation dataset was split from the train.csv file, hence it is also in the the same format.
- Data was properly cleaned and pre-processed.
Fine tuning the BERT model:
- Run the BERT-ieeekagglecup-2023.ipynb notebook to fine-tune the BERT model on the training dataset.
- The notebook tokenizes the text, prepares input tensors, and trains the model.
Evaluation:
- The trained model is evaluated on the validation dataset using accuracy and classification report metrics.
- The evaluation results can be used to assess the model's performance.
Making predictions:
- Use the trained model to predict the speaker of dialogue lines.
- test.csv dataset is in CSV format with columns: "id", "line".
Generated the submission CSV for the Kaggle Contest - my first one :)

Feel free to contribute, open issues, or submit pull requests to enhance the project!

raajanwankhade/BERT-Dialogue_Classification

Dialogue Classification using Bidirectional Encoder Representations from Transformers (BERT)

Jim vs. Dwight Dialogue Speaker Classification

Dataset

Requirements

Application