Hakulani/Thai-text-classification

Thai-text-classification

Jupyter Notebook

Thai-text-classification

Thai-text-classification using transformer model. Thai Transformers using the Tcas61_2.csv dataset.

In this project, I developed a Thai text classification model using the state-of-the-art Transformer architecture.

The primary goal was to achieve a high F1 score on the Tcas61_2.csv dataset, which contains Thai text data with binary labels.

Summary of the Fine-tuning Process

Balanced the dataset using RandomOverSampler:
- Class distribution before RandomOverSampler:
- Class distribution after RandomOverSampler:
- Split the train dataset into train (70%: Class 0:50%, Class 1:50%), validation (15%), and test (15%) sets
Set the max_seq_len to 25
Chose the pre-trained WangchanBERTa model and froze its weights:

bert = AutoModel.from_pretrained('poom-sci/WangchanBERTa-finetuned-sentiment')

Set the batch_size to 32
Added more layers to the model:
- Dropout layer with a rate of 0.1
- Dense layer 1: self.fc1 = nn.Linear(768,512)
- Dense layer 2: self.fc2 = nn.Linear(512,256)
- Dense layer 3 (Output layer): self.fc3 = nn.Linear(256,2)
Defined the optimizer using AdamW with a learning rate of 2e-5 and weight_decay of 0.01
Trained the model for 100 epochs
Saved the best model with the lowest validation loss:

if valid_loss < best_valid_loss:
    best_valid_loss = valid_loss
    torch.save(model.state_dict(), 'saved_weights.pt')

Attachments

Google Colab notebook file with the final implemented Transformer model, training, and evaluation code
Screenshots of the captured results for both training and testing, including the F1 score

by Witsarut Wongsim