Thai-text-classification using transformer model. Thai Transformers using the Tcas61_2.csv dataset.
In this project, I developed a Thai text classification model using the state-of-the-art Transformer architecture.
The primary goal was to achieve a high F1 score on the Tcas61_2.csv dataset, which contains Thai text data with binary labels.
- Balanced the dataset using RandomOverSampler:
- Class distribution before RandomOverSampler:
0 0.653226 1 0.346774 Name: label, dtype: float64
- Class distribution after RandomOverSampler:
- Split the train dataset into train (70%: Class 0:50%, Class 1:50%), validation (15%), and test (15%) sets
- Set the max_seq_len to 25
- Chose the pre-trained WangchanBERTa model and froze its weights:
- Set the batch_size to 32
- Added more layers to the model:
- Dropout layer with a rate of 0.1
- Dense layer 1: self.fc1 = nn.Linear(768,512)
- Dense layer 2: self.fc2 = nn.Linear(512,256)
- Dense layer 3 (Output layer): self.fc3 = nn.Linear(256,2)
- Defined the optimizer using AdamW with a learning rate of 2e-5 and weight_decay of 0.01
- Trained the model for 100 epochs
- Saved the best model with the lowest validation loss:
0 0.5 1 0.5
bert = AutoModel.from_pretrained('poom-sci/WangchanBERTa-finetuned-sentiment')
if valid_loss < best_valid_loss: best_valid_loss = valid_loss torch.save(model.state_dict(), 'saved_weights.pt')
- Google Colab notebook file with the final implemented Transformer model, training, and evaluation code
- Screenshots of the captured results for both training and testing, including the F1 score