/Thai-text-classification

Thai-text-classification

Primary LanguageJupyter Notebook

Thai-text-classification

Thai-text-classification using transformer model. Thai Transformers using the Tcas61_2.csv dataset.

In this project, I developed a Thai text classification model using the state-of-the-art Transformer architecture.

The primary goal was to achieve a high F1 score on the Tcas61_2.csv dataset, which contains Thai text data with binary labels.

Summary of the Fine-tuning Process

  1. Balanced the dataset using RandomOverSampler:
    • Class distribution before RandomOverSampler:
    • 0    0.653226
      1    0.346774
      Name: label, dtype: float64
                      
    • Class distribution after RandomOverSampler:
    • 0    0.5
      1    0.5
                      
    • Split the train dataset into train (70%: Class 0:50%, Class 1:50%), validation (15%), and test (15%) sets
  2. Set the max_seq_len to 25
  3. Chose the pre-trained WangchanBERTa model and froze its weights:
  4. bert = AutoModel.from_pretrained('poom-sci/WangchanBERTa-finetuned-sentiment')
            
  5. Set the batch_size to 32
  6. Added more layers to the model:
    • Dropout layer with a rate of 0.1
    • Dense layer 1: self.fc1 = nn.Linear(768,512)
    • Dense layer 2: self.fc2 = nn.Linear(512,256)
    • Dense layer 3 (Output layer): self.fc3 = nn.Linear(256,2)
  7. Defined the optimizer using AdamW with a learning rate of 2e-5 and weight_decay of 0.01
  8. Trained the model for 100 epochs
  9. Saved the best model with the lowest validation loss:
  10. if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'saved_weights.pt')
            

Attachments

  • Google Colab notebook file with the final implemented Transformer model, training, and evaluation code
  • Screenshots of the captured results for both training and testing, including the F1 score
by Witsarut Wongsim