/BahasaRojakSentimentAnalysis

Handling Bahasa Rojak (Malaysian Code Mixing Language) OOV and performing Sentiment Analysis using downstreamed XLM-R

Primary LanguageJupyter Notebook

BahasaRojakSentimentAnalysis πŸ˜ΈπŸ˜‘πŸ˜Ύ

Handling Bahasa Rojak (Malaysian Code Mixing Language) OOV and performing Sentiment Analysis using downstreamed Cross Lingual Model XLM-RoBERTa (XLM-T)

Jupyter Notebooks includes detailing of:

  1. Text Preprocessing
  2. Model Fine Tuning
  3. New Data Inference Pipeline

For further resources regarding the project, please access link below.

Access the project here: https://drive.google.com/drive/folders/12Uir9KE4B1VL6oQWdj2BWvCUZOC0vWa2

Ablation Settings:

Preprocessing Method Model 1 (V1) Model 2 (V2) Model 3 (V3) Model 4 (V4)
Remove URLs βœ” βœ” βœ” βœ”
Convert Lowercase βœ” βœ” βœ” -
Remove Punctuations βœ” βœ” βœ” -
Remove Irregular Spaces βœ” βœ” βœ” βœ”
Handle OOV βœ” βœ” βœ” βœ”
Remove Stopwords βœ” βœ” - -
Chinese Character Segmentation - βœ” βœ” -
Remove Rare Words - - βœ” -

image

Model Results:

Precision Recall F1-Score Accuracy
0 1 0 1 0 1
Model V1 0.716 0.830 0.840 0.702 0.773 0.760 0.767
Model V2 0.768 0.771 0.735 0.801 0.751 0.786 0.770
Model V3 0.794 0.703 0.691 0.802 0.739 0.749 0.744
Model V4 0.861 0.833 0.802 0.884 0.831 0.858 0.845

Web Application to Test out the Sentiment Analysis Model (w/ Twitter Web Scraping):

Scrap tweets related to "britneyspears":

image

Inference Results:

image