/MuRIL

Profanity classification for Dravidian Languages

Primary LanguageJupyter Notebook

MurRIL - Profanity classification for Dravidian Languages

MuRIL: Multilingual Representations for Indian Languages is a BERT based model retrained on the Embeddings of Indian Languages like Hindi, Tamil, Kannada etc.

This repository walks you through the processes involved in fine tuning an NLP model for task specific applications using Transformers (:hugs:) implementation. We will deal with a hate-speech classification task in this one. But remember, You can always generalise it to any number of classes (as long as you can procure the right dataset :grinning:) just y adjusting the number of outputs in the final layer.

TASK : A six-class classification problem based on Tamil, Kannada and Malayalam language tweets (credits: ACL).

Class Labels:

  • 'Not_offensive'
  • 'Offensive_Targeted_Insult_Group'
  • 'Offensive_Targeted_Insult_Individual'
  • 'Offensive_Targeted_Insult_Other'
  • <'Offensive_Untargeted'
  • 'Not {language_name}'

You could dive right into the colab notebook!

Open In Colab

NOTE: You might want to make a copy of the notebook first.

Or, you could come along to know more!

What do we need?

  1. A basic Understanding of How BERT works? (Insightful: Article)
  2. Understanding of tokenizers and word embeddings.
  3. PyTorch framework.

Click the icon to know more about PyTorch and how it works