Tashkeel

Overview
Our Achievement
Get Started
Modules
- Preprocessing
- Network
Contributors
License

Overview

This projects addresses the problem of arabic diacritization using Bi-LSTM.

for example:

ذَهَبَ اَلْوَلَدُ إِلَى اَلْمَدْرَسَةِ <-- ذهب الولد إلى المدرسة

Our Achievement

We have ranked the 5th team on the leader-board with an accuracy of 97% on the hidden test set

Get Started

For Cleaning

python cleaning.py --mode (choices: train, test, validate)

For Training

val_train_sentences, val_train_labels, val_train_size = get_params(vocab_map, classes, 'val_train_X.pickle', 'val_train_Y.pickle')
val_train_dataset = TashkeelDataset(val_train_sentences, val_train_labels, vocab_map['<PAD>'],max_length)
model = Tashkeel()

For Inference

inferencing = model.inference(test_input)

Modules

Preprocessing

Cleaning Process [Train & Validation Only]

  1. Remove HTML tags
  2. Remove URLs
  3. Remove special Arabic character (Kashida)
  4. Separate Numbers
  5. Remove Multiple Whitespaces
  6. Clear Punctuations
  7. Remove english letters and english and arabic numbers
  8. Remove shifts

Tokenization

  • Split Using: [\n.,،؛:«»?؟]+

Fix Diacritization Issue [Train & Validation Only]

  1. Replace consecutive diacritics with a single diacritic
  2. Ending Diacritics: Remove diacritics at the end of a word
  3. Misplaced Diacritics: Remove spaces between characters and diacritics

Tashkel Removal [Train & Validation Only]

  • Remove gold class for every character
  • Harakat:
    1. "Fatha":"\u064e"
    2. "Fathatan":  "\u064b"
    3. "Damma":"\u064f"
    4. "Dammatan":"\u064c"
    5. "Kasra":"\u0650"
    6. "Kasratan":"\u064d"
    7. "Sukun":"\u0652"
    8. "Shadda":"\u0651"
    9. "Shadda Fatha":"\u0651\u064e"
    10. "Shadda Fathatan":"\u0651\u064b"
    11. "Shadda Damma":"\u0651\u064f"
    12. "Shadda Dammatan":"\u0651\u064c"
    13. "Shadda Kasra":"\u0651\u0650"
    14. "Shadda Kasratan":"\u0651\u064d"

Reference Arabic Text Diacritization Using Deep Neural Networks

Network

class Tashkeel(nn.Module):
  def __init__(self, vocab_size=vocab_size, embedding_dim=100, hidden_size=256, n_classes=n_classes):
    """
    The constructor of our Tashkeel model
    Inputs:
    - vacab_size: the number of unique words
    - embedding_dim: the embedding dimension
    - n_classes: the number of final classes (tags)
    """
    super(Tashkeel, self).__init__()
    # (1) Create the embedding layer
    self.embedding = nn.Embedding(num_embeddings=vocab_size,embedding_dim=embedding_dim)

    # (2) Create an LSTM layer with hidden size = hidden_size and batch_first = True
    # self.lstm =  nn.LSTM(input_size=embedding_dim,hidden_size=hidden_size,batch_first=True)
    self.lstm =  nn.LSTM(input_size=embedding_dim,hidden_size=hidden_size,batch_first=True,num_layers=2,bidirectional=True)

    # (3) Create a linear layer with number of neorons = n_classes
    self.linear =  nn.Linear(2*hidden_size,n_classes)