/Tashkeel

Primary LanguageJupyter NotebookMIT LicenseMIT

Tashkeel

Table of Contents

Overview

This projects addresses the problem of arabic diacritization using Bi-LSTM.

for example:

ذَهَبَ اَلْوَلَدُ إِلَى اَلْمَدْرَسَةِ <-- ذهب الولد إلى المدرسة

Our Achievement

We have ranked the 5th team on the leader-board with an accuracy of 97% on the hidden test set score_board

Get Started

For Cleaning

python cleaning.py --mode (choices: train, test, validate)

For Training

val_train_sentences, val_train_labels, val_train_size = get_params(vocab_map, classes, 'val_train_X.pickle', 'val_train_Y.pickle')
val_train_dataset = TashkeelDataset(val_train_sentences, val_train_labels, vocab_map['<PAD>'],max_length)
model = Tashkeel()

For Inference

inferencing = model.inference(test_input)

Modules

Preprocessing

Cleaning Process [Train & Validation Only]
  1. Remove HTML tags
  2. Remove URLs
  3. Remove special Arabic character (Kashida)
  4. Separate Numbers
  5. Remove Multiple Whitespaces
  6. Clear Punctuations
  7. Remove english letters and english and arabic numbers
  8. Remove shifts
Tokenization
  • Split Using: [\n.,،؛:«»?؟]+
Fix Diacritization Issue [Train & Validation Only]
  1. Replace consecutive diacritics with a single diacritic
  2. Ending Diacritics: Remove diacritics at the end of a word
  3. Misplaced Diacritics: Remove spaces between characters and diacritics
Tashkel Removal [Train & Validation Only]
  • Remove gold class for every character
  • Harakat:
    1. "Fatha":"\u064e"
    2. "Fathatan":  "\u064b"
    3. "Damma":"\u064f"
    4. "Dammatan":"\u064c"
    5. "Kasra":"\u0650"
    6. "Kasratan":"\u064d"
    7. "Sukun":"\u0652"
    8. "Shadda":"\u0651"
    9. "Shadda Fatha":"\u0651\u064e"
    10. "Shadda Fathatan":"\u0651\u064b"
    11. "Shadda Damma":"\u0651\u064f"
    12. "Shadda Dammatan":"\u0651\u064c"
    13. "Shadda Kasra":"\u0651\u0650"
    14. "Shadda Kasratan":"\u0651\u064d"      

Network

WhatsApp Image 2024-02-12 at 13 42 10_584857b3

class Tashkeel(nn.Module):
  def __init__(self, vocab_size=vocab_size, embedding_dim=100, hidden_size=256, n_classes=n_classes):
    """
    The constructor of our Tashkeel model
    Inputs:
    - vacab_size: the number of unique words
    - embedding_dim: the embedding dimension
    - n_classes: the number of final classes (tags)
    """
    super(Tashkeel, self).__init__()
    # (1) Create the embedding layer
    self.embedding = nn.Embedding(num_embeddings=vocab_size,embedding_dim=embedding_dim)

    # (2) Create an LSTM layer with hidden size = hidden_size and batch_first = True
    # self.lstm =  nn.LSTM(input_size=embedding_dim,hidden_size=hidden_size,batch_first=True)
    self.lstm =  nn.LSTM(input_size=embedding_dim,hidden_size=hidden_size,batch_first=True,num_layers=2,bidirectional=True)

    # (3) Create a linear layer with number of neorons = n_classes
    self.linear =  nn.Linear(2*hidden_size,n_classes)

Contributors


Ahmed Hany

Mohab Zaghloul


Shaza Mohamed


Basma Elhoseny

License

This software is licensed under MIT License, See License for more information ©Basma Elhoseny.