/Arabic-Text-Diacritization

Course Project for Natural Language Processing

Primary LanguagePythonMIT LicenseMIT

Arabic Text Diacritization

logo

”مِنْ لَمْ يُحْتَمَلْ ذُلُّ اَلتَّعَلُّمِ سَاعَةً ، بَقِيَ فِي ذُلِّ اَلْجَهْلِ أَبَدًا .“


📝 Table of Contents


📙 Overview

  • Arabic is one of the most spoken languages around the globe. Although the use of Arabic increased on the Internet, the Arabic NLP community is lagging compared to other languages. One of the aspects that differentiate Arabic is diacritics. Diacritics are short vowels with a constant length that are spoken but usually omitted from Arabic text as Arabic speakers usually can infer it easily.
  • The same word in the Arabic language can have different meanings and different pronunciations based on how it is diacritized. Getting back these diacritics in the text is very useful in many NLP systems like Text To Speech (TTS) systems and machine translation as diacritics removes ambiguity in both pronunciation and meaning. Here is an example of Arabic text diacritization:

  • Real input Golden Output
    ذهب علي إلى الشاطئ ذَهَبَ عَلِي إِلَى اَلشَّاطِئِ
  • Built using Python.
  • You can view Data Set which was used to train the model
  • Project Report

🚀 How To Run

pip install -r requirements.txt
  • Folder Structure
├───dataset
├───src
│   ├──utils
│   ├──constants.py
│   ├──evaluation.py
│   ├──featureExtraction.py
│   ├──models.py
│   ├──preprocessing.py
│   └──train.py
├───trained_models
├───requirements.txt
...
  • Navigate to the src directory
cd src
  • Run the train.py file to train model
python train.py
  • Run the evaluation.py file to test model
python evaluation.py
  • Features are saved in ../trained_models

🤖 Modules

Preprocessing Module

  1. Data Cleaning: First step is always to clean the sentences we read from the corpus by defining the basic Arabic letters with all different formats of them they encountered 36 unique letter, the main diacritics we have in the language and they encountered 15 unique diacritic, all punctuations and white spaces. Anything other than the mentioned characters gets filtered out.
  2. Tokenization: The way we found yielding the best result is to divide the corpus into sentences of fixed size (A window we set with length of 1000) which means that if a sentence exceeds the window size we will go backward until the first space we will face then this will be the cutting edge of the first sentence, and the splitted word will be the first one in the next sentence and keep going like this. If the sentence length is less than the window size then we pad the rest of the empty size to ensure they’re all almost equal.
  3. Encoding: The last step is to encode each character and diacritic to a specific index which is defined in our character to index and diacritic to index dictionaries. Basically transforming letters and diacritics into a numerical form to be input to our model.
  4. Failed Trails: We Tried not to give it a full sentence but a small sliding window and this sliding window is flexible in size as we can determine the size of previous words we want to get and the size of the next words.

Feature Extraction Module

  1. Trainable Embeddings (Used): Here we use the Embedding layer provided by torch.nn. which gives us trainable embeddings on the character level. This layer in a neural network is responsible for transforming discrete input elements, in our case character indices, into continuous vector representations where each unique input element is associated with a learnable vector and these vectors capture semantic relationships between the elements.
  2. AraVec CBOW Word Embeddings: AraVec is an open-source project that offers pre-trained distributed word representations, specifically designed for the Arabic natural language processing (NLP) research community.
  3. AraBERT: Pre-trained Transformers for Arabic Language Understanding and Generation (Arabic BERT, Arabic GPT2, Arabic ELECTRA)
The next approaches weren’t possible on pure Arabic letters because these libraries tokenize on English statements. They expect the data to be joined sentences in English form so we had to find a way to deal with this issue. After a bit of research, we found a method that basically maps Arabic letters and diacritics to English letters and symbols by using Buckwalter transliterate and untransiliterate functions, we were able to switch the language for the feature extraction by ourselves part.
  • Bag Of Words: by using the CountVectorizer method which is trained on the whole corpus. The vectorizer gets the feature names after fitting the data and then we save them to a csv file which represents the bag of words model. We can index any word and get its corresponding vector which describes its count in the corpus and no info about the position of this word.
  • TF-IDF: The TfIdfVectorizer initializes our model and we choose to turn off lowercasing the words. After transforming and fitting the model on the input data, we extract the feature names out and this will be out words set that we’ll place them in column headings of each column in the output csv file.

Model Selection

Fitting training data and labels into a 5-Layer Bidirectional LSTM which gave us 97% accuracy.


👑 Contributors


Abdelrahman Hamdy


Beshoy Morad


Abdelrahman Noaman


Eslam Ashraf

🔒 License

Note: This software is licensed under MIT License, See License for more information ©AbdelrahmanHamdyy.