This projects addresses the problem of arabic diacritization using Bi-LSTM.
for example:
ذَهَبَ اَلْوَلَدُ إِلَى اَلْمَدْرَسَةِ <-- ذهب الولد إلى المدرسة
We have ranked the 5th team on the leader-board with an accuracy of 97% on the hidden test set
For Cleaning
python cleaning.py --mode (choices: train, test, validate)
For Training
val_train_sentences, val_train_labels, val_train_size = get_params(vocab_map, classes, 'val_train_X.pickle', 'val_train_Y.pickle')
val_train_dataset = TashkeelDataset(val_train_sentences, val_train_labels, vocab_map['<PAD>'],max_length)
model = Tashkeel()
For Inference
inferencing = model.inference(test_input)
1. Remove HTML tags
2. Remove URLs
3. Remove special Arabic character (Kashida)
4. Separate Numbers
5. Remove Multiple Whitespaces
6. Clear Punctuations
7. Remove english letters and english and arabic numbers
8. Remove shifts
• Split Using: [\n.,،؛:«»?؟]+
1. Replace consecutive diacritics with a single diacritic
2. Ending Diacritics: Remove diacritics at the end of a word
3. Misplaced Diacritics: Remove spaces between characters and diacritics
• Remove gold class for every character
• Harakat:
1. "Fatha":"\u064e"
2. "Fathatan": "\u064b"
3. "Damma":"\u064f"
4. "Dammatan":"\u064c"
5. "Kasra":"\u0650"
6. "Kasratan":"\u064d"
7. "Sukun":"\u0652"
8. "Shadda":"\u0651"
9. "Shadda Fatha":"\u0651\u064e"
10. "Shadda Fathatan":"\u0651\u064b"
11. "Shadda Damma":"\u0651\u064f"
12. "Shadda Dammatan":"\u0651\u064c"
13. "Shadda Kasra":"\u0651\u0650"
14. "Shadda Kasratan":"\u0651\u064d"
class Tashkeel(nn.Module):
def __init__(self, vocab_size=vocab_size, embedding_dim=100, hidden_size=256, n_classes=n_classes):
"""
The constructor of our Tashkeel model
Inputs:
- vacab_size: the number of unique words
- embedding_dim: the embedding dimension
- n_classes: the number of final classes (tags)
"""
super(Tashkeel, self).__init__()
# (1) Create the embedding layer
self.embedding = nn.Embedding(num_embeddings=vocab_size,embedding_dim=embedding_dim)
# (2) Create an LSTM layer with hidden size = hidden_size and batch_first = True
# self.lstm = nn.LSTM(input_size=embedding_dim,hidden_size=hidden_size,batch_first=True)
self.lstm = nn.LSTM(input_size=embedding_dim,hidden_size=hidden_size,batch_first=True,num_layers=2,bidirectional=True)
# (3) Create a linear layer with number of neorons = n_classes
self.linear = nn.Linear(2*hidden_size,n_classes)
Ahmed Hany |
Mohab Zaghloul | Shaza Mohamed | Basma Elhoseny |
This software is licensed under MIT License, See License for more information ©Basma Elhoseny.