Collected by Nora Madi email : nmadi at ksu dot edu dot sa site: https://github.com/iwan-rg
Reference: N. Madi and H. S. Al‐Khalifa, “A7’ta: Data on a Monolingual Arabic Parallel Corpus for Grammar Checking,” Data in Brief, vol. 22, pp. 237–240, 2019.
The parallel corpus is a collection of Modern Standard Arabic (MSA) sentences (and words) extracted from the book كشاف الأخطاء اللغوية - الصحافة السعودية أنموذجاً (Linguistic Error Detector – Saudi Press as a Sample).
Contains erroneous Arabic sentences and their correct counterparts.
1- Text format 2- UTF-8 encoding
The data contains 300 documents, 445 erroneous sentences and their error-free counterparts, and a total of 3,532 words. Each pair of sentences differs in only one word.
- There are 8 folders for each of the eight main categories in the book.
- Within each folder, there is a sub-folder for each sub-category within the main category if any.
- Inside each main folder or sub-folder, there are folders for each type of error.
- Within each error type folder, there are two files; one for the correctly written sentences (الصواب) and another for the erroneous sentences (الخطأ).