A-Monolingual-Arabic-Parallel-Corpus-

A7'ta: A Monolingual Arabic Parallel Corpus for Grammar Checking

Collected by Nora Madi email : nmadi at ksu dot edu dot sa site: https://github.com/iwan-rg

Reference: N. Madi and H. S. Al‐Khalifa, “A7’ta: Data on a Monolingual Arabic Parallel Corpus for Grammar Checking,” Data in Brief, vol. 22, pp. 237–240, 2019.

Resource

The parallel corpus is a collection of Modern Standard Arabic (MSA) sentences (and words) extracted from the book كشاف الأخطاء اللغوية - الصحافة السعودية أنموذجاً (Linguistic Error Detector – Saudi Press as a Sample).

Data Files:

Contains erroneous Arabic sentences and their correct counterparts.

Data Structure:

1- Text format 2- UTF-8 encoding

Statitics :

The data contains 300 documents, 445 erroneous sentences and their error-free counterparts, and a total of 3,532 words. Each pair of sentences differs in only one word.

Folder structure:

There are 8 folders for each of the eight main categories in the book.
Within each folder, there is a sub-folder for each sub-category within the main category if any.
Inside each main folder or sub-folder, there are folders for each type of error.
Within each error type folder, there are two files; one for the correctly written sentences (الصواب) and another for the erroneous sentences (الخطأ).