This is our bachelor's project in computer science, and the goal is to classify arabic text into five types of dialects (GLF,EGY,IRQ,LEV,NOR).
You can find the research paper at research/Graduation_project.pdf.
Here is a link to test the model: https://arabic.hawzen.me/
Dataset | Source |
---|---|
SMADC | Areej Alshutayri and Eric Atwell. Classifying arabic dialect text in the social media arabic dialect corpus (smadc). 01 2021. |
AOC-dialectal-annotations | Ryan Cotterell and Chris Callison-Burch. A multi-dialect, multigenre corpus of informal written Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 241–245, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). |
annotated_data | Omar F. Zaidan and Chris Callison-Burch. The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 37–41, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. |
Dart | Israa Alsarsour, Esraa Mohamed, Reem Suwaileh, and Tamer Elsayed. DART: A large dataset of dialectal Arabic tweets. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA). |
extra_data | Us |
Dr. Nasser A. AlSadhan
Abdulrahman Al-Shawi
Musaad Al-Qubayl
Khaled Al-Bader
Abdullah Al-Suwailem
Mohand Al-Rasheed