This repository contains a list of Modern Standard Arabic closed-class words, which can be used as a stop list for a variety of natural language processing applications. The list contains 740 inflected words and clitics in the Arabic Treebank (ATB) tokenization scheme (Maamouri et al., 2004; Habash, 2010). The inflected words are based on 309 lemmas from the Standard Arabic Morphological Analyzer, SAMA (Graff et al., 2009).
The list was create by Wael Salloum and Nizar Habash. The repository contains a technical report detailing its design decisions.
If you use this resource, please cite:
- Wael Salloum and Nizar Habash. 2012. A Modern Standard Arabic Closed-Class Word List. Columbia University's Center for Computational Learning Systems Tech Report #CCLS-12-03.
- D. Graff, M. Maamouri, B. Bouziri, S. Krouna, S. Kulick, and T. Buckwalter. Standard Arabic Morphological Analyzer (SAMA) Version 3.1, 2009. Linguistic Data Consortium LDC2009E73.
- N. Habash. Introduction to Arabic Natural Language Processing. Morgan & Claypool Publishers, 2010.
- M. Maamouri, A. Bies, T. Buckwalter, and W. Mekki. The Penn Arabic Treebank: Building a Large- Scale Annotated Arabic Corpus. In NEMLAR Conference on Arabic Language Resources and Tools, pages 102–109, Cairo, Egypt, 2004.