Arabic Light stemming with Python
This code performs light stemming for Arabic words. The implementation is based on ISRI Arabic Stemmer, which is a rooting algorithm for Arabic text. ISRI Arabic Stemmer is described in:
Taghva, K., Elkoury, R., and Coombs, J. 2005. Arabic Stemming without a root dictionary. Information Science Research Institute. University of Nevada, Las Vegas, USA.
- remove diacritics which representing Arabic short vowels
- remove length three and length two prefixes in this order
- remove length three and length two suffixes in this order
- remove connective ‘و’ if it precedes a word beginning with ‘و’
- normalize initial hamza to bare alif
For more information, please refer to http://www.nltk.org/_modules/nltk/stem/isri.html
Please note steps are derived from http://www.nltk.org/_modules/nltk/stem/isri.html#ISRIStemmer.stem
pip install -r requirements
usage: light_stem.py [-h] -i INFILE -o OUTFILE
performs light stemming for Arabic words.
optional arguments:
-h, --help show this help message and exit
-i INFILE, --infile INFILE
input file.
-o OUTFILE, --outfile OUTFILE
out file.