Hazm

Python library for digesting Persian text.

Text cleaning
Sentence and word tokenizer
Word lemmatizer
POS tagger
Shallow parser
Dependency parser
Interfaces for Persian corpora
NLTK compatible

Documentation

Visit https://roshan-ai.ir/hazm/docs to view the full documentation.

Modules accuracy

Module name	accuracy
Lemmatizer	89.9%
Chunker	93.4%	download pre-trained model
POSTagger	97.2% universal: 98.8%	download pre-trained model
DependencyParser	97.1%	download pre-trained model

Installation

The latest stable version of Hazm can be installed through pip:

pip install hazm

But for testing or using Hazm with the latest updates you may use:

pip install https://github.com/roshan-research/hazm/archive/master.zip --upgrade

Usage

>>> from hazm import *

>>> normalizer = Normalizer()
>>> normalizer.normalize('اصلاح نويسه ها و استفاده از نیم‌فاصله پردازش را آسان مي كند')
'اصلاح نویسه‌ها و استفاده از نیم‌فاصله پردازش را آسان می‌کند'

>>> sent_tokenize('ما هم برای وصل کردن آمدیم! ولی برای پردازش، جدا بهتر نیست؟')
['ما هم برای وصل کردن آمدیم!', 'ولی برای پردازش، جدا بهتر نیست؟']
>>> word_tokenize('ولی برای پردازش، جدا بهتر نیست؟')
['ولی', 'برای', 'پردازش', '،', 'جدا', 'بهتر', 'نیست', '؟']

>>> stemmer = Stemmer()
>>> stemmer.stem('کتاب‌ها')
'کتاب'
>>> lemmatizer = Lemmatizer()
>>> lemmatizer.lemmatize('می‌روم')
'رفت#رو'

>>> tagger = POSTagger(model='resources/pos_tagger.model')
>>> tagger.tag(word_tokenize('ما بسیار کتاب می‌خوانیم'))
[('ما', 'PRO'), ('بسیار', 'ADV'), ('کتاب', 'N'), ('می‌خوانیم', 'V')]

>>> chunker = Chunker(model='resources/chunker.model')
>>> tagged = tagger.tag(word_tokenize('کتاب خواندن را دوست داریم'))
>>> tree2brackets(chunker.parse(tagged))
'[کتاب خواندن NP] [را POSTP] [دوست داریم VP]'

>>> parser = DependencyParser(tagger=tagger, lemmatizer=lemmatizer)
>>> parser.parse(word_tokenize('زنگ‌ها برای که به صدا درمی‌آید؟'))
<DependencyGraph with 8 nodes>

Hazm in other languages

Disclaimer: These ports are not developed or maintained by Roshan. They may not have the same functionality or quality as the original Hazm..

JHazm: A Java port of Hazm
NHazm: A C# port of Hazm

Contribution

We welcome and appreciate any contributions to this repo, such as bug reports, feature requests, code improvements, documentation updates, etc. Please follow the Contribution guideline when contributing. You can open an issue, fork the repo, write your code, create a pull request and wait for a review and feedback. Thank you for your interest and support in this repo!

Thanks

Code contributores

Others

Thanks to Virastyar project for providing the persian word list.

DA7OUD/hazm