- Evaluation
- Introduction
- Features
- Installation
- Pretrained-Models
- Usage
- Documentation
- Hazm in other languages
- Contribution
- Thanks
Module name | |
---|---|
DependencyParser | 85.6% |
POSTagger | 98.8% |
Chunker | 93.4% |
Lemmatizer | 89.9% |
Hazm is a python library to perform natural language processing tasks on Persian text. It offers various features for analyzing, processing, and understanding Persian text. You can use Hazm to normalize text, tokenize sentences and words, lemmatize words, assign part-of-speech tags, identify dependency relations, create word and sentence embeddings, or read popular Persian corpora.
- Normalization: Converts text to a standard form, such as removing diacritics, correcting spacing, etc.
- Tokenization: Splits text into sentences and words.
- Lemmatization: Reduces words to their base forms.
- POS tagging: Assigns a part of speech to each word.
- Dependency parsing: Identifies the syntactic relations between words.
- Embedding: Creates vector representations of words and sentences.
- Persian corpora reading: Easily read popular Persian corpora with ready-made scripts and minimal code.
To install the latest version of Hazm, run the following command in your terminal:
pip install hazm
Alternatively, you can install the latest update from GitHub (this version may be unstable and buggy):
pip install git+https://github.com/roshan-research/hazm.git
Finally if you want to use our pretrained models, you can download it from the links below:
Module name | Size |
---|---|
Download WordEmbedding | ~ 5 GB |
Download SentEmbedding | ~ 1 GB |
Download POSTagger | ~ 18 MB |
Download DependencyParser | ~ 15 MB |
Download Chunker | ~ 4 MB |
>>> from hazm import *
>>> normalizer = Normalizer()
>>> normalizer.normalize('اصلاح نويسه ها و استفاده از نیمفاصله پردازش را آسان مي كند')
'اصلاح نویسهها و استفاده از نیمفاصله پردازش را آسان میکند'
>>> sent_tokenize('ما هم برای وصل کردن آمدیم! ولی برای پردازش، جدا بهتر نیست؟')
['ما هم برای وصل کردن آمدیم!', 'ولی برای پردازش، جدا بهتر نیست؟']
>>> word_tokenize('ولی برای پردازش، جدا بهتر نیست؟')
['ولی', 'برای', 'پردازش', '،', 'جدا', 'بهتر', 'نیست', '؟']
>>> stemmer = Stemmer()
>>> stemmer.stem('کتابها')
'کتاب'
>>> lemmatizer = Lemmatizer()
>>> lemmatizer.lemmatize('میروم')
'رفت#رو'
>>> tagger = POSTagger(model='pos_tagger.model')
>>> tagger.tag(word_tokenize('ما بسیار کتاب میخوانیم'))
[('ما', 'PRO'), ('بسیار', 'ADV'), ('کتاب', 'N'), ('میخوانیم', 'V')]
>>> chunker = Chunker(model='chunker.model')
>>> tagged = tagger.tag(word_tokenize('کتاب خواندن را دوست داریم'))
>>> tree2brackets(chunker.parse(tagged))
'[کتاب خواندن NP] [را POSTP] [دوست داریم VP]'
>>> word_embedding = WordEmbedding(model_type = 'fasttext', model_path = 'word2vec.bin')
>>> word_embedding.doesnt_match(['سلام' ,'درود' ,'خداحافظ' ,'پنجره'])
'پنجره'
>>> word_embedding.doesnt_match(['ساعت' ,'پلنگ' ,'شیر'])
'ساعت'
>>> parser = DependencyParser(tagger=tagger, lemmatizer=lemmatizer)
>>> parser.parse(word_tokenize('زنگها برای که به صدا درمیآید؟'))
<DependencyGraph with 8 nodes>
Visit https://roshan-ai.ir/hazm/docs to view the full documentation.
Disclaimer: These ports are not developed or maintained by Roshan. They may not have the same functionality or quality as the original Hazm..
We welcome and appreciate any contributions to this repo, such as bug reports, feature requests, code improvements, documentation updates, etc. Please follow the Contribution guideline when contributing. You can open an issue, fork the repo, write your code, create a pull request and wait for a review and feedback. Thank you for your interest and support in this repo!
- Thanks to Virastyar project for providing the persian word list.