Mohammad H. Forouhesh
Metodata Inc ®
April 25, 2022
This repository contains Persian Part of Speech tagger based on Conditional Random Fields and a native Text Normalizer.
- CRF tagger commit#64
- Wapiti tagger commit#56
- Native Normalizer pull#4
- UnitTesting commit#127
- CI/CD pull#5
- Scrutinize Coverage issue#8
- Documentation pull#9
- Improve Coverage pull#9
- Smooth Installation issue#12 pull#13
- Excel code quality pull#11
- Adding documentation and flowchart of the code.
- CircleCI CI/CD Pipeline Config issue#14
- Interactive Docker container via
docker-compose
pull#23
A tiny interactive docker container is provided for production.
git clone https://github.com/MohammadForouhesh/crf-pos-persian.git
docker-compose run --rm crf-pos
! pip install crf_pos
$ git clone https://github.com/MohammadForouhesh/crf-pos-persian
$ cd crf-pos-persian
$ python setup.py install
! pip install git+https://github.com/MohammadForouhesh/crf-pos-persian.git
from crf_pos.pos_tagger.wapiti import WapitiPosTagger
pos_tagger = WapitiPosTagger()
tokens = 'او رئیسجمهور حجتالاسلاموالمسلمین ابرهیم رئیسی رئیس جمهور ایران اسلامی می باشد'
pos_tagger[tokens]
[1]:
[('او', 'PRO'),
('رئیس\u200cجمهور', 'N'),
('حجت\u200cالاسلام\u200cوالمسلمین', 'N'),
('ابرهیم', 'N'),
('رئیسی', 'N'),
('رئیس\u200cجمهور', 'N'),
('ایران', 'N'),
('اسلامی', 'ADJ'),
('می\u200cباشد', 'V')]
Test and training is perfomed on Mojgan Seraji's Uppsala Persian Corpus
Part-of-Speech | Description | precision | recall | f1-score | support |
---|---|---|---|---|---|
N | Noun | 0.985 | 0.970 | 0.977 | 186585 |
P | Preposition | 0.998 | 0.998 | 0.998 | 89450 |
V | Verb | 0.999 | 0.999 | 0.999 | 87762 |
ADV | Adverb | 0.976 | 0.972 | 0.974 | 15983 |
FW | Foreign Word | 0.989 | 0.992 | 0.991 | 2784 |
DET | Determiner | 0.973 | 0.977 | 0.975 | 19786 |
ADJ | Adjective | 0.978 | 0.975 | 0.977 | 61526 |
INT | Interjection | 1.000 | 1.000 | 1.000 | 73 |
CONJ | Conjunction | 0.996 | 0.997 | 0.997 | 74796 |
PRO | Pronoun | 0.973 | 0.974 | 0.973 | 23094 |
NUM | Numeral | 0.988 | 0.992 | 0.990 | 24864 |
avg/total | - | 0.985 | 0.985 | 0.985 | 586703 |