/khmer-nltk

Khmer language processing toolkit

Primary LanguagePythonApache License 2.0Apache-2.0

πŸ…Khmer natural language processing toolkitπŸ…

circleci Codacy Badge pre-commit code style release versions fownloads DOI

🎯TODO

  • Sentence Segmentation
  • Word Segmentation
  • Part of speech Tagging
  • Named Entity Recognition
  • Text classification

πŸ’ͺInstallation

pip install khmer-nltk

🏹 Quick tour

[Blog]

To get the evaluation result of khmer-nltk's functionalities, please refer the sub-modules's readme

Sentence tokenization

>>> from khmernltk import sentence_tokenize
>>> raw_text = "αžαž½αž”αž†αŸ’αž“αžΆαŸ†αž‘αžΈαŸ’αŸ¨! ្៣ αžαž»αž›αžΆ αžŸαŸ’αž˜αžΆαžšαžαžΈαž•αŸ’αžŸαŸ‡αž•αŸ’αžŸαžΆαž‡αžΆαžαž·αžšαžœαžΆαž„αžαŸ’αž˜αŸ‚αžšαž“αž·αž„αžαŸ’αž˜αŸ‚αžš αžˆαžΆαž“αž‘αŸ…αž”αž‰αŸ’αž…αž”αŸ‹αžŸαž„αŸ’αžšαŸ’αž‚αžΆαž˜ αž“αžΆαŸ†αž–αž“αŸ’αž›αžΊαžŸαž“αŸ’αžαž·αž—αžΆαž– αž“αž·αž„αž€αžΆαžšαžšαž½αž”αžšαž½αž˜αž‡αžΆαžαŸ’αž˜αžΈ"
>>> print(sentence_tokenize(raw_text))
['αžαž½αž”αž†αŸ’αž“αžΆαŸ†αž‘αžΈαŸ’αŸ¨!', '្៣ αžαž»αž›αžΆ αžŸαŸ’αž˜αžΆαžšαžαžΈαž•αŸ’αžŸαŸ‡αž•αŸ’αžŸαžΆαž‡αžΆαžαž·αžšαžœαžΆαž„αžαŸ’αž˜αŸ‚αžšαž“αž·αž„αžαŸ’αž˜αŸ‚αžš αžˆαžΆαž“αž‘αŸ…αž”αž‰αŸ’αž…αž”αŸ‹αžŸαž„αŸ’αžšαŸ’αž‚αžΆαž˜ αž“αžΆαŸ†αž–αž“αŸ’αž›αžΊαžŸαž“αŸ’αžαž·αž—αžΆαž– αž“αž·αž„αž€αžΆαžšαžšαž½αž”αžšαž½αž˜αž‡αžΆαžαŸ’αž˜αžΈ']
>>> from khmernltk import word_tokenize
>>> raw_text = "αžαž½αž”αž†αŸ’αž“αžΆαŸ†αž‘αžΈαŸ’αŸ¨! ្៣ αžαž»αž›αžΆ αžŸαŸ’αž˜αžΆαžšαžαžΈαž•αŸ’αžŸαŸ‡αž•αŸ’αžŸαžΆαž‡αžΆαžαž·αžšαžœαžΆαž„αžαŸ’αž˜αŸ‚αžšαž“αž·αž„αžαŸ’αž˜αŸ‚αžš αžˆαžΆαž“αž‘αŸ…αž”αž‰αŸ’αž…αž”αŸ‹αžŸαž„αŸ’αžšαŸ’αž‚αžΆαž˜ αž“αžΆαŸ†αž–αž“αŸ’αž›αžΊαžŸαž“αŸ’αžαž·αž—αžΆαž– αž“αž·αž„αž€αžΆαžšαžšαž½αž”αžšαž½αž˜αž‡αžΆαžαŸ’αž˜αžΈ"
>>> print(word_tokenize(raw_text, return_tokens=True))
['αžαž½αž”', 'αž†αŸ’αž“αžΆαŸ†', 'αž‘αžΈ', '្៨', '!', ' ', '្៣', ' ', 'αžαž»αž›αžΆ', ' ', 'αžŸαŸ’αž˜αžΆαžšαžαžΈ', 'αž•αŸ’αžŸαŸ‡αž•αŸ’αžŸαžΆ', 'αž‡αžΆαžαž·', 'αžšαžœαžΆαž„', 'αžαŸ’αž˜αŸ‚αžš', 'αž“αž·αž„', 'αžαŸ’αž˜αŸ‚αžš', ' ', 'αžˆαžΆαž“', 'αž‘αŸ…', 'αž”αž‰αŸ’αž…αž”αŸ‹', 'αžŸαž„αŸ’αžšαŸ’αž‚αžΆαž˜', ' ', 'αž“αžΆαŸ†', 'αž–αž“αŸ’αž›αžΊ', 'αžŸαž“αŸ’αžαž·αž—αžΆαž–', ' ', 'αž“αž·αž„', 'αž€αžΆαžšαžšαž½αž”αžšαž½αž˜', 'αž‡αžΆαžαŸ’αž˜αžΈ']

Usage

>>> from khmernltk import pos_tag
>>> raw_text = "αžαž½αž”αž†αŸ’αž“αžΆαŸ†αž‘αžΈαŸ’αŸ¨! ្៣ αžαž»αž›αžΆ αžŸαŸ’αž˜αžΆαžšαžαžΈαž•αŸ’αžŸαŸ‡αž•αŸ’αžŸαžΆαž‡αžΆαžαž·αžšαžœαžΆαž„αžαŸ’αž˜αŸ‚αžšαž“αž·αž„αžαŸ’αž˜αŸ‚αžš αžˆαžΆαž“αž‘αŸ…αž”αž‰αŸ’αž…αž”αŸ‹αžŸαž„αŸ’αžšαŸ’αž‚αžΆαž˜ αž“αžΆαŸ†αž–αž“αŸ’αž›αžΊαžŸαž“αŸ’αžαž·αž—αžΆαž– αž“αž·αž„αž€αžΆαžšαžšαž½αž”αžšαž½αž˜αž‡αžΆαžαŸ’αž˜αžΈ"
>>> print(pos_tag(raw_text))
[('αžαž½αž”', 'n'), ('αž†αŸ’αž“αžΆαŸ†', 'n'), ('αž‘αžΈ', 'n'), ('្៨', '1'), ('!', '.'), (' ', 'n'), ('្៣', '1'), (' ', 'n'), ('αžαž»αž›αžΆ', 'n'), (' ', 'n'), ('αžŸαŸ’αž˜αžΆαžšαžαžΈ', 'n'), ('αž•αŸ’αžŸαŸ‡αž•αŸ’αžŸαžΆ', 'n'), ('αž‡αžΆαžαž·', 'n'), ('αžšαžœαžΆαž„', 'o'), ('αžαŸ’αž˜αŸ‚αžš', 'n'), ('αž“αž·αž„', 'o'), ('αžαŸ’αž˜αŸ‚αžš', 'n'), (' ', 'n'), ('αžˆαžΆαž“', 'v'), ('αž‘αŸ…', 'v'), ('αž”αž‰αŸ’αž…αž”αŸ‹', 'v'), ('αžŸαž„αŸ’αžšαŸ’αž‚αžΆαž˜', 'n'), (' ', 'n'), ('αž“αžΆαŸ†', 'v'), ('αž–αž“αŸ’αž›αžΊ', 'n'), ('αžŸαž“αŸ’αžαž·αž—αžΆαž–', 'n'), (' ', 'n'), ('αž“αž·αž„', 'o'), ('αž€αžΆαžšαžšαž½αž”αžšαž½αž˜', 'n'), ('αž‡αžΆαžαŸ’αž˜αžΈ', 'o')]

✍️ Citation

@misc{hoang-khmer-nltk,
  author = {Phan Viet Hoang},
  title = {Khmer Natural Language Processing Tookit},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/VietHoang1512/khmer-nltk}}
}

Used in:

πŸ‘¨β€πŸŽ“ References

πŸ“œ Advisor