Urdu Word Segmentation

This repository contains code & dataset for Urdu word segmentation as described in paper Urdu Word Segmentation using Conditional Random Fields (CRFs).

Requirement(s)

It is implemented in python and requires scikit-learn and python-crfsuite.

Dataset

A manually annotated corpus of approximately 111,000 tokens is available for download.

Reference(s)

If you use this tool in any of your work, please cite below paper.

Urdu Word Segmentation using Conditional Random Fields (CRFs)

@InProceedings{C18-1217,
  author = 	"Bin Zia, Haris
		and Raza, Agha Ali
		and Athar, Awais",
  title = 	"Urdu Word Segmentation using Conditional Random Fields (CRFs)",
  booktitle = 	"Proceedings of the 27th International Conference on Computational Linguistics",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"2562--2569",
  location = 	"Santa Fe, New Mexico, USA",
  url = 	"http://aclweb.org/anthology/C18-1217"
}

License(s)

Code licensed under the MIT License: http://opensource.org/licenses/MIT Data licensed under CC-BY 4.0: https://creativecommons.org/licenses/by/4.0/

humsha/Urdu-Word-Segmentation

Urdu Word Segmentation

Requirement(s)

Dataset

Reference(s)

License(s)