Code for Unsupervised multi-granular Chinese word segmentation and term discovery via graph partition.
N-gram and trained BERT classifier cannot be public since privacy policy.
python -m graces -s 饮食可,睡眠可,大便不规律,小便正常,体重无明显减轻。
python -m graces -f ./input.txt -o ./output.txt
import graces
graces.cut("饮食可,睡眠可,大便不规律,小便正常,体重无明显减轻。") # Segment a single sentence
graces.cut_k("饮食可,睡眠可,大便不规律,小便正常,体重无明显减轻。", k=8) # Segment a single sentence with fixed word count k.
graces.cut_file("./input.txt", "./output.txt") # Segment a file
We ask MD students to construct coarse and fine level word segmentation on EHRs for validation. We do not use data for training!
- dev.txt: Unlabeled EHRs from part of CCKS2019.
- dev_label_coarse.txt: Coarse-level word segmentation labels.
- dev_label_fine.txt: Fine-level word segmentation labels.
If you find our codes or data useful, please cite:
@article{YUAN2020103542,
title = "Unsupervised multi-granular Chinese word segmentation and term discovery via graph partition",
journal = "Journal of Biomedical Informatics",
volume = "110",
pages = "103542",
year = "2020",
issn = "1532-0464",
doi = "https://doi.org/10.1016/j.jbi.2020.103542",
url = "http://www.sciencedirect.com/science/article/pii/S1532046420301702",
author = "Zheng Yuan and Yuanhao Liu and Qiuyang Yin and Boyao Li and Xiaobin Feng and Guoming Zhang and Sheng Yu",
}