Introduction

Code for Unsupervised multi-granular Chinese word segmentation and term discovery via graph partition.

Usage

N-gram and trained BERT classifier cannot be public since privacy policy.

Use in command lines

python -m graces -s 饮食可，睡眠可，大便不规律，小便正常，体重无明显减轻。
python -m graces -f ./input.txt -o ./output.txt

Import from python

import graces
graces.cut("饮食可，睡眠可，大便不规律，小便正常，体重无明显减轻。") # Segment a single sentence
graces.cut_k("饮食可，睡眠可，大便不规律，小便正常，体重无明显减轻。", k=8) # Segment a single sentence with fixed word count k.
graces.cut_file("./input.txt", "./output.txt") # Segment a file

Data

We ask MD students to construct coarse and fine level word segmentation on EHRs for validation. We do not use data for training!

dev.txt: Unlabeled EHRs from part of CCKS2019.
dev_label_coarse.txt: Coarse-level word segmentation labels.
dev_label_fine.txt: Fine-level word segmentation labels.

Citation

If you find our codes or data useful, please cite: