An implementation of Lexical Unit Analysis (LUA) for sequence segmentation tasks (e.g., Chinese POS Tagging). Note that this is not an officially supported Tencent product.

Preparation

Two steps. Firstly, reformulate the chunking data sets and move them into a new folder named "dataset". The folder contains {train, dev, test}.json. Each JSON file is a list of dicts. See the following NER case:

[ 
 {
  "sentence": "['Somerset', '83', 'and', '174', '(', 'P.', 'Simmons']",
  "labeled entities": "[(0, 0, 'ORG'), (1, 1, 'O'), (2, 2, 'O'), (3, 3, 'O'), (4, 4, 'O'), (5, 6, 'PER')]",
 },
 {
  "sentence": "['Leicestershire', '22', 'points', ',', 'Somerset', '4', '.']",
  "labeled entities": "[(0, 0, 'ORG'), (1, 1, 'O'), (2, 2, 'O'), (3, 3, 'O'), (4, 4, 'ORG'), (5, 5, 'O'), (0, 0, 'O')]",
 }
]

Secondly, pretrained LM (i.e., BERT) and evaluation script. Create another directory, "resource", with the following arrangement:

resource
- pretrained_lm
  - model.pt
  - vocab.txt
- conlleval.pl

For Chinese tasks, the source to construct "pretrained_lm" is bert-base-chinese.

Training and Test

CUDA_VISIBLE_DEVICES=0 python main.py -dd dataset -sd dump -rd resource

Citation

@inproceedings{li-etal-2021-segmenting-natural,
    title = "Segmenting Natural Language Sentences via Lexical Unit Analysis",
    author = "Li, Yangming  and  Liu, Lemao  and  Shi, Shuming",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.18",
    doi = "10.18653/v1/2021.findings-emnlp.18",
    pages = "181--187",
}

LeePleased/LUA

Preparation

Training and Test

Citation