C-LLM

This is the source code for the paper "C-LLM: Learn to Check Chinese Spelling Errors Character by Character" （https://arxiv.org/pdf/2406.16536 ）

[2024.9.20] Our paper is accepted by EMNLP2024 Main!

Environment

Python: 3.8
Cuda: 12.0 (NVIDIA A100-SXM4-40GB)
Packages: pip install -r requirements.txt

Data

Train

Data for Continued Pre-training: Tiger-pretrain-zh (https://huggingface.co/datasets/TigerResearch/pretrain_zh)
Data for Supervised Fine-tuning (see /dataset/train_date/):
- Train: Wang271K, CSCD-NS (train)
- Dev: CSCD-NS (dev)

Test

See /dataset/test_date/:

General Dataset: CSCD-NS (test)
Multi-Domain Dataset: LEMON (https://github.com/gingasan/lemon/tree/main/lemon_v2)

Character-Level Tokenization

First, run tokenizer_prune_qwen.py to trim the vocabulary for BPE-based tokenization. Next, run pruner.py to update the model embeddings with the new vocabulary.

python tokenizer_prune_qwen.py 
python pruner.py

Continued Pre-training

The training data comprised approximately 19B tokens, but we trained for 30,000 steps, covering about 2B tokens. The backbone model is QWEN1.5.

Supervised Fine-tuning

After the above steps are completed, run train.sh for fine-tuning.

sh train.sh

Inference

After fine-tuning, run test.sh for inference. Please modify the parameter path in the script is updated to match the path where you have saved the parameters.

bash test.sh

Evaluation

Two methods for handling unequal length sentences were designed: one based on CheRRANT and the other on difflib. In this paper, we adopted the CheRRANT-based method. For evaluation, CheRRANT must first be downloaded to the specified directory.

Run evaluate_result.py for evaluation:

python evaluate_result.py

The script for calculating metrics is adapted from CSCD-NS.

Citation

If you find this work is useful for your research, please cite the following paper: C-LLM: Learn to Check Chinese Spelling Errors Character by Character （https://arxiv.org/pdf/2406.16536 ）

@article{li2024c,
  title={C-LLM: Learn to Check Chinese Spelling Errors Character by Character},
  author={Li, Kunting and Hu, Yong and He, Liang and Meng, Fandong and Zhou, Jie},
  journal={arXiv preprint arXiv:2406.16536},
  year={2024}
}

ktlKTL/C-LLM