topic-segment-chatglm

The definition of topic segmentation task involves dividing a document into coherent segments, which is typically a prerequisite for tasks such as question retrieval, new knowledge-augmented LLM knowledge and others. This github presents a generative model-based approach for addressing the topic segmentation task and releases the chatglm-6b model finetuning on the wiki727 dataset. The model has also been validated to achieve excellent performance on the low-resource cross-domain conference dataset QMSum.

dependences

torch==1.3.1+cu11.6
transformers==4.27.1
datasets==2.10.0

loralib
deepspeed==0.8.0
accelerate==0.16.0
bitsandbytes
peft==0.3.0

Data

download wiki727 org datasets, and put it in data/wiki_727

cd utils

sh wiki_process.sh

Train

wiki727k: using 4xA100 80G

sh scripts/ds_train_finetune.sh

qmsum train

using lora, using 1xA100 80G

sh scripts/train_lora.sh

Test

wiki727k:

sh scripts/evaluate_finetune.sh

qmsum:

sh scripts/evaluate_lora.sh

wiki727k:

performance compare with other methods, chatglm+finetune models can be found in huggingface.co

	Pk
CATS	15.95
StructuredSummarizaiton	15.0
chatglm+finetune(ours)	14.0

qmsum:

performance compare with other methods

	Pk
DialogLM	38.0
StructuredSummarizaiton	32.8
chatglm(wiki)-no train	37.7
chatglm-lora	36.0
chatglm(wiki)-lora	32.3

Reference

code: https://github.com/THUDM/ChatGLM-6B