Chinese MentalBERT: Domain-Adaptive Pre-training on Social Media for Chinese Mental Health Text Analysis
It contains:
- link to domain adaptive pretrained models for Chinese mental health domain (link)
- link to trained model for 4 evaluation tasks: two semantic recognition tasks (link), suicide classification (link), cognitive distortion (link)
- code and material for domain adaptive pretraining (link)
- code and material for downstream tasks finetuning and evaluation (link)
- Download pretraining corpus:
- Sina Weibo Depression Dataset (SWDD) [5] : https://github.com/ethan-nicholas-tsai/DepressionDetection
- Weibo User Depression Detection Dataset (WU3D) [6]: https://github.com/aidenwang9867/Weibo-User-Depression-Detection-Dataset
- Download the depression lexicon [7]: https://github.com/omfoggynight/Chinese-Depression-domain-Lexicon
- Download the word segmentation tool LTP: https://github.com/HIT-SCIR/ltp
- Download the Chinese pre-trained BERT model (Chinese-BERT-wwm-ext) [3]: https://huggingface.co/hfl/chinese-bert-wwm-ext
- Download the datasets:
- SMP2020-EWECT (Sentiment analysis tasks): https://github.com/BrownSweater/BERT_SMP2020-EWECT
- Suicide and cognitive distrotion tasks [4]: https://github.com/HongzhiQ/SupervisedVsLLM-EfficacyEval
- Download the pretrained model:
- Chinese MentalBERT: https://huggingface.co/zwzzz/Chinese-MentalBERT
We use two public data sets as pretraining corpus examples and the depression lexicon for guided mask mechanism, see link for details. Feel free to add more related corpus or lexion to enrich your pretrain material.
In the pre_processing.py
, it includes the following steps:
- Data cleaning: remove irrelevant information, which included URLs, user tags (e.g., @username), topic tags (e.g., #topic#), and we also removed special symbols, emoticons, and unstructured characters.
- Sentence concatenation: Connect all cleaned sentences in their original sequence to form a continuous stream of text.
- Segmentation into 128-token samples: Split the continuous text stream into multiple samples, each containing 128 tokens, to facilitate efficient processing and enable the model to learn long-distance dependencies in the text.
And put your data as TRAIN_FILE
when you run the pre-training.
Download the word segmentation tool LTP (link) and put it as LTP_RESOURCE
when you run the pre-training.
We utilized the Chinese pre-trained BERT model (Chinese-BERT-wwm-ext) [3] in our experiment for the foundational pre-trained model. And put it as BERT_RESOURCE
when you run the pre-training.
You could run the following:
export TRAIN_FILE=/path/to/train/file
export LTP_RESOURCE=/path/to/ltp/tokenizer
export BERT_RESOURCE=/path/to/bert/tokenizer
export SAVE_PATH=/path/to/data/ref.txt
python run_chinese_ref.py \
--file_name=$TRAIN_FILE \
--ltp=$LTP_RESOURCE \
--bert=$BERT_RESOURCE \
--save_path=$SAVE_PATH
Then you can run the script like this:
export TRAIN_FILE=/path/to/train/file
export VALIDATION_FILE=/path/to/validation/file
export TRAIN_REF_FILE=/path/to/train/chinese_ref/file
export VALIDATION_REF_FILE=/path/to/validation/chinese_ref/file
export OUTPUT_DIR=/tmp/test-mlm-wwm
python run_mlm_wwm.py \
--model_name_or_path roberta-base \
--train_file $TRAIN_FILE \
--validation_file $VALIDATION_FILE \
--train_ref_file $TRAIN_REF_FILE \
--validation_ref_file $VALIDATION_REF_FILE \
--do_train \
--do_eval \
--output_dir $OUTPUT_DIR
Our trained model is public available at: https://huggingface.co/zwzzz/Chinese-MentalBERT. You can load it and use it to fine tune on your downstream task.
Chinese MentalBERT is evaluated on four public datasets in the mental health domain, including two semantic recognition tasks (link, suicide classification (link), cognitive distortion (link). In the provided open source code, we use cognitive distortion multi-label classification as an example as a demonstration of finetuning and evaluation on downstream tasks.
You can download the public dataset in our experiment as details in [link](# Material-for-fine-tuning-on-downstream-task). And put it on the downstreamTasks
path
You can download the pretrained model and set up
You could run the following:
python finetuning.py
Then you can evaluate like this:
python evaluate.py
-
Genghao Li, Bing Li, Langlin Huang, Sibing Hou, et al. 2020. Automatic construction of a depressiondomain lexicon based on microblogs: text mining study. JMIR medical informatics, 8(6):e17650.
-
Wanxiang Che, Yunlong Feng, Libo Qin, Ting Liu. 2020. N-LTP: An open-source neural language technology platform for Chinese. arXiv preprint arXiv:2009.11616.
-
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, and Ziqing Yang. 2021. Pre-training with whole word masking for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3504–3514.
-
Hongzhi Qi, Qing Zhao, Changwei Song, Wei Zhai, Dan Luo, Shuo Liu, Yi Jing Yu, Fan Wang, Huijing Zou, Bing Xiang Yang, et al. 2023. Evaluating the efficacy of supervised learning vs large language models for identifying cognitive distortions and suicidal risks in chinese social media. arXiv preprint arXiv:2309.03564.
-
Cai, Yicheng, et al. "Depression detection on online social network with multivariate time series feature of user depressive symptoms." Expert Systems with Applications 217 (2023): 119538.
-
Wang, Yiding, et al. "A multitask deep learning approach for user depression detection on sina weibo." arXiv preprint arXiv:2008.11708 (2020).
-
Li, Genghao, et al. "Automatic construction of a depression-domain lexicon based on microblogs: text mining study." JMIR medical informatics 8.6 (2020): e17650.