/MentalLLaMA

This repository introduces MentaLLaMA, the first open-source instruction following large language model for interpretable mental health analysis.

Primary LanguagePythonMIT LicenseMIT

Kailai Yang1,2Tianlin Zhang1,2Shaoxiong Ji3  Qianqian Xie1,2  Ziyan Kuang6Sophia Ananiadou1,2,4  Jimin Huang5
1National Centre for Text Mining  2The University of Manchester  3University of Helsinki  4Artificial Intelligence Research Center, AIST  5Wuhan University  6Jiangxi Normal University 
NaCTeMUoM University Logohelsinki Logoairc LogoWuhan University Logo Jiangxi Logo

News

📢 Oct. 31, 2023 We release the MentaLLaMA-33B-lora model, a 33B edition of MentaLLaMA based on Vicuna-33B and the full IMHI dataset, but trained with LoRA due to the computational resources!

📢 Oct. 13, 2023 We release the training data for the following datasets: DR, dreaddit, SAD, MultiWD, and IRF. More to come, stay tuned!

📢 Oct. 7, 2023 Our evaluation paper: "Towards Interpretable Mental Health Analysis with Large Language Models" has been accepted by EMNLP 2023 main conference as a long paper!

Ethical Considerations

This repository and its contents are provided for non-clinical research only . None of the material constitutes actual diagnosis or advice, and help-seeker should get assistance from professional psychiatrists or clinical practitioners. No warranties, express or implied, are offered regarding the accuracy , completeness, or utility of the predictions and explanations. The authors and contributors are not responsible for any errors, omissions, or any consequences arising from the use of the information herein. Users should exercise their own judgment and consult professionals before making any clinical-related decisions. The use of the software and information contained in this repository is entirely at the user's own risk.

The raw datasets collected to build our IMHI dataset are from public social media platforms such as Reddit and Twitter, and we strictly follow the privacy protocols and ethical principles to protect user privacy and guarantee that anonymity is properly applied in all the mental health-related texts. In addition, to minimize misuse, all examples provided in our paper are paraphrased and obfuscated utilizing the moderate disguising scheme.

In addition, recent studies have indicated LLMs may introduce some potential bias, such as gender gaps. Meanwhile, some incorrect prediction results, inappropriate explanations, and over-generalization also illustrate the potential risks of current LLMs. Therefore, there are still many challenges in applying the model to real-scenario mental health monitoring systems.

By using or accessing the information in this repository, you agree to indemnify, defend, and hold harmless the authors, contributors, and any affiliated organizations or persons from any and all claims or damages.

Introduction

This project presents our efforts towards interpretable mental health analysis with large language models (LLMs). In early works we comprehensively evaluate the zero-shot/few-shot performances of the latest LLMs such as ChatGPT and GPT-4 on generating explanations for mental health analysis. Based on the findings, we build the Interpretable Mental Health Instruction (IMHI) dataset with 105K instruction samples, the first multi-task and multi-source instruction-tuning dataset for interpretable mental health analysis on social media. Based on the IMHI dataset, We propose MentaLLaMA, the first open-source instruction-following LLMs for interpretable mental health analysis. MentaLLaMA can perform mental health analysis on social media data and generate high-quality explanations for its predictions. We also introduce the first holistic evaluation benchmark for interpretable mental health analysis with 19K test samples, which covers 8 tasks and 10 test sets. Our contributions are presented in these 2 papers:

The MentaLLaMA Paper | The Evaluation Paper

MentaLLaMA Model

We provide 5 model checkpoints evaluated in the MentaLLaMA paper:

  • MentaLLaMA-33B-lora: This model is fine-tuned based on the Vicuna-33B foundation model and the full IMHI instruction tuning data. The training data covers 8 mental health analysis tasks. The model can follow instructions to make accurate mental health analysis and generate high-quality explanations for the predictions. Due to the limitation of computational resources, we train the MentaLLaMA-33B model with the PeFT technique LoRA, which significantly reduced memory usage.

  • MentaLLaMA-chat-13B: This model is fine-tuned based on the Meta LLaMA2-chat-13B foundation model and the full IMHI instruction tuning data. The training data covers 8 mental health analysis tasks. The model can follow instructions to make accurate mental health analysis and generate high-quality explanations for the predictions. Due to the model size, the inference are relatively slow.

  • MentaLLaMA-chat-7B: This model is fine-tuned based on the Meta LLaMA2-chat-7B foundation model and the full IMHI instruction tuning data. The training data covers 8 mental health analysis tasks. The model can follow instructions to make mental health analysis and generate explanations for the predictions.

  • MentalBART: This model is fine-tuned based on the BART-large foundation model and the full IMHI-completion data. The training data covers 8 mental health analysis tasks. The model cannot follow instructions, but can make mental health analysis and generate explanations in a completion-based manner. The smaller size of this model allows faster inference and easier deployment.

  • MentalT5: This model is fine-tuned based on the T5-large foundation model and the full IMHI-completion data. The model cannot follow instructions, but can make mental health analysis and generate explanations in a completion-based manner. The smaller size of this model allows faster inference and easier deployment.

You can use the MentaLLaMA models in your Python project with the Hugging Face Transformers library. Here is a simple example of how to load the fully fine-tuned model:

from transformers import LlamaTokenizer, LlamaForCausalLM
tokenizer = LlamaTokenizer.from_pretrained(MODEL_PATH)
model = LlamaForCausalLM.from_pretrained(MODEL_PATH, device_map='auto')

In this example, LlamaTokenizer is used to load the tokenizer, and LlamaForCausalLM is used to load the model. The device_map='auto' argument is used to automatically use the GPU if it's available. MODEL_PATH denotes your model save path.

After loading the models, you can generate a response. Here is an example:

prompt = 'Consider this post: "work, it has been a stressful week! hope it gets better." Question: What is the stress cause of this post?'
inputs = tokenizer(prompt, return_tensors="pt")

# Generate
generate_ids = model.generate(inputs.input_ids, max_length=2048)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

Our running of these codes on MentaLLaMA-chat-13B gets the following response:

Answer: This post shows the stress cause related to work. Reasoning: The post explicitly mentions work as being stressful and expresses a hope that it gets better. This indicates that the poster is experiencing stress in relation to their work, suggesting that work is the primary cause of their stress in this instance.

For the MentaLLaMA-33B-lora model, since our model is based on the Vicuna-33B foundation model, you need to first download the Vicuna-33B model here, and put it under the ./vicuna-33B dir. Then download the MentaLLaMA-33B-lora weights and put it under the ./MentaLLaMA-33B-lora dir.

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
peft_model = AutoPeftModelForCausalLM.from_pretrained("./MentaLLaMA-33B-lora")
tokenizer = AutoTokenizer.from_pretrained('./MentaLLaMA-33B-lora')

After loading the models, you can generate a response. Here is an example:

input_data = ["Consider this post: I'm finally dead inside and I don't know how to feel about it Fear, Anger, Sadness... It's all gone. I just feel numb. Question: Does the poster suffer from depression?"]

inputs = tokenizer(input_data, return_tensors="pt", padding=True)
input_ids = inputs.input_ids

generate_ids = peft_model.generate(**inputs, max_length=2048)

truc_ids = generate_ids[0][len(input_ids[0]) :]
response = tokenizer.decode(truc_ids, skip_special_tokens=True, spaces_between_special_tokens=False)

Our running of these codes on MentaLLaMA-33B-lora gets the following response:

Reasoning: Yes, the poster suffers from depression. Reasoning: The poster's statement expresses a sense of emotional numbness and a lack of emotional response. This is a common symptom of depression, as individuals with depression often experience a diminished ability to feel emotions. The poster also mentions feeling dead inside, which further suggests a lack of emotional connection and a sense of hopelessness, both of which are common in depression. Overall, the language used and the description of emotional numbness align with symptoms commonly associated with depression.

The IMHI Dataset

We collect raw data from 10 existing datasets covering 8 mental health analysis tasks, and transfer them into test data for interpretable mental health analysis. Statistic about the 10 test sets are as follows:

Name Task Data Split Data Source Annotation Released
DR depression detection 1,003/430/405 Reddit Weak labels Yes
CLP depression detection 456/196/299 Reddit Human annotations Not yet
dreaddit stress detection 2,837/300/414 Reddit Human annotations Yes
SWMH mental disorders detection 34,822/8,705/10,882 Reddit Weak labels Not yet
T-SID mental disorders detection 3,071/767/959 Twitter Weak labels Not yet
SAD stress cause detection 5,547/616/684 SMS Human annotations Yes
CAMS depression/suicide cause detection 2,207/320/625 Reddit Human annotations Not yet
loneliness loneliness detection 2,463/527/531 Reddit Human annotations Not yet
MultiWD Wellness dimensions detection 15,744/1,500/2,441 Reddit Human annotations Yes
IRF Interpersonal risks factors detection 3,943/985/2,113 Reddit Human annotations Yes

Training data

We introduce IMHI, the first multi-task and multi-source instruction-tuning dataset for interpretable mental health analysis on social media. We currently release the training and evaluation data from the following sets: DR, dreaddit, SAD, MultiWD, and IRF. The instruction data is put under

/train_data/instruction_data

The items are easy to follow: the query row denotes the question, and the gpt-3.5-turbo row denotes our modified and evaluated predictions and explanations from ChatGPT. gpt-3.5-turbo is used as the golden response for evaluation.

To facilitate training on models with no instruction following ability, we also release part of the test data for IMHI-completion. The data is put under

/train_data/complete_data

The file layouts are the same with instruction tuning data.

Evaluation Benchmark

We introduce the first holistic evaluation benchmark for interpretable mental health analysis with 19K test samples . We currently release the test data from the following sets: DR, dreaddit, SAD, MultiWD, and IRF. The instruction data is put under

/test_data/test_instruction

The items are easy to follow: the query row denotes the question, and the gpt-3.5-turbo row denotes our modified and evaluated predictions and explanations from ChatGPT. gpt-3.5-turbo is used as the golden response for evaluation.

To facilitate test on models with no instruction following ability, we also release part of the test data for IMHI-completion. The data is put under

/test_data/test_complete

The file layouts are the same with instruction tuning data.

Model Evaluation

Response Generation

To evaluate your trained model on the IMHI benchmark, first load your model and generate responses for all test items. We use the Hugging Face Transformers library to load the model. For LLaMA-based models, you can generate the responses with the following commands:

cd src
python IMHI.py --model_path MODEL_PATH --batch_size 8 --model_output_path OUTPUT_PATH --test_dataset IMHI --llama --cuda

MODEL_PATH and OUTPUT_PATH denote the model save path and the save path for generated responses. All generated responses will be put under ../model_output. Some generated examples are shown in

./examples/response_generation_examples

You can also evaluate with the IMHI-completion test set with the following commands:

cd src
python IMHI.py --model_path MODEL_PATH --batch_size 8 --model_output_path OUTPUT_PATH --test_dataset IMHI-completion --llama --cuda

You can also load models that are not based on LLaMA by removing the --llama argument. In the generated examples, the goldens row denotes the reference explanations and the generated_text row denotes the generated responses from your model.

Correctness Evaluation

The first evaluation metric for our IMHI benchmark is to evaluate the classification correctness of the model generations. If your model can generate very regular responses, a rule-based classifier can do well to assign a label to each response. We provide a rule-based classifier in IMHI.py and you can use it during the response generation process by adding the argument: --rule_calculate to your command. The classifier requires the following template:

[label] Reasoning: [explanation]

However, as most LLMs are trained to generate diverse responses, a rule-based label classifier is impractical. For example, MentaLLaMA can have the following response for an SAD query:

This post indicates that the poster's sister has tested positive for ovarian cancer and that the family is devastated. This suggests that the cause of stress in this situation is health issues, specifically the sister's diagnosis of ovarian cancer. The post does not mention any other potential stress causes, making health issues the most appropriate label in this case.

To solve this problem, in our MentaLLaMA paper we train 10 neural network classifiers based on MentalBERT, one for each collected raw dataset. The classifiers are trained to assign a classification label given the explanation. We release these 10 classifiers to facilitate future evaluations on IMHI benchmark.

All trained models achieve over 95% accuracy on the IMHI test data. Before you assign the labels, make sure you have transferred your output files in the format of /exmaples/response_generation_examples and named as DATASET.csv. Put all the output files you want to label under the same DATA_PATH dir. Then download the corresponding classifier models from the following links:

The models download links: CAMS, CLP, DR, dreaddit, Irf, loneliness, MultiWD, SAD, swmh, t-sid

Put all downloaded models under a MODEL_PATH dir and name each model with its dataset. For example, the model for DR dataset should be put under /MODEL_PATH/DR. Now you can obtain the labels using these models with the following commands:

cd src
python label_inference.py --model_path MODEL_PATH --data_path DATA_PATH --data_output_path OUTPUT_PATH --cuda

where MODEL_PATH, DATA_PATH denote your specified model and data dirs, and OUTPUT_PATH denotes your output path. After processing, the output files should have the format as the examples in /examples/label_data_examples. If you hope to calculate the metrics such as weight-F1 score and accuracy, add the argument --calculate to the above command.

Explanation Quality Evaluation

The second evaluation metric for the IMHI benchmark is to evaluate the quality of the generated explanations. The results in our evaluation paper show that BART-score is moderately correlated with human annotations in 4 human evaluation aspects, and outperforms other automatic evaluation metrics. Therefore, we utilize BART-score to evaluate the quality of the generated explanations. Specifically, you should first generate responses using the IMHI.py script and obtain the response dir as in examples/response_generation_examples. Firstly, download the BART-score directory and put it under /src, then download the BART-score checkpoint. Then score your responses with BART-score using the following commands:

cd src
python score.py --gen_dir_name DIR_NAME --score_method bart_score --cuda

DIR_NAME denotes the dir name of your geenrated responses and should be put under ../model_output. We also provide other scoring methods. You can change --score_method to 'GPT3_score', 'bert_score', 'bleu', 'rouge' to use these metrics. For GPT-score, you need to first download the project and put it under /src.

Human Annotations

We release our human annotations on AI-generated explanations to facilitate future research on aligning automatic evaluation tools for interpretable mental health analysis. Based on these human evaluation results, we tested various existing automatic evaluation metrics on correlation with human preferences. The results in our evaluation paper show that BART-score is moderately correlated with human annotations in all 4 aspects.

Quality Evaluation

In our evaluation paper, we manually labeled a subset of the AIGC results for the DR dataset in 4 aspects: fluency, completeness, reliability, and overall. The annotations are released in this dir:

/human_evaluation/DR_annotation

where we labeled 163 ChatGPT-generated explanations for the depression detection dataset DR. The file chatgpt_data.csv includes 121 explanations that correctly classified by ChatGPT. chatgpt_false_data.csv includes 42 explanations that falsely classified by ChatGPT. We also include 121 explanations that correctly classified by InstructionGPT-3 in gpt3_data.csv.

Expert-written Golden Explanations

In our MentaLLaMA paper, we invited one domain expert major in quantitative psychology to write an explanation for 350 selected posts (35 posts for each raw dataset). The golden set is used to accurately evaluate the explanation-generation ability of LLMs in an automatic manner. To facilitate future research, we release the expert-written explanations for the following datasets: DR, dreaddit, SWMH, T-SID, SAD, CAMS, loneliness, MultiWD, and IRF (35 samples each). The data is released in this dir:

/human_evaluation/test_instruction_expert

The expert-written explanations are processed to follow the same format as other test datasets to facilitate model evaluations. You can test your model on the expert-written golden explanations with similar commands as in response generation. For example, you can test LLaMA-based models as follows:

cd src
python IMHI.py --model_path MODEL_PATH --batch_size 8 --model_output_path OUTPUT_PATH --test_dataset expert --llama --cuda

Citation

If you use the human annotations or analysis in the evaluation paper, please cite:

@misc{yang2023interpretable,
      title={Towards Interpretable Mental Health Analysis with Large Language Models}, 
      author={Kailai Yang and Shaoxiong Ji and Tianlin Zhang and Qianqian Xie and Ziyan Kuang and Sophia Ananiadou},
      year={2023},
      eprint={2304.03347},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

If you use MentaLLaMA in your work, please cite:

@article{yang2023mentalllama,
  title={MentalLLaMA: Interpretable Mental Health Analysis on Social Media with Large Language Models},
  author={Yang, Kailai and Zhang, Tianlin and Kuang, Ziyan and Xie, Qianqian and Ananiadou, Sophia},
  journal={arXiv preprint arXiv:2309.13567},
  year={2023}
}

License

MentaLLaMA is licensed under [MIT]. Please find more details in the MIT file.