/CFGPT

Chinese Financial Assistant with Large Language Model

Primary LanguagePythonApache License 2.0Apache-2.0

📈 CFGPT: Chinese Financial Assistant with Large Language Model

English | 简体中文

Introduction

We introduce CFGPT, an open-source language model trained by firstly further pretraining general LLMs on collected and cleaned Chinese finance text data (CFData-pt), including financial domain-specific data (announcement, finance articles, finance exams, finance news, finance research papers) and general data (Wikipedia), and secondly fine-tuning with knowledge-intensive instruction tuning data (CFData-sft). As for preliminary evaluation, we use CFBenchmark-Basic. CFGPT outperforms the baselines on objective and subjective tasks compared to several baseline models with similar parameters.

  • We develop CFGPT2 (7B&20B) based on InternLM, which also integrates modules for retrieval enhancement, fact verification, compliance checking, and risk monitoring. While enhancing the real-time performance and accuracy of financial large model services, it effectively monitors and controls financial risks.

    • CFGPT2-7B: Further pretraining and supervised fine-tuning InternLM2 on our dataset and outperforms the baselines on several financial task.
  • We release CFGPT1 (7B) in three parts:

  • We release the CFBenchmark, a Chinese financial assistant benhmark for large language model. The basic version of CFBenchmark includes 3917 financial texts spanning three aspects and eight tasks, for the evaluation of the financial text processing capability of LLMs in Chinese financial market.

  • We release the code of further pretrain and instruction tuning of CFGPT.

  • We further provide several samples about our CFData-sft.

The following is the overview of training CFGPT:

Content

Quick Start

1. Prepare the code and the environment

Clone our repository, create a Python environment, and activate it via the following command

git clone https://github.com/TongjiFinLab/CFGPT.git
cd CFGPT
conda create -n env_name python=3.10   
source activate env_name 
pip install -r requirements.txt

2. Prepare the pretrained CFGPT1

The CFGPT1 consists of three parts: a pretrain model, continued pretraining InternLM-7B on our CFData-pt, an adapter model (trained via PEFT on our CFData-sft), and a Full-finetuned model trained base on the pretrain model.

Pretrain model Adapter model Full SFT Model
CFGPT1-pt-7B CFGPT1-sft-7B-lora CFGPT1-sft-7B-full

3. Use CFGPT1-sft-7B-LoRA

from transformers import AutoModel, AutoTokenizer
from peft import PeftModel
base_model = 'TongjiFinLab/CFGPT1-pt-7B'
lora_weights = 'TongjiFinLab/CFGPT1-sft-7B-LoRA'
device_map = 'cuda:0'
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModel.from_pretrained(
    base_model,
    trust_remote_code=True,
    device_map=device_map,
    torch_dtype=torch.bfloat16
)
model = PeftModel.from_pretrained(
    model,
    lora_weights,
    device_map=device_map,
)
model = model.eval()
inputs = tokenizer("""你是一名金融从业者,请对这篇新闻进行情感分析。请从(中性、积极、消极)中选取答案。新闻内容:挖贝快讯:特步国际发布2023年第二季度**内地业务营运状况,披露截至2023年6月30日止3个月零售销售实现高双位数同比增长(包括线上线下渠道),零售折扣水平约七五折。同时,2022年7月MSCI首次予以特步ESG评级,一年后评级表现即迎来提升。明晟MSCI上调特步ESG评级,由“BB”升至“BBB”。\n回答:""", return_tensors='pt').to(device_map)
pred = model.generate(**inputs, max_new_tokens=64, do_sample=False, repetition_penalty=1.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True).split('回答:')[1])

4. Use CFGPT1-sft-7B-Full

from transformers import AutoModel, AutoTokenizer
base_model = 'TongjiFinLab/CFGPT1-sft-7B-Full'
device_map = 'cuda:0'
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModel.from_pretrained(
    base_model,
    trust_remote_code=True,
    device_map=device_map,
    torch_dtype=torch.bfloat16
)
model = model.eval()
inputs = tokenizer("""你是一名金融从业者,请对这篇新闻进行情感分析。请从(中性、积极、消极)中选取答案。新闻内容:挖贝快讯:特步国际发布2023年第二季度**内地业务营运状况,披露截至2023年6月30日止3个月零售销售实现高双位数同比增长(包括线上线下渠道),零售折扣水平约七五折。同时,2022年7月MSCI首次予以特步ESG评级,一年后评级表现即迎来提升。明晟MSCI上调特步ESG评级,由“BB”升至“BBB”。\n回答:""", return_tensors='pt').to(device_map)
pred = model.generate(**inputs, max_new_tokens=64, do_sample=False, repetition_penalty=1.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True).split('回答:')[1])
  • More detail are in ./code/test

User Cases

Data

In this repo, we share the samples of CFData:

  • CFData: ./data

    The CFData comprising both a pre-training dataset and a supervised fine-tuning dataset, where the pre-training dataset collates Chinese financial data and analytics, alongside a smaller subset of general-purpose text with 584M documents and 141B tokens in total, and the supervised finetuning dataset is tailored for six distinct financial tasks, embodying various facets of financial analysis and decisionmaking with 1.5M instruction pairs and 1.5B tokens in total.

Further pretrain

The pre-training dataset consists of 591 million documents and 193 billion tokens, including six sub-datasets

  • CFData-CP (6.24%): 39 thousand corporate prospectus with 13 billion tokens;
  • CFData-CA (12.28%): 6 million corporate announcements with 17 billion tokens;
  • CFData-RR (2.51% ): 392 thousand research reports with 3 billion tokens;
  • CFData-FN (18.70%): 82 million financial news with 26 billion tokens;
  • CFData-SM (60.15%): 495 million social medias and 84 billion tokens;
  • CFData-Wiki (0.09%): 255 thousand Wikipedia content with 137 million tokens.

We sample a financial text sub-corpus from CFData-pt for further pretraining on InternLM-7B consists of 13.7 billion tokens from a large amount of Chinese financial data and analytics and a small amount of general-purpose text, such as announcements, research reports, social media content, financial news articles, and Wikipedia. And they were mainly collected by ourselves.

Supervised Finetuning

The supervised fine-tuning dataset consist 1.6 million instructions pairs and 1.5 billion tokens, including six financial tasks:

  • CFData-SA (5.69% ): 120 thousand instances with 86 million tokens for sentiment analysis;
  • CFData-RS (50.60%): 369 thousand instances and 765 million tokens for report summary;
  • CFData-ED (22.69% ): 490 thousand instances with 343 million tokens for event detection;
  • CFData-TD (12.37%): 369 thousand instances and 187 million tokens for topic decomposition;
  • CFData-QA (0.39%): 12 thousand instances and 6 million tokens for question-answering;
  • CFData-SP (8.27%): 212 thousand instances and 125 million tokens for stock moving prediction.

We employ high-quality domain specific data to achieve finance domain adaptation during supervised finetuing. The dataset includes six financial datasets to reflect different aspects of financial analysis and decision-making, which include sentiment analysis, event detection, report summarization, topic decomposition, question answering, and stock movement prediction. CFData-sft provides much text information in the financial domain, allowing a FinLLM to learn from different of sources. Considering requirement in reality, we reform these financial supervised finetuning dataset into ten tasks.

The details are as follows:

Task Task Description Dataset Size
Sentiment Identify the sentiment associated with financial document CFData-SA 13K
Summary Generate a content summary based on the provided financial document CFData-RS 18K
Risk Generate risk alerts based on the provided financial document CFData-RS 20K
Suggestion Generate investment recommendations based on the provided financial document CFData-RS 18K
Event Identify the event categories associated with financial document CFData-ED 12K
Industry Identify the industry categories associated with financial document CFData-ED 14K
Company Identify the company names associated with financial document CFData-ED 12K
Product Identify the product names associated with financial document CFData-ED 21K
Exam Answer true-false questions related to finance question CFData-QA 16K
Stock Predict stocks future movement CFData-SP 15K

The researchers could read the sample case of CFData

Code

Further Pretrain

The training script is ./code/train/pretrain

deepspeed --include localhost:0,1,2,3,4,5,6,7 --master_port 60002 bf_16_parallel_train.py --config bf_16_parallel_train.yml > bf_16_parallel_train.log 2>&1

The trainer parameters we use are in ./code/train/pretrain/bf_16_parallel_train.yml:

# basic setting
model_name: path/of/your/further/pretrain/model
dataset: path/to/your/further/pretrain/dataset
deepspeed: ./ds_config.json
seed: 42
max_seq_length: 2048

# train setting 
output_dir: ./bf_16_parallel_train
logging_steps: 10
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 16
learning_rate: 2.0e-4
weight_decay: 0.01
warmup_steps: 1000
save_steps: 1000
fp16: 0
bf16: 1
torch_compile: 0
save_strategy: steps
remove_unused_columns: 0

The deepspeed parameters we use are in ./code/train/pretrain/ds_config.json:

{
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false,
    "optimizer": {
        "type": "AdamW",
        "params": {
          "lr": "auto",
          "betas": "auto",
          "eps": "auto",
          "weight_decay": 0.01
          }
        },
     "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "total_num_steps": "auto",
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "bf16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 1,
        "reduce_bucket_size": 5e8
    }
}

Supervised Finetuning

The training script is in ./code/train/lora. Here we use the lora-bf16 as illustrations.

deepspeed --include localhost:6,7 --master_port 60005 lora_bf_16_parallel_train.py --config lora_bf_16_parallel_train.yml > lora_bf_16_parallel_train.log 2>&1

The trainer parameters we use are in ./code/train/lora/bf16/bf_16_parallel_train.yml:

# basic setting
model_name: path/of/your/supervised/finetuning/model
dataset: path/to/your/supervised/finetuning/dataset
dataset_eval: path/to/your/evaluate/dataset
deepspeed: ./ds_config.json
seed: 42
max_seq_length: 2048

# train setting 
output_dir: ./lora_bf_16_parallel_train
num_train_epochs: 1
per_device_train_batch_size: 8
per_device_eval_batch_size: 8
gradient_accumulation_steps: 1
learning_rate: 2.0e-4
weight_decay: 0.01
warmup_steps: 500
fp16: 0
bf16: 1
torch_compile: 0
save_strategy: steps
save_steps: 500
evaluation_strategy: steps
eval_steps: 100
logging_steps: 10
remove_unused_columns: 0

# lora setting
rank: 64
lora_alpha: 16
lora_dropout: 0.05
target_modules: ['k_proj', 'o_proj', 'down_proj', 'v_proj', 'q_proj', 'gate_proj', 'up_proj']
bias: 'none'

# restart info
resume_from_checkpoint: null

The deepspeed parameters we use are in ./code/train/lora/bf16/ds_config.json:

{
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false,

    "optimizer": {
      "type": "AdamW",
      "params": {
        "lr": "auto",
        "betas": "auto",
        "eps": "auto",
        "weight_decay": "auto"
        }
      
      },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "bf16": {
      "enabled": true
    },
    "zero_optimization": {
        "stage": 0
    }
}

Evaluation

The performance of our CFGPT2 (13B) is shown as follows:

C-Eval

Model Size STEM Social Science Humanities Others Average Average(hard)
GPT-4 - 67.1 77.6 64.5 67.8 68.7 54.9
ChatGPT 175B 52.9 61.8 50.9 53.6 54.4 41.4
InternLM-7B 7B 48.0 67.4 55.4 45.8 52.8 37.1
ChatGLM2-6B 6B 48.6 60.5 51.3 49.8 51.7 37.1
Qwen-7B 7B 52.8 74.1 63.1 55.2 59.6 41.0
Qwen-14B 14B 65.7 85.4 75.3 68.4 72.1 53.7
Baichuan-7B 7B 38.2 52.0 46.2 39.3 42.8 31.5
Baichuan-13B 13B 47.0 66.8 57.3 49.8 53.6 36.7
Baichuan2-13B-Chat 13B 48.4 70.5 60.3 55.0 56.6 37.9
InternLM2-7B 7B 52.3 71.9 64.9 61.0 60.8 38.8
InternLM2-20B 20B 56.1 75.7 62.6 62.4 63.0 46.3
CFGPT2-7B 7B 56.7 76.4 63.9 63.0 63.5 43.2
CFGPT2-20B 20B 64.6 80.8 72.1 68.9 69.2 49.9

FinEval

Model Size Finance Economy Accounting Certificate Average
GPT-4 - 71.0 74.5 59.3 70.4 68.6
ChatGPT 175B 59.3 61.6 45.2 55.1 55.0
InternLM-7B 7B 49.0 49.2 40.5 49.4 47.1
ChatGLM2-6B 6B 46.5 46.4 44.5 51.5 47.4
Qwen-Chat-7B 7B 51.5 52.1 44.5 53.6 50.5
Qwen-7B 7B 54.5 54.4 50.3 55.8 53.8
Baichuan-7B-Chat 7B 44.9 41.5 34.9 45.6 42.0
Baichuan-13B-Chat 13B 51.6 51.1 41.7 52.8 49.4
InternLM2-7B 7B 54.2 54.0 43.5 55.4 51.9
InternLM2-20B 20B 57.3 58.9 47.4 58.6 55.5
CFGPT2-7B 7B 62.6 63.9 58.9 66.0 62.9
CFGPT2-20B 20B 64.0 64.9 62.1 67.9 64.8

CFBenchmark-Basic

Model Size Company Product R.Avg Sector Event Sentiment C.Avg Summary Risk Suggestion G.Avg Avg
HUMAN - 0.931 0.744 0.838 0.975 0.939 0.912 0.942 1.000 1.000 1.000 1.000 0.927
ChatGPT 20B 0.797 0.198 0.498 0.453 0.458 0.425 0.455 0.593 0.541 0.771 0.635 0.529
ERNIE-Bot 260B 0.807 0.300 0.533 0.408 0.350 0.186 0.315 0.715 0.590 0.716 0.673 0.507
ERNIE-Bot-4 - 0.819 0.417 0.618 0.418 0.358 0.375 0.384 0.721 0.629 0.718 0.689 0.564
Falcon-7B 7B 0.671 0.168 0.420 0.169 0.132 0.250 0.184 0.302 0.301 0.246 0.283 0.296
Falcon-7B-chat 7B 0.582 0.046 0.314 0.112 0.142 0.153 0.135 0.307 0.299 0.258 0.288 0.246
bloomz-7B1 7B 0.765 0.166 0.465 0.252 0.154 0.394 0.267 0.451 0.371 0.462 0.428 0.387
bloomz-7Bt1-mt 7B 0.751 0.157 0.454 0.087 0.182 0.380 0.216 0.425 0.379 0.396 0.400 0.357
Qwen-7B 7B 0.780 0.357 0.569 0.480 0.335 0.379 0.398 0.750 0.505 0.713 0.656 0.541
Qwen-Chat-7B 7B 0.763 0.360 0.562 0.400 0.367 0.265 0.344 0.548 0.307 0.379 0.411 0.439
Qwen-14B 14B 0.805 0.421 0.613 0.481 0.350 0.385 0.405 0.754 0.608 0.717 0.693 0.570
Qwen-Chat-14B 14B 0.814 0.442 0.628 0.382 0.400 0.350 0.377 0.732 0.478 0.736 0.649 0.551
ChatGLM2-6B 6B 0.747 0.313 0.530 0.285 0.300 0.357 0.314 0.657 0.454 0.671 0.594 0.479
Baichuan2-7B-Base 7B 0.672 0.340 0.506 0.342 0.490 0.480 0.437 0.739 0.619 0.751 0.703 0.549
Baichuan2-7B-Chat 7B 0.757 0.402 0.579 0.425 0.475 0.323 0.408 0.725 0.648 0.732 0.702 0.563
Baichuan2-13B-Base 13B 0.781 0.330 0.555 0.436 0.496 0.477 0.470 0.725 0.503 0.747 0.658 0.561
Baichuan2-13B-Chat 13B 0.797 0.314 0.556 0.472 0.507 0.387 0.455 0.739 0.634 0.746 0.706 0.572
InternLM-7B 7B 0.612 0.233 0.423 0.266 0.311 0.328 0.302 0.378 0.336 0.379 0.364 0.363
InternLM-7B-Chat 7B 0.632 0.261 0.447 0.272 0.364 0.399 0.345 0.363 0.270 0.353 0.329 0.374
InternLM-20B 20B 0.809 0.358 0.583 0.500 0.427 0.417 0.448 0.706 0.653 0.728 0.695 0.575
InternLM-20B-Chat 20B 0.488 0.362 0.425 0.323 0.327 0.370 0.340 0.706 0.578 0.762 0.662 0.476
CFGPT1-stf-LoRA 7B 0.820 0.414 0.617 0.569 0.729 0.769 0.689 0.745 0.584 0.609 0.646 0.650
CFGPT1-sft-Full 7B 0.836 0.476 0.656 0.700 0.808 0.829 0.779 0.798 0.669 0.808 0.758 0.731
CFGPT2-7B 7B 0.834 0.470 0.652 0.644 0.750 0.793 0.729 0.801 0.692 0.790 0.761 0.714
CFGPT2-20B 20B 0.891 0.501 0.696 0.722 0.825 0.865 0.806 0.825 0.727 0.823 0.792 0.755

OpenFinData

Model Size Knowledge Caluation Explanation Identification Analysis Compliance Average
ERNIE-Bot-3.5 - 78.0 70.4 82.1 75.3 77.7 36.7 70.0
ERNIE-Bot-4 - 87.3 73.6 84.3 77.0 79.1 37.3 73.1
InternLM-7B 7B 65.3 45.8 71.4 62.5 59.2 37.2 56.9
ChatGLM2-6B 6B 62.4 37.2 70.8 59.2 58.3 38.7 54.4
Qwen-Chat-7B 7B 71.3 40.5 71.4 58.6 51.3 40.0 55.5
Qwen-Chat-14B 14B 78.0 57.6 75.6 71.6 59.3 40.6 63.8
Baichuan2-7B-Chat 7B 46.2 37.0 76.5 60.2 55.0 28.7 50.6
Baichuan2-13B-Chat 13B 69.3 39.5 75.3 65.7 62.0 31.3 57.2
InternLM2-7B 7B 70.2 39.9 73.4 62.8 61.4 39.5 57.8
InternLM2-20B 20B 76.4 52.6 76.3 66.2 63.9 42.1 62.9
CFGPT2-7B 7B 81.9 62.8 75.2 71.3 64.1 68.2 70.5
CFGPT2-20B 20B 84.6 66.5 78.1 75.9 66.0 71.9 73.8

Acknowledgements

CFGPT has referred to the following open-source projects. We want to express our gratitude to the researchers of the projects.

To-Do List

  • Series of applications with CFGPT.
  • Constructing more comprehensive training tasks and their corresponding databases.
  • Continued improvement of the capabilities of CFGPT in more complex financial tasks.

License

The use of the source code of CFGPT complies with the Apache 2.0 License. CFGPT model also supports commercial use under the base model Licenses of InternLM 7B&20B Model and the Terms of Use of the data generated by OpenAI. Please contact us if you find any potential violations.

Thanks To Our Contributors :

Citation

If you find CFGPT is useful for your research, please consider citing the following papers.

@article{li2023cfgpt,
  title={CFGPT: Chinese financial assistant with large language model},
  author={Li, Jiangtong and Bian, Yuxuan and Wang, Guoxuan and Lei, Yang and Cheng, Dawei and Ding, Zhijun and Jiang, Changjun},
  journal={arXiv preprint arXiv:2309.10654},
  year={2023}
}

@article{li2024ra,
  title={RA-CFGPT: Chinese financial assistant with retrieval-augmented large language model},
  author={Li, Jiangtong and Lei, Yang and Bian, Yuxuan and Cheng, Dawei and Ding, Zhijun and Jiang, Changjun},
  journal={Frontiers of Computer Science},
  volume={18},
  number={5},
  pages={185350},
  year={2024},
  publisher={Springer}
}