/LLSec

Primary LanguagePythonMIT LicenseMIT

LLSec

LLSec is a prototype tool for generating business rule specifications in the securities domain based on large language models. It is the official implementation for paper . In this paper, we propose an automatic specification method for business rules in the securities domain based on large language models, which utilizes the powerful natural language processing ability of the large language models to assist in classification and extraction of software requirement related business rules. In addition, we also use domain knowledge to assist in refining business rules and identifying rule relationships, ultimately forming a requirement specification in the form of a data stream.

Project Structure

  • data/. The annotation, training and validating data for rule filtering and rule extraction.
    • business_rules/. The annotation data.
    • rule_*. Training and validating data for Llama2 (.csv) and Mengzi (.json).
    • knowledge.json. The domain knowledge base.
  • experiment. Data and codes for producing the experimental results.
  • fine_tune_llama2_model. Codes for fine-tuning Llama2 and use the trained model to interface.
  • lora_train_llama2_model. Codes for lora training LLama2 and use the trained model to interface.
  • specification_generation. Codes for the 4-step framework that generates specifications from rule documents.
  • train_rule_extraction_model. Codes for fine-tuning Mengzi for rule extraction task.
  • train_rule_filtering_model. Codes for fine-tuning Mengzi for rule filtering task.
  • transfer. Codes to transfer the rule format.

Getting Started

We provide commands that will install all the necessary dependencies step by step (sudo rights might be required). We conducted all the experiments on a workstation equipped with a 32-core AMD Ryzen Threadripper PRO 5975WX CPU, 256GB RAM, and an NVIDIA RTX 3090Ti GPU running Ubuntu 22.04.

  1. Install dependencies.

    sudo apt update
    sudo apt upgrade -y
    sudo apt install build-essential zlib1g-dev libbz2-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev
    sudo apt-get install -y libgl1-mesa-dev
    sudo apt-get install libglib2.0-dev
    sudo apt install wget
    sudo apt install git
  2. Install miniconda.

    cd ~
    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    bash Miniconda3-latest-Linux-x86_64.sh
    source ~/.bashrc
  3. Create a virtual python environment and install all the required dependencies.

    git clone https://github.com/LingelLi/LLSec  
    cd LLSec  
    
    # Use conda to install
    conda create -n LLSec python=3.10  
    conda activate LLSec  
    pip config set global.index-url https://pypi.mirrors.ustc.edu.cn/simple  
    pip install -r requirements.txt  
    
    # Install flash-attention based on your CUDA version. For example:
    wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.6/flash_attn-2.5.6+cu122torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl  
    pip install flash_attn-2.5.6+cu122torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl  
    
    pip install -e .  
  4. Download the trained LLMs.

    git lfs install
    git clone https://huggingface.co/whotookmycookie/LLSec
    cp -r LLSec/mengzi_rule_filtering/ train_rule_filtering_model/model/
    cp -r LLSec/mengzi_rule_extraction/ train_rule_extraction_model/model/
    cp -r LLSec/llama2_rule_filtering_fine_tune/ fine_tune_llama2_model/model/rule_filtering/
    cp -r LLSec/llama2_rule_extraction_fine_tune/ fine_tune_llama2_model/model/rule_extraction/
    cp -r LLSec/llama2_rule_filtering_lora/ lora_train_llama2_model/model/rule_filtering/
    cp -r LLSec/llama2_rule_extraction_lora/ lora_train_llama2_model/model/rule_extraction/
    
    git clone https://huggingface.co/FlagAlpha/Atom-7B
    cp -r Atom-7B/ fine_tune_llama2_model/model/
    cp -r Atom-7B/ lora_train_llama2_model/model/
    
    git clone https://huggingface.co/Langboat/mengzi-bert-base-fin
    cp -r mengzi-bert-base-fin/ train_rule_filtering_model/model/
    cp -r mengzi-bert-base-fin/ train_rule_extraction_model/model/
    
    rm -rf LLSec
    rm -rf Atom-7B
    rm -rf mengzi-bert-base-fin
  5. Run a test demo.

    cd specification_generation
    python main.py

    After the command finishes running, the generated specifications are saved at rules_cache/output.json.

Data Reproduction

We provide scripts that reproduce the experimental results in our paper.

cd experiment
  1. To obtain the results in Table 1, run:

    python case_study.py

    After the command finish running, the results are organized into tabular data in CSV format and stored in case_study_data/table1.csv. Example outputs of Table 1 are:

    步骤 规则表示形式 业务规则数 处理时间(秒) 涉及的领域知识数 规则关系数 规约中涉及的业务路径数
    规则过滤 自然语言 98 3.89 - - -
    规则抽取 FBR形式 135 约3000 - - -
    规则理解 FBR形式 2308 0.37 28 - -
    关系识别 FBR形式 2562 5.46 12 326 2562
  2. To obtain the results in Table 4, run:

    python generate_specification_exp1.py
    python compute_function_and_accuracy_exp1.py

    After the command finish running, the results are organized into tabular data in CSV format and stored in exp1_data/table4.csv. Example outputs of Table 4 are:

    数据集数据集特征领域专家非专家GPT-4GLM-4LLSec
    #规则#依赖关系#DF#DFFPI(%)时间(分)#DFFPI(%)时间(分)#DFFPI(%)时间(分)#DFFPI(%)时间(分)#DFFPI(%)时间(分)
    1 10 0 11 20 89.11 33 29 69.81 75 38 55.99 20 41 50.69 18 14 87.52 4
    2 12 0 50 48 93.92 40 36 74.05 73 85 80.31 15 88 82.11 19 700 93.92 5
    3 12 4 78 40 91.26 35 50 64.75 85 41 82.63 25 48 71.73 9 276 93.27 5
    4 12 3 112 56 86.20 40 36 61.57 70 58 59.56 17 66 60.26 20 1288 90.95 6
    5 11 17 168 83 83.25 50 55 52.22 74 90 75.83 25 64 40.56 20 518 94.20 6
  3. To obtain the results in Figure 6, run:

    python compute_sc_LLM_acc.py
    python compute_tc_LLM_acc.py
    python draw_figure.py

    After the command finish running, the results are organized into a figure and stored in rule_filtering_data/figure_6a.svg and rule_extraction_data/figure_6b.svg.

    Figure 6(a): rule_filtering_data/figure_6a.svg Figure 6(b): rule_extraction_data/figure_6b.svg

  4. To obtain the results in Table 5, run:

    python generate_specification_exp3.py
    python compute_function_and_accuracy_exp3.py

    After the command finish running, the results are organized into tabular data in CSV format and stored in exp3_data/table5.csv. Example outputs of Table 5 are:

    数据集数据集特征领域专家非专家GPT-4GLM-4LLSec(无领域知识)LLSec
    #规则#DF#依赖关系#DFFPI(%)#DFFPI(%)#DFFPI(%)#DFFPI(%)#DFFPI(%)#DFFPI(%)
    152404795.831675.004879.172578.331670.0019890.00
    2104207392.244769.9910777.465073.413671.0235890.60
    3108686388.095158.128767.714464.193058.7233394.31
    411186427278.287651.798151.285255.153744.4952494.29
  5. To obtain the results in Table 6, run:

    python generate_specification_exp4.py
    python compute_function_and_accuracy_exp4.py

    After the command finish running, the results are organized into tabular data in CSV format and stored in exp4_data/table6.csv. Example outputs of Table 6 are:

    数据集数据集特征GPT-4GLM-4LLSec
    数据集名称来源#规则#DF#依赖关系#DFFPI(%)#DFFPI(%)#DFFPI(%)
    1纽约证券交易所股票交易规则《纽约交易所规则》104474873.155668.9357982.95
    2纽约证券交易所交易和结算规则《纽约交易所规则》92843072.127656.50249776.68
    3东京证券交易所股票经营规定《东京证券交易所经营规定》123062755.046251.7936877.67
    4东京证券交易所债券经营规定《东京证券交易所经营规定》92252056.493652.9153781.60
    5香港交易所交易机制《香港交易所交易机制》113852560.245269.20306279.84

Detailed Instruction

We provide details about how to generate specifications for a financial document, as well as how to train the LLMs used in the process.

  1. To generate specifications for a financial document named input_file, run:

    cd specification_generation
    python main.py --input_file {input_file}
    
    # for example:
    python main.py --input_file ./rules_cache/深圳证券交易所债券交易规则.pdf

    The workflow of generating specifications:

    • document_preprocess.py.
      • Read the input file and divide it into sentences.
      • Input: input_file (pdf or txt format), e.g., ./rules_cache/深圳证券交易所债券交易规则.pdf.
      • Output: ./rules_cache/sci.json and setting.json.
    • rule_filtering.py.
      • Use trained LLM Mengzi to classify each rule to 0, 1 or 2.
      • Input: ./rules_cache/sci.json.
      • Output: ./rules_cache/sco.json.
    • Use GPT-4 to perform rule extraction (Manual completion).
      • Input: rules.txt (All the rules in ./rules_cache/sco.json where type=1)
      • Output: chatgpt_output.txt.
    • gpt_output_to_input.py.
      • Read the output of GPT-4 and transfer to a json format.
      • Input: chatgpt_output.txt.
      • Output: input.json.
    • rule_assembly.py.
      • Assemble the extracted rule elements to FBR.
      • Input: input.json and setting.json.
      • Output: BR.mydsl. mydsl is a plain text format for FBR.
    • rule_understanding.py.
      • Complete and combine the rules based on domain knowledge.
      • Input: BR.mydsl.
      • Output: UBR.mydsl and UBR.json.
    • rule_relation_mining.py.
      • Mining relation among the rules based on domain knowledge.
      • Input: UBR.mydsl.
      • Output: RUBR.mydsl, RUBR.json, relation.json, explicit_relation.json, and implicit_relation.json.
    • main.py
      • Integrate the above workflow.
      • Input: input_file (pdf or txt format) for rule extraction, and input.json which is the result of rule extraction performed by GPT-4 for rule understanding and relation mining.
  2. To train the Mengzi model for rule filtering task, run:

    cd train_rule_filtering_model
    nohup ./train_model.sh >./output/train_model.log &

    After the command finish running, the trained models are saved at ./model/best_{timestamp}. The code for training and validation is in train.py.

  3. To train the Mengzi model for rule extraction task, run:

    cd train_rule_extraction_model
    nohup ./train_model.sh >./output/train_model.log &

    After the command finish running, the trained models are saved at ./model/best_{timestamp}. The code for training and validation is in train.py.

  4. To fine tune the Llama2 model for rule filtering task, run:

    cd fine_tune_llama2_model
    nohup bash run_rule_filtering.sh >./output/run_rule_filtering.log &

    After the command finish running, the trained models are saved at ./model/rule_filtering/best_fine-tune_model_{timestamp}. The code for training is in train.py and the code for validation is in predict.py.

  5. To fine tune the Llama2 model for rule extraction task, run:

    cd fine_tune_llama2_model
    nohup bash run_rule_extraction.sh >./output/run_rule_extraction.log &

    After the command finish running, the trained models are saved at ./model/rule_extraction/best_fine-tune_model_{timestamp}. The code for training is in train.py and the code for validation is in predict.py.

  6. To train the Llama2 model for rule filtering task using lora, run:

    cd lora_train_llama2_model
    nohup bash run_rule_filtering.sh >./output/run_rule_filtering.log &

    After the command finish running, the trained models are saved at ./model/rule_filtering/best_lora_model_{timestamp}. The code for training is in train.py and the code for validation is in predict.py.

  7. To train the Llama2 model for rule extraction task using lora, run:

    cd lora_train_llama2_model
    nohup bash run_rule_extraction.sh >./output/run_rule_extraction.log &

    After the command finish running, the trained models are saved at ./model/rule_extraction/best_lora_model_{timestamp}. The code for training is in train.py and the code for validation is in predict.py.