LLSec

LLSec is a prototype tool for generating business rule specifications in the securities domain based on large language models. It is the official implementation for paper . In this paper, we propose an automatic specification method for business rules in the securities domain based on large language models, which utilizes the powerful natural language processing ability of the large language models to assist in classification and extraction of software requirement related business rules. In addition, we also use domain knowledge to assist in refining business rules and identifying rule relationships, ultimately forming a requirement specification in the form of a data stream.

Project Structure

data/. The annotation, training and validating data for rule filtering and rule extraction.

business_rules/. The annotation data.

rule_*. Training and validating data for Llama2 (.csv) and Mengzi (.json).

knowledge.json. The domain knowledge base.

experiment. Data and codes for producing the experimental results.

fine_tune_llama2_model. Codes for fine-tuning Llama2 and use the trained model to interface.

lora_train_llama2_model. Codes for lora training LLama2 and use the trained model to interface.

specification_generation. Codes for the 4-step framework that generates specifications from rule documents.

train_rule_extraction_model. Codes for fine-tuning Mengzi for rule extraction task.

train_rule_filtering_model. Codes for fine-tuning Mengzi for rule filtering task.

transfer. Codes to transfer the rule format.

Getting Started

We provide commands that will install all the necessary dependencies step by step (sudo rights might be required). We conducted all the experiments on a workstation equipped with a 32-core AMD Ryzen Threadripper PRO 5975WX CPU, 256GB RAM, and an NVIDIA RTX 3090Ti GPU running Ubuntu 22.04.

Install dependencies.

sudo apt update
sudo apt upgrade -y
sudo apt install build-essential zlib1g-dev libbz2-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev
sudo apt-get install -y libgl1-mesa-dev
sudo apt-get install libglib2.0-dev
sudo apt install wget
sudo apt install git

Install miniconda.

cd ~
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc

Create a virtual python environment and install all the required dependencies.

git clone https://github.com/LingelLi/LLSec  
cd LLSec  

# Use conda to install
conda create -n LLSec python=3.10  
conda activate LLSec  
pip config set global.index-url https://pypi.mirrors.ustc.edu.cn/simple  
pip install -r requirements.txt  

# Install flash-attention based on your CUDA version. For example:
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.6/flash_attn-2.5.6+cu122torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl  
pip install flash_attn-2.5.6+cu122torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl  

pip install -e .

Download the trained LLMs.

git lfs install
git clone https://huggingface.co/whotookmycookie/LLSec
cp -r LLSec/mengzi_rule_filtering/ train_rule_filtering_model/model/
cp -r LLSec/mengzi_rule_extraction/ train_rule_extraction_model/model/
cp -r LLSec/llama2_rule_filtering_fine_tune/ fine_tune_llama2_model/model/rule_filtering/
cp -r LLSec/llama2_rule_extraction_fine_tune/ fine_tune_llama2_model/model/rule_extraction/
cp -r LLSec/llama2_rule_filtering_lora/ lora_train_llama2_model/model/rule_filtering/
cp -r LLSec/llama2_rule_extraction_lora/ lora_train_llama2_model/model/rule_extraction/

git clone https://huggingface.co/FlagAlpha/Atom-7B
cp -r Atom-7B/ fine_tune_llama2_model/model/
cp -r Atom-7B/ lora_train_llama2_model/model/

git clone https://huggingface.co/Langboat/mengzi-bert-base-fin
cp -r mengzi-bert-base-fin/ train_rule_filtering_model/model/
cp -r mengzi-bert-base-fin/ train_rule_extraction_model/model/

rm -rf LLSec
rm -rf Atom-7B
rm -rf mengzi-bert-base-fin

Run a test demo.
```
cd specification_generation
python main.py
```
After the command finishes running, the generated specifications are saved at rules_cache/output.json.

Data Reproduction

We provide scripts that reproduce the experimental results in our paper.

cd experiment

To obtain the results in Table 1, run:

python case_study.py

After the command finish running, the results are organized into tabular data in CSV format and stored in case_study_data/table1.csv. Example outputs of Table 1 are:

步骤	规则表示形式	业务规则数	处理时间(秒)	涉及的领域知识数	规则关系数	规约中涉及的业务路径数
规则过滤	自然语言	98	3.89	-	-	-
规则抽取	FBR形式	135	约3000	-	-	-
规则理解	FBR形式	2308	0.37	28	-	-
关系识别	FBR形式	2562	5.46	12	326	2562

To obtain the results in Table 4, run:

python generate_specification_exp1.py
python compute_function_and_accuracy_exp1.py

After the command finish running, the results are organized into tabular data in CSV format and stored in exp1_data/table4.csv. Example outputs of Table 4 are:

数据集	数据集特征			领域专家			非专家			GPT-4			GLM-4			LLSec
数据集	#规则	#依赖关系	#DF	#DF	FPI(%)	时间(分)	#DF	FPI(%)	时间(分)	#DF	FPI(%)	时间(分)	#DF	FPI(%)	时间(分)	#DF	FPI(%)	时间(分)
1	10	0	11	20	89.11	33	29	69.81	75	38	55.99	20	41	50.69	18	14	87.52	4
2	12	0	50	48	93.92	40	36	74.05	73	85	80.31	15	88	82.11	19	700	93.92	5
3	12	4	78	40	91.26	35	50	64.75	85	41	82.63	25	48	71.73	9	276	93.27	5
4	12	3	112	56	86.20	40	36	61.57	70	58	59.56	17	66	60.26	20	1288	90.95	6
5	11	17	168	83	83.25	50	55	52.22	74	90	75.83	25	64	40.56	20	518	94.20	6

To obtain the results in Figure 6, run:
```
python compute_sc_LLM_acc.py
python compute_tc_LLM_acc.py
python draw_figure.py
```
After the command finish running, the results are organized into a figure and stored in rule_filtering_data/figure_6a.svg and rule_extraction_data/figure_6b.svg.

Figure 6(a): Figure 6(b):

To obtain the results in Table 5, run:

python generate_specification_exp3.py
python compute_function_and_accuracy_exp3.py

After the command finish running, the results are organized into tabular data in CSV format and stored in exp3_data/table5.csv. Example outputs of Table 5 are:

数据集	数据集特征			领域专家		非专家		GPT-4		GLM-4		LLSec(无领域知识)		LLSec
数据集	#规则	#DF	#依赖关系	#DF	FPI(%)	#DF	FPI(%)	#DF	FPI(%)	#DF	FPI(%)	#DF	FPI(%)	#DF	FPI(%)
1	5	24	0	47	95.83	16	75.00	48	79.17	25	78.33	16	70.00	198	90.00
2	10	42	0	73	92.24	47	69.99	107	77.46	50	73.41	36	71.02	358	90.60
3	10	86	8	63	88.09	51	58.12	87	67.71	44	64.19	30	58.72	333	94.31
4	11	186	42	72	78.28	76	51.79	81	51.28	52	55.15	37	44.49	524	94.29

To obtain the results in Table 6, run:

python generate_specification_exp4.py
python compute_function_and_accuracy_exp4.py

After the command finish running, the results are organized into tabular data in CSV format and stored in exp4_data/table6.csv. Example outputs of Table 6 are:

数据集	数据集特征					GPT-4		GLM-4		LLSec
数据集	数据集名称	来源	#规则	#DF	#依赖关系	#DF	FPI(%)	#DF	FPI(%)	#DF	FPI(%)
1	纽约证券交易所股票交易规则	《纽约交易所规则》	10	44	7	48	73.15	56	68.93	579	82.95
2	纽约证券交易所交易和结算规则	《纽约交易所规则》	9	28	4	30	72.12	76	56.50	2497	76.68
3	东京证券交易所股票经营规定	《东京证券交易所经营规定》	12	30	6	27	55.04	62	51.79	368	77.67
4	东京证券交易所债券经营规定	《东京证券交易所经营规定》	9	22	5	20	56.49	36	52.91	537	81.60
5	香港交易所交易机制	《香港交易所交易机制》	11	38	5	25	60.24	52	69.20	3062	79.84

Detailed Instruction

We provide details about how to generate specifications for a financial document, as well as how to train the LLMs used in the process.

To generate specifications for a financial document named input_file, run:
```
cd specification_generation
python main.py --input_file {input_file}

# for example:
python main.py --input_file ./rules_cache/深圳证券交易所债券交易规则.pdf
```
The workflow of generating specifications:
- document_preprocess.py.
  - Read the input file and divide it into sentences.
  - Input: input_file (pdf or txt format), e.g., ./rules_cache/深圳证券交易所债券交易规则.pdf.
  - Output: ./rules_cache/sci.json and setting.json.
- rule_filtering.py.
  - Use trained LLM Mengzi to classify each rule to 0, 1 or 2.
  - Input: ./rules_cache/sci.json.
  - Output: ./rules_cache/sco.json.
- Use GPT-4 to perform rule extraction (Manual completion).
  - Input: rules.txt (All the rules in ./rules_cache/sco.json where type=1)
  - Output: chatgpt_output.txt.
- gpt_output_to_input.py.
  - Read the output of GPT-4 and transfer to a json format.
  - Input: chatgpt_output.txt.
  - Output: input.json.
- rule_assembly.py.
  - Assemble the extracted rule elements to FBR.
  - Input: input.json and setting.json.
  - Output: BR.mydsl. mydsl is a plain text format for FBR.
- rule_understanding.py.
  - Complete and combine the rules based on domain knowledge.
  - Input: BR.mydsl.
  - Output: UBR.mydsl and UBR.json.
- rule_relation_mining.py.
  - Mining relation among the rules based on domain knowledge.
  - Input: UBR.mydsl.
  - Output: RUBR.mydsl, RUBR.json, relation.json, explicit_relation.json, and implicit_relation.json.
- main.py
  - Integrate the above workflow.
  - Input: input_file (pdf or txt format) for rule extraction, and input.json which is the result of rule extraction performed by GPT-4 for rule understanding and relation mining.
To train the Mengzi model for rule filtering task, run:
```
cd train_rule_filtering_model
nohup ./train_model.sh >./output/train_model.log &
```
After the command finish running, the trained models are saved at ./model/best_{timestamp}. The code for training and validation is in train.py.
To train the Mengzi model for rule extraction task, run:
```
cd train_rule_extraction_model
nohup ./train_model.sh >./output/train_model.log &
```
After the command finish running, the trained models are saved at ./model/best_{timestamp}. The code for training and validation is in train.py.
To fine tune the Llama2 model for rule filtering task, run:
```
cd fine_tune_llama2_model
nohup bash run_rule_filtering.sh >./output/run_rule_filtering.log &
```
After the command finish running, the trained models are saved at ./model/rule_filtering/best_fine-tune_model_{timestamp}. The code for training is in train.py and the code for validation is in predict.py.
To fine tune the Llama2 model for rule extraction task, run:
```
cd fine_tune_llama2_model
nohup bash run_rule_extraction.sh >./output/run_rule_extraction.log &
```
After the command finish running, the trained models are saved at ./model/rule_extraction/best_fine-tune_model_{timestamp}. The code for training is in train.py and the code for validation is in predict.py.
To train the Llama2 model for rule filtering task using lora, run:
```
cd lora_train_llama2_model
nohup bash run_rule_filtering.sh >./output/run_rule_filtering.log &
```
After the command finish running, the trained models are saved at ./model/rule_filtering/best_lora_model_{timestamp}. The code for training is in train.py and the code for validation is in predict.py.
To train the Llama2 model for rule extraction task using lora, run:
```
cd lora_train_llama2_model
nohup bash run_rule_extraction.sh >./output/run_rule_extraction.log &
```
After the command finish running, the trained models are saved at ./model/rule_extraction/best_lora_model_{timestamp}. The code for training is in train.py and the code for validation is in predict.py.

LingleLi/LLSec

LLSec

Project Structure

Getting Started

Data Reproduction

Detailed Instruction