/LAiW

LAiW: A Chinese Legal Large Language Models Benchmark

Primary LanguagePythonMIT LicenseMIT

⚖️LAiW: A Chinese Legal Large Language Models Benchmark

| English | Chinese

LAiW:A Comprehensive Benchmark for Chinese Legal Large Language Models (LLMs)

🔥 LAiW Leaderboard

🔥 Technical Report and Official Paper

News

🔄 Recent Updates

  • [2024-04-19] The official paper has been updated.

📅 Earlier News

  • [2024/1/22] Added evaluation results for the general LLMs Baichuan-7B.
  • [2024/1/14] Provided more detailed information on the evaluation dataset here, along with the calculation method for the model evaluation metric SCULAiW.
  • [2024/1/12] Further confirmed and improved relevant evaluation results, optimized the layout of the evaluation leaderboard SCULAiW, and supplemented more detailed information on evaluated models.
  • [2024/1/10] Added evaluations for commercial LLMs GPT-4 and general LLMs Llama-7B, Llama13B, Chinese-LLaMA-13B.
  • [2024/1/2] Announced the scoring mechanism for the legal capabilities of LLMs in here and published the evaluation scores for LLMs in here.
  • [2024/1/2] Released test datasets for 14 foundational tasks here.
  • [2024/1/1] Updated the legal capability evaluation results for SCULAiW.
  • [2024/12/31] Completed legal capability evaluations for mainstream LLMs. During the evaluation process, in addition to the models mentioned earlier, general LLMs ChatGLM and legal LLMs Lawyer-LLaMA, Fuzi-Mingcha, Wisdom-Interrogatory, LexiLaw were added.
  • [2023/10/12] Published the initial version of the LAiW Technical Report.
  • [2023/10/08] Released the first phase evaluation system for LAiW capabilities here.
  • [2023/10/08] Completed the first phase evaluation of the Basic Information Retrieval capabilities of LLMs, including commercial LLMs: ChatGPT; general LLMs: Llama2, Ziya-LLaMA, Chinese-LLaMA, Baichuan2; and legal LLMs: HanFei, ChatLaw, LaWGPT.
  • [2023/10/08] Released evaluation scores and calculation methods for legal capabilities and foundational tasks.

Contents

Evaluation structure diagram

Scores for LLMs

According to the calculation method of the large models' scoring mechanism, we have evaluated 7 mainstream legal large models and 6 general large models at this stage. The model scores are as follows:

Model Size Model Domain Total Score BIR LFI CLA Base Model
GPT-4 - General 69.63 80.92 69.27 58.69 -
ChatGPT - General 64.09 75.99 58.32 57.96 -
Baichuan2-Chat 13B General 48.04 53.67 32.03 58.40 -
ChatGLM 6B General 47.01 51.51 37.08 52.44 -
Ziya-LLaMA 13B General 45.79 61.47 29.44 46.45 Llama-13B
Fuzi-Mingcha 6B Legal 40.62 39.68 27.46 54.71 ChatGLM-6B
HanFei 7B Legal 35.69 37.42 16.33 53.31 -
LexiLaw 6B Legal 31.31 41.32 8.88 43.73 ChatGLM-6B
ChatLaw 13B Legal 25.77 58.02 12.54 6.74 Ziya-LLaMA-13B
Llama2-Chat 7B General 27.76 31.86 12.77 38.64 -
Lawyer-LLaMA 13B Legal 29.25 30.85 6.39 50.50 Chinese-LLaMA-13B
Chinese-LLaMA 13B General 24.99 21.02 19.16 34.80 Llama-13B
Chinese-LLaMA 7B General 24.91 22.32 18.25 34.16 Llama-7B
Baichuan 7B General 22.51 21.20 15.46 30.86 -
LaWGPT 7B Legal 22.69 15.47 14.27 38.32 Chinese-LLaMA-7B
Llama 13B General 21.00 18.51 15.08 29.40 -
Wisdom-Interrogatory 7B Legal 18.83 12.66 10.45 33.37 Baichuan-7B
Llama 7B General 16.35 11.12 15.40 22.54 -

The overall scores and scores for each level of legal capability of LLMs are ranked as follows:

Overall Histogram BIR Histogram LFI Histogram CLA Histogram

Tasks

With the joint efforts of legal experts and artificial intelligence experts, we categorize the Legal Capabilities of LLMs into three levels, ranging from easy to difficult: Basic Information Retrieval (BIR), Legal Foundation Inference (LFI), and Cplex Legal Application (CLA), totaling 14 foundational tasks. The diagram above shows the structure of these three capability levels.

  • Basic Information Retrieval. The capability of LLMs aims to address some fundamental tasks in the field of law that can be directly transferred from NLP, as well as some simple yet crucial pre-tasks in the legal domain. It includes 5 foundational tasks: Legal Article Recommendation (AR), Element Recognition (ER), Named Entity Recognition (NER), Judicial Summarization (JS), and Case Recognition (CR).
  • Legal Foundation Inference. This capability aims to test some basic legal applications for LLMs. It includes 6 foundational tasks: Controversial Focus Mining (CFM), Similar Case Matching (SCM), Charge Prediction (CP), prison Term Prediction (PTP), Civil Trial Prediction (CTP), and Legal Question Answering (LQA).
  • Legal Foundation Inference. We consider the challenging tasks that LLMs may face, such as complex reasoning in the legal field and aligning with real legal logic. Here, we focus on three tasks: Judicial Reasoning Generation (JRG), Case Understanding (CU), and Legal Consultation (LC).

Below is a brief description to each evaluation task.

Capability Task Description
BIR Legal Article Recommendation It aims to provide relevant articles based on the description of the case.
Element Recognition It analyzes and assesses each sentence to identify the pivotal elements of the case.
Named Entity Recognition It aims to extract nouns and phrases with legal characteristics from various legal documents.
Judicial Summarization It aims to condense, summarize, and synthesize the content of legal documents.
Case Recognition It aims to determine, based on the relevant description of the case, whether it pertains to a criminal or civil matter.
LFI Controversial Focus Mining It aims to extract the logical and interactive arguments between the defense and prosecution in legal documents, which will be analyzed as a key component for the tasks that relate to the case result.
Similar Case Matching It aims to find cases that bear the closest resemblance, which is a core aspect of various legal systems worldwide, as they require consistent judgments for similar cases to ensure the fairness of the law.
Criminal Judgment Prediction It involves predicting the guilt or innocence of the defendant, along with the potential sentencing, based on the results of basic legal NLP, including the facts of the case, the evidence presented, and the applicable law articles. Therefore, it is divided into two types of tasks: Charge Prediction and prison Term Prediction.
Civil Trial Prediction It involves using factual descriptions to predict the judgment of the defendant in response to the plaintiff’s claim, which we should consider the Controversial Focus.
Legal Question Answering It utilizes the model’s legal knowledge to address the national judicial examination, which encompasses various specific legal types.
CLA Judicial Reasoning Generation It aims to generate relevant legal reasoning texts based on the factual description of the case. It is a complex reasoning task, because the court requires further elaboration on the reasoning behind the judgment based on the determination of the facts of the case. This task also involves aligning with the logical structure of syllogism in law
Case Understanding It is expected to provide reasonable and compliant answers based on the questions posed regarding the case-related descriptions in the judicial documents, which is also a complex reasoning task.
Legal Consultation It covers a wide range of legal areas and aims to provide accurate, clear, and reliable answers based on the legal questions provided by the different users. Therefore, it usually requires the sum of the aforementioned capabilities to provide professional and reliable analysis.

Datasets

We have reorganized and constructed the evaluation datasets for the aforementioned tasks based on existing publicly available Chinese legal datasets. These datasets are collectively referred to as the Legal Evaluation Dataset (LED). We present the evaluation datasets for each foundational task. For more detailed information about the datasets, please refer to here.

Level Task Main Dataset Evaluation Dataset Data Size Category
BIR Legal Article Recommendation CAIL-2018 legal_ar 1,000 Classification
Element Recognition CAIL-2019 legal_er 1,000 Classification
Named Entity Recognition CAIL-2021 legal_ner 1040 named entity recognition
Judicial Summarization CAIL-2020 legal_js 364 Text Generation
Case Recognition CJRC legal_cr 2,000 Classification
LFI Controversial Focus Mining LAIC-2021 legal_cfm 306 Classification
Similar Case Matching CAIL-2019 legal_scm 260 Classification
Charge Prediction Criminal-S legal_cp 827 Classification
prison Term Prediction MLMN legal_ptp 349 Classification
Civil Trial Prediction MSJudeg legal_ctp 800 Classification
Legal Question Answering JEC-QA legal_lqa 855 Classification
CLA Judicial Reasoning Generation AC-NLG legal_jrg 834 Text Generation
Case Understanding CJRC legal_cu 1,054 Text Generation
Legal Consultation CrimeKgAssitant legal_lc 916 Text Generation

Scoring Mechanism

⭐️ socres for each task

$$ S_{(Task)} = \begin{cases} F1 * 100, & \text{If }\quad Task\quad\in\quad Classification \\ \frac{1}{3}*(R1 + R2 + RL) * 100, & \text{If }\quad Task \quad\in\quad Text\quad Generation \\ Acc * 100, & \text{If }\quad Task\quad\in\quad NER \end{cases} $$

Currently, our evaluation benchmarks mainly consist of three types of tasks: classification tasks, text generation tasks and named entity recognition. For classification tasks, we use the F1 score. For text generation tasks, we use the average of Rouge1, Rouge2, and RougeL scores. Specifically, for legal Named Entity Recognition tasks, we use the extraction accuracy of legal entities as their score.

🌟 Scores for each LLM

For individual LLM, we first calculate the average score of tasks at each level as its legal capability score for that level. Then, we take the average of these three legal capability scores as the final evaluation score for the LLM. Model evaluation scores can be found here.

Run

We will continue to evaluate the performance of existing LLMs on these tasks according to the structure diagram of the 14 foundational tasks. For details, please refer to the leaderboard.

1.Preparation

git clone git clone https://github.com/Dai-shen/LAiW.git --recursive
cd LAiW
pip install -r requirements.txt
cd LAiW/src/financial-evaluation
pip install -e .[multilingual]

2.Output of LLM

We select the model and legal tasks to be evaluated. By running the following code, we can obtain the model's output.

export CUDA_VISIBLE_DEVICES="1,2"
python eval.py \
    --model "hf-causal-experimental" \
    --model_args "use_accelerate=True,pretrained=$pretrained_model,tokenizer=$pretrained_model,use_fast=False,trust_remote_code=True" \
    --tasks "legal_ar,legal_er,legal_js" \
    --no_cache \
    --num_fewshot 0 \
    --write_out \
    --output_base_path ""

Parameter Description

  • model: Model interface type, optional parameters can be found in src/financial-evaluation/lm_eval/models/__init__.py
  • tasks: Predefined task names, you can define your own tasks in src/tasks/_init_.py and src/tasks/legal.py
  • pretrained_model: Path to the large model (Hugging Face space or local model path)
  • output_base_path: Model saving path

Contributors

  • Sichuan University: Yongfu Dai, Duanyu Feng, Haochen Jia, Yifang Zhang and Hao Wang
  • Wuhan University: Qianqian Xie, Weiguang Han and Jimin Huang
  • Southwest Petroleum University: Wei Tian

Disclaimer

This project is provided for academic and educational purposes only. We do not take responsibility for any issues, risks, or adverse consequences that may arise from the use of this project.

Acknowledgements

This project is built upon the following open-source projects, and we are really thankful for them:

Cite

If this project has been helpful to your research, please consider citing our project.

@article{dai2023laiw,
  title={LAiW: A Chinese legal large language models benchmark},
  author={Dai, Yongfu and Feng, Duanyu and Huang, Jimin and Jia, Haochen and Xie, Qianqian and Zhang, Yifang and Han, Weiguang and Tian, Wei and Wang, Hao},
  journal={arXiv preprint arXiv:2310.05620},
  year={2023}
}