/LM-CompEval-Legal

Code for the paper "A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction"

Primary LanguagePython

⚖️
A Comprehensive Evaluation of LLMs on
Legal Judgment Prediction

[📜 Paper][🐱 GitHub]
Quick StartCitation

Repo for "A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction"
published at EMNLP Findings 2023

💡 Introduction

To comprehensively evaluate the law capacity of large language models, we propose baseline solutions and conduct evaluation on the task of legal judgment prediction.

Motivation
Existing benchmarks, e.g., lm_eval_harness, mainly adopt a perplexity-based approach to select the most possible options as the prediction for classification tasks. However, LMs typically interact with humans in the way of open-ended generation. It is critical to directly evaluate the contents generated by greedy decoding or sampling.

Evaluation on LM Generated Contents
We propose an automatic evaluation pipeline to directly evaluate the generated contents for classification tasks.

  1. Prompt LMs with task instruction to generate class labels. The generated contents may not strictly match standard label names.
  2. Then, a parser is to map generated contents to labels, based on the text similarity scores.

LM + Retrieval System

To address the performance with retrieved information of LMs in legal domain, additional information, e.g., label candidates and similar cases as demonstrations, are included into prompts. Considering the combination of the two additional information, there are four sub-settings of prompts:

  • (free, zero shot): No additional information. Only task instruction.
  • (free, few shot): Task instruction + demonstrations
  • (multi, zero shot): Task instruction + label candidates (options)
  • (multi, few shot): Task instruction + label candidates + demonstrations

setting

🔥 Leaderboard

rank model score free-0shot free-1shot free-2shot free-3shot free-4shot multi-0shot multi-1shot multi-2shot multi-3shot multi-4shot
1 gpt4 63.05 50.52 62.72 67.54 68.61 71.02 62.31 70.42 71.81 73.24 74.00
2 chatgpt 58.13 43.14 58.42 61.86 64.40 66.16 60.67 63.51 66.85 69.59 66.62
3 chatglm_6b 47.74 41.89 50.30 47.76 48.59 48.67 53.74 49.26 47.56 47.61 45.32
4 bloomz_7b 44.14 46.90 53.28 51.06 50.90 49.26 50.68 29.25 27.92 25.27 23.37
5 vicuna_13b 39.83 25.50 48.85 47.64 49.49 39.82 44.70 41.73 41.48 35.03 21.61

Note:

  • Metric: Macro-F1
  • $score = (free\text{-}0shot + free\text{-}2shot + multi\text{-}0shot + multi\text{-}2shot)/4$
  • OpenAI model names: gpt-3.5-turbo-0301, gpt-4-0314

🚀 Quick Start

⚙️ Install

git clone https://github.com/srhthu/LM-CompEval-Legal.git

# Enter the repo
cd LM-CompEval-Legal

pip install -r requirements.txt

bash download_data.sh
# Download evaluation dataset to data_hub/ljp
# Download model generated results to runs/paper_version

The data is availabel at Google Drive

Evaluate Models

There are totally 10 sub_tasks: {free,multi}-{0..4}.

Evaluate a Huggingface model on all sub_tasks:

CUDA_VISIBLE_DEVICES=0 python main.py \
--config ./config/default_hf.json \
--output_dir ./runs/test/<model_name> \
--model_type hf \
--model <path of model>

Evaluate a OpenAI model on all sub_tasks:

CUDA_VISIBLE_DEVICES=0 python main.py \
--config ./config/default_openai.json \
--output_dir ./runs/test/<model_name> \
--model_type openai \
--model <path of model>

To evaluate some of the whole settings, add one more argument, e.g.,

--sub_tasks 'free-0shot,free-2shot,multi-0shot,multi-2shot'

The huggingface paths of the evaluated models in the paper are

  • ChatGLM: THUDM/chatglm-6b
  • BLOOMZ: bigscience/bloomz-7b1-mt
  • Vicuna: lmsys/vicuna-13b-delta-v1.1

Features:

  • If the evaluation process is interupted, just run it again with the same parameters. The process saves model outputs immediately and will skip previous finished samples when resuming.
  • Samples that trigger a GPU out-of-memory error will be skipped. You can change the configurations and run the process again. (See suggested GPU configurations below)

Suggested GPU configurations

  • 7B model
    • 1 GPU with RAM around 24G (RTX 3090, A5000)
    • If total RAM >=32G, e.g., 2*RTX3090 or 1*V100(32G), add the --speed argument for faster inference.
  • 13B model
    • 2 GPU with RAM >= 24G (e.g., 2*V100)
    • If total RAM>=64G, e.g., 3*RTX3090 or 2*V100, add the --speed argument for faster inference

When context is long, e.g., in multi-4shot setting, 1 GPU of 24G RAM may be insufficient for 7B model. You have to eigher increase the number of GPUs or decrease the demonstration length (default to 500) by modifying the demo_max_len parameter in config/default_hf.json

Create Result table

After evaluating some models locally, the leaderboard can be generated in csv format:

python scripts/get_result_table.py \
--exp_dir runs/paper_version \
--metric f1  \
--save_path resources/paper_version_f1.csv

Citation

@misc{shui2023comprehensive,
      title={A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction}, 
      author={Ruihao Shui and Yixin Cao and Xiang Wang and Tat-Seng Chua},
      year={2023},
      eprint={2310.11761},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}