
Primary LanguagePython

CaseGNN & CaseGNN++

Code for CaseGNN (ECIR 2024 paper):

Title: CaseGNN: Graph Neural Networks for Legal Case Retrieval with Text-Attributed Graphs

Author: Yanran Tang, Ruihong Qiu, Yilun Liu, Xue Li and Zi Huang

And CaseGNN++ (Extension of CaseGNN):

Title: CaseGNN++: Graph Contrastive Learning for Legal Case Retrieval with Graph Augmentation

Author: Yanran Tang, Ruihong Qiu, Yilun Liu, Xue Li and Zi Huang


Requirements can be seen in /requirements.txt


Datasets can be downloaded from COLIEE2022 and COLIEE2023.

Specifically, the downloaded COLIEE2022 folders task1_train_files_2022 and task1_test_files_2022 should be put into /PromptCase/task1_train_2022/ and /PromptCase/task1_test_2022/ respectively.

The label file task1_train_labels_2022.json and task1_test_labels_2022.json shoule be put into folder /label/.

COLIEE2023 folders should be set in a similar way.

The final project file are as follows:

$ ./CaseGNN/
│   └── data_load.py
├── Grpah_generation
│   ├── graph
│   │   ├── graph_bin_2022
│   │   └── graph_bin_2023
│   └── TACG.py
├── Information_extraction  
│   ├── coliee2022_ie    
│   ├── coliee2023_ie
│   ├── lexnlp             
│   ├── stanford-openie
│   ├── create_structured_csv.py
│   ├── knowledge_graph.py
│   └── relation_extractor.py             
├── label 
│   ├── hard_neg_top50_train_2022.json
│   ├── hard_neg_top50_train_2023.json
│   ├── task1_test_labels_2022.json            
│   ├── task1_test_labels_2023.json 
│   ├── task1_train_labels_2022.json 
│   ├── task1_train_labels_2023.json 
│   ├── test_2022_candidate_with_yearfilter.json
│   └── test_2023_candidate_with_yearfilter.json     
├── PromptCase
│   ├── preprocessing
│   │   ├── openaiAPI.py
│   │   ├── process.py
│   │   └── reference.py
│   ├── promptcase_embedding
│   ├── PromptCase_embedding_generation.py
│   ├── task1_test_2022
│   │   └── task1_test_files_2022
│   ├── task1_test_2023
│   │   └── task1_test_files_2023
│   ├── task1_train_2022
│   │   └── task1_train_files_2022
│   └── task1_train_2023
│       └── task1_train_files_2023
├── CaseGNN2022_run.sh
├── CaseGNN2023_run.sh
├── CaseGNN++2022_run.sh
├── CaseGNN++2023_run.sh
├── LegalFeatureExtraction.sh
├── RelationExtraction.sh
├── PromptcaseEmbeddingGeneration.sh
├── TACG.sh
├── main.py
├── model.py
├── train.py
├── main_casegnn2plus.py
├── model_casegnn2plus.py
├── train_casegnn2plus.py
├── EUGATConv.py
├── torch_metrics.py
├── requirements.txt
└── README.md          

Data Preparation

1. Information Extraction

    1. Legal Feature Extraction
    • PromptCase Preprocessing is used to extracted the fact and issue from the cases.

    • Run . ./LegalFeatureExtraction.sh to generate files in the following three folders:

      • /PromptCase/task1_test_2022/processed/,
      • /PromptCase/task1_test_2022/processed_new/, which is the legal issues of cases,
      • /PromptCase/task1_test_2022/summary_test_2022_txt/, which is the legal facts of cases.
    • The same process for COLIEE2023, please change the --data 2022 to --data 2023 in LegalFeatureExtraction.sh.

    1. Relation Extraction
    • Run . ./RelationExtraction.sh.

    • The final relation triplets are in the folder /Information_extraction/coliee2022_ie/coliee2022train(or test)_sum(or fact)/result/.

    • The same process for COLIEE2023, please change the --data 2022 to --data 2023 in RelationExtraction.sh.

    • The relation extraction is based on the knowledge_graph_from_unstructured_text and lexnlp.

  • Note: Legal feature extraction should be done first since the relation extraction is based on the extracted legal features.

  • The extracted information can be also downloaded here.

2. PromptCase Embedding Generation

  • PromptCase is used to generate the case embedding (the feature of virtual global node)
    • Run . ./PromptcaseEmbeddingGeneration.sh.
    • The generated case embedding and the according index list of cases are saved in folder /PromptCase/promptcase_embedding/
    • The same process for COLIEE2023, please change the --data 2022 to --data 2023 in PromptcaseEmbeddingGeneration.sh.
  • The generated PromptCase embedding can be also downloaded here.

3. TACG Constrction

  • TACG constrction utilises the result of Information Extraction and PromptCase Embedding, please ensure the folders of coliee2022_ie/coliee2022train(or test)_sum(or fact)/result/ and /PromptCase/promptcase_embedding/ have been generated or downloaded.

  • Run . ./TACG.sh

  • The TACG graphs are saved in folder /Graph_generation/graph/

  • The same process for COLIEE2023, please change the --data 2022 to --data 2023 in TACG.sh.

Model Training

1. CaseGNN Model Training

Run . ./CaseGNN2022_run.sh and . ./CaseGNN2023_run.sh for COLIEE2022 and COLIEE2023, respectively.

2. CaseGNN++ Model Training

Run . ./CaseGNN++2022_run.sh and . ./CaseGNN++2023_run.sh for COLIEE2022 and COLIEE2023, respectively.

Specifically, augmentation methods can be chosen to use for:

  • Positive samples only (--pos_aug)
  • Random negative samples only (--ran_aug)
  • Both positive and random negative samples (--pos_aug --ran_aug)


If you find this repo useful, please cite

  author       = {Yanran Tang and
                  Ruihong Qiu and
                  Yilun Liu and
                  Xue Li and
                  Zi Huang},
  title        = {CaseGNN++: Graph Contrastive Learning for Legal Case Retrieval with Graph Augmentation},
  journal      = {CoRR},
  volume       = {abs/2405.11791},
  year         = {2024},

  author       = {Yanran Tang and
                  Ruihong Qiu and
                  Yilun Liu and
                  Xue Li and
                  Zi Huang},
  title        = {CaseGNN: Graph Neural Networks for Legal Case Retrieval with Text-Attributed
  booktitle    = {ECIR},
  year         = {2024}

  author       = {Yanran Tang and
                  Ruihong Qiu and
                  Xue Li},
  title        = {Prompt-Based Effective Input Reformulation for Legal Case Retrieval},
  booktitle    = {ADC},
  year         = {2023}