Ashish Chouhan and Michael Gertz
Heidelberg University
Contact us at: {chouhan, gertz}@informatik.uni-heidelberg.de
Report Bug · Request Feature
Table of Contents
A pre-print of our work is available; it has also been accepted for the main conference of LREC-COLING 2024. The conference proceedings will be available in May 2024.
With the increase in legislative documents at the EU, the number of new terms and their definitions is increasing as well. As per Joint Practical Guide, terms used in legal documents shall be consistent, and identical concepts shall be expressed without departing from their meaning in ordinary, legal, or technical language. Thus, while drafting a new legislative document, having a framework that provides insights about existing definitions and helps define new terms based on a document’s context will support such harmonized legal definitions across different regulations and thus avoid ambiguities. In this paper, we present LexDrafter, a framework that assists in drafting Definitions articles for legislative documents using retrieval augmented generation (RAG) and existing term definitions present in different legislative documents. For this, definition elements are built by extracting definitions from existing documents. Using definition elements and RAG, a Definitions article can be suggested on demand for a legislative document that is being drafted. We demonstrate and evaluate the functionality of LexDrafter using a collection of EU documents from the energy domain. The code for LexDrafter framework is available at https://github.com/achouhan93/LexDrafter.
LexDrafter functions help users when drafting a legal act, in particular when drafting document sections has been completed, but the section with terminology definitions (the Definitions article) is missing. The fragments that include a given term in the drafted sections are the key components required
by LexDrafter, as such fragments provide contextual information for the definition of a term using our
RAG approach. In our work, terms for which definitions need to be determined are selected by the user. The LexDrafter framework is realized with two workflows. The first workflow (see Figure 1) takes EUR-Lex legal acts (here for the Energy domain) as input, preprocess them, and stores them in an IR system. The data acquisition process in this work is similar to the one employed by Aumiller et al. (2022), where a particular legal act web page is crawled to store the text and metadata in an OpenSearch instance. The stored text is further processed by DocStruct
component to extract components from retrieved legal acts,
preprocessing and storing them in the IR system inorder to build the Document Corpus
. On the other hand, Definition Corpus
is built by first identifying the Definition
article in the legal act, followed by extracted the definitions, similar to the approach proposed by Damaratskaya 2023 considering Definitions
article present in the legal acts.
Figure 1: LexDrafter Workflow 1
The second workflow (see Figure 2) takes a term selected by the user and either determines existing definitions or generates a definition for that term. Existing definitions can easily be identified and retrieved from the Definition Corpus
and new definitions are generated using a retrieval augmented generation (RAG) approach.
Figure 2: LexDrafter Workflow 2
Clone the repository by executing the below command
git clone https://github.com/achouhan93/LexDrafter.git
Navigate to the cloned repository folder
cd LexDrafter
Once the repository is successfully cloned and user navigated to the folder.
Execute the below steps to setup Python Environment (tested with Python 3.9.0):
- Setup a venv with python (or
conda
)
python -m venv .venv
- Activate venv
source .venv/bin/activate
- Install all necessary dependencies by running
pip install -r requirements.txt
- Rename the
.env-example
to.env
and populate the file with the required credentials
LOG_EXE_PATH="logs/execution.log"
LOG_PATH="logs/"
# Required for Dataset Collection, Schema Creation, and Definition Generation
# Opensearch Connection Details
DB_USERNAME = "your_opensearch_username"
DB_PASSWORD = "your_opensearch_password"
DB_HOSTNAME = "your_opensearch_hostname"
DB_PORT = "your_opensearch_port"
DB_LEXDRAFTER_INDEX ="your_opensearch_index_name"
# Required for Schema Creation, Ground Truth Definition Corpus Creation, and Definition Generation
# Postgresql Connection Details
PG_USER = "your_postgres_username"
PG_PWD = "your_postgres_password"
PG_DATABASE = "your_postgres_database"
PG_SERVER = "your_postgres_hostname"
PG_HOST = "your_postgres_port"
# Required for Definition Generation
# HuggingFace Key
HUGGINGFACE_AUTH_KEY = "huggingface_auth_key"
This code base provides necessary scripts for the dataset collection process (code/1. dataset_collection
), followed by the building document corpus, i.e., storage of the document content in schema (code/2. docStruct_component
). Once the document corpus is built, the next step is to build a definition corpus using an approach similar to the one proposed by Damaratskaya 2023. The execution of the scripts (code/3. defExtract_component
) results in building the definition corpus comprising of the existing definitions present in energy domain documents on EUR-Lex platform. As definition corpus comprises of static and dyanmic definitions, script present at (code/4. citeResolver_component
) is executed to extract the citation information for the dynamic definition fragments. After building document corpus and definition corpus, the next step in the framework is to check if the selected term has an existing static definition/definitions or a new definition needs to be generated based on the contextual information present in the legislative document, this decision is made by the TermRetriever component and scripts for this component is present at (code/5. termretriever_component
). Once decision is made by TermRetriever
to generate new definitions for terms selected by the user using retrieval augmented generation (RAG) pipeline. Scripts present at (code/6. ragenerator_component
) are used to generate definitions using LLAMA-2 and Vicuna. Finally, the generated definitions are compared with the ground-truth definition present in the definition corpus and evaluated on BLEU, BERTScore, and BLEURT evaluation metrics using the scripts present at (code/7. definition_evaluation
).
If you use the dataset or other parts of this code base, please use the following citation for attribution:
@misc{chouhan2024lexdrafter,
title={LexDrafter: Terminology Drafting for Legislative Documents using Retrieval Augmented Generation},
author={Ashish Chouhan and Michael Gertz},
year={2024},
eprint={2403.16295},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Copyright for the editorial content of EUR-Lex website, the EU legislative document content owned by the EU, are licensed under the Creative Commons Attribution 4.0 International licence, i.e., CC BY 4.0 as mentioned on the official EUR-Lex website. Any data artifacts remain licensed under the CC BY 4.0 license.
Per the recommendation of Creative Commons, we apply a separate license to the software component of this repository. We use the standard MIT license for code artifacts.
See license/LICENSE.txt
for more information.