python
repository for summarizing greek legal judgements
This repository holds the code used for the purposes of under review publication on Automated Summarization of Legal Judgements (el: Αυτόματη Περίληψη Δικαστικών Αποφάσεων).
We experiment with various unsupervised models that are used in the literature. Each model can be found in the models folder. We mainly utilized open-source python
code & repositories, while making the appropriate changes to extend/correct the algorithms and to support the greek language.
Currently we have implemented the following algorithms:
- LexRank (paper): by adapting crabcamp's implementention.
- Biased LexRank(paper)
- Random Sentece: a random sentence extraction baseline
To summarize a dataset of legal judgements make sure the file used as input is either:
- a directory with 2 folders; one named
caselaw
containing the full text of the judgement, and one namedsummary
containg the summaries. Every file must by named using the template{index}.[json|csv]
such that caselaws and summaries that share the shame index belong to the same judgement. - a
.csv
file containing all the data.
In both cases, the columns containing the judgement's full-text and the summary must be named text
and summary
respectively.
💻 Finally, to use the summarization script run the following:
python3 summarize.py [-h] [-m [{lexrank,random_sentence, biased_lexrank}]] [-o [OUTPUT]] [--output_filetype {json,txt}] [-l LIMIT] file
📚 To get more information about summarize.py arguments run
python3 summarize.py --help
We utilize the Encoder-Decoder approach of Liu and Lapata, 2019, where both Encoder and Decoder are initialized with the Greek BERT weights.
To train/evaluate the model run:
bertLegalSum.py [-h] [--model_save_path MODEL_SAVE_PATH] [--model_load_path MODEL_LOAD_PATH] [--eval_output_file_path EVAL_OUTPUT_FILE_PATH] [--reorder] [--remove_info] [--put_categs_at_start] file_path {train,eval,both}
📚 To get more information about bert_legal_sum.py arguments run
python3 bertLegalSum.py --help
We evaluate our models using the ROUGE automated metrics used in the automated summarization literature.
- ROUGE (paper): by adapting Diego999's ROUGE script which reimplements the original ROUGE script (written in Pearl) in python. We make several extensions, including the ability to stem greek words and disable the deletion of greek characters.
Before running the evaluation script, make sure the summaries (both reference and generated) lie in seperate folders and that each summary is named using the template {index}.[.json|csv]
such tat the reference summaries and generated summaries that share the shame index summarize the same judgement. Also make sure, each summary file must have the summary under a column named summary
. In case both generated summary and reference summary lie in same folders, then the --summaries_together flag can be used.
📏 To use the evaluation script run the following:
python3 evaluate.py [-h] [-o [OUTPUT]] [-m [METRIC]] [-l LIMIT] [--quiet QUIET] reference_sum generated_sum
📚 To get more information about evaluate.py arguments run
python3 evaluate.py --help
📚 For more information on the data crawling, read the documentation in the dataset folder.
🔧 For more information on the crawler's settings and options, read the documentation in the dataset_crawler folder.
Data analysis (token-level statistics & Grusky's et al. 2018 Coverage and Density) can be obtained by the analysis.py file in the [dataset folder] the dataset folder.
In order to install most of the required packages run:
pip3 install -r requirements.txt
To install the rest of the required packages, namely the LexRank library and the ROUGE evaluation library, run:
pip3 install -e models/LexRank/lexrank_master
pip3 install -e evaluators/ROUGE/py-rouge
Note: This project was developed using Python 3.8.12