BiLSTM-CRF and DistilRoBERTa Models for Legal Named Entity Recognition

Project Description

This repository contains the final project for the course 'Advanced Natural Language Processing' of the M.Sc. Cognitive Systems: Language, Learning and Reasoning at Universität Potsdam. This project deals with the SemEval-2023 task 6: LegalEval , subtask B: Legal Entity Recognition (L-NER). You can find the paper presenting this task here. This repository has been contributed by Guillem Gili i Bueno, Yi-Sheng Hsu and Delfina Jovanovich Trakál.

In this project, we propose two models for L-NER: a bidirectional long-short term memory neural network with a conditional random field layer (BiLSTM-CRF) and a pretrained DistilRoBERTa model.


The packages required to run this project can be found in requirements.txt.

$ pip install -r requirements.txt

Make sure your Python version is compatible with PyTorch.


The data has been collected by the SemEval-2023 tasks 6 creators. It is divided into two categories, judgement and preamble, which don't present the same entity type and frequency. The .json files can be found under src/data. More details on the data extraction and annotation processes can be found in the base paper linked above.


Training Prequisites

We must do the split between training,validation and testing:

$ python src/ --split_datasets

The new files will also be saved under src/data


To use this method we will need some pretrained Word Embeddings. Download the pretrained Glove word embeddings:

$ python src/ --download_glove


In this case we need to download the pretrained model for distilroberta-base since this model is the milestone we will be fine-tuning to our data. The code in src/ automatically downloads its pretrained model from huggingface, so there is no need to run any explicit commands. However note that the first time this code is run, it may take a while to download the model.

It is also worth nothing that for roberta the batch_size values are hardcoded, since we had to cater to our GPU limitations(NVIDIA GeForce GTX 1650). The current batch_sizes are: 4 for training, 48 for validation and are declared atop src/ as BATCH_SIZE_TRAIN_CONCURRENT and BATCH_SIZE_VALIDATE_CONCURRENT. Feel free to tinker with them if you are running out of GPU memory or you want to run the training faster.


Models will be in the folder src/generated_models and plots in the folder src/plots. In the case of roberta, where to save the plots can be specified with the --round parameter(1, 2 or other that will leave the plots in the folder inside src/plots round1_roberta, round2_roberta or other respectively).

Initialize either a BiLSTM-CRF or a RoBERTa model by using either the --bilstm_crf or the --roberta arguments. For training, specify for either model the number of epochs, the batch size (only for BiLSTM-CRF), and the learning rate with the respective parameters --epochs,--batch_size, and --lr. Choose either the judgement or the preamble datasets with the argument --dataset. Here are the base examples:

$ python3 src/ --bilstm_crf --epochs 25 --batch_size 256 --lr 0.05 --dataset judgement
$ python3 src/ --roberta  --epochs 10 --lr 0.00005 --dataset preamble --round 1

Testing and Evaluation

Run $ python src/ --evaluate_model to test and evaluate either model on either judgement or preamble dev data. Specify which model to evaluate after the argument --model and on which dataset to test (judgement or preamble). We use F1 score. Here is an example:

$ python src/ --evaluate_model bilstm_crf.judgement.e25.bs256.lr0.05 --model judgement

Reproducing our results


You may need to give permissions to your filesystem to run the scripts:

$ chmod 755
$ chmod 755

To replicate the models and plots from the first round of experiments (where we test different learning rates for 10 epochs each, THIS WILL TAKE A WHILE!!):

$ ./

This took 2 hours for the preamble models and 8 hours for the NVIDIA GeForce GTX 1650, so you may want to edit the to make it take just 5 epochs.

To replicate the models and plots from the second round of experiments (where we only train the 2 best models, one for preamble and one for judgement):

$ ./

This took an hour for the NVIDIA GeForce GTX 1650.


You may need to give permissions to your filesystem to run the scripts:

$ chmod 755

Then simply run the following script:

$ ./

The script contains data splitting, downloading GloVe embeddings, and training the models with the hyperparameters that we mainly referred to: epoch=25, batch_size=256, lr=0.01 (preamble) or 0.05 (judgement).

It takes around 7 hours to train the model with both datasets on a MacBook Pro (M1, 2021). It is also possible to edit to make adjustments such as reducing epochs. Primanry results of this model can usually be observed with around 15 epochs.


This part will simply test a model then print the resulting csvs under src/evaluation_logs/. Once again, you may need to give permissions to your filesystem to run the scripts:

$ chmod 755

And simply run it, it will generate the csvs and show the results by terminal:

$ ./


  1. Advanced: Making Dynamic Decisions and the Bi-LSTM CRF | PyTorch Tutorials
  2. F1-Score | Hugging Face evaluation Library
  3. Transformer Token Classification | Hugging Face Transformer Token Classification
  4. pytorch-RoBERTa-named-entity-recognition | Kaggle RoBERTa Model