Danish Legal Language Models
Available Language Models for Danish
This model is pre-trained on a combination of the Danish part of the MultiEURLEX (Chalkidis et al., 2021) dataset comprising 65k EU laws and two subsets (retsinformationdk
, retspraksis
) of the Danish Gigaword Corpus (Derczynski et al., 2021) comprising legal proceedings. It achieves the following results on the evaluation set.
Model Name
Loss
Accuracy
Maltehb/danish-bert-botxo
22.3
7.038
coastalcph/danish-legal-lm-base
84.8
0.651
coastalcph/danish-legal-bert-base
80.1
0.878
coastalcph/danish-legal-bert-base
82.5
0.768
coastalcph/danish-legal-xlm-base
83.1
0.727
Model Name
EURLEX Val.
EURLEX Test
Maltehb/danish-bert-botxo
73.7 / 42.8
67.6 / 38.2
coastalcph/danish-legal-lm-base
75.1 / 46.5
69.1 / 41.9
coastalcph/danish-legal-bert-base
75.0 / 50.4
68.9 / 44.3
coastalcph/danish-legal-xlm-base
TBA
TBA
coastalcph/danish-legal-longformer-base
75.7 / 52.9
69.6 / 47.0
coastalcph/danish-legal-longformer-base
+ SD Penalty (Pezeshki et al., 2020 )
76.1 / 52.9
69.9 / 47.0
The top-2 best models (coastalcph/danish-legal-longformer-base
, coastalcph/danish-legal-longformer-base-sd
) are available on HuggingFace Hub with instructions on how can be used as text classifier or feature extractor.
export PYTHONPATH=.
python src/mod_teacher_model.py --teacher_model_path coastalcph/danish-legal-lm-base --student_model_path coastalcph/danish-legal-lm-base
Longformerize pre-trained RoBERTa LM
export PYTHONPATH=.
python src/longformerize_model.py --roberta_model_path coastalcph/danish-legal-lm-base --max_length 2048 --attention_window 128