Script and results on eurlex

Hello! Thanks for this great repository. I have tried experiments on many of its subtasks and it works beautifully.

Now a problem is, when I am trying to reproduce the results on EUR-LEX, using run_eurlex.sh, it fails to give results similar to (or somewhere near) the ones in paper.

                       VALIDATION                                      | TEST
bert-base-uncased: MICRO-F1: 69.7      ± 0.1  MACRO-F1: 32.8   ± 0.4   | MICRO-F1: 63.1       MACRO-F1: 30.8

( I tried to change the model to legal-base-uncased, or change the number of epochs from 2 to 20, but these attempts failed too)

Can you help to have a look into this and give some suggestions?

A more detailed log for one of the 5 seeds are as follows:

...
[INFO|trainer.py:1419] 2022-06-27 05:09:06,003 >> ***** Running training *****
[INFO|trainer.py:1420] 2022-06-27 05:09:06,003 >>   Num examples = 55000
[INFO|trainer.py:1421] 2022-06-27 05:09:06,003 >>   Num Epochs = 2
[INFO|trainer.py:1422] 2022-06-27 05:09:06,003 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:1423] 2022-06-27 05:09:06,003 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:1424] 2022-06-27 05:09:06,003 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1425] 2022-06-27 05:09:06,003 >>   Total optimization steps = 13750
{'loss': 0.1809, 'learning_rate': 2.890909090909091e-05, 'epoch': 0.07}
{'loss': 0.1112, 'learning_rate': 2.7818181818181818e-05, 'epoch': 0.15}
{'loss': 0.0966, 'learning_rate': 2.6727272727272728e-05, 'epoch': 0.22}
{'loss': 0.0857, 'learning_rate': 2.5636363636363635e-05, 'epoch': 0.29}
{'loss': 0.0784, 'learning_rate': 2.454545454545455e-05, 'epoch': 0.36}
{'loss': 0.072, 'learning_rate': 2.3454545454545456e-05, 'epoch': 0.44}
{'loss': 0.0676, 'learning_rate': 2.2363636363636366e-05, 'epoch': 0.51}
{'loss': 0.0663, 'learning_rate': 2.1272727272727273e-05, 'epoch': 0.58}
{'loss': 0.0632, 'learning_rate': 2.0181818181818183e-05, 'epoch': 0.65}
{'loss': 0.0603, 'learning_rate': 1.909090909090909e-05, 'epoch': 0.73}
{'loss': 0.0593, 'learning_rate': 1.8e-05, 'epoch': 0.8}
{'loss': 0.0571, 'learning_rate': 1.6909090909090907e-05, 'epoch': 0.87}
{'loss': 0.0551, 'learning_rate': 1.5818181818181818e-05, 'epoch': 0.95}
 50%|███████████████████████████████████████████████████████████████████                                                                   | 6875/13750 [14:19<14:12,  8.07it/s]
[INFO|trainer.py:622] 2022-06-27 05:23:25,910 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and
have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
[INFO|trainer.py:2590] 2022-06-27 05:23:25,913 >> ***** Running Evaluation *****
[INFO|trainer.py:2592] 2022-06-27 05:23:25,914 >>   Num examples = 5000
[INFO|trainer.py:2595] 2022-06-27 05:23:25,914 >>   Batch size = 8
{'eval_loss': 0.06690910458564758, 'eval_macro-f1': 0.26581931249101237, 'eval_micro-f1': 0.6573569918647109, 'eval_runtime': 25.2148, 'eval_samples_per_second': 198.296, 'eval
_steps_per_second': 24.787, 'epoch': 1.0}
INFO|trainer.py:2340] 2022-06-27 05:23:51,131 >> Saving model checkpoint to logs/062605_eurlex_original/eurlex/bert-base-uncased/se
ed_5/checkpoint-6875
[INFO|configuration_utils.py:446] 2022-06-27 05:23:51,134 >> Configuration saved in logs/062605_eurlex_original/eurlex/bert-base-un
cased/seed_5/checkpoint-6875/config.json
[INFO|modeling_utils.py:1542] 2022-06-27 05:23:52,343 >> Model weights saved in logs/062605_eurlex_original/eurlex/bert-base-uncase
d/seed_5/checkpoint-6875/pytorch_model.bin
[INFO|tokenization_utils_base.py:2108] 2022-06-27 05:23:52,345 >> tokenizer config file saved in logs/062605_eurlex_original/eurlex
/bert-base-uncased/seed_5/checkpoint-6875/tokenizer_config.json
[INFO|tokenization_utils_base.py:2114] 2022-06-27 05:23:52,346 >> Special tokens file saved in logs/062605_eurlex_original/eurlex/b
ert-base-uncased/seed_5/checkpoint-6875/special_tokens_map.json
{'loss': 0.0546, 'learning_rate': 1.4727272727272728e-05, 'epoch': 1.02}
{'loss': 0.0531, 'learning_rate': 1.3636363636363637e-05, 'epoch': 1.09}
{'loss': 0.0518, 'learning_rate': 1.2545454545454545e-05, 'epoch': 1.16}
{'loss': 0.0521, 'learning_rate': 1.1454545454545455e-05, 'epoch': 1.24}
{'loss': 0.0497, 'learning_rate': 1.0363636363636364e-05, 'epoch': 1.31}
{'loss': 0.0481, 'learning_rate': 9.272727272727273e-06, 'epoch': 1.38}
{'loss': 0.0487, 'learning_rate': 8.181818181818181e-06, 'epoch': 1.45}
{'loss': 0.0488, 'learning_rate': 7.090909090909091e-06, 'epoch': 1.53}
{'loss': 0.0477, 'learning_rate': 6e-06, 'epoch': 1.6}
{'loss': 0.0476, 'learning_rate': 4.90909090909091e-06, 'epoch': 1.67}
{'loss': 0.047, 'learning_rate': 3.818181818181818e-06, 'epoch': 1.75}
{'loss': 0.0471, 'learning_rate': 2.7294545454545455e-06, 'epoch': 1.82}
{'loss': 0.0462, 'learning_rate': 1.6385454545454545e-06, 'epoch': 1.89}
{'loss': 0.0466, 'learning_rate': 5.476363636363636e-07, 'epoch': 1.96}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13750/13750 [29:11<00:00,  7.96it/s]
[INFO|trainer.py:622] 2022-06-27 05:38:17,987 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and
have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
[INFO|trainer.py:2590] 2022-06-27 05:38:17,989 >> ***** Running Evaluation *****
[INFO|trainer.py:2592] 2022-06-27 05:38:17,989 >>   Num examples = 5000
[INFO|trainer.py:2595] 2022-06-27 05:38:17,989 >>   Batch size = 8
{'eval_loss': 0.06163998320698738, 'eval_macro-f1': 0.3223906812379972, 'eval_micro-f1': 0.6903704623792815, 'eval_runtime': 24.2671, 'eval_samples_per_second': 206.041, 'eval_
steps_per_second': 25.755, 'epoch': 2.0}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13750/13750 [29:36<00:00,  7.96it/s[
INFO|trainer.py:2340] 2022-06-27 05:38:42,258 >> Saving model checkpoint to logs/062605_eurlex_original/eurlex/bert-base-uncased/se
ed_5/checkpoint-13750
[INFO|configuration_utils.py:446] 2022-06-27 05:38:42,261 >> Configuration saved in logs/062605_eurlex_original/eurlex/bert-base-un
cased/seed_5/checkpoint-13750/config.json
[INFO|modeling_utils.py:1542] 2022-06-27 05:38:43,511 >> Model weights saved in logs/062605_eurlex_original/eurlex/bert-base-uncase
d/seed_5/checkpoint-13750/pytorch_model.bin
[INFO|tokenization_utils_base.py:2108] 2022-06-27 05:38:43,513 >> tokenizer config file saved in logs/062605_eurlex_original/eurlex
/bert-base-uncased/seed_5/checkpoint-13750/tokenizer_config.json
[INFO|tokenization_utils_base.py:2114] 2022-06-27 05:38:43,513 >> Special tokens file saved in logs/062605_eurlex_original/eurlex/b
ert-base-uncased/seed_5/checkpoint-13750/special_tokens_map.json
[INFO|trainer.py:1662] 2022-06-27 05:38:46,057 >>

Training completed. Do not forget to share your model on huggingface.co/models =)


[INFO|trainer.py:1727] 2022-06-27 05:38:46,057 >> Loading best model from logs/062605_eurlex_original/eurlex/bert-base-uncased/seed
_5/checkpoint-13750 (score: 0.6903704623792815).
{'train_runtime': 1781.228, 'train_samples_per_second': 61.755, 'train_steps_per_second': 7.719, 'train_loss': 0.06421310944990678, 'epoch': 2.0}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13750/13750 [29:41<00:00,  7.72it/s]
[INFO|trainer.py:2340] 2022-06-27 05:38:47,236 >> Saving model checkpoint to logs/062605_eurlex_original/eurlex/bert-base-uncased/s
eed_5
[INFO|configuration_utils.py:446] 2022-06-27 05:38:47,261 >> Configuration saved in logs/062605_eurlex_original/eurlex/bert-base-un
cased/seed_5/config.json
[INFO|modeling_utils.py:1542] 2022-06-27 05:38:48,560 >> Model weights saved in logs/062605_eurlex_original/eurlex/bert-base-uncase
d/seed_5/pytorch_model.bin
[INFO|tokenization_utils_base.py:2108] 2022-06-27 05:38:48,562 >> tokenizer config file saved in logs/062605_eurlex_original/eurlex
/bert-base-uncased/seed_5/tokenizer_config.json
[INFO|tokenization_utils_base.py:2114] 2022-06-27 05:38:48,563 >> Special tokens file saved in logs/062605_eurlex_original/eurlex/b
ert-base-uncased/seed_5/special_tokens_map.json
***** train metrics *****
  epoch                    =        2.0
  train_loss               =     0.0642
  train_runtime            = 0:29:41.22
  train_samples            =      55000
  train_samples_per_second =     61.755
  train_steps_per_second   =      7.719
06/27/2022 05:38:48 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:622] 2022-06-27 05:38:48,611 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and
have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
[INFO|trainer.py:2590] 2022-06-27 05:38:48,620 >> ***** Running Evaluation *****
[INFO|trainer.py:2592] 2022-06-27 05:38:48,620 >>   Num examples = 5000
[INFO|trainer.py:2595] 2022-06-27 05:38:48,620 >>   Batch size = 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 625/625 [00:22<00:00, 27.85it/s]
***** eval metrics *****
  epoch                   =        2.0
  eval_loss               =     0.0616
  eval_macro-f1           =     0.3224
  eval_micro-f1           =     0.6904
  eval_runtime            = 0:00:22.48
  eval_samples            =       5000
  eval_samples_per_second =    222.372
  eval_steps_per_second   =     27.796
06/27/2022 05:39:11 - INFO - __main__ - *** Predict ***
[INFO|trainer.py:622] 2022-06-27 05:39:11,101 >> The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have b
een ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
[INFO|trainer.py:2590] 2022-06-27 05:39:11,106 >> ***** Running Prediction *****
[INFO|trainer.py:2592] 2022-06-27 05:39:11,106 >>   Num examples = 5000
[INFO|trainer.py:2595] 2022-06-27 05:39:11,106 >>   Batch size = 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 625/625 [00:22<00:00, 27.64it/s]
***** predict metrics *****
  predict_loss               =     0.0712
  predict_macro-f1           =     0.2969
  predict_micro-f1           =     0.6196
  predict_runtime            = 0:00:22.44
  predict_samples            =       5000
  predict_samples_per_second =    222.741
  predict_steps_per_second   =     27.843
...

Hi @Glaciohound,

I'm trying to figure out what could go wrong. Here are the results for two of the runs (seed 1 and 2) with nlpaueb/legal-bert-base-uncased:

{
    "epoch": 10.0,
    "eval_loss": 0.06902255862951279,
    "eval_macro-f1": 0.6179554031391897,
    "eval_micro-f1": 0.777115002244885,
    "eval_r-precision": 0.7927199422799421,
    "predict_loss": 0.08612240850925446,
    "predict_macro-f1": 0.5623417257812284,
    "predict_micro-f1": 0.721954706983199,
    "predict_r-precision": 0.7462810028860029,
    "train_loss": 0.03723958882418546,
    "train_runtime": 11224.1911,
}

{
    "epoch": 10.0,
    "eval_loss": 0.07012862712144852,
    "eval_macro-f1": 0.6059832855751778,
    "eval_micro-f1": 0.7745175560447076,
    "eval_r-precision": 0.7841969047619048,
    "predict_loss": 0.08702316880226135,
    "predict_macro-f1": 0.5589387715780068,
    "predict_micro-f1": 0.7222667623106461,
    "predict_r-precision": 0.7325598051948052,
    "train_loss": 0.03719398325486617,
    "train_runtime": 11213.1767,
}

Across cases (five seeds), the model in our experiments stopped after [10,10,13,10,11] epochs*, so in the log you presented the model is severely under-trained (under-fit), i.e., the model barely "learned" how to resolve the most frequent classes (~60-70% micro-F1) and it's much worse for the rest of infrequent classes (~30% macro-F1).

In case of bert-base-uncased the total training epochs were [16,13,13,8,9].

Can you please attach the log when you train the model for 20 epochs? I'll try to find time to rerun experiments with the last version of the code, but it's very unlikely that there is a bug since there are others who already replicated experiments with very similar results.

Thank you I will give it a try ^_^

Hi @Glaciohound, there was a major bug in data loading and the label list under consideration.

In the released HuggingFace data loader, all 127 labels are pre-defined, lexically ordered based on their EUROVOC IDs (https://github.com/huggingface/datasets/blob/1529bdca496d2180bc2af6e1607dd0708438b873/datasets/lex_glue/lex_glue.py#L48).

Then, as you mentioned, the EUR-LEX training script considers the first 100 labels, instead of the most-frequent ones based on the training label distribution.

lex-glue/experiments/eurlex.py

Line 214 in d640bfc

label_list = list(range(100))

In the original experiments, we used custom data loaders at the time, but then we built and released the HuggingFace data loader w/o noticing this “stealthy” bug...).

Permanent Bug Fix

I have already made a pull request to fix this issue on the data loader (huggingface/datasets#5048).

Temporary Bug Fix

Until this happen, early next week, you can also replicate the results by manually defining the label list based on the 100 most-frequent labels, by replacing this line

lex-glue/experiments/eurlex.py

Line 214 in d640bfc

label_list = list(range(100))

with this line of code:

labels = [119, 120, 114, 90, 28, 29, 30, 82, 87, 8, 44, 31, 33, 94, 22, 14, 52, 91, 92, 13, 89, 86, 118, 93, 12, 68, 83,
          98, 11, 7, 32, 115, 96, 79, 116, 106, 81, 75, 117, 112, 59, 6, 77, 95, 72, 108, 60, 99, 74, 24, 27, 34, 58,
          66, 84, 61, 16, 107, 20, 43, 97, 105, 76, 67, 80, 57, 63, 37, 36, 85, 5, 109, 69, 38, 78, 39, 49, 23, 42, 100,
          17, 70, 9, 51, 113, 103, 102, 110, 0, 41, 111, 101, 35, 64, 10, 121, 21, 26, 71, 122]