Script and results on eurlex
Glaciohound opened this issue · 3 comments
Hello! Thanks for this great repository. I have tried experiments on many of its subtasks and it works beautifully.
Now a problem is, when I am trying to reproduce the results on EUR-LEX, using run_eurlex.sh
, it fails to give results similar to (or somewhere near) the ones in paper.
VALIDATION | TEST
bert-base-uncased: MICRO-F1: 69.7 ± 0.1 MACRO-F1: 32.8 ± 0.4 | MICRO-F1: 63.1 MACRO-F1: 30.8
( I tried to change the model to legal-base-uncased
, or change the number of epochs from 2 to 20, but these attempts failed too)
Can you help to have a look into this and give some suggestions?
A more detailed log for one of the 5 seeds are as follows:
...
[INFO|trainer.py:1419] 2022-06-27 05:09:06,003 >> ***** Running training *****
[INFO|trainer.py:1420] 2022-06-27 05:09:06,003 >> Num examples = 55000
[INFO|trainer.py:1421] 2022-06-27 05:09:06,003 >> Num Epochs = 2
[INFO|trainer.py:1422] 2022-06-27 05:09:06,003 >> Instantaneous batch size per device = 8
[INFO|trainer.py:1423] 2022-06-27 05:09:06,003 >> Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:1424] 2022-06-27 05:09:06,003 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1425] 2022-06-27 05:09:06,003 >> Total optimization steps = 13750
{'loss': 0.1809, 'learning_rate': 2.890909090909091e-05, 'epoch': 0.07}
{'loss': 0.1112, 'learning_rate': 2.7818181818181818e-05, 'epoch': 0.15}
{'loss': 0.0966, 'learning_rate': 2.6727272727272728e-05, 'epoch': 0.22}
{'loss': 0.0857, 'learning_rate': 2.5636363636363635e-05, 'epoch': 0.29}
{'loss': 0.0784, 'learning_rate': 2.454545454545455e-05, 'epoch': 0.36}
{'loss': 0.072, 'learning_rate': 2.3454545454545456e-05, 'epoch': 0.44}
{'loss': 0.0676, 'learning_rate': 2.2363636363636366e-05, 'epoch': 0.51}
{'loss': 0.0663, 'learning_rate': 2.1272727272727273e-05, 'epoch': 0.58}
{'loss': 0.0632, 'learning_rate': 2.0181818181818183e-05, 'epoch': 0.65}
{'loss': 0.0603, 'learning_rate': 1.909090909090909e-05, 'epoch': 0.73}
{'loss': 0.0593, 'learning_rate': 1.8e-05, 'epoch': 0.8}
{'loss': 0.0571, 'learning_rate': 1.6909090909090907e-05, 'epoch': 0.87}
{'loss': 0.0551, 'learning_rate': 1.5818181818181818e-05, 'epoch': 0.95}
50%|███████████████████████████████████████████████████████████████████ | 6875/13750 [14:19<14:12, 8.07it/s]
[INFO|trainer.py:622] 2022-06-27 05:23:25,910 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and
have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`, you can safely ignore this message.
[INFO|trainer.py:2590] 2022-06-27 05:23:25,913 >> ***** Running Evaluation *****
[INFO|trainer.py:2592] 2022-06-27 05:23:25,914 >> Num examples = 5000
[INFO|trainer.py:2595] 2022-06-27 05:23:25,914 >> Batch size = 8
{'eval_loss': 0.06690910458564758, 'eval_macro-f1': 0.26581931249101237, 'eval_micro-f1': 0.6573569918647109, 'eval_runtime': 25.2148, 'eval_samples_per_second': 198.296, 'eval
_steps_per_second': 24.787, 'epoch': 1.0}
INFO|trainer.py:2340] 2022-06-27 05:23:51,131 >> Saving model checkpoint to logs/062605_eurlex_original/eurlex/bert-base-uncased/se
ed_5/checkpoint-6875
[INFO|configuration_utils.py:446] 2022-06-27 05:23:51,134 >> Configuration saved in logs/062605_eurlex_original/eurlex/bert-base-un
cased/seed_5/checkpoint-6875/config.json
[INFO|modeling_utils.py:1542] 2022-06-27 05:23:52,343 >> Model weights saved in logs/062605_eurlex_original/eurlex/bert-base-uncase
d/seed_5/checkpoint-6875/pytorch_model.bin
[INFO|tokenization_utils_base.py:2108] 2022-06-27 05:23:52,345 >> tokenizer config file saved in logs/062605_eurlex_original/eurlex
/bert-base-uncased/seed_5/checkpoint-6875/tokenizer_config.json
[INFO|tokenization_utils_base.py:2114] 2022-06-27 05:23:52,346 >> Special tokens file saved in logs/062605_eurlex_original/eurlex/b
ert-base-uncased/seed_5/checkpoint-6875/special_tokens_map.json
{'loss': 0.0546, 'learning_rate': 1.4727272727272728e-05, 'epoch': 1.02}
{'loss': 0.0531, 'learning_rate': 1.3636363636363637e-05, 'epoch': 1.09}
{'loss': 0.0518, 'learning_rate': 1.2545454545454545e-05, 'epoch': 1.16}
{'loss': 0.0521, 'learning_rate': 1.1454545454545455e-05, 'epoch': 1.24}
{'loss': 0.0497, 'learning_rate': 1.0363636363636364e-05, 'epoch': 1.31}
{'loss': 0.0481, 'learning_rate': 9.272727272727273e-06, 'epoch': 1.38}
{'loss': 0.0487, 'learning_rate': 8.181818181818181e-06, 'epoch': 1.45}
{'loss': 0.0488, 'learning_rate': 7.090909090909091e-06, 'epoch': 1.53}
{'loss': 0.0477, 'learning_rate': 6e-06, 'epoch': 1.6}
{'loss': 0.0476, 'learning_rate': 4.90909090909091e-06, 'epoch': 1.67}
{'loss': 0.047, 'learning_rate': 3.818181818181818e-06, 'epoch': 1.75}
{'loss': 0.0471, 'learning_rate': 2.7294545454545455e-06, 'epoch': 1.82}
{'loss': 0.0462, 'learning_rate': 1.6385454545454545e-06, 'epoch': 1.89}
{'loss': 0.0466, 'learning_rate': 5.476363636363636e-07, 'epoch': 1.96}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13750/13750 [29:11<00:00, 7.96it/s]
[INFO|trainer.py:622] 2022-06-27 05:38:17,987 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and
have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`, you can safely ignore this message.
[INFO|trainer.py:2590] 2022-06-27 05:38:17,989 >> ***** Running Evaluation *****
[INFO|trainer.py:2592] 2022-06-27 05:38:17,989 >> Num examples = 5000
[INFO|trainer.py:2595] 2022-06-27 05:38:17,989 >> Batch size = 8
{'eval_loss': 0.06163998320698738, 'eval_macro-f1': 0.3223906812379972, 'eval_micro-f1': 0.6903704623792815, 'eval_runtime': 24.2671, 'eval_samples_per_second': 206.041, 'eval_
steps_per_second': 25.755, 'epoch': 2.0}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13750/13750 [29:36<00:00, 7.96it/s[
INFO|trainer.py:2340] 2022-06-27 05:38:42,258 >> Saving model checkpoint to logs/062605_eurlex_original/eurlex/bert-base-uncased/se
ed_5/checkpoint-13750
[INFO|configuration_utils.py:446] 2022-06-27 05:38:42,261 >> Configuration saved in logs/062605_eurlex_original/eurlex/bert-base-un
cased/seed_5/checkpoint-13750/config.json
[INFO|modeling_utils.py:1542] 2022-06-27 05:38:43,511 >> Model weights saved in logs/062605_eurlex_original/eurlex/bert-base-uncase
d/seed_5/checkpoint-13750/pytorch_model.bin
[INFO|tokenization_utils_base.py:2108] 2022-06-27 05:38:43,513 >> tokenizer config file saved in logs/062605_eurlex_original/eurlex
/bert-base-uncased/seed_5/checkpoint-13750/tokenizer_config.json
[INFO|tokenization_utils_base.py:2114] 2022-06-27 05:38:43,513 >> Special tokens file saved in logs/062605_eurlex_original/eurlex/b
ert-base-uncased/seed_5/checkpoint-13750/special_tokens_map.json
[INFO|trainer.py:1662] 2022-06-27 05:38:46,057 >>
Training completed. Do not forget to share your model on huggingface.co/models =)
[INFO|trainer.py:1727] 2022-06-27 05:38:46,057 >> Loading best model from logs/062605_eurlex_original/eurlex/bert-base-uncased/seed
_5/checkpoint-13750 (score: 0.6903704623792815).
{'train_runtime': 1781.228, 'train_samples_per_second': 61.755, 'train_steps_per_second': 7.719, 'train_loss': 0.06421310944990678, 'epoch': 2.0}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13750/13750 [29:41<00:00, 7.72it/s]
[INFO|trainer.py:2340] 2022-06-27 05:38:47,236 >> Saving model checkpoint to logs/062605_eurlex_original/eurlex/bert-base-uncased/s
eed_5
[INFO|configuration_utils.py:446] 2022-06-27 05:38:47,261 >> Configuration saved in logs/062605_eurlex_original/eurlex/bert-base-un
cased/seed_5/config.json
[INFO|modeling_utils.py:1542] 2022-06-27 05:38:48,560 >> Model weights saved in logs/062605_eurlex_original/eurlex/bert-base-uncase
d/seed_5/pytorch_model.bin
[INFO|tokenization_utils_base.py:2108] 2022-06-27 05:38:48,562 >> tokenizer config file saved in logs/062605_eurlex_original/eurlex
/bert-base-uncased/seed_5/tokenizer_config.json
[INFO|tokenization_utils_base.py:2114] 2022-06-27 05:38:48,563 >> Special tokens file saved in logs/062605_eurlex_original/eurlex/b
ert-base-uncased/seed_5/special_tokens_map.json
***** train metrics *****
epoch = 2.0
train_loss = 0.0642
train_runtime = 0:29:41.22
train_samples = 55000
train_samples_per_second = 61.755
train_steps_per_second = 7.719
06/27/2022 05:38:48 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:622] 2022-06-27 05:38:48,611 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and
have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`, you can safely ignore this message.
[INFO|trainer.py:2590] 2022-06-27 05:38:48,620 >> ***** Running Evaluation *****
[INFO|trainer.py:2592] 2022-06-27 05:38:48,620 >> Num examples = 5000
[INFO|trainer.py:2595] 2022-06-27 05:38:48,620 >> Batch size = 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 625/625 [00:22<00:00, 27.85it/s]
***** eval metrics *****
epoch = 2.0
eval_loss = 0.0616
eval_macro-f1 = 0.3224
eval_micro-f1 = 0.6904
eval_runtime = 0:00:22.48
eval_samples = 5000
eval_samples_per_second = 222.372
eval_steps_per_second = 27.796
06/27/2022 05:39:11 - INFO - __main__ - *** Predict ***
[INFO|trainer.py:622] 2022-06-27 05:39:11,101 >> The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have b
een ignored: text. If text are not expected by `BertForSequenceClassification.forward`, you can safely ignore this message.
[INFO|trainer.py:2590] 2022-06-27 05:39:11,106 >> ***** Running Prediction *****
[INFO|trainer.py:2592] 2022-06-27 05:39:11,106 >> Num examples = 5000
[INFO|trainer.py:2595] 2022-06-27 05:39:11,106 >> Batch size = 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 625/625 [00:22<00:00, 27.64it/s]
***** predict metrics *****
predict_loss = 0.0712
predict_macro-f1 = 0.2969
predict_micro-f1 = 0.6196
predict_runtime = 0:00:22.44
predict_samples = 5000
predict_samples_per_second = 222.741
predict_steps_per_second = 27.843
...
Hi @Glaciohound,
I'm trying to figure out what could go wrong. Here are the results for two of the runs (seed 1 and 2) with nlpaueb/legal-bert-base-uncased
:
{
"epoch": 10.0,
"eval_loss": 0.06902255862951279,
"eval_macro-f1": 0.6179554031391897,
"eval_micro-f1": 0.777115002244885,
"eval_r-precision": 0.7927199422799421,
"predict_loss": 0.08612240850925446,
"predict_macro-f1": 0.5623417257812284,
"predict_micro-f1": 0.721954706983199,
"predict_r-precision": 0.7462810028860029,
"train_loss": 0.03723958882418546,
"train_runtime": 11224.1911,
}
{
"epoch": 10.0,
"eval_loss": 0.07012862712144852,
"eval_macro-f1": 0.6059832855751778,
"eval_micro-f1": 0.7745175560447076,
"eval_r-precision": 0.7841969047619048,
"predict_loss": 0.08702316880226135,
"predict_macro-f1": 0.5589387715780068,
"predict_micro-f1": 0.7222667623106461,
"predict_r-precision": 0.7325598051948052,
"train_loss": 0.03719398325486617,
"train_runtime": 11213.1767,
}
Across cases (five seeds), the model in our experiments stopped after [10,10,13,10,11] epochs*, so in the log you presented the model is severely under-trained (under-fit), i.e., the model barely "learned" how to resolve the most frequent classes (~60-70% micro-F1) and it's much worse for the rest of infrequent classes (~30% macro-F1).
- In case of
bert-base-uncased
the total training epochs were [16,13,13,8,9].
Can you please attach the log when you train the model for 20 epochs? I'll try to find time to rerun experiments with the last version of the code, but it's very unlikely that there is a bug since there are others who already replicated experiments with very similar results.
Thank you I will give it a try ^_^
Hi @Glaciohound, there was a major bug in data loading and the label list under consideration.
In the released HuggingFace data loader, all 127 labels are pre-defined, lexically ordered based on their EUROVOC IDs (https://github.com/huggingface/datasets/blob/1529bdca496d2180bc2af6e1607dd0708438b873/datasets/lex_glue/lex_glue.py#L48).
Then, as you mentioned, the EUR-LEX training script considers the first 100 labels, instead of the most-frequent ones based on the training label distribution.
lex-glue/experiments/eurlex.py
Line 214 in d640bfc
In the original experiments, we used custom data loaders at the time, but then we built and released the HuggingFace data loader w/o noticing this “stealthy” bug...).
Permanent Bug Fix
I have already made a pull request to fix this issue on the data loader (huggingface/datasets#5048).
Temporary Bug Fix
Until this happen, early next week, you can also replicate the results by manually defining the label list based on the 100 most-frequent labels, by replacing this line
lex-glue/experiments/eurlex.py
Line 214 in d640bfc
with this line of code:
labels = [119, 120, 114, 90, 28, 29, 30, 82, 87, 8, 44, 31, 33, 94, 22, 14, 52, 91, 92, 13, 89, 86, 118, 93, 12, 68, 83,
98, 11, 7, 32, 115, 96, 79, 116, 106, 81, 75, 117, 112, 59, 6, 77, 95, 72, 108, 60, 99, 74, 24, 27, 34, 58,
66, 84, 61, 16, 107, 20, 43, 97, 105, 76, 67, 80, 57, 63, 37, 36, 85, 5, 109, 69, 38, 78, 39, 49, 23, 42, 100,
17, 70, 9, 51, 113, 103, 102, 110, 0, 41, 111, 101, 35, 64, 10, 121, 21, 26, 71, 122]