torchrun breaks with load_model_at_end and with metric_for_best_model=eval_f1 on question_answering example

System Info

transformers version: 4.41.0.dev0
Platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.31
Python version: 3.10.14
Huggingface_hub version: 0.23.0
Safetensors version: 0.4.3
Accelerate version: 0.29.3
Accelerate config: not found
PyTorch version (GPU?): 2.1.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: tried using ddp, but the setting is single system, multi-gpu

Who can help?

@muellerzr @pacman100 @ArthurZucker @younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I clone the main branch of transformers and pip install -e . in the cloned transformers folder.
I then run torchrun --nproc_per_node 2 run_qa.py --model_name_or_path google-bert/bert- base-uncased --dataset_name squad --do_train --do_eval --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad --max_steps 20 --eval_steps 2 --save_steps 2 --save_total_limit 2 --load_best_model_at_end True --metric_for_best_model eval_f1 --max_eval_samples 20 --eval_strategy steps --save_strategy steps 2>&1 | tee scratch.log
The code errors out with KeyError: eval_f1.
I believe this happens because the compute_metrics function computes the eval_f1 metric on one process but the trainer _save_checkpoint() method checks for the metric on all processes and therefore, some other process beats process 0 doesn't find the key leading to this error. (

transformers/src/transformers/trainer.py

Line 2820 in 1360801

if metrics is not None and self.args.metric_for_best_model is not None:

)

Expected behavior

Ideally, the code should seamlessly run using torchrun. There should be no key error. The trainer should be able to handle single process eval_f1 along with multi-process metric computation done in other workloads such as summarization.