torchrun breaks with load_model_at_end and with metric_for_best_model=eval_f1 on question_answering example
godspeed5 opened this issue · 0 comments
godspeed5 commented
System Info
transformers
version: 4.41.0.dev0- Platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.31
- Python version: 3.10.14
- Huggingface_hub version: 0.23.0
- Safetensors version: 0.4.3
- Accelerate version: 0.29.3
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: tried using ddp, but the setting is single system, multi-gpu
Who can help?
@muellerzr @pacman100 @ArthurZucker @younesbelkada
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- I clone the main branch of transformers and
pip install -e .
in the clonedtransformers
folder. - I then run
torchrun --nproc_per_node 2 run_qa.py --model_name_or_path google-bert/bert- base-uncased --dataset_name squad --do_train --do_eval --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad --max_steps 20 --eval_steps 2 --save_steps 2 --save_total_limit 2 --load_best_model_at_end True --metric_for_best_model eval_f1 --max_eval_samples 20 --eval_strategy steps --save_strategy steps 2>&1 | tee scratch.log
- The code errors out with KeyError: eval_f1.
- I believe this happens because the compute_metrics function computes the eval_f1 metric on one process but the trainer
_save_checkpoint()
method checks for the metric on all processes and therefore, some other process beats process 0 doesn't find the key leading to this error. (transformers/src/transformers/trainer.py
Line 2820 in 1360801
Expected behavior
Ideally, the code should seamlessly run using torchrun. There should be no key error. The trainer should be able to handle single process eval_f1 along with multi-process metric computation done in other workloads such as summarization.