huggingface/transformers

torchrun breaks with load_model_at_end and with metric_for_best_model=eval_f1 on question_answering example

godspeed5 opened this issue · 0 comments

System Info

  • transformers version: 4.41.0.dev0
  • Platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.31
  • Python version: 3.10.14
  • Huggingface_hub version: 0.23.0
  • Safetensors version: 0.4.3
  • Accelerate version: 0.29.3
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: tried using ddp, but the setting is single system, multi-gpu

Who can help?

@muellerzr @pacman100 @ArthurZucker @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. I clone the main branch of transformers and pip install -e . in the cloned transformers folder.
  2. I then run torchrun --nproc_per_node 2 run_qa.py --model_name_or_path google-bert/bert- base-uncased --dataset_name squad --do_train --do_eval --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad --max_steps 20 --eval_steps 2 --save_steps 2 --save_total_limit 2 --load_best_model_at_end True --metric_for_best_model eval_f1 --max_eval_samples 20 --eval_strategy steps --save_strategy steps 2>&1 | tee scratch.log
  3. The code errors out with KeyError: eval_f1.
  4. I believe this happens because the compute_metrics function computes the eval_f1 metric on one process but the trainer _save_checkpoint() method checks for the metric on all processes and therefore, some other process beats process 0 doesn't find the key leading to this error. (
    if metrics is not None and self.args.metric_for_best_model is not None:
    )

Expected behavior

Ideally, the code should seamlessly run using torchrun. There should be no key error. The trainer should be able to handle single process eval_f1 along with multi-process metric computation done in other workloads such as summarization.