huggingface/transformers

Issues occuring during parallel evaluation (using Trainer.evaluate)

psychocosine opened this issue · 0 comments

System Info

  • transformers version: 4.40.0
  • Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.31
  • Python version: 3.9.19
  • Huggingface_hub version: 0.22.2
  • Safetensors version: 0.4.3
  • Accelerate version: 0.29.3
  • Accelerate config: not found
  • PyTorch version (GPU?): 1.13.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed

Who can help?

@muellerzr

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The length of the eval_preds parameter received in the compute_metrics function is different from the original length in eval_dataset.

    def compute_metrics(eval_preds):
        preds, labels = eval_preds
        # In case the model returns more than the prediction logits
        if isinstance(preds, tuple):
            preds = preds[0]
        assert preds.shape[-1] == training_args.max_length
        assert preds.shape[0] == len(tokenized_datasets[-1]) #### assertion error preds.shape[0]=1024 ,len(tokenized_datasets[-1])=1012
trainer = Trainer(
        model,
        training_args,
        train_dataset=tokenized_datasets[0].shuffle(seed=42).select(range(int(1e6))),
        eval_dataset={data_args.task_name: tokenized_datasets[-1]},
        data_collator=data_collator,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
        callbacks=[LoggerCallback, DenserEvalCallback]
    )

My training args lists below :

n_gpus=2
per_device_train_batch_size=8
per_device_eval_batch_size=8
gradient_accumulation_steps=3
len(preds)=1024
len(tokenized_datasets[-1])=1012

Expected behavior

Everything works fine when using single gpu but not gpus
I started my script by calling accelerate launch scripy.py