Issues occuring during parallel evaluation (using Trainer.evaluate)
psychocosine opened this issue · 0 comments
psychocosine commented
System Info
transformers
version: 4.40.0- Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.31
- Python version: 3.9.19
- Huggingface_hub version: 0.22.2
- Safetensors version: 0.4.3
- Accelerate version: 0.29.3
- Accelerate config: not found
- PyTorch version (GPU?): 1.13.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
The length of the eval_preds
parameter received in the compute_metrics
function is different from the original length in eval_dataset
.
def compute_metrics(eval_preds):
preds, labels = eval_preds
# In case the model returns more than the prediction logits
if isinstance(preds, tuple):
preds = preds[0]
assert preds.shape[-1] == training_args.max_length
assert preds.shape[0] == len(tokenized_datasets[-1]) #### assertion error preds.shape[0]=1024 ,len(tokenized_datasets[-1])=1012
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_datasets[0].shuffle(seed=42).select(range(int(1e6))),
eval_dataset={data_args.task_name: tokenized_datasets[-1]},
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
callbacks=[LoggerCallback, DenserEvalCallback]
)
My training args lists below :
n_gpus=2
per_device_train_batch_size=8
per_device_eval_batch_size=8
gradient_accumulation_steps=3
len(preds)=1024
len(tokenized_datasets[-1])=1012
Expected behavior
Everything works fine when using single gpu but not gpus
I started my script by calling accelerate launch scripy.py