FSDP T5 Example not working

Question

FSDP T5 Example not working

YooSungHyun opened this issue a year ago · 6 comments

YooSungHyun commented a year ago

Context

Pytorch version: 3.10
Operating System and version: ubuntu 22.04

Your Environment

Installed using source? [yes/no]: no
Are you planning to deploy it using docker container? [yes/no]: no
Is it a CPU or GPU environment?: GPU
Which example are you using: FSDP T5 example
Link to code or data to repro [if any]:

Expected Behavior

training well

Current Behavior

error raised and training stop

Possible Solution

Steps to Reproduce

launch just fsdp t5 example
error raised TypeError: T5Block.forward() got an unexpected keyword argument 'offload_to_cpu'
...

Failure Logs [if any]

Traceback (most recent call last):
  File "/data/bart/temp_workspace/examples/distributed/FSDP/T5_training.py", line 215, in <module>
    fsdp_main(args)
  File "/data/bart/temp_workspace/examples/distributed/FSDP/T5_training.py", line 148, in fsdp_main
    train_accuracy = train(args, model, rank, world_size, train_loader, optimizer, epoch, sampler=sampler1)
  File "/data/bart/temp_workspace/examples/distributed/FSDP/utils/train_utils.py", line 50, in train
    output = model(input_ids=batch["source_ids"],attention_mask=batch["source_mask"],labels=batch["target_ids"] )
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 839, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 1706, in forward
    encoder_outputs = self.encoder(
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 1110, in forward
    layer_outputs = layer_module(
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 839, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py", line 164, in forward
    return self.checkpoint_fn(  # type: ignore[misc]
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
    return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
    return fn(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 458, in checkpoint
    ret = function(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: T5Block.forward() got an unexpected keyword argument 'offload_to_cpu'

Answer 1 · 2024-01-04T14:01:25.000Z

and i have this problem on save and load sharded too...
pytorch/pytorch#103627
how can i solve it?

Answer 2 · 2024-01-08T07:12:04.000Z

I am also facing the same issue...Any solution to this?

Answer 3 · 2024-03-04T19:48:41.000Z

Facing the same issue.

Answer 4 · 2024-05-24T03:19:06.000Z

facing the same issue

Answer 5 · 2024-06-21T01:16:25.000Z

facing the same issue, Any solution to this? thanks.

Answer 6 · 2024-06-29T06:27:25.000Z

I fixed and merged this on main by disabling activation checkpointing #1273

By changing the below line in distributed/FSDP/configs/fsdp.py

- fsdp_activation_checkpointing: bool=True
+ fsdp_activation_checkpointing: bool=False

Will look for a proper fix next