FP8 Fine Tuning Crashes

Question

FP8 Fine Tuning Crashes

Opened this issue 6 months ago · 1 comments

I get this error message if I set the max_len to 300 or any higher than 100 for that matter whenever I'm training to train with FP8. I'm using cuda-12.4.0-2 and the nightly cuda 12.4 pytorch builds and have MS-AMP and TransformerEngine installed.

accelerate launch --mixed_precision=fp8 train_finetune.py --config_path ./Configs/config_ft.yml

RuntimeError: GET was unable to find an engine to execute this computation

Answer 1 · 2024-06-07T04:41:12.000Z

Nevermind, it was only crashing when I used virutal console mode. I switched to a xfce4 session and it doesn't crash anymore. I installed the stable version of TransformerEngine.

Edit: I reinstalled MS-AMP and I still get this error message and then I reinstalled the stable verison of TransformerEngine and still get the error message.

Traceback (most recent call last):
  File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/train_finetune.py", line 713, in <module>
    main()
  File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/fp8/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/fp8/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/fp8/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/fp8/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/train_finetune.py", line 460, in main
    g_loss.backward()
  File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/fp8/lib/python3.11/site-packages/torch/_tensor.py", line 520, in backward
    torch.autograd.backward(
  File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/fp8/lib/python3.11/site-packages/torch/autograd/__init__.py", line 288, in backward
    _engine_run_backward(
  File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/fp8/lib/python3.11/site-packages/torch/autograd/graph.py", line 767, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: GET was unable to find an engine to execute this computation
Traceback (most recent call last):
  File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/fp8/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/fp8/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/fp8/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1082, in launch_command
    simple_launcher(args)
  File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/fp8/lib/python3.11/site-packages/accelerate/commands/launch.py", line 688, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/fp8/bin/python3.11', 'train_finetune.py', '--config_path', './Configs/config_ft-Ellie-Up-FP8.yml']' returned non-zero exit status 1.

Edit: I got this error message. This was with the stable version of TransformerEngine.

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.