NUS-HPC-AI-Lab/VideoSys

Failed to enable layernorm kernel

MrPeterJin opened this issue · 9 comments

Firstly very appreciate your work! When I try to use the framework to reproduce your work, I noticed the layernorm kernel is not working on my side. Here is the log:

torchrun --standalone --nproc_per_node=1 scripts/dit/train_dit.py \
     --model DiT-XL/2 \
     --batch_size 2 \
     --enable_layernorm_kernel \
     --enable_flashattn \
     --mixed_precision bf16 \
     --num_classes 10
/home/cjinag/code/playground/ColossalAI/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/_pytree.py:254: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
  warnings.warn(
/home/cjinag/code/playground/ColossalAI/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
  warnings.warn("`config` is deprecated and will be removed soon.")
[03/20/24 15:13:27] INFO     colossalai - colossalai - INFO: /home/cjinag/code/playground/ColossalAI/colossalai/initialize.py:67     
                             launch                                                                                                  
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, world size: 1                   
[2024-03-20 15:13:27] Experiment directory created at ./outputs/018-DiT-XL-2
[2024-03-20 15:13:39] Model params: 642.77 M
No ROCm runtime is found, using ROCM_HOME='/usr/local'
[extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
[extension] Time taken to compile cpu_adam_x86 op: 0.12372970581054688 seconds
[extension] Compiling the JIT fused_optim_cuda kernel during runtime now
[extension] Time taken to compile fused_optim_cuda op: 0.14196491241455078 seconds
/home/cjinag/code/playground/ColossalAI/colossalai/nn/optimizer/hybrid_adam.py:90: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
Files already downloaded and verified
[2024-03-20 15:13:51] Dataset contains 50,000 images (./datasets)
[2024-03-20 15:13:51] Boost model for distributed training
[2024-03-20 15:13:51] Training for 1400 epochs...
[2024-03-20 15:13:51] Beginning epoch 0...
Epoch 0:   0%|                                                                                             | 0/25000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/cjinag/code/project/multimodal/OpenDiT/scripts/dit/train_dit.py", line 324, in <module>
    main(args)
  File "/home/cjinag/code/project/multimodal/OpenDiT/scripts/dit/train_dit.py", line 245, in main
    loss_dict = diffusion.training_losses(model, x, t, model_kwargs)
  File "/home/cjinag/code/project/multimodal/OpenDiT/opendit/diffusion/respace.py", line 90, in training_losses
    return super().training_losses(self._wrap_model(model), *args, **kwargs)
  File "/home/cjinag/code/project/multimodal/OpenDiT/opendit/diffusion/gaussian_diffusion.py", line 708, in training_losses
    model_output = model(x_t, t, **model_kwargs)
  File "/home/cjinag/code/project/multimodal/OpenDiT/opendit/diffusion/respace.py", line 120, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/cjinag/code/playground/ColossalAI/colossalai/booster/plugin/low_level_zero_plugin.py", line 65, in forward
    return super().forward(*args, **kwargs)
  File "/home/cjinag/code/playground/ColossalAI/colossalai/interface/model.py", line 25, in forward
    return self.module(*args, **kwargs)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/cjinag/code/project/multimodal/OpenDiT/opendit/models/dit/dit.py", line 213, in forward
    x = block(x, c)  # (N, T, D)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/cjinag/code/project/multimodal/OpenDiT/opendit/models/dit/dit.py", line 58, in forward
    modulate(self.norm1, x, shift_msa, scale_msa, self.enable_modulate_kernel)
  File "/home/cjinag/code/project/multimodal/OpenDiT/opendit/modules/layers.py", line 33, in modulate
    x = norm_func(x.to(torch.float32)).to(dtype)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/apex/normalization/fused_layer_norm.py", line 323, in forward
    return fused_layer_norm(input, self.normalized_shape, self.eps, self.memory_efficient)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/apex/normalization/fused_layer_norm.py", line 203, in fused_layer_norm
    return FusedLayerNormFunction.apply(*args)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/apex/normalization/fused_layer_norm.py", line 149, in forward
    output, mean, invvar = fused_layer_norm_cuda.forward(input_, ctx.normalized_shape, ctx.eps)
RuntimeError: memory format option is only supported by strided tensors
[2024-03-20 15:13:58,980] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 155022) of binary: /home/cjinag/anaconda3/envs/opendit/bin/python
Traceback (most recent call last):
  File "/home/cjinag/anaconda3/envs/opendit/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts/dit/train_dit.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-20_15:13:58
  host      : 191host040.mobilenet.cse.ust.hk
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 155022)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Please advise possible solutions. Thanks!

its a verison mismatch problem for apex. maybe your apex version is too old or too new. you can first disable enable_layernorm_kernel arg to run the code

its a verison mismatch problem for apex. maybe your apex version is too old or too new. you can first disable enable_layernorm_kernel arg to run the code

When I disabled the layernorm kernel, the code runs fine for me. However, I have conducted a reinstallation of OpenDiT according to the version recommended in your README file and this error log still exists. Is there any other possible reasons?

sorry no clues. i suppose it should be about your enviroment and apex.

sorry no clues. i suppose it should be about your enviroment and apex.

Then may I have a reference for your environment settings?(e.g. torch version, CUDA, etc.), since your requirements.txt does not restricting this... I suspect the new version of PyTorch may have something changed to have this error.

we use cuda 11.8 and torch 2.1.2, good luck

we use cuda 11.8 and torch 2.1.2, good luck

What is the cudnn version on your platform? Just call print(torch.backends.cudnn.version()) for the output.

cudnn 8.9.7

cudnn 8.9.7

I noticed through your installation guidelines in your README file, it will automatically update the torch and other dependencies to the newest version and cause version mismatch. So I think you probably need to fix the version in your environment settings.

Thanks for providing your settings. I commenced the training successfully.