Failed to enable layernorm kernel
MrPeterJin opened this issue · 9 comments
Firstly very appreciate your work! When I try to use the framework to reproduce your work, I noticed the layernorm kernel is not working on my side. Here is the log:
torchrun --standalone --nproc_per_node=1 scripts/dit/train_dit.py \
--model DiT-XL/2 \
--batch_size 2 \
--enable_layernorm_kernel \
--enable_flashattn \
--mixed_precision bf16 \
--num_classes 10
/home/cjinag/code/playground/ColossalAI/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/_pytree.py:254: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
warnings.warn(
/home/cjinag/code/playground/ColossalAI/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
warnings.warn("`config` is deprecated and will be removed soon.")
[03/20/24 15:13:27] INFO colossalai - colossalai - INFO: /home/cjinag/code/playground/ColossalAI/colossalai/initialize.py:67
launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, world size: 1
[2024-03-20 15:13:27] Experiment directory created at ./outputs/018-DiT-XL-2
[2024-03-20 15:13:39] Model params: 642.77 M
No ROCm runtime is found, using ROCM_HOME='/usr/local'
[extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
[extension] Time taken to compile cpu_adam_x86 op: 0.12372970581054688 seconds
[extension] Compiling the JIT fused_optim_cuda kernel during runtime now
[extension] Time taken to compile fused_optim_cuda op: 0.14196491241455078 seconds
/home/cjinag/code/playground/ColossalAI/colossalai/nn/optimizer/hybrid_adam.py:90: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
self._dummy_overflow_buf = torch.cuda.IntTensor([0])
Files already downloaded and verified
[2024-03-20 15:13:51] Dataset contains 50,000 images (./datasets)
[2024-03-20 15:13:51] Boost model for distributed training
[2024-03-20 15:13:51] Training for 1400 epochs...
[2024-03-20 15:13:51] Beginning epoch 0...
Epoch 0: 0%| | 0/25000 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/cjinag/code/project/multimodal/OpenDiT/scripts/dit/train_dit.py", line 324, in <module>
main(args)
File "/home/cjinag/code/project/multimodal/OpenDiT/scripts/dit/train_dit.py", line 245, in main
loss_dict = diffusion.training_losses(model, x, t, model_kwargs)
File "/home/cjinag/code/project/multimodal/OpenDiT/opendit/diffusion/respace.py", line 90, in training_losses
return super().training_losses(self._wrap_model(model), *args, **kwargs)
File "/home/cjinag/code/project/multimodal/OpenDiT/opendit/diffusion/gaussian_diffusion.py", line 708, in training_losses
model_output = model(x_t, t, **model_kwargs)
File "/home/cjinag/code/project/multimodal/OpenDiT/opendit/diffusion/respace.py", line 120, in __call__
return self.model(x, new_ts, **kwargs)
File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/cjinag/code/playground/ColossalAI/colossalai/booster/plugin/low_level_zero_plugin.py", line 65, in forward
return super().forward(*args, **kwargs)
File "/home/cjinag/code/playground/ColossalAI/colossalai/interface/model.py", line 25, in forward
return self.module(*args, **kwargs)
File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/cjinag/code/project/multimodal/OpenDiT/opendit/models/dit/dit.py", line 213, in forward
x = block(x, c) # (N, T, D)
File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/cjinag/code/project/multimodal/OpenDiT/opendit/models/dit/dit.py", line 58, in forward
modulate(self.norm1, x, shift_msa, scale_msa, self.enable_modulate_kernel)
File "/home/cjinag/code/project/multimodal/OpenDiT/opendit/modules/layers.py", line 33, in modulate
x = norm_func(x.to(torch.float32)).to(dtype)
File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/apex/normalization/fused_layer_norm.py", line 323, in forward
return fused_layer_norm(input, self.normalized_shape, self.eps, self.memory_efficient)
File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/apex/normalization/fused_layer_norm.py", line 203, in fused_layer_norm
return FusedLayerNormFunction.apply(*args)
File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/apex/normalization/fused_layer_norm.py", line 149, in forward
output, mean, invvar = fused_layer_norm_cuda.forward(input_, ctx.normalized_shape, ctx.eps)
RuntimeError: memory format option is only supported by strided tensors
[2024-03-20 15:13:58,980] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 155022) of binary: /home/cjinag/anaconda3/envs/opendit/bin/python
Traceback (most recent call last):
File "/home/cjinag/anaconda3/envs/opendit/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/dit/train_dit.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-20_15:13:58
host : 191host040.mobilenet.cse.ust.hk
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 155022)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Please advise possible solutions. Thanks!
its a verison mismatch problem for apex. maybe your apex version is too old or too new. you can first disable enable_layernorm_kernel
arg to run the code
its a verison mismatch problem for apex. maybe your apex version is too old or too new. you can first disable
enable_layernorm_kernel
arg to run the code
When I disabled the layernorm kernel, the code runs fine for me. However, I have conducted a reinstallation of OpenDiT according to the version recommended in your README file and this error log still exists. Is there any other possible reasons?
sorry no clues. i suppose it should be about your enviroment and apex.
sorry no clues. i suppose it should be about your enviroment and apex.
Then may I have a reference for your environment settings?(e.g. torch version, CUDA, etc.), since your requirements.txt does not restricting this... I suspect the new version of PyTorch may have something changed to have this error.
we use cuda 11.8 and torch 2.1.2, good luck
we use cuda 11.8 and torch 2.1.2, good luck
What is the cudnn version on your platform? Just call print(torch.backends.cudnn.version())
for the output.
cudnn 8.9.7
cudnn 8.9.7
I noticed through your installation guidelines in your README file, it will automatically update the torch and other dependencies to the newest version and cause version mismatch. So I think you probably need to fix the version in your environment settings.
Thanks for providing your settings. I commenced the training successfully.