[Bug]: llama模型loss=0时出现"Tensor need be reduced must not empty [Hint: Expected x.numel() > 0, but received x.numel():0 <= 0:0.]"错误

Question

[Bug]: llama模型loss=0时出现"Tensor need be reduced must not empty [Hint: Expected x.numel() > 0, but received x.numel():0 <= 0:0.]"错误

dynamicheart opened this issue 2 months ago · 3 comments

软件环境

- paddlepaddle-gpu: 
commit: 4ffb7da786cef844deb3cf8ad7f95d56000bd010
cuda: 12.0
cudnn: 8.9.1
- paddlenlp: 
commit: 74bb39b51bef45f32aee310efdb8994042c00bb3

重复问题

I have searched the existing issues

错误描述

[2024-03-05 08:06:28,678] [    INFO] - loss: 4.23760509, learning_rate: 2.999e-05, global_step: 2310, interval_runtime: 1.1534, interval_samples_per_second: 6.935981184579392, interval_steps_per_second: 0.866997648072424, epoch: 0.0229
[2024-03-05 08:06:29,834] [    INFO] - loss: 4.39690018, learning_rate: 2.999e-05, global_step: 2311, interval_runtime: 1.1555, interval_samples_per_second: 6.923501595186914, interval_steps_per_second: 0.8654376993983642, epoch: 0.0229
LAUNCH INFO 2024-03-05 08:06:34,816 Pod failed
LAUNCH ERROR 2024-03-05 08:06:34,817 Container failed !!!
Container rank 6 status failed cmd ['/usr/bin/python', '-u', 'run_pretrain.py', '--model_type', 'llama', '--model_name_or_path', 'facebook/llama-13b', '--tokenizer_name_or_path', 'facebook/llama-13b', '--input_dir', './data', '--output_dir', 'output/llama_hybrid', '--split', '949,50,1', '--max_seq_length', '2048', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--use_flash_attention', '1', '--use_fused_rope', '1', '--fuse_attention_ffn', '1', '--fuse_attention_qkv', '1', '--use_fused_rms_norm', '1', '--num_hidden_layers', '40', '--bf16', '--fp16_opt_level', 'O2', '--scale_loss', '1024', '--learning_rate', '0.00003', '--min_learning_rate', '0.000005', '--lr_scheduler_type', 'cosine', '--max_steps', '100000', '--save_steps', '100000', '--weight_decay', '0.01', '--warmup_ratio', '0.01', '--max_grad_norm', '1.0', '--logging_steps', '1', '--dataloader_num_workers', '1', '--sharding', 'stage2', '--eval_steps', '1000', '--report_to', 'visualdl', '--disable_tqdm', 'true', '--continue_training', '0', '--recompute', '0', '--do_train', '--device', 'gpu'] code 1 log output/llama_hybrid_log/workerlog.6 
env {'NV_LIBCUBLAS_VERSION': '12.0.1.189-1', 'NVIDIA_VISIBLE_DEVICES': 'all', 'COLORTERM': 'truecolor', 'NV_NVML_DEV_VERSION': '12.0.76-1', 'NV_CUDNN_PACKAGE_NAME': 'libcudnn8', 'GREP_COLOR': '1;31', 'TERM_PROGRAM_VERSION': '1.83.1', 'NV_LIBNCCL_DEV_PACKAGE': 'libnccl-dev=2.17.1-1+cuda12.0', 'NV_LIBNCCL_DEV_PACKAGE_VERSION': '2.17.1-1', 'HOSTNAME': 'szzj-isa-ai-peking-poc13.szzj.baidu.com', 'LANGUAGE': 'en_US.UTF-8', 'NVIDIA_REQUIRE_CUDA': 'cuda>=12.0 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471', 'NV_LIBCUBLAS_DEV_PACKAGE': 'libcublas-dev-12-0=12.0.1.189-1', 'NV_NVTX_VERSION': '12.0.76-1', 'NV_CUDA_CUDART_DEV_VERSION': '12.0.107-1', 'NV_LIBCUSPARSE_VERSION': '12.0.0.76-1', 'NV_LIBNPP_VERSION': '12.0.0.30-1', 'NCCL_VERSION': '2.17.1-1', 'PWD': '/host/PaddleNLP-XPU/llm/llama', 'NV_CUDNN_PACKAGE': 'libcudnn8=8.8.0.121-1+cuda12.0', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'WITH_AVX': 'ON', 'NV_NVPROF_DEV_PACKAGE': 'cuda-nvprof-12-0=12.0.90-1', 'NV_LIBNPP_PACKAGE': 'libnpp-12-0=12.0.0.30-1', 'NV_LIBNCCL_DEV_PACKAGE_NAME': 'libnccl-dev', 'GREP_OPTIONS': '--color=auto', 'VSCODE_GIT_ASKPASS_NODE': '/root/.vscode-server/bin/1.8.401.83.1.02/node', 'NV_LIBCUBLAS_DEV_VERSION': '12.0.1.189-1', 'NVIDIA_PRODUCT_NAME': 'CUDA', 'NV_LIBCUBLAS_DEV_PACKAGE_NAME': 'libcublas-dev-12-0', 'NV_CUDA_CUDART_VERSION': '12.0.107-1', 'HOME': '/root', 'LANG': 'en_US.UTF-8', 'NVIDIA_CUDA_END_OF_LIFE': '1', 'CUDA_VERSION': '12.0.0', 'NV_LIBCUBLAS_PACKAGE': 'libcublas-12-0=12.0.1.189-1', 'NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE': 'cuda-nsight-compute-12-0=12.0.0-1', 'ICODING_VERSION': '1.8.401.83.1.02', 'GIT_ASKPASS': '/root/.vscode-server/bin/1.8.401.83.1.02/extensions/git/dist/askpass.sh', 'CLICOLOR': '1', 'NV_LIBNPP_DEV_PACKAGE': 'libnpp-dev-12-0=12.0.0.30-1', 'GOROOT': '/usr/local/go', 'NV_LIBCUBLAS_PACKAGE_NAME': 'libcublas-12-0', 'NV_LIBNPP_DEV_VERSION': '12.0.0.30-1', 'VSCODE_GIT_ASKPASS_EXTRA_ARGS': '', 'WITH_GPU': 'ON', 'TERM': 'xterm-256color', 'NV_LIBCUSPARSE_DEV_VERSION': '12.0.0.76-1', 'LIBRARY_PATH': '/usr/local/cuda/lib64/stubs', 'NV_CUDNN_VERSION': '8.8.0.121', 'VSCODE_GIT_IPC_HANDLE': '/tmp/vscode-git-a504850b12.sock', 'SHLVL': '2', 'NV_CUDA_LIB_VERSION': '12.0.0-1', 'NVARCH': 'x86_64', 'CUDNN_VERSION': '8.9.1', 'NV_CUDNN_PACKAGE_DEV': 'libcudnn8-dev=8.8.0.121-1+cuda12.0', 'NV_CUDA_COMPAT_PACKAGE': 'cuda-compat-12-0', 'NV_LIBNCCL_PACKAGE': 'libnccl2=2.17.1-1+cuda12.0', 'LD_LIBRARY_PATH': '', 'NV_CUDA_NSIGHT_COMPUTE_VERSION': '12.0.0-1', 'NV_NVPROF_VERSION': '12.0.90-1', 'LC_ALL': 'en_US.UTF-8', 'VSCODE_GIT_ASKPASS_MAIN': '/root/.vscode-server/bin/1.8.401.83.1.02/extensions/git/dist/askpass-main.js', 'BROWSER': '/root/.vscode-server/bin/1.8.401.83.1.02/bin/helpers/browser.sh', 'PATH': '/root/.BCloud/bin:/root/.vscode-server/bin/1.8.401.83.1.02/bin/remote-cli:/root/.BCloud/bin:/root/.vscode-server/bin/1.8.401.83.1.02/bin:/root/.vscode-server/bin:/home/cmake-3.18.0-Linux-x86_64/bin:/usr/local/gcc-12.1/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/go/bin:/root/gopath/bin', 'NV_LIBNCCL_PACKAGE_NAME': 'libnccl2', 'NV_LIBNCCL_PACKAGE_VERSION': '2.17.1-1', 'DEBIAN_FRONTEND': 'noninteractive', 'OLDPWD': '/host/PaddleNLP-XPU', 'GOPATH': '/root/gopath', 'TERM_PROGRAM': 'vscode', 'VSCODE_IPC_HOOK_CLI': '/tmp/vscode-ipc-1f8e8da3-5315-4fd5-b7be-285e4dc98f23.sock', '_': '/usr/bin/python', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'POD_NAME': 'egfwmz', 'PADDLE_MASTER': '10.93.234.25:45151', 'PADDLE_GLOBAL_SIZE': '8', 'PADDLE_LOCAL_SIZE': '8', 'PADDLE_GLOBAL_RANK': '6', 'PADDLE_LOCAL_RANK': '6', 'PADDLE_NNODES': '1', 'PADDLE_CURRENT_ENDPOINT': '10.93.234.25:45158', 'PADDLE_TRAINER_ID': '6', 'PADDLE_TRAINERS_NUM': '8', 'PADDLE_RANK_IN_NODE': '6', 'PADDLE_TRAINER_ENDPOINTS': '10.93.234.25:45152,10.93.234.25:45153,10.93.234.25:45154,10.93.234.25:45155,10.93.234.25:45156,10.93.234.25:45157,10.93.234.25:45158,10.93.234.25:45159', 'FLAGS_selected_gpus': '6', 'PADDLE_LOG_DIR': '/host/PaddleNLP-XPU/llm/llama/output/llama_hybrid_log'}
LAUNCH INFO 2024-03-05 08:06:34,817 ------------------------- ERROR LOG DETAIL -------------------------
[32m[2024-03-05 07:21:54,674] [    INFO] - ***** Running training *****
[2024-03-05 07:21:54,674] [    INFO] -   Num examples = 806,405
[2024-03-05 07:21:54,674] [    INFO] -   Num Epochs = 1
[2024-03-05 07:21:54,674] [    INFO] -   Instantaneous batch size per device = 1
[2024-03-05 07:21:54,674] [    INFO] -   Total train batch size (w. parallel, distributed & accumulation) = 8
[2024-03-05 07:21:54,674] [    INFO] -   Gradient Accumulation steps = 1
[2024-03-05 07:21:54,674] [    INFO] -   Total optimization steps = 100,000
[2024-03-05 07:21:54,674] [    INFO] -   Total num train samples = 800,000
[2024-03-05 07:21:54,676] [    INFO] -   Number of trainable parameters = 13,015,864,320 (per device)
I0305 07:21:56.126010 76258 custom_operator.cc:1296] register pir custom op :fused_rms_norm
I0305 07:21:56.126060 76258 custom_operator.cc:1296] register pir custom op :fused_rms_norm_grad
I0305 07:21:56.126178 76258 custom_operator.cc:1296] register pir custom op :fused_ln
I0305 07:21:56.126186 76258 custom_operator.cc:1296] register pir custom op :fused_ln_grad
Traceback (most recent call last):
  File "/host/PaddleNLP-XPU/llm/llama/run_pretrain.py", line 567, in <module>
    main()
  File "/host/PaddleNLP-XPU/llm/llama/run_pretrain.py", line 548, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/host/PaddleNLP-XPU/paddlenlp/trainer/trainer.py", line 890, in train
    dp_master_grad = (
  File "/host/PaddleNLP-XPU/paddlenlp/trainer/trainer.py", line 1900, in training_step
  File "/host/PaddleNLP-XPU/paddlenlp/trainer/trainer.py", line 1853, in compute_loss
    labels = (inputs.pop("start_positions"), inputs.pop("end_positions"))
  File "/usr/local/lib/python3.10/dist-packages/paddle/nn/layer/layers.py", line 1429, in __call__
    return self.forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py", line 190, in forward
    fw = self._layer(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle/nn/layer/layers.py", line 1429, in __call__
    return self.forward(*inputs, **kwargs)
  File "/host/PaddleNLP-XPU/paddlenlp/transformers/llama/modeling.py", line 1611, in forward
    loss = self.criterion(logits, labels)
  File "/usr/local/lib/python3.10/dist-packages/paddle/nn/layer/layers.py", line 1429, in __call__
    return self.forward(*inputs, **kwargs)
  File "/host/PaddleNLP-XPU/paddlenlp/transformers/llama/modeling.py", line 1427, in forward
    loss = paddle.mean(masked_lm_loss)
  File "/usr/local/lib/python3.10/dist-packages/paddle/tensor/stat.py", line 90, in mean
    return _C_ops.mean(x, axis, keepdim)
ValueError: (InvalidArgument) Tensor need be reduced must not empty.
  [Hint: Expected x.numel() > 0, but received x.numel():0 <= 0:0.] (at ../paddle/phi/kernels/funcs/reduce_function.h:1052)

LAUNCH INFO 2024-03-05 08:06:40,653 Exit code -15

稳定复现步骤 & 代码

错误来自于这两行

由于masked_lm_loss.numel() == 0，对其进行paddle.mean操作会报如上错误，loss为0的原因应该是softmax操作产生了onehot tensor, 只有target label对应位置的值为1，其它位置为0。

import numpy as np

def stable_softmax(x):
    z = x - np.max(x, axis=-1, keepdims=True)
    print("z", z)
    numerator = np.exp(z)
    print("numerator", numerator)
    denominator = np.sum(numerator, axis=-1, keepdims=True)
    print("denominator", denominator)
    softmax = numerator / denominator
    print("softmax", softmax)
    return softmax

x = [-2710.10620117, -2914.37866211, -5045.04443359, -4361.91601562, -459.57000732, 8843.65820312, -1871.62756348, 5447.12451172, -10947.22949219]
stable_softmax(x)

# z [-11553.76440429 -11758.03686523 -13888.70263671 -13205.57421874 -9303.22821044 0  -10715.2857666 -3396.5336914  -19790.88769531]
# numerator [0. 0. 0. 0. 0. 1. 0. 0. 0.]
# denominator [1.]
# softmax [0. 0. 0. 0. 0. 1. 0. 0. 0.]
# array([0., 0., 0., 0., 0., 1., 0., 0., 0.])

当exp的指数较小时(小于-1000)，结果会等于0

参考资料：

https://stackoverflow.com/questions/42599498/numerically-stable-softmax

Answer 1 · 2024-04-22T09:20:28.000Z

感谢您的反馈，我查了一下是这个pr引入的问题：

93e78c2#diff-99e104eff4c095428aa1cd5d186107ae22737297e8ec3b5c12cd138e69a79cb5

看看下面的实现能否解决您的问题：

masked_lm_loss = masked_lm_loss[masked_lm_labels != self.ignore_index]

Answer 2 · 2024-04-22T09:34:05.000Z

@w5688414 好的，看上去这样，如果数据集处理得没问题，应该能保证masked_lm_loss不为空tensor。我后续试一下，但这个不是稳定复现的。