PaddlePaddle/PaddleSeg

ubuntu上报错label expected >= 0 and < 2, or == 255, but got 89,但是模型配置和数据集都没有问题。

lingdujunshang opened this issue · 3 comments

问题确认 Search before asking

Bug描述 Describe the Bug

paddle的老用户了,直接按照教程中‘快速开始’部分的教程开始跑,模型配置文件咱直接用默认的,,数据也是完全使用公开的数据集optic_disc_seg,configs/quick_start/pp_liteseg_optic_disc_512x512_1k.yml中的设置完全按照教程中的来的,

第一次ubuntu尝试,报错数据如下:
(paddle_seg) xuqing@dell-PowerEdge-R740:/projects/paddle_seg/PaddleSeg$ python tools/train.py --config configs/quick_start/pp_liteseg_optic_disc_512x512_1k.yml --save_interval 500 --do_eval --use_vdl --save_dir output
2024-08-02 10:12:37 [WARNING] Add the num_classes in train_dataset and val_dataset config to model config. We suggest you manually set num_classes in model config.
2024-08-02 10:12:38 [INFO]
------------Environment Information-------------
platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35
Python: 3.9.19 (main, Apr 6 2024, 17:57:55) [GCC 11.4.0]
Paddle compiled with cuda: True
NVCC: Build cuda_11.8.r11.8/compiler.31833905_0
cudnn: 8.6
GPUs used: 1
CUDA_VISIBLE_DEVICES: None
GPU: ['GPU 0: NVIDIA GeForce', 'GPU 1: NVIDIA GeForce']
GCC: gcc (Ubuntu 11.4.0-1ubuntu1
22.04) 11.4.0
PaddleSeg: 0.0.0.dev0
PaddlePaddle: 2.6.1
OpenCV: 4.10.0

2024-08-02 10:12:38 [INFO]
---------------Config Information---------------
batch_size: 4
iters: 1000
train_dataset:
dataset_root: data/optic_disc_seg
mode: train
num_classes: 2
train_path: data/optic_disc_seg/train_list.txt
transforms:

  • max_scale_factor: 2.0
    min_scale_factor: 0.5
    scale_step_size: 0.25
    type: ResizeStepScaling
  • crop_size:
    • 512
    • 512
      type: RandomPaddingCrop
  • type: RandomHorizontalFlip
  • brightness_range: 0.5
    contrast_range: 0.5
    saturation_range: 0.5
    type: RandomDistort
  • type: Normalize
    type: Dataset
    val_dataset:
    dataset_root: data/optic_disc_seg
    mode: val
    num_classes: 2
    transforms:
  • type: Normalize
    type: Dataset
    val_path: data/optic_disc_seg/val_list.txt
    optimizer:
    momentum: 0.9
    type: SGD
    weight_decay: 4.0e-05
    lr_scheduler:
    end_lr: 0
    learning_rate: 0.01
    power: 0.9
    type: PolynomialDecay
    loss:
    coef:
  • 1
  • 1
  • 1
    types:
  • type: CrossEntropyLoss
  • type: CrossEntropyLoss
  • type: CrossEntropyLoss
    model:
    backbone:
    pretrained: https://bj.bcebos.com/paddleseg/dygraph/PP_STDCNet2.tar.gz
    type: STDC2
    num_classes: 2
    type: PPLiteSeg

2024-08-02 10:12:38 [INFO] Set device: gpu
2024-08-02 10:12:38 [INFO] Use the following config to build model
model:
backbone:
pretrained: https://bj.bcebos.com/paddleseg/dygraph/PP_STDCNet2.tar.gz
type: STDC2
num_classes: 2
type: PPLiteSeg
W0802 10:12:38.203576 221128 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.2, Runtime API Version: 11.8
W0802 10:12:38.203653 221128 gpu_resources.cc:164] device: 0, cuDNN Version: 8.6.
2024-08-02 10:12:38 [INFO] Loading pretrained model from https://bj.bcebos.com/paddleseg/dygraph/PP_STDCNet2.tar.gz
2024-08-02 10:12:38 [INFO] There are 265/265 variables loaded into STDCNet.
2024-08-02 10:12:38 [INFO] Use the following config to build train_dataset
train_dataset:
dataset_root: data/optic_disc_seg
mode: train
num_classes: 2
train_path: data/optic_disc_seg/train_list.txt
transforms:

  • max_scale_factor: 2.0
    min_scale_factor: 0.5
    scale_step_size: 0.25
    type: ResizeStepScaling
  • crop_size:
    • 512
    • 512
      type: RandomPaddingCrop
  • type: RandomHorizontalFlip
  • brightness_range: 0.5
    contrast_range: 0.5
    saturation_range: 0.5
    type: RandomDistort
  • type: Normalize
    type: Dataset
    2024-08-02 10:12:38 [INFO] Use the following config to build val_dataset
    val_dataset:
    dataset_root: data/optic_disc_seg
    mode: val
    num_classes: 2
    transforms:
  • type: Normalize
    type: Dataset
    val_path: data/optic_disc_seg/val_list.txt
    2024-08-02 10:12:38 [INFO] If the type is SGD and momentum in optimizer config, the type is changed to Momentum.
    2024-08-02 10:12:38 [INFO] Use the following config to build optimizer
    optimizer:
    momentum: 0.9
    type: Momentum
    weight_decay: 4.0e-05
    2024-08-02 10:12:38 [INFO] Use the following config to build loss
    loss:
    coef:
  • 1
  • 1
  • 1
    types:
  • type: CrossEntropyLoss
  • type: CrossEntropyLoss
  • type: CrossEntropyLoss
    /home/xuqing/projects/paddle_seg/lib/python3.9/site-packages/paddle/nn/layer/norm.py:824: UserWarning: When training, we now always track global mean and variance.
    warnings.warn(
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value.
    Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value.
    Traceback (most recent call last):
    File "/home/xuqing/projects/paddle_seg/PaddleSeg/tools/train.py", line 219, in
    main(args)
    File "/home/xuqing/projects/paddle_seg/PaddleSeg/tools/train.py", line 193, in main
    train(
    File "/home/xuqing/projects/paddle_seg/PaddleSeg/paddleseg/core/train.py", line 247, in train
    loss.backward()
    File "/home/xuqing/projects/paddle_seg/lib/python3.9/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
    File "/home/xuqing/projects/paddle_seg/lib/python3.9/site-packages/paddle/base/wrapped_decorator.py", line 26, in impl
    return wrapped_func(*args, **kwargs)
    File "/home/xuqing/projects/paddle_seg/lib/python3.9/site-packages/paddle/base/framework.py", line 593, in impl
    return func(*args, **kwargs)
    File "/home/xuqing/projects/paddle_seg/lib/python3.9/site-packages/paddle/base/dygraph/tensor_patch_methods.py", line 342, in backward
    core.eager.run_backward([self], grad_tensor, retain_graph)
    OSError: (External) CUDA error(719), unspecified launch failure.
    [Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:265)
    行,咱拿到了两个信息:a, The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value,从这个可以推断出,labels设置不对,或者是数据集中的labels不对,反正就是数据集中的labels和设置中的num_classes不匹配。b,报错信息来源于cc或者是cu文件,估计python级别的debug可能解决不了问题。c或者c++方面的,,那可就麻烦了啊。 那咱们先看看数据集吧,在optic_disc_seg/Annotations中随便打开一张图,发现背景是0,标注的数据是红色的,直觉不对啊,,我记得之前我跑的成功的数据格式明明是这样的,假设我的labels如下:背景,车辆,人,那一张图片中,背景部分的像素是0,车辆部分的像素是1,人的部分像素是2,,那行,那咱们手动将数据改一下咯,直接np.clip(image,0,1),反正只有两类,然后我再跑,,仍旧报错如上,,,使用python -m paddle.distributed.launch tools/train.py多卡同样是报错labels数量对不上。。。

行,咱就是说,除了ubuntu,本地电脑也不是不能用,直接在windows上跑,竟然没想到啊,它完全可以跑的起来,返回如下:
(paddleseg) D:\env\paddleseg\PaddleSeg>python tools/train.py --config configs/quick_start/pp_liteseg_optic_disc_512x512_1k.yml --save_interval 500 --do_eval --use_vdl --save_dir output
2024-08-02 10:33:12 [WARNING] Add the num_classes in train_dataset and val_dataset config to model config. We suggest you manually set num_classes in model config.
2024-08-02 10:33:12 [INFO]
------------Environment Information-------------
platform: Windows-10-10.0.19041-SP0
Python: 3.9.2 (tags/v3.9.2:1a79785, Feb 19 2021, 13:44:55) [MSC v.1928 64 bit (AMD64)]
Paddle compiled with cuda: True
NVCC: Build cuda_11.7.r11.7/compiler.31294372_0
cudnn: 8.4
GPUs used: 1
CUDA_VISIBLE_DEVICES: None
GPU: ['GPU 0: NVIDIA GeForce']
GCC: gcc (MinGW-W64 x86_64-posix-seh, built by Brecht Sanders) 11.3.0
PaddleSeg: 2.9.0
PaddlePaddle: 2.5.2
OpenCV: 4.8.1

2024-08-02 10:33:12 [INFO]
---------------Config Information---------------
batch_size: 4
iters: 1000
train_dataset:
dataset_root: data/optic_disc_seg
mode: train
num_classes: 2
train_path: data/optic_disc_seg/train_list.txt
transforms:

  • max_scale_factor: 2.0
    min_scale_factor: 0.5
    scale_step_size: 0.25
    type: ResizeStepScaling
  • crop_size:
    • 512
    • 512
      type: RandomPaddingCrop
  • type: RandomHorizontalFlip
  • brightness_range: 0.5
    contrast_range: 0.5
    saturation_range: 0.5
    type: RandomDistort
  • type: Normalize
    type: Dataset
    val_dataset:
    dataset_root: data/optic_disc_seg
    mode: val
    num_classes: 2
    transforms:
  • type: Normalize
    type: Dataset
    val_path: data/optic_disc_seg/val_list.txt
    optimizer:
    momentum: 0.9
    type: SGD
    weight_decay: 4.0e-05
    lr_scheduler:
    end_lr: 0
    learning_rate: 0.01
    power: 0.9
    type: PolynomialDecay
    loss:
    coef:
  • 1
  • 1
  • 1
    types:
  • type: CrossEntropyLoss
  • type: CrossEntropyLoss
  • type: CrossEntropyLoss
    model:
    backbone:
    pretrained: https://bj.bcebos.com/paddleseg/dygraph/PP_STDCNet2.tar.gz
    type: STDC2
    num_classes: 2
    type: PPLiteSeg

2024-08-02 10:33:12 [INFO] Set device: gpu
2024-08-02 10:33:12 [INFO] Use the following config to build model
model:
backbone:
pretrained: https://bj.bcebos.com/paddleseg/dygraph/PP_STDCNet2.tar.gz
type: STDC2
num_classes: 2
type: PPLiteSeg
W0802 10:33:12.746259 17620 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.7, Runtime API Version: 11.7
W0802 10:33:12.746259 17620 gpu_resources.cc:149] device: 0, cuDNN Version: 8.4.
2024-08-02 10:33:13 [INFO] Loading pretrained model from https://bj.bcebos.com/paddleseg/dygraph/PP_STDCNet2.tar.gz
Connecting to https://bj.bcebos.com/paddleseg/dygraph/PP_STDCNet2.tar.gz
Downloading PP_STDCNet2.tar.gz
[==================================================] 100.00%
Uncompress PP_STDCNet2.tar.gz
[==================================================] 100.00%
2024-08-02 10:33:15 [INFO] There are 265/265 variables loaded into STDCNet.
2024-08-02 10:33:15 [INFO] Use the following config to build train_dataset
train_dataset:
dataset_root: data/optic_disc_seg
mode: train
num_classes: 2
train_path: data/optic_disc_seg/train_list.txt
transforms:

  • max_scale_factor: 2.0
    min_scale_factor: 0.5
    scale_step_size: 0.25
    type: ResizeStepScaling
  • crop_size:
    • 512
    • 512
      type: RandomPaddingCrop
  • type: RandomHorizontalFlip
  • brightness_range: 0.5
    contrast_range: 0.5
    saturation_range: 0.5
    type: RandomDistort
  • type: Normalize
    type: Dataset
    2024-08-02 10:33:15 [INFO] Use the following config to build val_dataset
    val_dataset:
    dataset_root: data/optic_disc_seg
    mode: val
    num_classes: 2
    transforms:
  • type: Normalize
    type: Dataset
    val_path: data/optic_disc_seg/val_list.txt
    2024-08-02 10:33:15 [INFO] If the type is SGD and momentum in optimizer config, the type is changed to Momentum.
    2024-08-02 10:33:15 [INFO] Use the following config to build optimizer
    optimizer:
    momentum: 0.9
    type: Momentum
    weight_decay: 4.0e-05
    2024-08-02 10:33:15 [INFO] Use the following config to build loss
    loss:
    coef:
  • 1
  • 1
  • 1
    types:
  • type: CrossEntropyLoss
  • type: CrossEntropyLoss
  • type: CrossEntropyLoss
    D:\env\paddleseg\lib\site-packages\paddle\nn\layer\norm.py:777: UserWarning: When training, we now always track global mean and variance.
    warnings.warn(
    2024-08-02 10:33:20 [INFO] [TRAIN] epoch: 1, iter: 10/1000, loss: 1.2764, lr: 0.009919, batch_cost: 0.3724, reader_cost: 0.02240, ips: 10.7403 samples/sec | ETA 00:06:08
    2024-08-02 10:33:21 [INFO] [TRAIN] epoch: 1, iter: 20/1000, loss: 0.2465, lr: 0.009829, batch_cost: 0.1303, reader_cost: 0.00000, ips: 30.6941 samples/sec | ETA 00:02:07
    2024-08-02 10:33:22 [INFO] [TRAIN] epoch: 1, iter: 30/1000, loss: 0.2122, lr: 0.009739, batch_cost: 0.1300, reader_cost: 0.00000, ips: 30.7799 samples/sec | ETA 00:02:06
    2024-08-02 10:33:24 [INFO] [TRAIN] epoch: 1, iter: 40/1000, loss: 0.2306, lr: 0.009648, batch_cost: 0.1301, reader_cost: 0.00010, ips: 30.7349 samples/sec | ETA 00:02:04
    2024-08-02 10:33:25 [INFO] [TRAIN] epoch: 1, iter: 50/1000, loss: 0.1755, lr: 0.009558, batch_cost: 0.1303, reader_cost: 0.00000, ips: 30.7027 samples/sec | ETA 00:02:03
    2024-08-02 10:33:26 [INFO] [TRAIN] epoch: 1, iter: 60/1000, loss: 0.1643, lr: 0.009467, batch_cost: 0.1300, reader_cost: 0.00000, ips: 30.7676 samples/sec | ETA 00:02:02
    2024-08-02 10:33:28 [INFO] [TRAIN] epoch: 2, iter: 70/1000, loss: 0.1220, lr: 0.009377, batch_cost: 0.1393, reader_cost: 0.00931, ips: 28.7197 samples/sec | ETA 00:02:09

其他自定义数据(coco的暂时没尝试,就标注是png,图像是jpg的这种普通图像分割的数据类别)也多方尝试,也得到了同样的结果,综上所述,我有两个怀疑:
1,旧版本的paddleseg跑pp_liteseg的模型,不会出问题,新版本的paddleseg(8月1号git clone下来的这个版本)会出现读取Annotations中图像数据的时候有问题,要么是读取灰度图成了三通道图,要么是某个图像包在windows上和ubuntu上的返回不一致
2,windows上的paddleseg和ubuntu上的paddleseg不一样,斗胆猜测是读取数据集这块有个什么问题,导致ubuntu上无法将设置的labels数据和png上的标注的数据对应起来,,

复现环境 Environment

platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35
Python: 3.9.19 (main, Apr 6 2024, 17:57:55) [GCC 11.4.0]
Paddle compiled with cuda: True
NVCC: Build cuda_11.8.r11.8/compiler.31833905_0
cudnn: 8.6
GPUs used: 1
CUDA_VISIBLE_DEVICES: None
GPU: ['GPU 0: NVIDIA GeForce', 'GPU 1: NVIDIA GeForce']
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PaddleSeg: 0.0.0.dev0
PaddlePaddle: 2.6.1
OpenCV: 4.10.0

备注:ubuntu22.04
按照教程安装并且可以通过运行检查(sh tests/install/check_predict.sh)

Bug描述确认 Bug description confirmation

  • 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息,确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.

是否愿意提交PR? Are you willing to submit a PR?

  • 我愿意提交PR!I'd like to help by submitting a PR!

您好,我这边在 ubuntu 上 使用paddle2.6 训练并没有遇到上述问题。 您可以参考 docs/quick_start.md 重新下载数据试试。这个报错看上去像是数据集的问题