训练时报错：RuntimeError: CUDA error: too many resources requested for launch

Question

训练时报错：RuntimeError: CUDA error: too many resources requested for launch

Opened this issue 2 years ago · 3 comments

完整报错信息：

ReResNet Orientation: 8 Fix Params: False
2022-06-25 00:20:44,437 - mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.7.r11.7/compiler.31294372_0
GPU 0: NVIDIA GeForce RTX 2080 Ti
GCC: gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0
PyTorch: 1.4.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

TorchVision: 0.5.0
OpenCV: 4.6.0
MMCV: 0.6.2
MMDetection: 1.1.0+258d792
MMDetection Compiler: GCC 11.2
MMDetection CUDA Compiler: 11.7
------------------------------------------------------------

2022-06-25 00:20:44,437 - mmdet - INFO - Distributed training: False
2022-06-25 00:20:44,437 - mmdet - INFO - Config:
/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/configs/dota/r50_dotav1.py
work_dir = 'work_dirs/r50_dotav1/'

# model settings
norm_cfg = dict(type='GN', num_groups=32, requires_grad=True)

model = dict(
    type='OrientedRepPointsDetector',
    pretrained='torchvision://resnet50', 
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        norm_cfg=dict(type='BN', requires_grad=True),
        style='pytorch',
    ),
    neck=
        dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        start_level=1,
        add_extra_convs=True,
        num_outs=5,
        norm_cfg=norm_cfg
        ),
    bbox_head=dict(
        type='OrientedRepPointsHead',
        num_classes=16,
        in_channels=256,
        feat_channels=256,
        point_feat_channels=256,
        stacked_convs=3,
        num_points=9,
        gradient_mul=0.3,
        point_strides=[8, 16, 32, 64, 128],
        point_base_scale=2,
        norm_cfg=norm_cfg,
        loss_cls=dict(type='FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, loss_weight=1.0),
        loss_rbox_init=dict(type='GIoULoss', loss_weight=0.375),
        loss_rbox_refine=dict(type='GIoULoss', loss_weight=1.0),
        loss_spatial_init=dict(type='SpatialBorderLoss', loss_weight=0.05),
        loss_spatial_refine=dict(type='SpatialBorderLoss', loss_weight=0.1),
        top_ratio=0.4,))
# training and testing settings
train_cfg = dict(
    init=dict(
        assigner=dict(type='PointAssigner', scale=4, pos_num=1),  # 每个gtbox仅选一个正样本
        allowed_border=-1,
        pos_weight=-1,
        debug=False),
    refine=dict(
        assigner=dict(
            type='MaxIoUAssigner', #pre-assign to select more samples for samples selection
            pos_iou_thr=0.1,
            neg_iou_thr=0.1,
            min_pos_iou=0,
            ignore_iof_thr=-1),
        allowed_border=-1,
        pos_weight=-1,
        debug=False))

test_cfg = dict(
    nms_pre=2000,
    min_bbox_size=0,
    score_thr=0.05,
    nms=dict(type='rnms', iou_thr=0.4),
    max_per_img=2000)

# dataset settings
dataset_type = 'DotaDatasetv1'
data_root = '/home/r/文档/WPW/Remote/DataSets/Dota-v1.5/' #'data/dataset_demo_split/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='CorrectRBBox', correct_rbbox=True, refine_rbbox=True),
    dict(type='PolyResize',
        img_scale=[(1333, 768), (1333, 1280)],
        keep_ratio=True,
        multiscale_mode='range',
        clamp_rbbox=False),
    dict(type='PolyRandomFlip', flip_ratio=0.5),
    #dict(type='HSVAugment', hgain=0.015, sgain=0.7, vgain=0.4),
    #dict(type='PolyRandomRotate', rotate_ratio=0.5, angles_range=180, auto_bound=False),
    dict(type='Pad', size_divisor=32),
    #dict(type='Poly_Mosaic_RandomPerspective', mosaic_ratio=0.5, ifcrop=True, degrees=0, translate=0.1, scale=0.2, shear=0, perspective=0.0),
    #dict(type='MixUp', mixup_ratio=0.5),
    dict(type='PolyImgPlot', img_save_path=work_dir, save_img_num=16, class_num=15, thickness=2),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])]

test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1024, 1024),
        flip=False,
        transforms=[
            dict(type='PolyResize', keep_ratio=True),
            dict(type='PolyRandomFlip'),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']), 
            dict(type='Collect', keys=['img']),
        ])
]

data = dict(
    imgs_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type=dataset_type,
        ann_file=data_root + 'trainval_split/' + 'trainval.json',
        img_prefix=data_root + 'trainval_split/' + 'images/',
        pipeline=train_pipeline,
        Mosaic4=False,
        Mosaic9=False,
        Mixup=False),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'trainval_split/' + 'trainval.json',
        img_prefix=data_root + 'trainval_split/' + 'images/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'test_split/' + 'test.json',
        img_prefix=data_root + 'test_split/' + 'images/',
        pipeline=test_pipeline))

evaluation = dict(interval=1, metric='bbox')
# optimizer
optimizer = dict(type='AdamW', lr=0.0001, betas=(0.9, 0.999), weight_decay=0.05,
                paramwise_cfg=dict(custom_keys={'absolute_pos_embed': dict(decay_mult=0.),
                                                 'relative_position_bias_table': dict(decay_mult=0.),
                                                 'norm': dict(decay_mult=0.)}))
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
# learning policy
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=1.0 / 3,
    step=[24, 32, 38])
checkpoint_config = dict(interval=20)
# yapf:disable
log_config = dict(
    interval=1,          # 迭代n次时打印一次
    hooks=[
        dict(type='TextLoggerHook')
    ])
# yapf:enable
# runtime settings
total_epochs = 40
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None#'work_dirs/orientedreppoints_r50_demo/latest.pth'
workflow = [('train', 1)]


2022-06-25 00:20:44,666 - mmdet - INFO - load model from: torchvision://resnet50
2022-06-25 00:20:44,779 - mmdet - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

loading annotations into memory...
Done (t=4.07s)
creating index...
index created!
2022-06-25 00:20:50,462 - mmdet - INFO - Start running, host: r@4508, work_dir: /home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/work_dirs/r50_dotav1
2022-06-25 00:20:50,462 - mmdet - INFO - workflow: [('train', 1)], max: 40 epochs
Traceback (most recent call last):
  File "tools/train.py", line 154, in <module>
    main()
  File "tools/train.py", line 143, in main
    train_detector(
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/apis/train.py", line 105, in train_detector
    _non_dist_train(
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/apis/train.py", line 244, in _non_dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/r/miniconda3/envs/orientedreppoints/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 122, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/r/miniconda3/envs/orientedreppoints/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 34, in train
    outputs = self.batch_processor(
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/apis/train.py", line 75, in batch_processor
    losses = model(**data)
  File "/home/r/miniconda3/envs/orientedreppoints/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r/miniconda3/envs/orientedreppoints/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/r/miniconda3/envs/orientedreppoints/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/core/fp16/decorators.py", line 49, in new_func
    return old_func(*args, **kwargs)
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/models/detectors/base.py", line 147, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/models/detectors/orientedreppoints_detector.py", line 31, in forward_train
    losses = self.bbox_head.loss(
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/models/anchor_heads/orientedreppoints_head.py", line 388, in loss
    cls_reg_targets_refine = refine_pointset_target(
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/core/bbox/pointset_target.py", line 148, in refine_pointset_target
    all_proposal_weights, pos_inds_list, neg_inds_list, all_gt_inds) = multi_apply(
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/core/utils/misc.py", line 24, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/core/bbox/pointset_target.py", line 190, in refine_pointset_target_single
    assign_result = bbox_assigner.assign(proposals, gt_rbboxes,
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/core/bbox/assigners/max_iou_assigner.py", line 80, in assign
    assign_result = self.assign_wrt_overlaps(overlaps, gt_labels)
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/core/bbox/assigners/max_iou_assigner.py", line 92, in assign_wrt_overlaps
    assigned_gt_inds = overlaps.new_full((num_bboxes,),
RuntimeError: CUDA error: too many resources requested for launch

Answer 1 · 2022-12-12T07:12:08.000Z

请问解决了吗，我也有同样的问题

Answer 2 · 2024-07-12T08:27:42.000Z

请问解决了吗，我也有同样的问题

请问你解决了吗，我也遇到了同样的

Answer 3 · 2024-07-14T17:08:41.000Z

请问解决了吗，我也有同样的问题

请问你解决了吗，我也遇到了同样的

我记得当时是重装了一遍环境解决的，问题有点久了可能记不清