johnolafenwa/deepstack-trainer

Getting RuntimeError ....

Closed this issue · 1 comments

Hi all,

I'm trying to train the custom model locally but have the following issue... Can someone tell me what I did wrong?

(base) C:\Utils\deepstack-trainer>python train.py --dataset-path C:\Movie\Deepstack\datasets\vehicles
Using torch 1.11.0 CUDA:0 (NVIDIA GeForce GTX 1650 Ti, 4095MB)

Namespace(model='yolov5m', classes='', dataset_path='C:\\Movie\\Deepstack\\datasets\\vehicles', hyp='data/hyp.scratch.yaml', epochs=300, batch_size=16, img_size=[640, 640], rect=False, resume=False, nosave=False, notest=False, noautoanchor=False, evolve=False, bucket='', cache_images=False, image_weights=False, device='', multi_scale=False, single_cls=False, adam=False, sync_bn=False, local_rank=-1, log_imgs=16, workers=8, project='train-runs/vehicles', name='exp', exist_ok=False, cfg='.\\models\\yolov5m.yaml', weights='yolov5m.pt', data={'train': 'C:\\Movie\\Deepstack\\datasets\\vehicles\\train', 'val': 'C:\\Movie\\Deepstack\\datasets\\vehicles\\test', 'nc': 5, 'names': ['vehicle/car', 'vehicle/truck', 'vehicle/bus', 'vehicle/train', '']}, total_batch_size=16, world_size=1, global_rank=-1, save_dir='train-runs\\vehicles\\exp4')
Start Tensorboard with "tensorboard --logdir train-runs/vehicles", view at http://localhost:6006/
Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0}
Overriding model.yaml nc=80 with nc=5

                 from  n    params  module                                  arguments
  0                -1  1      5280  models.common.Focus                     [3, 48, 3]
  1                -1  1     41664  models.common.Conv                      [48, 96, 3, 2]
  2                -1  1     67680  models.common.BottleneckCSP             [96, 96, 2]
  3                -1  1    166272  models.common.Conv                      [96, 192, 3, 2]
  4                -1  1    639168  models.common.BottleneckCSP             [192, 192, 6]
  5                -1  1    664320  models.common.Conv                      [192, 384, 3, 2]
  6                -1  1   2550144  models.common.BottleneckCSP             [384, 384, 6]
  7                -1  1   2655744  models.common.Conv                      [384, 768, 3, 2]
  8                -1  1   1476864  models.common.SPP                       [768, 768, [5, 9, 13]]
  9                -1  1   4283136  models.common.BottleneckCSP             [768, 768, 2, False]
 10                -1  1    295680  models.common.Conv                      [768, 384, 1, 1]
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 12           [-1, 6]  1         0  models.common.Concat                    [1]
 13                -1  1   1219968  models.common.BottleneckCSP             [768, 384, 2, False]
 14                -1  1     74112  models.common.Conv                      [384, 192, 1, 1]
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 16           [-1, 4]  1         0  models.common.Concat                    [1]
 17                -1  1    305856  models.common.BottleneckCSP             [384, 192, 2, False]
 18                -1  1    332160  models.common.Conv                      [192, 192, 3, 2]
 19          [-1, 14]  1         0  models.common.Concat                    [1]
 20                -1  1   1072512  models.common.BottleneckCSP             [384, 384, 2, False]
 21                -1  1   1327872  models.common.Conv                      [384, 384, 3, 2]
 22          [-1, 10]  1         0  models.common.Concat                    [1]
 23                -1  1   4283136  models.common.BottleneckCSP             [768, 768, 2, False]
 24      [17, 20, 23]  1     40410  models.yolo.Detect                      [5, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [192, 384, 768]]
Traceback (most recent call last):
  File "C:\Utils\deepstack-trainer\train.py", line 530, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "C:\Utils\deepstack-trainer\train.py", line 90, in train
    model = Model(opt.cfg or ckpt['model'].yaml, ch=3, nc=nc).to(device)  # create
  File "C:\Utils\deepstack-trainer\models\yolo.py", line 96, in __init__
    self._initialize_biases()  # only run once
  File "C:\Utils\deepstack-trainer\models\yolo.py", line 151, in _initialize_biases
    b[:, 4] += math.log(8 / (640 / s) ** 2)  # obj (8 objects per 640 image)
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

I tried to use older version of PyTorch, as marcokloeckler suggested,

pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

but got another error:

python train.py --dataset-path C:\Movie\Deepstack\datasets\vehicles
Using torch 1.7.1+cu110 CUDA:0 (NVIDIA GeForce GTX 1650 Ti, 4095MB)

Namespace(model='yolov5m', classes='', dataset_path='C:\\Movie\\Deepstack\\datasets\\vehicles', hyp='data/hyp.scratch.yaml', epochs=300, batch_size=16, img_size=[640, 640], rect=False, resume=False, nosave=False, notest=False, noautoanchor=False, evolve=False, bucket='', cache_images=False, image_weights=False, device='', multi_scale=False, single_cls=False, adam=False, sync_bn=False, local_rank=-1, log_imgs=16, workers=8, project='train-runs/vehicles', name='exp', exist_ok=False, cfg='.\\models\\yolov5m.yaml', weights='yolov5m.pt', data={'train': 'C:\\Movie\\Deepstack\\datasets\\vehicles\\train', 'val': 'C:\\Movie\\Deepstack\\datasets\\vehicles\\test', 'nc': 5, 'names': ['vehicle/car', 'vehicle/truck', 'vehicle/bus', 'vehicle/train', '']}, total_batch_size=16, world_size=1, global_rank=-1, save_dir='train-runs\\vehicles\\exp6')
Start Tensorboard with "tensorboard --logdir train-runs/vehicles", view at http://localhost:6006/
Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0}
Overriding model.yaml nc=80 with nc=5

                 from  n    params  module                                  arguments
  0                -1  1      5280  models.common.Focus                     [3, 48, 3]
  1                -1  1     41664  models.common.Conv                      [48, 96, 3, 2]
  2                -1  1     67680  models.common.BottleneckCSP             [96, 96, 2]
  3                -1  1    166272  models.common.Conv                      [96, 192, 3, 2]
  4                -1  1    639168  models.common.BottleneckCSP             [192, 192, 6]
  5                -1  1    664320  models.common.Conv                      [192, 384, 3, 2]
  6                -1  1   2550144  models.common.BottleneckCSP             [384, 384, 6]
  7                -1  1   2655744  models.common.Conv                      [384, 768, 3, 2]
  8                -1  1   1476864  models.common.SPP                       [768, 768, [5, 9, 13]]
  9                -1  1   4283136  models.common.BottleneckCSP             [768, 768, 2, False]
 10                -1  1    295680  models.common.Conv                      [768, 384, 1, 1]
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 12           [-1, 6]  1         0  models.common.Concat                    [1]
 13                -1  1   1219968  models.common.BottleneckCSP             [768, 384, 2, False]
 14                -1  1     74112  models.common.Conv                      [384, 192, 1, 1]
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 16           [-1, 4]  1         0  models.common.Concat                    [1]
 17                -1  1    305856  models.common.BottleneckCSP             [384, 192, 2, False]
 18                -1  1    332160  models.common.Conv                      [192, 192, 3, 2]
 19          [-1, 14]  1         0  models.common.Concat                    [1]
 20                -1  1   1072512  models.common.BottleneckCSP             [384, 384, 2, False]
 21                -1  1   1327872  models.common.Conv                      [384, 384, 3, 2]
 22          [-1, 10]  1         0  models.common.Concat                    [1]
 23                -1  1   4283136  models.common.BottleneckCSP             [768, 768, 2, False]
 24      [17, 20, 23]  1     40410  models.yolo.Detect                      [5, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [192, 384, 768]]
Model Summary: 391 layers, 21501978 parameters, 21501978 gradients, 51.4 GFLOPS

Transferred 506/514 items from yolov5m.pt
Optimizer groups: 86 .bias, 94 conv.weight, 83 other
Scanning 'C:\Movie\Deepstack\datasets\vehicles\train.cache' for images and labels... 286 found, 1 missing, 0 empty, 0 corrupted: 100%|████████████| 287/287 [00:00<?, ?it/s]
Scanning 'C:\Movie\Deepstack\datasets\vehicles\test.cache' for images and labels... 64 found, 0 missing, 0 empty, 0 corrupted: 100%|████████████████| 64/64 [00:00<?, ?it/s]
OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.

Analyzing anchors... anchors/target = 4.80, Best Possible Recall (BPR) = 1.0000
Image sizes 640 train, 640 test
Using 8 dataloader workers
Logging results to train-runs\vehicles\exp6
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
  0%|                                                                                                                                                | 0/18 [00:00<?, ?it/s]Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
NumExpr defaulting to 8 threads.
OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.

Any ideas?

My dataset works on Google Colab..

Thanks,

Alex

Ok. I ended up setting env variable KMP_DUPLICATE_LIB_OK=TRUE. Then I got another crash (memory allocation related). So, I set --batch-size = 8 and now it seems to run....

Alex