RAM usage continues to grow and the training process stopped without error !!!

Question

RAM usage continues to grow and the training process stopped without error !!!

JinYAnGHe opened this issue 3 years ago · 25 comments

During training, RAM usage continues to grow. Finaily, the training process stopped. It is a bug?

2021-07-23 14:46:56 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2000/3905, mem: 1730Mb, iter_time: 0.165s, data_time: 0.001s, total_loss: 3.3, iou_loss: 1.7, l1_loss: 0.0, conf_loss: 1.2, cls_loss: 0.5, lr: 9.955e-03, size: 320, ETA: 16:27:19
2021-07-23 14:47:04 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2050/3905, mem: 1730Mb, iter_time: 0.166s, data_time: 0.001s, total_loss: 3.0, iou_loss: 1.8, l1_loss: 0.0, conf_loss: 0.7, cls_loss: 0.5, lr: 9.955e-03, size: 320, ETA: 16:27:11
2021-07-23 14:47:13 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2100/3905, mem: 1730Mb, iter_time: 0.165s, data_time: 0.001s, total_loss: 2.9, iou_loss: 1.8, l1_loss: 0.0, conf_loss: 0.6, cls_loss: 0.5, lr: 9.954e-03, size: 320, ETA: 16:27:02
2021-07-23 14:47:21 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2150/3905, mem: 1730Mb, iter_time: 0.166s, data_time: 0.001s, total_loss: 3.3, iou_loss: 1.9, l1_loss: 0.0, conf_loss: 1.0, cls_loss: 0.5, lr: 9.954e-03, size: 320, ETA: 16:26:54
2021-07-23 14:47:30 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2200/3905, mem: 1730Mb, iter_time: 0.177s, data_time: 0.002s, total_loss: 2.3, iou_loss: 1.3, l1_loss: 0.0, conf_loss: 0.6, cls_loss: 0.4, lr: 9.954e-03, size: 320, ETA: 16:26:51
2021-07-23 14:47:38 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2250/3905, mem: 1730Mb, iter_time: 0.166s, data_time: 0.001s, total_loss: 2.9, iou_loss: 1.7, l1_loss: 0.0, conf_loss: 0.7, cls_loss: 0.5, lr: 9.953e-03, size: 320, ETA: 16:26:43
2021-07-23 14:47:47 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2300/3905, mem: 1730Mb, iter_time: 0.168s, data_time: 0.001s, total_loss: 2.4, iou_loss: 1.5, l1_loss: 0.0, conf_loss: 0.5, cls_loss: 0.4, lr: 9.953e-03, size: 320, ETA: 16:26:36
------------------------stopped here--------------------

是RAM持续增长，然后溢出，导致程序停止？

Answer 1 · 2021-07-23T07:34:13.000Z

#91 Someone meets the same issue. We have no idea what happened now as we cannot reproduce this bug. Could you set the num_workers to 0 and try it again?

Answer 2 · 2021-07-23T07:48:55.000Z

#91 Someone meets the same issue. We have no idea what happened now as we cannot reproduce this bug. Could you set the num_workers to 0 and try it again?

I have tried. The result is same with #91, the RAM also continues to grow. Just slower than num_worker = 4.

Answer 3 · 2021-07-23T07:52:59.000Z

What's the version of your torch? And could you provide more information about your env so we can reproduce this problem? Thx~

Answer 4 · 2021-07-23T08:03:36.000Z

同样遇到这个问题，训练自定义coco类型数据集时发现内存占用会不断增长直至溢出，环境是ubuntu18.04 anaconda环境，pytorch版本1.8.1， py3.7_cuda10.2_cudnn7.6.5_0，然后单卡、四卡都会出现这个问题。

Answer 5 · 2021-07-23T08:07:46.000Z

THx too~
My ENV:
Ubuntu 16.04
CUDA 10.2
RTX 2080Ti
python 3.8
torch 1.8.0
torchvision 0.9.0

By the way：
或许你们可以提供一个更加兼容的requirement.txt，包含每个包的版本，可以兼容的环境等等？

Answer 6 · 2021-07-23T08:09:45.000Z

同样遇到这个问题，训练自定义coco类型数据集时发现内存占用会不断增长直至溢出，环境是ubuntu18.04 anaconda环境，pytorch版本1.8.1， py3.7_cuda10.2_cudnn7.6.5_0，然后单卡、四卡都会出现这个问题。

我训练的也是自定义的COCO格式数据集，不过好像跟数据集没关系 #91 是VOC

Answer 7 · 2021-07-23T08:48:01.000Z

我训练自定义的VOC格式数据，也遇到了同样的问题，在204/300处断了

Answer 8 · 2021-07-23T08:56:51.000Z

OK, keep this issue opening and we'll try to reproduce it first.

THx too~
My ENV:
Ubuntu 16.04
CUDA 10.2
RTX 2080Ti
python 3.8
torch 1.8.0
torchvision 0.9.0

By the way：
或许你们可以提供一个更加兼容的requirement.txt，包含每个包的版本，可以兼容的环境等等？

Thx for your advice and we are still finding the oldest working version for each package.

Answer 9 · 2021-07-23T09:02:20.000Z

OK, keep this issue opening and we'll try to reproduce it first.

THx too~
My ENV:
Ubuntu 16.04
CUDA 10.2
RTX 2080Ti
python 3.8
torch 1.8.0
torchvision 0.9.0
By the way：
或许你们可以提供一个更加兼容的requirement.txt，包含每个包的版本，可以兼容的环境等等？

Thx for your advice and we are still finding the oldest working version for each package.

Thank you for providing such a great project and waiting for the good news.

Answer 10 · 2021-07-23T09:08:50.000Z

@JinYAnGHe @Hiwyl @VantastyChen How much RAM do you have for training?

Answer 11 · 2021-07-23T09:12:19.000Z

@JinYAnGHe @Hiwyl @VantastyChen How much RAM do you have for training?

32G RAM，bs为32，yolox-l网络，开始训练时是占用14G，然后慢慢增加。

Answer 12 · 2021-07-23T09:24:43.000Z

@ruinmessi 128GB

Answer 13 · 2021-07-23T10:51:48.000Z

We have reproduced this problem and are trying to fix it now.

Answer 14 · 2021-07-26T08:09:16.000Z

Hey, guys! We currently do not fix the memory leak issues, but we change the multiprocess backend from spawn to subprocess. In our test, although memory would still increase during training, the training process would not crash. So we recommend you to pull the latest update and reinstall yolox, and then retry your exp.

Answer 15 · 2021-07-26T08:13:53.000Z

Hey, guys! We currently do not fix the memory leak issues, but we change the multiprocess backend from spawn to subprocess. In our test, although memory would still increase during training, the training process would not crash. So we recommend you to pull the latest update and reinstall yolox, and then retry your exp.

Thx.

Answer 16 · 2021-07-26T08:31:53.000Z

Do you have a specific setup in which this problem doesn't occur?
I use nvcr.io/nvidia/pytorch:21.06-py3 docker image and able to reproduce the problem.
torch version: 1.9.0a0+c3d40fd
cuda: 11.3, V11.3.109
cudnn: 8.2.1

Answer 17 · 2021-07-27T02:57:18.000Z

Sorry, guys.... we found some error in the above updates and have to revert to the original spawn version. Currently, we are rewriting the whole dataloader to skip this bug.

Answer 18 · 2021-07-27T06:21:30.000Z

@ruinmessi locally, I reworked the COCODataset to create numpy arrays of all the info the COCO class provides upfront. When I tested this version of the COCODataset by itself using this notebook:

https://gist.github.com/mprostock/2850f3cd465155689052f0fa3a177a50

I see that the original version "leaks" memory and that my numpy array-based one does not. However, when run it in YOLOX, I still get the memory leak. So maybe something downstream like the MosaicDetection is also contributing to this problem?

Answer 19 · 2021-07-28T06:30:51.000Z

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

Answer 20 · 2021-07-28T06:32:08.000Z

@yonomitt @VantastyChen @JinYAnGHe Finally.... we fix the bug and currently the training memory curve is like (yolox-tiny with 128 batchsize in 8 gpus):

Answer 21 · 2021-07-28T06:34:59.000Z

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

Well, we found two major issues. One is our coco dataset: we must load the whole annos during the initial; Another is about loss log: we have to detach and pick the loss value instead of the single tensor for log.

Answer 22 · 2021-07-28T06:39:23.000Z

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

Well, we found two major issues. One is our coco dataset: we must load the whole annos during the initial; Another is about loss log: we have to detach and pick the loss value instead of the single tensor for log.

Thanks for you awesome work !

Answer 23 · 2021-07-28T06:43:34.000Z

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

Well, we found two major issues. One is our coco dataset: we must load the whole annos during the initial; Another is about loss log: we have to detach and pick the loss value instead of the single tensor for log.

Thanks!

Answer 24 · 2022-01-09T08:44:40.000Z

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

Well, we found two major issues. One is our coco dataset: we must load the whole annos during the initial; Another is about loss log: we have to detach and pick the loss value instead of the single tensor for log.

Could you kindly tell me how you implemented this part?
I am getting stuck in this problem

Answer 25 · 2023-05-10T20:25:49.000Z

@GOATmessi7 I am facing similar memory leak issue. Could you provide your solution here? Thanks