Megvii-BaseDetection/YOLOX

RAM usage continues to grow and the training process stopped without error !!!

JinYAnGHe opened this issue · 25 comments

During training, RAM usage continues to grow. Finaily, the training process stopped. It is a bug?

2021-07-23 14:46:56 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2000/3905, mem: 1730Mb, iter_time: 0.165s, data_time: 0.001s, total_loss: 3.3, iou_loss: 1.7, l1_loss: 0.0, conf_loss: 1.2, cls_loss: 0.5, lr: 9.955e-03, size: 320, ETA: 16:27:19
2021-07-23 14:47:04 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2050/3905, mem: 1730Mb, iter_time: 0.166s, data_time: 0.001s, total_loss: 3.0, iou_loss: 1.8, l1_loss: 0.0, conf_loss: 0.7, cls_loss: 0.5, lr: 9.955e-03, size: 320, ETA: 16:27:11
2021-07-23 14:47:13 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2100/3905, mem: 1730Mb, iter_time: 0.165s, data_time: 0.001s, total_loss: 2.9, iou_loss: 1.8, l1_loss: 0.0, conf_loss: 0.6, cls_loss: 0.5, lr: 9.954e-03, size: 320, ETA: 16:27:02
2021-07-23 14:47:21 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2150/3905, mem: 1730Mb, iter_time: 0.166s, data_time: 0.001s, total_loss: 3.3, iou_loss: 1.9, l1_loss: 0.0, conf_loss: 1.0, cls_loss: 0.5, lr: 9.954e-03, size: 320, ETA: 16:26:54
2021-07-23 14:47:30 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2200/3905, mem: 1730Mb, iter_time: 0.177s, data_time: 0.002s, total_loss: 2.3, iou_loss: 1.3, l1_loss: 0.0, conf_loss: 0.6, cls_loss: 0.4, lr: 9.954e-03, size: 320, ETA: 16:26:51
2021-07-23 14:47:38 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2250/3905, mem: 1730Mb, iter_time: 0.166s, data_time: 0.001s, total_loss: 2.9, iou_loss: 1.7, l1_loss: 0.0, conf_loss: 0.7, cls_loss: 0.5, lr: 9.953e-03, size: 320, ETA: 16:26:43
2021-07-23 14:47:47 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2300/3905, mem: 1730Mb, iter_time: 0.168s, data_time: 0.001s, total_loss: 2.4, iou_loss: 1.5, l1_loss: 0.0, conf_loss: 0.5, cls_loss: 0.4, lr: 9.953e-03, size: 320, ETA: 16:26:36
------------------------stopped here--------------------

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:3D:00.0 Off | N/A |
| 28% 50C P2 109W / 250W | 2050MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... On | 00000000:3E:00.0 Off | N/A |
| 27% 49C P2 103W / 250W | 2086MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... On | 00000000:41:00.0 Off | N/A |
| 25% 48C P2 117W / 250W | 2086MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... On | 00000000:42:00.0 Off | N/A |
| 28% 50C P2 113W / 250W | 2086MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce RTX 208... On | 00000000:44:00.0 Off | N/A |
| 16% 27C P8 21W / 250W | 11MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce RTX 208... On | 00000000:45:00.0 Off | N/A |
| 28% 50C P2 110W / 250W | 2086MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce RTX 208... On | 00000000:46:00.0 Off | N/A |
| 24% 47C P2 95W / 250W | 2086MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce RTX 208... On | 00000000:47:00.0 Off | N/A |
| 26% 49C P2 99W / 250W | 2086MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+

是RAM持续增长,然后溢出,导致程序停止?

#91 Someone meets the same issue. We have no idea what happened now as we cannot reproduce this bug. Could you set the num_workers to 0 and try it again?

#91 Someone meets the same issue. We have no idea what happened now as we cannot reproduce this bug. Could you set the num_workers to 0 and try it again?

I have tried. The result is same with #91, the RAM also continues to grow. Just slower than num_worker = 4.

What's the version of your torch? And could you provide more information about your env so we can reproduce this problem? Thx~

同样遇到这个问题,训练自定义coco类型数据集时发现内存占用会不断增长直至溢出,环境是ubuntu18.04 anaconda环境,pytorch版本1.8.1, py3.7_cuda10.2_cudnn7.6.5_0, 然后单卡、四卡都会出现这个问题。

THx too~
My ENV:
Ubuntu 16.04
CUDA 10.2
RTX 2080Ti
python 3.8
torch 1.8.0
torchvision 0.9.0

By the way:
或许你们可以提供一个更加兼容的requirement.txt,包含每个包的版本,可以兼容的环境等等?

同样遇到这个问题,训练自定义coco类型数据集时发现内存占用会不断增长直至溢出,环境是ubuntu18.04 anaconda环境,pytorch版本1.8.1, py3.7_cuda10.2_cudnn7.6.5_0, 然后单卡、四卡都会出现这个问题。

我训练的也是自定义的COCO格式数据集,不过好像跟数据集没关系 #91 是VOC

Hiwyl commented

我训练自定义的VOC格式数据,也遇到了同样的问题,在204/300处断了

OK, keep this issue opening and we'll try to reproduce it first.

THx too~
My ENV:
Ubuntu 16.04
CUDA 10.2
RTX 2080Ti
python 3.8
torch 1.8.0
torchvision 0.9.0

By the way:
或许你们可以提供一个更加兼容的requirement.txt,包含每个包的版本,可以兼容的环境等等?

Thx for your advice and we are still finding the oldest working version for each package.

OK, keep this issue opening and we'll try to reproduce it first.

THx too~
My ENV:
Ubuntu 16.04
CUDA 10.2
RTX 2080Ti
python 3.8
torch 1.8.0
torchvision 0.9.0
By the way:
或许你们可以提供一个更加兼容的requirement.txt,包含每个包的版本,可以兼容的环境等等?

Thx for your advice and we are still finding the oldest working version for each package.

Thank you for providing such a great project and waiting for the good news.

@JinYAnGHe @Hiwyl @VantastyChen How much RAM do you have for training?

@JinYAnGHe @Hiwyl @VantastyChen How much RAM do you have for training?

32G RAM,bs为32,yolox-l网络,开始训练时是占用14G,然后慢慢增加。

@ruinmessi 128GB

We have reproduced this problem and are trying to fix it now.

Hey, guys! We currently do not fix the memory leak issues, but we change the multiprocess backend from spawn to subprocess. In our test, although memory would still increase during training, the training process would not crash. So we recommend you to pull the latest update and reinstall yolox, and then retry your exp.

Hey, guys! We currently do not fix the memory leak issues, but we change the multiprocess backend from spawn to subprocess. In our test, although memory would still increase during training, the training process would not crash. So we recommend you to pull the latest update and reinstall yolox, and then retry your exp.

Thx.

Do you have a specific setup in which this problem doesn't occur?
I use nvcr.io/nvidia/pytorch:21.06-py3 docker image and able to reproduce the problem.
torch version: 1.9.0a0+c3d40fd
cuda: 11.3, V11.3.109
cudnn: 8.2.1

Sorry, guys.... we found some error in the above updates and have to revert to the original spawn version. Currently, we are rewriting the whole dataloader to skip this bug.

@ruinmessi locally, I reworked the COCODataset to create numpy arrays of all the info the COCO class provides upfront. When I tested this version of the COCODataset by itself using this notebook:

https://gist.github.com/mprostock/2850f3cd465155689052f0fa3a177a50

I see that the original version "leaks" memory and that my numpy array-based one does not. However, when run it in YOLOX, I still get the memory leak. So maybe something downstream like the MosaicDetection is also contributing to this problem?

F0xZz commented

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

@yonomitt @VantastyChen @JinYAnGHe Finally.... we fix the bug and currently the training memory curve is like (yolox-tiny with 128 batchsize in 8 gpus):
image

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

Well, we found two major issues. One is our coco dataset: we must load the whole annos during the initial; Another is about loss log: we have to detach and pick the loss value instead of the single tensor for log.

F0xZz commented

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

Well, we found two major issues. One is our coco dataset: we must load the whole annos during the initial; Another is about loss log: we have to detach and pick the loss value instead of the single tensor for log.

Thanks for you awesome work !

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

Well, we found two major issues. One is our coco dataset: we must load the whole annos during the initial; Another is about loss log: we have to detach and pick the loss value instead of the single tensor for log.

Thanks!

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

Well, we found two major issues. One is our coco dataset: we must load the whole annos during the initial; Another is about loss log: we have to detach and pick the loss value instead of the single tensor for log.

Could you kindly tell me how you implemented this part?
I am getting stuck in this problem

@GOATmessi7 I am facing similar memory leak issue. Could you provide your solution here? Thanks