xlvector/learning-dl

内存问题

Closed this issue · 14 comments

您好!我尝试训练了CNN的方案,加入了大小写字母,把输出改为62;
内存很快就占满了,看起来好像是用过的样本没有被释放掉,我刚开始用mxnet,请问是不是用过的样本要及时清理,在哪里修改代码呢?谢谢

是内存满了,还是显存满了?

非常感谢,这么快就回复了! 是内存满了,交换内存都占了很多

要不你把你的程序提交到你的某个项目里,让我瞅一眼?

我运行了你的程序,貌似没有问题

2016-07-19 15:50:34,197 Epoch[0] Batch [50] Speed: 246.42 samples/sec Train-Accuracy=0.000000
2016-07-19 15:50:47,314 Epoch[0] Batch [100] Speed: 243.97 samples/sec Train-Accuracy=0.000000
2016-07-19 15:51:00,176 Epoch[0] Batch [150] Speed: 248.80 samples/sec Train-Accuracy=0.000000
2016-07-19 15:51:13,019 Epoch[0] Batch [200] Speed: 249.15 samples/sec Train-Accuracy=0.000000
2016-07-19 15:51:25,883 Epoch[0] Batch [250] Speed: 248.77 samples/sec Train-Accuracy=0.000000
2016-07-19 15:51:38,661 Epoch[0] Batch [300] Speed: 250.44 samples/sec Train-Accuracy=0.000000
2016-07-19 15:51:51,498 Epoch[0] Batch [350] Speed: 249.28 samples/sec Train-Accuracy=0.000000
2016-07-19 15:52:04,362 Epoch[0] Batch [400] Speed: 248.76 samples/sec Train-Accuracy=0.000000
2016-07-19 15:52:17,426 Epoch[0] Batch [450] Speed: 244.95 samples/sec Train-Accuracy=0.000000
2016-07-19 15:52:30,335 Epoch[0] Batch [500] Speed: 247.90 samples/sec Train-Accuracy=0.000000
2016-07-19 15:52:42,930 Epoch[0] Batch [550] Speed: 254.06 samples/sec Train-Accuracy=0.000000
2016-07-19 15:52:55,559 Epoch[0] Batch [600] Speed: 253.40 samples/sec Train-Accuracy=0.000000
2016-07-19 15:53:08,168 Epoch[0] Batch [650] Speed: 253.80 samples/sec Train-Accuracy=0.000000
2016-07-19 15:53:20,729 Epoch[0] Batch [700] Speed: 254.75 samples/sec Train-Accuracy=0.000313
2016-07-19 15:53:33,342 Epoch[0] Batch [750] Speed: 253.71 samples/sec Train-Accuracy=0.000937
2016-07-19 15:53:45,945 Epoch[0] Batch [800] Speed: 253.91 samples/sec Train-Accuracy=0.000313
2016-07-19 15:53:58,689 Epoch[0] Batch [850] Speed: 251.10 samples/sec Train-Accuracy=0.001875
2016-07-19 15:54:11,265 Epoch[0] Batch [900] Speed: 254.46 samples/sec Train-Accuracy=0.002812
2016-07-19 15:54:23,902 Epoch[0] Batch [950] Speed: 253.22 samples/sec Train-Accuracy=0.004375
2016-07-19 15:54:36,490 Epoch[0] Batch [1000] Speed: 254.22 samples/sec Train-Accuracy=0.005000
2016-07-19 15:54:49,109 Epoch[0] Batch [1050] Speed: 253.59 samples/sec Train-Accuracy=0.009687
2016-07-19 15:55:01,686 Epoch[0] Batch [1100] Speed: 254.44 samples/sec Train-Accuracy=0.009687
2016-07-19 15:55:14,250 Epoch[0] Batch [1150] Speed: 254.69 samples/sec Train-Accuracy=0.013125
2016-07-19 15:55:26,876 Epoch[0] Batch [1200] Speed: 253.45 samples/sec Train-Accuracy=0.017812
2016-07-19 15:55:39,485 Epoch[0] Batch [1250] Speed: 253.78 samples/sec Train-Accuracy=0.023750
2016-07-19 15:55:52,098 Epoch[0] Batch [1300] Speed: 253.72 samples/sec Train-Accuracy=0.025625
2016-07-19 15:56:04,701 Epoch[0] Batch [1350] Speed: 253.91 samples/sec Train-Accuracy=0.032813
2016-07-19 15:56:17,276 Epoch[0] Batch [1400] Speed: 254.48 samples/sec Train-Accuracy=0.037812
2016-07-19 15:56:29,908 Epoch[0] Batch [1450] Speed: 253.32 samples/sec Train-Accuracy=0.037812
2016-07-19 15:56:42,573 Epoch[0] Batch [1500] Speed: 252.68 samples/sec Train-Accuracy=0.047500
2016-07-19 15:56:55,174 Epoch[0] Batch [1550] Speed: 253.95 samples/sec Train-Accuracy=0.053437
2016-07-19 15:57:07,824 Epoch[0] Batch [1600] Speed: 252.97 samples/sec Train-Accuracy=0.054062
2016-07-19 15:57:20,399 Epoch[0] Batch [1650] Speed: 254.47 samples/sec Train-Accuracy=0.059687
2016-07-19 15:57:33,001 Epoch[0] Batch [1700] Speed: 253.95 samples/sec Train-Accuracy=0.068125
2016-07-19 15:57:45,606 Epoch[0] Batch [1750] Speed: 253.86 samples/sec Train-Accuracy=0.079375
2016-07-19 15:57:58,208 Epoch[0] Batch [1800] Speed: 253.92 samples/sec Train-Accuracy=0.080937
2016-07-19 15:58:10,817 Epoch[0] Batch [1850] Speed: 253.79 samples/sec Train-Accuracy=0.085312
2016-07-19 15:58:23,393 Epoch[0] Batch [1900] Speed: 254.46 samples/sec Train-Accuracy=0.085312
2016-07-19 15:58:36,003 Epoch[0] Batch [1950] Speed: 253.77 samples/sec Train-Accuracy=0.097812
2016-07-19 15:58:48,602 Epoch[0] Batch [2000] Speed: 253.99 samples/sec Train-Accuracy=0.107500
2016-07-19 15:59:01,209 Epoch[0] Batch [2050] Speed: 253.84 samples/sec Train-Accuracy=0.114687
2016-07-19 15:59:13,818 Epoch[0] Batch [2100] Speed: 253.79 samples/sec Train-Accuracy=0.115937
2016-07-19 15:59:26,415 Epoch[0] Batch [2150] Speed: 254.04 samples/sec Train-Accuracy=0.130625
2016-07-19 15:59:38,992 Epoch[0] Batch [2200] Speed: 254.43 samples/sec Train-Accuracy=0.123438
2016-07-19 15:59:51,615 Epoch[0] Batch [2250] Speed: 253.52 samples/sec Train-Accuracy=0.135000
2016-07-19 16:00:04,235 Epoch[0] Batch [2300] Speed: 253.56 samples/sec Train-Accuracy=0.129688
2016-07-19 16:00:16,868 Epoch[0] Batch [2350] Speed: 253.31 samples/sec Train-Accuracy=0.144375
2016-07-19 16:00:29,472 Epoch[0] Batch [2400] Speed: 253.90 samples/sec Train-Accuracy=0.139687
2016-07-19 16:00:42,225 Epoch[0] Batch [2450] Speed: 250.92 samples/sec Train-Accuracy=0.148750
2016-07-19 16:00:54,825 Epoch[0] Batch [2500] Speed: 253.97 samples/sec Train-Accuracy=0.163438
2016-07-19 16:01:07,417 Epoch[0] Batch [2550] Speed: 254.14 samples/sec Train-Accuracy=0.173750
2016-07-19 16:01:20,032 Epoch[0] Batch [2600] Speed: 253.67 samples/sec Train-Accuracy=0.159688
2016-07-19 16:01:33,207 Epoch[0] Batch [2650] Speed: 242.88 samples/sec Train-Accuracy=0.172813
2016-07-19 16:01:46,281 Epoch[0] Batch [2700] Speed: 244.76 samples/sec Train-Accuracy=0.180938
2016-07-19 16:01:59,411 Epoch[0] Batch [2750] Speed: 243.73 samples/sec Train-Accuracy=0.180938
2016-07-19 16:02:12,292 Epoch[0] Batch [2800] Speed: 248.43 samples/sec Train-Accuracy=0.210000
2016-07-19 16:02:25,260 Epoch[0] Batch [2850] Speed: 246.76 samples/sec Train-Accuracy=0.203750
2016-07-19 16:02:38,286 Epoch[0] Batch [2900] Speed: 245.67 samples/sec Train-Accuracy=0.205625

我的内存从开始运行一直在涨,我贴几个图过来看看

这不是运行的挺好的吗?

memory 呼呼的涨啊,再过一会儿就满了,跑一个batch要半个小时

从你的日志看不是6分钟一个epoc吗

刚开始的时候内存没用完,速度会很快,再往后内存用完,每次去申请swap memory,速度就会变的奇慢

有可能是python-captcha 的问题。https://pypi.python.org/pypi/memory_profiler 你可以用这个做一下memory profile

我把python-captcha拿出来单独跑,确实是内存泄漏!谢谢

👍