How much memory and time is needed to train?

Question

How much memory and time is needed to train?

Mauker1 opened this issue 7 years ago · 4 comments

Hello,

First of all, congratulations on the excellent results with SEGAN.

I'm trying to run train_segan.sh using the test samples. My setup includes a single Geforce GTX 1060 with 6GB or memory, but it seems it was unable to run with batch size of 100. So, how much memory does a GPU needs to run the algorithm?

I tried it with batch size 70 and it ran, but it seems to be taking forever to finish.

I noticed that on each line there was a two number expression like: "2/104250.0" as you can see on the image below. Does that mean that I have 104250 batches to process?

Answer 1 · 2018-01-28T23:17:39.000Z

Hi there!

That big number is the total amount of batches to achieve your N epochs with B batch_size (70 in your case). I see ur mtime/batch a bit high, are you sure you're using the GPU for the computations? As an example, look at my log with batch_size of 70:

0/59770 (epoch 0), d_rl_loss = 0.99999, d_fk_loss = 0.00000, g_adv_loss = 0.99866, g_l1_loss = 0.00000, time/batch = 3.94123, mtime/batch = 3.94123
1/59770 (epoch 0), d_rl_loss = 0.99865, d_fk_loss = 0.00000, g_adv_loss = 0.99692, g_l1_loss = 0.00000, time/batch = 1.82510, mtime/batch = 2.88316
2/59770 (epoch 0), d_rl_loss = 0.99692, d_fk_loss = 0.00000, g_adv_loss = 0.99504, g_l1_loss = 0.00000, time/batch = 0.94474, mtime/batch = 2.23702
3/59770 (epoch 0), d_rl_loss = 0.99504, d_fk_loss = 0.00001, g_adv_loss = 0.99309, g_l1_loss = 0.00000, time/batch = 1.00357, mtime/batch = 1.92866
4/59770 (epoch 0), d_rl_loss = 0.99310, d_fk_loss = 0.00001, g_adv_loss = 0.99108, g_l1_loss = 0.00000, time/batch = 1.01260, mtime/batch = 1.74545
5/59770 (epoch 0), d_rl_loss = 0.99108, d_fk_loss = 0.00002, g_adv_loss = 0.98899, g_l1_loss = 0.00000, time/batch = 1.01747, mtime/batch = 1.62412
6/59770 (epoch 0), d_rl_loss = 0.98900, d_fk_loss = 0.00003, g_adv_loss = 0.98682, g_l1_loss = 0.00000, time/batch = 1.00243, mtime/batch = 1.53530
7/59770 (epoch 0), d_rl_loss = 0.98683, d_fk_loss = 0.00004, g_adv_loss = 0.98456, g_l1_loss = 0.00000, time/batch = 1.00293, mtime/batch = 1.46876
8/59770 (epoch 0), d_rl_loss = 0.98458, d_fk_loss = 0.00006, g_adv_loss = 0.98222, g_l1_loss = 0.00000, time/batch = 1.00675, mtime/batch = 1.41742
9/59770 (epoch 0), d_rl_loss = 0.98223, d_fk_loss = 0.00008, g_adv_loss = 0.97980, g_l1_loss = 0.00000, time/batch = 0.99918, mtime/batch = 1.37560
10/59770 (epoch 0), d_rl_loss = 0.97982, d_fk_loss = 0.00010, g_adv_loss = 0.97731, g_l1_loss = 0.00000, time/batch = 1.00354, mtime/batch = 1.34178
11/59770 (epoch 0), d_rl_loss = 0.97731, d_fk_loss = 0.00013, g_adv_loss = 0.97469, g_l1_loss = 0.00000, time/batch = 1.00171, mtime/batch = 1.31344
12/59770 (epoch 0), d_rl_loss = 0.97470, d_fk_loss = 0.00016, g_adv_loss = 0.97199, g_l1_loss = 0.00000, time/batch = 1.01731, mtime/batch = 1.29066
13/59770 (epoch 0), d_rl_loss = 0.97200, d_fk_loss = 0.00020, g_adv_loss = 0.96918, g_l1_loss = 0.00000, time/batch = 1.02041, mtime/batch = 1.27135
14/59770 (epoch 0), d_rl_loss = 0.96919, d_fk_loss = 0.00024, g_adv_loss = 0.96625, g_l1_loss = 0.00000, time/batch = 1.01162, mtime/batch = 1.25404

Answer 2 · 2018-01-29T21:33:35.000Z

Hello there!

Thanks for the quick reply :)

Indeed, my mtime/batch is way too high. When I tried to run it on my laptop (just for fun) I got a mtime of 240. So it's really a bit weird to get 146 seconds on a GeForce 1060.

Is there a way I can verify if it's really running on my GPU? I saw on "GPU-Z" that the GeForce memory allocation was at ~5200MB. I'll check the GPU usage as well and will edit this post.

EDIT: Here's a GPU-Z Screenshot while SEGAN is running.

As it seems, the "GPU Load" is at 0%~2% with only a brief peak at 100%, which is really weird, since I installed CUDA, CuDNN and Tensorflow-gpu. And this is the only GPU on my system!

I'm starting to believe it's running on the CPU after all. Maybe this has something to do with my CUDA/CuDNN versions?

Cuda: 9.0
CuDnn (as seen on cudnn.h):

#define CUDNN_MAJOR 7
#define CUDNN_MINOR 0
#define CUDNN_PATCHLEVEL 5

Tensorflow: 1.5.0-rc1
Python: 3.6

I tried to run this sample code:

import tensorflow as tf
with tf.device('/gpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)

with tf.Session() as sess:
    print (sess.run(c))

And I got these messages as a result:

[[22. 28.]
 [49. 64.]]
2018-01-29 20:09:03.958810: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX
2018-01-29 20:09:05.978486: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1105] Found device 0 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7845 pciBusID: 0000:01:00.0 totalMemory: 6.00GiB freeMemory: 4.96GiB
2018-01-29 20:09:05.978599: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)

I tried to run on this commit: lordet01@63d5df0

Answer 3 · 2018-01-30T23:04:31.000Z

Hello again.

I managed to get it running with a mtime/batch of 2.6 seconds. The problem was quite simple.

I noticed that the script was proccessing two batches at a time, and as you said before it was taking too long to run. So I looked at the main.py file, and I noticed that on this line: https://github.com/lordet01/segan/blob/master/main.py#L66 the if case was testing for 'cpu' with lower case.

Since I was trying to run on a newer version of tensorflow, the devices are now represented with uppercase e.g. 'CPU'. And that was the problem. Since the script didn't detect that there was a cpu amongst the other devices, it was included on the udevices list here: https://github.com/lordet01/segan/blob/master/main.py#L70

Bingo. That's why my GPU was still getting its memory allocated after all, and had some processing "peaks". It was just finishing its work and waiting for the CPU to finish theirs to move to the next step of the iteration. Once I changed that line from 'cpu' to 'CPU' it successfully detected the CPU and only inserted the GPU:0 to the udevices list, and now I'm running 100% of the training on the GPU.

Thank you very much for your time with this issue, it helped me a lot to track down what was wrong with it.

Answer 4 · 2018-05-09T14:07:22.000Z

Hi, when I run the program, it was stopped in 'Sampling some wavs to store sample references... in model.py#L362', how is it addressed?
Thanks