Training on Jetson NX
Closed this issue · 5 comments
Hi,
I'm relatively new to training NNs, so I'm wondering if I'm just underestimating the size of U-Nets, or overestimating the abilities of a Jetson NX, to train my dataset. It has 8GB of ram. A tool, jetson_stats's jtop program, says the CPU has 3GB and the GPU has 3.2GB though, and i don't see it use up more memory.
I've made changes to the kz-izbi-challenge notebook, to train on images at 256*256 size. I decreased batch size to batch_size=1, to try help it out.
But it seems to run out of memory, after training for one epoch, regardless of steps_per_epoch
ResourceExhaustedError: OOM when allocating tensor with shape[32,128,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node functional_1/concatenate_3/concat (defined at <ipython-input-24-2b12f42da9e0>:7) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_test_function_3313]
Any idea if it's feasible to train a unet on a Jetson NX?
Thanks
Hi @javadan ,
I'm not familiar with Jetson NX nor have I used it for training NNs.
It's possible that after fitting the network into GPU memory there's simply no more space for fitting images into it.
But it's also possible that some other process is hoarding resources and your training process can't access your gpu memory in it's full capacity. I usually use tools like nvidia-smi
to monitor my GPUs and see if no other processes are blocking me from using the GPU
Thanks, the Jetsons don't have nvidia-smi, and have some other tool, tegrastats, which doesn't seem to have a lot of useful info. But maybe the "IRAM" count is relevant. The jtop program is pretty nifty, but also not helping to find the cause.
I'll have another look this evening.
Since the error looks like a GPU memory issue, do you have an idea what your GPU memory might be like, for a 256*256 U-Net? According to the jtop, and jetson specs, I have about 8GB of RAM total, and it gets split between CPU and GPU, showing 3.2GB of GPU memory.
From a thought experiment perspective,
the 256 * 256 U-Net has "Trainable params: 31,030,593"
And I suspect each parameter is 32 bytes,
32 * 31,030,593 = 992,978,976
it shouldn't be using more than a gig for the NN.
Even if it's 64 bytes per parameter, that's just 2GB, which is still well under 3.2GB.
I thought maybe because I'm running the code inside an l4t-ml:r32.5.0-py3 docker instance, I might be constrained by docker somehow, but it does seem like the docker should have access to the host resources.
I thought maybe the Model.fit_generator method is deprecated because of some issue with it, but it does look like it calls Model.fit under the hood.
Anyway, I'll continue this evening. Maybe I can find something similar to that GPU processes list for the Jetson.
According to what you wrote, I think it should fit into the memory however the problem might be with number of processes using that GPU. The way that Keras works by default is that once initialized it would reserve all the memory on given GPU so if another process would initialize Keras as "the second" it would find little to no memory to use and throw OOM error
Hmm, good to know. So I changed filters=64 to filters=32, and it's down to 7 million parameters now. This seems to be training ok. That's a good omen. What, in a nutshell does filters=X do? resolution of the convolutions? I'll leave it training overnight. Initial results are interesting. Ghostly stuff. Thanks you can close the issue
Glad you figured out the way to make it work for you.
filters
parameter is used to determine the number of convolutional kernels in the first conv block. Number of filters is divided by 2 for every next block. So this parameter will strongly influence the number of params for the overall network and thus the size in memory. It will also influence cognitive capabilities of your network - more filters potentially means more patterns that the network can recognize but there's always a risk of your network starting to memorize some of the features from training images and overfit on these