remicres/sr4rs

stop running train.py

Closed this issue · 2 comments

Hi, again @remicres
I encountered a bug when running train.py. I entered 24 patches for training, and the patch size of LR and HR is 128 and 512 respectively. Is my patch size too large?
The training parameters are as follows.
1
Maybe there is the bug in the picture below.
2

2023-03-15 12:56:55.345324: W tensorflow/core/common_runtime/bfc_allocator.cc:474] ****************************************************************************************************
2023-03-15 12:56:55.345356: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at conv_ops.cc:684 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[4,512,64,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 1377, in _do_call
return fn(*args)
File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[4,128,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node dis/res_2x/conv2/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

     [[Mean_24/_343]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

(1) RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[4,128,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node dis/res_2x/conv2/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

0 successful operations.
0 derived errors ignored.

2023-03-15 12:56:55.913107: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.
[[{{node PyFunc}}]]

Hello @Eliaukyxw ,

OOM means out-of-memory.
You can try to decrease the network depth, number of resblocks, and batch size.

Hi, Remi
By decreasing the network depth, the program ran smoothly. Really thanks for your help!