stop running train.py

Question

stop running train.py

Closed this issue a year ago · 2 comments

Hi, again @remicres
I encountered a bug when running train.py. I entered 24 patches for training, and the patch size of LR and HR is 128 and 512 respectively. Is my patch size too large?
The training parameters are as follows.

Maybe there is the bug in the picture below.

2023-03-15 12:56:55.345324: W tensorflow/core/common_runtime/bfc_allocator.cc:474] ****************************************************************************************************
2023-03-15 12:56:55.345356: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at conv_ops.cc:684 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[4,512,64,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 1377, in _do_call
return fn(*args)
File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[4,128,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node dis/res_2x/conv2/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

     [[Mean_24/_343]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

(1) RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[4,128,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node dis/res_2x/conv2/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

0 successful operations.
0 derived errors ignored.

2023-03-15 12:56:55.913107: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.
[[{{node PyFunc}}]]

Answer 1 · 2023-03-20T14:33:26.000Z

Hello @Eliaukyxw ,

OOM means out-of-memory.
You can try to decrease the network depth, number of resblocks, and batch size.

Answer 2 · 2023-03-21T01:34:26.000Z

Hi, Remi
By decreasing the network depth, the program ran smoothly. Really thanks for your help!