Confusion using the plug-and-play data
Closed this issue · 5 comments
Hi,
I've cloned the repo and grabbed your trained models in an attempt to quickly see the demo running on my computer, but I'm getting an error and I'm not 100% sure I've understood how to place the data correctly:
get_train_batch c.TRAIN_DIR_CLIPS ../Data/.Clips/ c.NUM_CLIPS 0
Traceback (most recent call last):
File "avg_runner.py", line 186, in <module>
main()
File "avg_runner.py", line 182, in main
runner.train()
File "avg_runner.py", line 68, in train
batch = get_train_batch()
File "~/Adversarial_Video_Generation/Code/utils.py", line 127, in get_train_batch
path = c.TRAIN_DIR_CLIPS + str(np.random.choice(c.NUM_CLIPS)) + '.npz'
File "mtrand.pyx", line 1391, in mtrand.RandomState.choice (numpy/random/mtrand/mtrand.c:15381)
ValueError: a must be greater than 0
I'm running the script by first cd
ing into Code
then running python avg_runner.py -l ../Models/Adversarial/model.ckpt-500000
. I've added a print statement before the error line to see what the variables hold and it looks like the .Clips
folder is empty:
get_train_batch c.TRAIN_DIR_CLIPS ../Data/.Clips/ c.NUM_CLIPS 0
I've double checked and that seems to be the case:
> file ../Data/.Clips/
../Data/.Clips/: directory
> ls ../Data/.Clips/ | wc -w
0
I feel I'm missing something: should I have downloaded the contents of the .Clips
folder (if so from where ?) or should the .Clips
contents be generated ?
How can I double check and make sure I'm using the examples correctly ?
I am using tensorflow version '0.12.0'
with gpu support in a virtual environment on OSX 10.11.5 with an nVidia GeForce GT 750M (2GB VRAM), CUDA 8.0 and CuDNN 5.1 installed.
The first 3 levels of the repo look like this:
├── Code
│ ├── avg_runner.py
│ ├── constants.py
│ ├── constants.pyc
│ ├── d_model.py
│ ├── d_model.pyc
│ ├── d_scale_model.py
│ ├── d_scale_model.pyc
│ ├── g_model.py
│ ├── g_model.pyc
│ ├── loss_functions.py
│ ├── loss_functions.pyc
│ ├── loss_functions_test.py
│ ├── process_data.py
│ ├── tfutils.py
│ ├── tfutils.pyc
│ ├── tfutils_test.py
│ ├── utils.py
│ └── utils.pyc
├── Data
│ └── Ms_Pacman
│ ├── Test
│ └── Train
├── DataOld
│ └── Ms_Pacman
│ ├── Test
│ └── Train
├── LICENSE
├── Models
│ ├── Adversarial
│ │ ├── checkpoint
│ │ ├── model.ckpt-500000
│ │ └── model.ckpt-500000.meta
│ └── NonAdversarial
│ ├── checkpoint
│ ├── model.ckpt-1020000
│ └── model.ckpt-1020000.meta
├── Models.zip
├── Ms_Pacman.zip
├── README.md
├── Results
│ ├── Gifs
│ │ ├── 4_Comparison.gif
│ │ ├── 5_Comparison.gif
│ │ └── rainbow_NonAdv.gif
│ └── Summaries
│ ├── Adv-1
│ └── NonAdv-1
├── Save
│ ├── Images
│ │ └── Default
│ ├── Models
│ │ └── Default
│ └── Summaries
│ └── Default
└── deep_multi-scale_video_prediction_beyond_mean_square_error.pdf
Full output:
python avg_runner.py -l ../Models/Adversarial/model.ckpt-500000
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.dylib locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.dylib locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.dylib locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.1.dylib locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.dylib locally
c.TEST_DIR ../Data/Ms_Pacman/Test/
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901] OS X does not support NUMA - returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GT 750M
major: 3 minor: 0 memoryClockRate (GHz) 0.9255
pciBusID 0000:01:00.0
Total memory: 2.00GiB
Free memory: 1.21GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:01:00.0)
WARNING:tensorflow:From avg_runner.py:30 in __init__.: __init__ (from tensorflow.python.training.summary_io) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.FileWriter. The interface and behavior is the same; this is just a rename.
Init discriminator...
WARNING:tensorflow:From ~/Adversarial_Video_Generation/Code/d_model.py:92 in define_graph.: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
WARNING:tensorflow:From ~/Adversarial_Video_Generation/Code/d_model.py:93 in define_graph.: merge_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.merge.
Init generator...
WARNING:tensorflow:From ~/Adversarial_Video_Generation/Code/g_model.py:199 in define_graph.: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
WARNING:tensorflow:From ~/Adversarial_Video_Generation/Code/g_model.py:219 in define_graph.: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
WARNING:tensorflow:From ~/Adversarial_Video_Generation/Code/g_model.py:221 in define_graph.: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
WARNING:tensorflow:From ~/Adversarial_Video_Generation/Code/g_model.py:226 in define_graph.: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
WARNING:tensorflow:From ~/Adversarial_Video_Generation/Code/g_model.py:228 in define_graph.: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
WARNING:tensorflow:From ~/Adversarial_Video_Generation/Code/g_model.py:232 in define_graph.: merge_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.merge.
WARNING:tensorflow:From ~/Adversarial_Video_Generation/Code/g_model.py:233 in define_graph.: merge_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.merge.
Init variables...
Model restored from ../Models/Adversarial/model.ckpt-500000
get_train_batch c.TRAIN_DIR_CLIPS ../Data/.Clips/ c.NUM_CLIPS 0
Traceback (most recent call last):
File "avg_runner.py", line 186, in <module>
main()
File "avg_runner.py", line 182, in main
runner.train()
File "avg_runner.py", line 68, in train
batch = get_train_batch()
File "~/Adversarial_Video_Generation/Code/utils.py", line 127, in get_train_batch
path = c.TRAIN_DIR_CLIPS + str(np.random.choice(c.NUM_CLIPS)) + '.npz'
File "mtrand.pyx", line 1391, in mtrand.RandomState.choice (numpy/random/mtrand/mtrand.c:15381)
ValueError: a must be greater than 0
I appreciate any tips or advice you can share.
Thank you,
George
Hey @orgicus – Thanks for the detailed info. Did you follow along with the usage instructions? (specifically step 3 about processing the data)
Hi Matt, Thank you so much for getting in touch and sorry to take your time with this.
It might be a case of RFTM on my side 😊
Thank you for pointing me in the right direction.
I've started this yesterday:
python process_data.py -t ../Data/Ms_Pacman/Train/ ../Data/.Clips/
Currently it's Processed 2799700 clips
. I haven't passed --num-clips
so now I'm eagerly awaiting for the 5000000
counter :))
Eventually training completed and I started the avg_runner.py
script, but after a full night of number crunching my 2GB GPU ran out of RAM:
I tensorflow/core/common_runtime/bfc_allocator.cc:693] Summary of in-use Chunks by size:
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 56 Chunks of size 256 totalling 14.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 31 Chunks of size 512 totalling 15.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 25 Chunks of size 1024 totalling 25.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1280 totalling 1.2KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 15 Chunks of size 2048 totalling 30.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3072 totalling 3.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 4096 totalling 16.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 6912 totalling 6.8KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 11776 totalling 11.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 13824 totalling 54.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 7 Chunks of size 38400 totalling 262.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 55296 totalling 162.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 61440 totalling 60.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 75264 totalling 367.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 2 Chunks of size 131072 totalling 256.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 192000 totalling 937.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 245248 totalling 239.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 322560 totalling 315.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 360448 totalling 352.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 376320 totalling 1.08MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 2 Chunks of size 524288 totalling 1.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 589824 totalling 576.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 662272 totalling 646.8KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 19 Chunks of size 1179648 totalling 21.38MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1409024 totalling 1.34MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 2686976 totalling 2.56MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3225600 totalling 3.08MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 9 Chunks of size 3276800 totalling 28.12MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3538944 totalling 3.38MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 4194304 totalling 4.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 8 Chunks of size 4718592 totalling 36.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 5111808 totalling 4.88MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 6553600 totalling 18.75MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 12320768 totalling 11.75MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 8 Chunks of size 13107200 totalling 100.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 19660800 totalling 18.75MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 29360128 totalling 28.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 429004800 totalling 409.13MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 827952128 totalling 789.60MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 1.45GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 1587499008
InUse: 1559270656
MaxInUse: 1586930688
NumAllocs: 9986112
MaxAllocSize: 1260182528
W tensorflow/core/common_runtime/bfc_allocator.cc:274] ***************************************xxxxxxxx************************************xxxxxxxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 262.50MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[8,256,210,160]
Traceback (most recent call last):
File "avg_runner.py", line 186, in <module>
main()
File "avg_runner.py", line 182, in main
runner.train()
File "avg_runner.py", line 90, in train
self.test()
File "avg_runner.py", line 98, in test
batch, self.global_step, num_rec_out=self.num_test_rec)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/Adversarial_Video_Generation/Code/g_model.py", line 389, in test_batch
feed_dict=feed_dict)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[8,256,210,160]
[[Node: generator/scale_3/calculation/convolutions_1/Conv2D_2 = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](generator/scale_3/calculation/convolutions_1/Relu_1, generator/scale_3/setup/Variable_4/read)]]
Caused by op u'generator/scale_3/calculation/convolutions_1/Conv2D_2', defined at:
File "avg_runner.py", line 186, in <module>
main()
File "avg_runner.py", line 178, in main
runner = AVGRunner(num_steps, load_path, num_test_rec)
File "avg_runner.py", line 50, in __init__
c.SCALE_KERNEL_SIZES_G)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/Adversarial_Video_Generation/Code/g_model.py", line 48, in __init__
self.define_graph()
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/Adversarial_Video_Generation/Code/g_model.py", line 179, in define_graph
last_scale_pred_test)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/Adversarial_Video_Generation/Code/g_model.py", line 127, in calculate
preds, ws[i], [1, 1, 1, 1], padding=c.PADDING_G)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 396, in conv2d
data_format=data_format, name=name)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[8,256,210,160]
[[Node: generator/scale_3/calculation/convolutions_1/Conv2D_2 = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](generator/scale_3/calculation/convolutions_1/Relu_1, generator/scale_3/setup/Variable_4/read)]]
Is there a way to "resume" the process from just before it crashed ? :D
Hmm yeah, I trained this on 6GB GPUs, so you might need to change the batch size or some other hyperparams to get it to work on 2GB. You can load the last-saved version of your model by passing in its .ckpt
file with the -l
flag
Thank you very much for the explanations, worked like a charm! ❤️