minimal example running on NCS
matpalm opened this issue · 14 comments
Happy days! I'm delighted you have got it working! I am currently collecting another dataset but hope to get it running on the NCS soon.
I finally got a chance to test it out tonight. It looks like it is working but I had to remove an encoding layer as my images are 640x480 and a 512 patch size is too big. I changed it to look like this:
input (?, 256, 256, 3) #196608
e1 (?, 127, 127, 16) #258064
e2 (?, 63, 63, 32) #127008
e3 (?, 31, 31, 64) #61504
e4 (?, 15, 15, 128) #28800
d1 (?, 31, 31, 64) #61504
d2 (?, 63, 63, 32) #127008
d3 (?, 127, 127, 16) #258064
logits (?, 127, 127, 1) #16129
Is that the correct way to do it?
Also I uncommented the modelTester code (I love me some stats) and got the following error:
full res test model...
WARNING: ncs_hacktastic
input (?, 480, 640, 3) #921600
e1 (?, 239, 319, 16) #1219856
e2 (?, 119, 159, 32) #605472
e3 (?, 59, 79, 64) #298304
e4 (?, 29, 39, 128) #144768
d1 (?, 59, 79, 64) #298304
d2 (?, 119, 159, 32) #605472
d3 (?, 239, 319, 16) #1219856
logits (?, 239, 319, 1) #76241
ValueError: logits and labels must have the same shape ((?, 239, 319, 1) vs (?, 127, 127, 1))
It seems to be picking up the image size, not the patch size. How do I best mix patches with the testing code?
yeah that (256,256) -> (127,127) all looks good.
with respect to the (127,127) you're sadly hitting some hard coded stuff i have in there... it's this bit of code which is an explicit slice/reshape workaround for the size/shape of the 2d output being wrong
it's clumsy i know, but that could be configurable (until there's a fix..)
ahh, would it be quicker for me to just crop the test images to 127,127 or will it work if I change the shape of the output?
changing the code to match your size would probably be the quickest...
I finally got a bit of time this morning and managed to get it working from start to NCS finish! Unfortunately the results were not great. I went back to train.py and uncommented the test code to see how well the training was working. There was an issue with the training network set up for a certain patch size and the test network being used on the full image so I turned off the patches and changed the image shape in data.py to resize to 239x319. The network topology and labels now match up:
patch train model...
input (?, 480, 640, 3) #921600
e1 (?, 239, 319, 16) #1219856
e2 (?, 119, 159, 32) #605472
e3 (?, 59, 79, 64) #298304
e4 (?, 29, 39, 128) #144768
d1 (?, 59, 79, 64) #298304
d2 (?, 119, 159, 32) #605472
d3 (?, 239, 319, 16) #1219856
logits (?, 239, 319, 1) #76241
but when I run the training I get the following error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 19200 values, but the requested shape has 76800
[[Node: Reshape_1 = Reshape[T=DT_FLOAT, Tshape=DT_INT32](arg1, Reshape_1/shape)]]
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,480,640,3], [?,239,319,1]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
so 240x320 is 76800 but I cannot see anywhere in the code where the tensor is being set to 19200 and I am starting to realise that tensorflow is difficult to debug to say the least!
Do you have any suggestions to see where this is getting set of for debugging tensorflow models?
I thought it may have been the shape of my images so I resized them to match the patch size, I also resized my labels to 64x64 with nearest neighbour interpolation. I get the same error despite the dumped shapes of the models being identical
So it works with the patch flag but not without. This means it is either something wrong with my labels or i'm missing something in xys_iterator. I tried tfdbg but it is hard to see what is going on....
yeah, it's been a nightmare to debug... i've also made this repo now more complicated than it needs to be because i've been confounding two things 1) running a patch batched model with fixed sized inference to run on the NCS and 2) training patch based and running on arbitrary sized output for my meta learning experiments; i should really move 2) into it's own repo since it requires different things than 1) on the data pipeline.... but that's an aside...
are you trying to run with an output of (239,319) on the NCS? i recall having a problem where i couldn't get anything over (127,127) as output on the stick...
can you share a larger stack trace around the tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 19200 values, but the requested shape has 76800
line ?
It seemed to compile and run with 239,219 but the output was a mess. I will resize it if I hit the same 127,127 limitation
Here is the stack trace and some additional information, as the expected size (19200) is 120 x 160 and the value being passed (76800) is 240 x 320 I think it might be the output of a particular layer is the wrong size. I used the slim model analyzer to get more info but still cannot see anything wrong.
$ ./train.py --run $RUN --steps $STEPS --train-steps 1000 --train-image-dir $DATADIR/train/ --test-image-dir $DATADIR/test/ --label-dir $DATADIR/labels/ --no-use-batch-norm --no-use-skip-connections --width 640 --height 480 --label-rescale 0.25
opts Namespace(base_filter_size=8, batch_size=32, flip_left_right=False, height=480, label_dir='data/1850/labels/', label_rescale=0.25, learning_rate=0.001, no_use_batch_norm=True, no_use_skip_connections=True, patch_width_height=None, random_rotate=False, run='r2', secs=None, steps=2000, test_image_dir='data/1850/test/', train_image_dir='data/1850/train/', train_steps=1000, width=640)
len(rgb_filenames) 1401 NO CACHE
WARNING: ncs_hacktastic
patch train model...
input (?, 480, 640, 3) #921600
e1 (?, 239, 319, 16) #1219856
e2 (?, 119, 159, 32) #605472
e3 (?, 59, 79, 64) #298304
e4 (?, 29, 39, 128) #144768
d1 (?, 59, 79, 64) #298304
d2 (?, 119, 159, 32) #605472
d3 (?, 239, 319, 16) #1219856
logits (?, 239, 319, 1) #76241
Variables: name (type shape) [size]
train_test_model/e1/weights:0 (float32_ref 3x3x3x16) [432, bytes: 1728]
train_test_model/e1/biases:0 (float32_ref 16) [16, bytes: 64]
train_test_model/e2/weights:0 (float32_ref 3x3x16x32) [4608, bytes: 18432]
train_test_model/e2/biases:0 (float32_ref 32) [32, bytes: 128]
train_test_model/e3/weights:0 (float32_ref 3x3x32x64) [18432, bytes: 73728]
train_test_model/e3/biases:0 (float32_ref 64) [64, bytes: 256]
train_test_model/e4/weights:0 (float32_ref 3x3x64x128) [73728, bytes: 294912]
train_test_model/e4/biases:0 (float32_ref 128) [128, bytes: 512]
train_test_model/d1/weights:0 (float32_ref 3x3x64x128) [73728, bytes: 294912]
train_test_model/d1/biases:0 (float32_ref 64) [64, bytes: 256]
train_test_model/d2/weights:0 (float32_ref 3x3x32x64) [18432, bytes: 73728]
train_test_model/d2/biases:0 (float32_ref 32) [32, bytes: 128]
train_test_model/d3/weights:0 (float32_ref 3x3x16x32) [4608, bytes: 18432]
train_test_model/d3/biases:0 (float32_ref 16) [16, bytes: 64]
train_test_model/d4/weights:0 (float32_ref 3x3x16x1) [144, bytes: 576]
train_test_model/d4/biases:0 (float32_ref 1) [1, bytes: 4]
Total size of variables: 194465
Total bytes of variables: 777860
2018-09-25 07:46:07.989531: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-09-25 07:46:07.989915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.607
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 10.34GiB
2018-09-25 07:46:07.989926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-09-25 07:46:08.140031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-25 07:46:08.140055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-09-25 07:46:08.140059: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-09-25 07:46:08.140224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10009 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
Traceback (most recent call last):
File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 19200 values, but the requested shape has 76800
[[Node: Reshape_1 = Reshape[T=DT_FLOAT, Tshape=DT_INT32](arg1, Reshape_1/shape)]]
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,480,640,3], [?,239,319,1]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[Node: train_test_model/d1/Shape-0-0-VecPermuteNCHWToNHWC-LayoutOptimizer/_11 = _HostRecvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_66_train_test_model/d1/Shape-0-0-VecPermuteNCHWToNHWC-LayoutOptimizer", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./train.py", line 96, in
sess.run(train_op)
File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 19200 values, but the requested shape has 76800
[[Node: Reshape_1 = Reshape[T=DT_FLOAT, Tshape=DT_INT32](arg1, Reshape_1/shape)]]
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,480,640,3], [?,239,319,1]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[Node: train_test_model/d1/Shape-0-0-VecPermuteNCHWToNHWC-LayoutOptimizer/_11 = _HostRecvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_66_train_test_model/d1/Shape-0-0-VecPermuteNCHWToNHWC-LayoutOptimizer", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
Exit 1
thanks for waiting jono, i still haven't had a chance to look at this yet... hopefully this afternoon the planets will align for some free time :D
No rush! I only get a chance to look at it at the weekend atm
I am going to try rewriting the code over the weekend to work with my images, is the NCS_POC still the latest version or should I be working off the master branch?
Yeah. I still haven't merged it back yet sorry (since it also needs some clean up) but it demonstrates the things I needed to do. Good luck!
Wow, super excited that you got this working on the NCS as well. So at some point I'm going to try to get this running on our DepthAI platform (here) so that you can know the physical location in cartesian coordinates (x,y,z) in centimeters of the bees - so to be able to map their 3D flight patterns.