分布式执行时出现Reshape error
shiftone1001 opened this issue · 2 comments
Caused by op u'Reshape', defined at:
File "DeepFM.py", line 392, in
tf.app.run()
File "/data/hadoop/local/usercache/test/appcache/application_5145270655_21212399/container_1569565_99362122/Python/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "DeepFM.py", line 326, in main
tf.estimator.train_and_evaluate(DeepFM, train_spec, eval_spec)
File "DeepFM.py", line 128, in model_fn
feat_ids = tf.reshape(feat_ids, shape=[-1, field_size])
InvalidArgumentError (see above for traceback): Reshape cannot infer the missing input size for an empty tensor unless all specified input sizes are non-zero
[[Node: Reshape = Reshape[T=DT_INT32, Tshape=DT_INT32, _device="/job:chief/replica:0/task:0/device:CPU:0"](IteratorGetNext, Reshape/shape)]]
[[Node: gradients/Deep-part/deep_out/MatMul_grad/tuple/control_dependency_1_S313 = _Recvclient_terminated=false, recv_device="/job:ps/replica:0/task:0/device:CPU:0", send_device="/job:chief/replica:0/task:0/device:CPU:0", send_device_incarnation=-1178756093214127197, tensor_name="edge_1006_gradients/Deep-part/deep_out/MatMul_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:CPU:0"]]
上传一个run_dist.sh脚本,你试一下,我这里是可以跑的
@lambdaji 多谢大神,run_dist.sh本地测试可以运行。
是因为下面的提交语句中,field_size没有正确赋值,而是使用了默认值0导致的。
tf_submit \
--data_dir=xx
--train_dir=xx
--command=Python/bin/python DeepFM.py --task_type=train --learning_rate=0.0005 --optimizer=Adam --num_epochs=1 --batch_size=256 --field_size=39 --feature_size=117581 --deep_layers=400,400,400 --dropout=0.5,0.5,0.5 --log_steps=1000 --num_threads=8