DeepRec-AI/DeepRec

No OpKernel was registered to support Op 'PreprocessingForward' Error for Multi Machine, Multi GPU

wangcaihua opened this issue · 0 comments

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 20.04): Linux Ubuntu 20.04, Offical GPU Image 2304
  • DeepRec version or commit id: deeprec2302
  • Python version: 3.8.10
  • Bazel version (if compiling from source): not compiling from source
  • GCC/Compiler version (if compiling from source): gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
  • CUDA/cuDNN version: 11.6

Describe the current behavior
[1,9]:Traceback (most recent call last):
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
[1,9]: return fn(*args)
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1348, in _run_fn
[1,9]: self._extend_graph()
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1388, in _extend_graph
[1,9]: tf_session.ExtendSession(self._session)
[1,9]:tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'PreprocessingForward' used by {{node input_layer/input_layer/group_embedding_lookup/PreprocessingForward/PreprocessingForward}}with these attrs: [rank=9, id_in_local_rank=0, num_ranks=16, num_gpus=16, Toffsets=DT_INT64, Tindices=DT_INT64, num_lookups=26, combiners=["mean", "mean", "mean", "mean", "mean", ..., "mean", "mean", "mean", "mean", "mean"], dimensions=[16, 16, 16, 16, 16, ..., 16, 16, 16, 16, 16], shard=[-1, -1, -1, -1, -1, ..., -1, -1, -1, -1, -1]]
[1,9]:Registered devices: [CPU, XLA_CPU]
[1,9]:Registered kernels:
[1,9]: device='GPU'; Tindices in [DT_INT32]; Toffsets in [DT_INT32]
[1,9]: device='GPU'; Tindices in [DT_INT32]; Toffsets in [DT_INT64]
[1,9]: device='GPU'; Tindices in [DT_INT64]; Toffsets in [DT_INT32]
[1,9]: device='GPU'; Tindices in [DT_INT64]; Toffsets in [DT_INT64]
[1,9]:
[1,9]: [[input_layer/input_layer/group_embedding_lookup/PreprocessingForward/PreprocessingForward]]
[1,9]:
[1,9]:During handling of the above exception, another exception occurred:
[1,9]:
[1,9]:Traceback (most recent call last):
[1,9]: File "train.py", line 887, in
[1,9]: main()
[1,9]: File "train.py", line 642, in main
[1,9]: train(sess_config, hooks, model, train_init_op, train_steps,
[1,9]: File "train.py", line 505, in train
[1,9]: with tf.train.MonitoredTrainingSession(
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 655, in MonitoredTrainingSession
[1,9]: return MonitoredSession(
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1085, in init
[1,9]: super(MonitoredSession, self).init(
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 800, in init
[1,9]: self._sess = _RecoverableSession(self._coordinated_creator)
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1282, in init
[1,9]: _WrappedSession.init(self, self._create_session())
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1287, in _create_session
[1,9]: return self._sess_creator.create_session()
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 953, in create_session
[1,9]: self.tf_sess = self._session_creator.create_session()
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 713, in create_session
[1,9]: return self._get_session_manager().prepare_session(
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/session_manager.py", line 306, in prepare_session
[1,9]: sess.run(init_op, feed_dict=init_feed_dict)
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 955, in run
[1,9]: result = self._run(None, fetches, feed_dict, options_ptr,
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1179, in _run
[1,9]: results = self._do_run(handle, final_targets, final_fetches,
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1358, in _do_run
[1,9]: return self._do_call(_run_fn, feeds, fetches, targets, options,
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
[1,9]: raise type(e)(node_def, op, message)
[1,9]:tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'PreprocessingForward' used by node input_layer/input_layer/group_embedding_lookup/PreprocessingF[1,9]:orward/PreprocessingForward (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [rank=9, id_in_local_rank=0, num_ranks=16, num_gpus=16, Toffsets=DT_INT64, Tindices=DT_INT64, num_lookups=26, combiners=["mean", "mean", "mean", "mean", "mean", ..., "mean", "mean", "mean", "mean", "mean"], dimensions=[16, 16, 16, 16, 16, ..., 16, 16, 16, 16, 16], shard=[-1, -1, -1, -1, -1, ..., -1, -1, -1, -1, -1]]
[1,9]:Registered devices: [CPU, XLA_CPU]
[1,9]:Registered kernels:
[1,9]: device='GPU'; Tindices in [DT_INT32]; Toffsets in [DT_INT32]
[1,9]: device='GPU'; Tindices in [DT_INT32]; Toffsets in [DT_INT64]
[1,9]: device='GPU'; Tindices in [DT_INT64]; Toffsets in [DT_INT32]
[1,9]: device='GPU'; Tindices in [DT_INT64]; Toffsets in [DT_INT64]
[1,9]:
[1,9]: [[input_layer/input_layer/group_embedding_lookup/PreprocessingForward/PreprocessingForward]]

Describe the expected behavior

Code to reproduce the issue

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.