DeepRec-AI/DeepRec

Cannot place the graph because a reference or resource edge connects colocation groups with incompatible resource devices

formath opened this issue · 3 comments

Describe the current behavior
TensorFlow's originnal categorical_column_with_hash_bucket works fine. However, replacing it with DeepRec's categorical_column_with_embedding causes an error like this.

tensorflow.python.framework.errors_impl.InvalidArgumentError: 
Cannot place the graph because a reference or resource edge connects colocation groups with incompatible resource devices: /job:ps/task:8 vs /job:ps/task:7. The edge src node is model/input_layer/userid_embedding/embedding_weights/part_1 , and the dst node is save/SaveV2

Code to reproduce the issue
Demo code:

class Model(object):
  
  def model_forward(self, features):
    embedding_input = []
    for feature_name in features.keys():
      filter_option = tf.CounterFilter(filter_freq=10)
      evict_opt = tf.GlobalStepEvict(steps_to_live=2000000)
      ev_opt = tf.EmbeddingVariableOption(filter_option=filter_option, evict_option=evict_opt)
      # DeepRec categorical_column_with_embedding will cause error
      hash_feature = tf.feature_column.categorical_column_with_embedding(feature_name, dtype=tf.string, partition_num=10, ev_option=ev_opt)
      # categorical_column_with_hash_bucket works fine
      # hash_feature = tf.feature_column.categorical_column_with_hash_bucket(feature_name, hash_bucket=10000, dtype=tf.string)
      emb_col = tf.feature_column.embedding_column(hash_feature, dimension=32, combiner='mean')
      feature_emb = tf.feature_column.input_layer(features[feature_name], emb_col)
      embedding_input.append(feature_emb)
    embedding_input = tf.concat(embedding_input)
    logit = mlp(embedding_input)
    return logit

  def train(self):
    self.cluster = tf.train.ClusterSpec({'chief': self.chief_hosts, 'ps': self.ps_hosts, 'worker': self.worker_hosts})
    self.cpu_device = '/job:%s/task:%s/cpu:0' % (self.job_name, self.task_index)
    self.param_server_device = tf.train.replica_device_setter(worker_device=self.cpu_device, 
                                                                                                      cluster=self.cluster)
    if self.job_name == 'ps':
      with tf.device('/cpu:0'):
        self.server.join()
    elif self.job_name == 'worker':
      with tf.Graph().as_default():
        tf.set_random_seed(int(time.time()))
        with tf.device('/cpu:0'):
          train_iterator = some_tfrecord_dataset(...)     
          train_features, train_labels = train_iterator.get_next()   
        with tf.device(self.param_server_device):
          self.global_step = tf.train.get_or_create_global_step()
          pred = self.model_forward(train_features)
          train_loss = some_loss(pred, train_labels)
          opt = tf.train.AdamOptimizer(learning_rate=0.001)
          train_op = opt.minimize(train_loss, global_step=self.global_step)

        with tf.train.MonitoredTrainingSession(...) as sess:
          while True:
            try:
              sess.run(train_op)
            except tf.errors.OutOfRangeError:
              break

Hi @formath .
There is two things I want to confirm.

  1. It seems that you are using previous DeepRec version which is older than 2302.
  2. you have customized tf.train.replica_device_setter worker worker device which may conflict with ev placement.

Could you plz turn on logging of the placement of variable? And set the worker_device to '/job:worker/task:%d' % task_index instead.

@Mesilenceki

  1. DeepRec's version is in waiting.
  2. For replica_device_setter, I used default round-robin strategy.
  3. I set log_device_placement=True. When the ps num is 10, I got the error and the job exit before the whole placement information show. From the restricted log, I see model/input_layer/userid_embedding/embedding_weights/part_0 and its save/SaveV2 are both placed on /job:ps/task:7 which meets expectation. However, model/input_layer/userid_embedding/embedding_weights/part_1 is placed on /job:ps/task:8 but I can't find its log of save/SaveV2placement. I guess it is placed on /job:ps/task:7 which violates the colocation condition so the error occurs. And, it is not the ev problem because when I closed ev this error also exists.
2023-09-24 14:54:23.876226: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_0/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_0' is a reference connection and already has a device field set to /job:ps/task:7
2023-09-24 14:54:23.876278: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_1/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_1' is a reference connection and already has a device field set to /job:ps/task:8
2023-09-24 14:54:23.876287: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_2/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_2' is a reference connection and already has a device field set to /job:ps/task:9
2023-09-24 14:54:23.876295: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_3/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_3' is a reference connection and already has a device field set to /job:ps/task:0
2023-09-24 14:54:23.876302: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_4/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_4' is a reference connection and already has a device field set to /job:ps/task:1
2023-09-24 14:54:23.876311: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_5/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_5' is a reference connection and already has a device field set to /job:ps/task:2
2023-09-24 14:54:23.876320: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_6/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_6' is a reference connection and already has a device field set to /job:ps/task:3
2023-09-24 14:54:23.876328: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_7/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_7' is a reference connection and already has a device field set to /job:ps/task:4
2023-09-24 14:54:23.876336: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_8/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_8' is a reference connection and already has a device field set to /job:ps/task:5
2023-09-24 14:54:23.876345: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_9/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_9' is a reference connection and already has a device field set to /job:ps/task:6

2023-09-24 14:54:23.923634: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'save/SaveV2' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_0' is a reference connection and already has a device field set to /job:ps/task:7
  1. When I set ps num to 1, the job works fine and the whole placement information will show.
  2. /job:worker/task:%d has the same error as /job:worker/task:%d/cpu:0.

The reason is tf.train.Saver(sharded=False). Changing it to sharded=True fix my problem.