Cannot place the graph because a reference or resource edge connects colocation groups with incompatible resource devices
formath opened this issue · 3 comments
formath commented
Describe the current behavior
TensorFlow's originnal categorical_column_with_hash_bucket
works fine. However, replacing it with DeepRec's categorical_column_with_embedding
causes an error like this.
tensorflow.python.framework.errors_impl.InvalidArgumentError:
Cannot place the graph because a reference or resource edge connects colocation groups with incompatible resource devices: /job:ps/task:8 vs /job:ps/task:7. The edge src node is model/input_layer/userid_embedding/embedding_weights/part_1 , and the dst node is save/SaveV2
Code to reproduce the issue
Demo code:
class Model(object):
def model_forward(self, features):
embedding_input = []
for feature_name in features.keys():
filter_option = tf.CounterFilter(filter_freq=10)
evict_opt = tf.GlobalStepEvict(steps_to_live=2000000)
ev_opt = tf.EmbeddingVariableOption(filter_option=filter_option, evict_option=evict_opt)
# DeepRec categorical_column_with_embedding will cause error
hash_feature = tf.feature_column.categorical_column_with_embedding(feature_name, dtype=tf.string, partition_num=10, ev_option=ev_opt)
# categorical_column_with_hash_bucket works fine
# hash_feature = tf.feature_column.categorical_column_with_hash_bucket(feature_name, hash_bucket=10000, dtype=tf.string)
emb_col = tf.feature_column.embedding_column(hash_feature, dimension=32, combiner='mean')
feature_emb = tf.feature_column.input_layer(features[feature_name], emb_col)
embedding_input.append(feature_emb)
embedding_input = tf.concat(embedding_input)
logit = mlp(embedding_input)
return logit
def train(self):
self.cluster = tf.train.ClusterSpec({'chief': self.chief_hosts, 'ps': self.ps_hosts, 'worker': self.worker_hosts})
self.cpu_device = '/job:%s/task:%s/cpu:0' % (self.job_name, self.task_index)
self.param_server_device = tf.train.replica_device_setter(worker_device=self.cpu_device,
cluster=self.cluster)
if self.job_name == 'ps':
with tf.device('/cpu:0'):
self.server.join()
elif self.job_name == 'worker':
with tf.Graph().as_default():
tf.set_random_seed(int(time.time()))
with tf.device('/cpu:0'):
train_iterator = some_tfrecord_dataset(...)
train_features, train_labels = train_iterator.get_next()
with tf.device(self.param_server_device):
self.global_step = tf.train.get_or_create_global_step()
pred = self.model_forward(train_features)
train_loss = some_loss(pred, train_labels)
opt = tf.train.AdamOptimizer(learning_rate=0.001)
train_op = opt.minimize(train_loss, global_step=self.global_step)
with tf.train.MonitoredTrainingSession(...) as sess:
while True:
try:
sess.run(train_op)
except tf.errors.OutOfRangeError:
break
Mesilenceki commented
Hi @formath .
There is two things I want to confirm.
- It seems that you are using previous DeepRec version which is older than 2302.
- you have customized tf.train.replica_device_setter worker worker device which may conflict with ev placement.
Could you plz turn on logging of the placement of variable? And set the worker_device
to '/job:worker/task:%d' % task_index instead.
formath commented
- DeepRec's version is in waiting.
- For
replica_device_setter
, I used defaultround-robin strategy
. - I set
log_device_placement=True
. When the ps num is 10, I got the error and the job exit before the whole placement information show. From the restricted log, I seemodel/input_layer/userid_embedding/embedding_weights/part_0
and itssave/SaveV2
are both placed on/job:ps/task:7
which meets expectation. However,model/input_layer/userid_embedding/embedding_weights/part_1
is placed on/job:ps/task:8
but I can't find its log ofsave/SaveV2
placement. I guess it is placed on/job:ps/task:7
which violates the colocation condition so the error occurs. And, it is not theev
problem because when I closedev
this error also exists.
2023-09-24 14:54:23.876226: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_0/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_0' is a reference connection and already has a device field set to /job:ps/task:7
2023-09-24 14:54:23.876278: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_1/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_1' is a reference connection and already has a device field set to /job:ps/task:8
2023-09-24 14:54:23.876287: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_2/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_2' is a reference connection and already has a device field set to /job:ps/task:9
2023-09-24 14:54:23.876295: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_3/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_3' is a reference connection and already has a device field set to /job:ps/task:0
2023-09-24 14:54:23.876302: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_4/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_4' is a reference connection and already has a device field set to /job:ps/task:1
2023-09-24 14:54:23.876311: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_5/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_5' is a reference connection and already has a device field set to /job:ps/task:2
2023-09-24 14:54:23.876320: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_6/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_6' is a reference connection and already has a device field set to /job:ps/task:3
2023-09-24 14:54:23.876328: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_7/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_7' is a reference connection and already has a device field set to /job:ps/task:4
2023-09-24 14:54:23.876336: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_8/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_8' is a reference connection and already has a device field set to /job:ps/task:5
2023-09-24 14:54:23.876345: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_9/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_9' is a reference connection and already has a device field set to /job:ps/task:6
2023-09-24 14:54:23.923634: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'save/SaveV2' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_0' is a reference connection and already has a device field set to /job:ps/task:7
- When I set ps num to 1, the job works fine and the whole placement information will show.
/job:worker/task:%d
has the same error as/job:worker/task:%d/cpu:0
.
formath commented
The reason is tf.train.Saver(sharded=False)
. Changing it to sharded=True
fix my problem.