tensorflow/recommenders-addons

`de.shadow_ops.embedding_lookup()` is non thread-safe

alionkun opened this issue · 1 comments

Describe the Problem

I built a model based on de.keras.layers.BasicEmbedding() and everything worked fine in training phase.
But when serving the trained SavedModel with TFServing, I got two issues:

  1. TFServing reported tensor shape mismatch frequently.

    error msg: Input to reshape is a tensor with 60 values, but the requested shape has 228 ....

  2. TFServing process cored dump occasionally

Resolving

After double checking my code and data without any progresses, I dived into the TFRA source code and found following code could bring a race condition:

def embedding_lookup(
shadow,
ids,
partition_strategy=None, # pylint: disable=unused-argument
name=None,
validate_indices=None, # pylint: disable=unused-argument
):
"""
Shadow version of dynamic_embedding.embedding_lookup. It use existed shadow
variable to to embedding lookup, and store the result. No by-product will
be introduced in this call. So it can be decorated by `tf.function`.
Args:
shadow: A ShadowVariable object.
ids: A tensor with any shape as same dtype of params.key_dtype.
partition_strategy: No used, for API compatiblity with `nn.emedding_lookup`.
name: A name for the operation.
validate_indices: No used, just for compatible with nn.embedding_lookup .
Returns:
A tensor with shape [shape of ids] + [dim],
dim is equal to the value dim of params.
containing the values from the params tensor(s) for keys in ids.
"""
ids = ops.convert_to_tensor(ids)
if shadow.ids.dtype != ids.dtype:
raise ValueError('{} ids is not matched with ShadowVariable with ids'
' {},'.format(ids.dtype, shadow.ids.dtype))
with ops.name_scope(name, "shadow_embedding_lookup"):
with ops.control_dependencies([shadow._reset_ids(ids)]):
return shadow.read_value(do_prefetch=True)

That is , L250 updates ShadowVariable.ids (typed ResourceVariable) every time before the real lookup operation, which is not thread-safe in multi-thread scenario, so are all APIs depending on de.shadow_ops.embedding_lookup().

A similar issue of DynamiceEmbedding was fixed at #24 , and a quick fixing can be borrowed.

I have solved and verified this issue in my environment and am willing to contribute the fixing.

@Lifann @rhdong Can you please take a look at this.

Thanks for feedback, @alionkun . A way to solve this is using sparse_variable.lookup(keys) instead of embedding_lookup(shadow, ids) in inference phase or exporting the inference model. Maybe it's possible to make it internal.