`de.shadow_ops.embedding_lookup()` is non thread-safe

Describe the Problem

I built a model based on de.keras.layers.BasicEmbedding() and everything worked fine in training phase.
But when serving the trained SavedModel with TFServing, I got two issues:

TFServing reported tensor shape mismatch frequently.

error msg: Input to reshape is a tensor with 60 values, but the requested shape has 228 ....
TFServing process cored dump occasionally

Resolving

After double checking my code and data without any progresses, I dived into the TFRA source code and found following code could bring a race condition:

recommenders-addons/tensorflow_recommenders_addons/dynamic_embedding/python/ops/shadow_embedding_ops.py

Lines 219 to 251 in 634b96d

    
           def embedding_lookup( 
        
               shadow, 
        
               ids, 
        
               partition_strategy=None,  # pylint: disable=unused-argument 
        
               name=None, 
        
               validate_indices=None,  # pylint: disable=unused-argument 
        
           ): 
        
             """ 
        
             Shadow version of dynamic_embedding.embedding_lookup. It use existed shadow 
        
             variable to to embedding lookup, and store the result. No by-product will 
        
             be introduced in this call. So it can be decorated by `tf.function`. 
        
             Args: 
        
               shadow: A ShadowVariable object. 
        
               ids: A tensor with any shape as same dtype of params.key_dtype. 
        
               partition_strategy: No used, for API compatiblity with `nn.emedding_lookup`. 
        
               name: A name for the operation. 
        
               validate_indices: No used, just for compatible with nn.embedding_lookup . 
        
             Returns: 
        
               A tensor with shape [shape of ids] + [dim], 
        
                 dim is equal to the value dim of params. 
        
                 containing the values from the params tensor(s) for keys in ids. 
        
             """ 
        
             ids = ops.convert_to_tensor(ids) 
        
             if shadow.ids.dtype != ids.dtype: 
        
               raise ValueError('{} ids is not matched with ShadowVariable with ids' 
        
                                ' {},'.format(ids.dtype, shadow.ids.dtype)) 
        
             with ops.name_scope(name, "shadow_embedding_lookup"): 
        
               with ops.control_dependencies([shadow._reset_ids(ids)]): 
        
                 return shadow.read_value(do_prefetch=True)

That is , L250 updates ShadowVariable.ids (typed ResourceVariable) every time before the real lookup operation, which is not thread-safe in multi-thread scenario, so are all APIs depending on de.shadow_ops.embedding_lookup().

A similar issue of DynamiceEmbedding was fixed at #24 , and a quick fixing can be borrowed.

I have solved and verified this issue in my environment and am willing to contribute the fixing.

@Lifann @rhdong Can you please take a look at this.

Thanks for feedback, @alionkun . A way to solve this is using sparse_variable.lookup(keys) instead of embedding_lookup(shadow, ids) in inference phase or exporting the inference model. Maybe it's possible to make it internal.

	def embedding_lookup(
	shadow,
	ids,
	partition_strategy=None, # pylint: disable=unused-argument
	name=None,
	validate_indices=None, # pylint: disable=unused-argument
	):
	"""
	Shadow version of dynamic_embedding.embedding_lookup. It use existed shadow
	variable to to embedding lookup, and store the result. No by-product will
	be introduced in this call. So it can be decorated by `tf.function`.

	Args:
	shadow: A ShadowVariable object.
	ids: A tensor with any shape as same dtype of params.key_dtype.
	partition_strategy: No used, for API compatiblity with `nn.emedding_lookup`.
	name: A name for the operation.
	validate_indices: No used, just for compatible with nn.embedding_lookup .

	Returns:
	A tensor with shape [shape of ids] + [dim],
	dim is equal to the value dim of params.
	containing the values from the params tensor(s) for keys in ids.
	"""
	ids = ops.convert_to_tensor(ids)

	if shadow.ids.dtype != ids.dtype:
	raise ValueError('{} ids is not matched with ShadowVariable with ids'
	' {},'.format(ids.dtype, shadow.ids.dtype))

	with ops.name_scope(name, "shadow_embedding_lookup"):
	with ops.control_dependencies([shadow._reset_ids(ids)]):
	return shadow.read_value(do_prefetch=True)