Describe the Problem
I built a model based on de.keras.layers.BasicEmbedding()
and everything worked fine in training phase.
But when serving the trained SavedModel with TFServing, I got two issues:
- TFServing reported tensor shape mismatch frequently.
error msg: Input to reshape is a tensor with 60 values, but the requested shape has 228 ....
- TFServing process cored dump occasionally
Resolving
After double checking my code and data without any progresses, I dived into the TFRA source code and found following code could bring a race condition:
|
def embedding_lookup( |
|
shadow, |
|
ids, |
|
partition_strategy=None, # pylint: disable=unused-argument |
|
name=None, |
|
validate_indices=None, # pylint: disable=unused-argument |
|
): |
|
""" |
|
Shadow version of dynamic_embedding.embedding_lookup. It use existed shadow |
|
variable to to embedding lookup, and store the result. No by-product will |
|
be introduced in this call. So it can be decorated by `tf.function`. |
|
|
|
Args: |
|
shadow: A ShadowVariable object. |
|
ids: A tensor with any shape as same dtype of params.key_dtype. |
|
partition_strategy: No used, for API compatiblity with `nn.emedding_lookup`. |
|
name: A name for the operation. |
|
validate_indices: No used, just for compatible with nn.embedding_lookup . |
|
|
|
Returns: |
|
A tensor with shape [shape of ids] + [dim], |
|
dim is equal to the value dim of params. |
|
containing the values from the params tensor(s) for keys in ids. |
|
""" |
|
ids = ops.convert_to_tensor(ids) |
|
|
|
if shadow.ids.dtype != ids.dtype: |
|
raise ValueError('{} ids is not matched with ShadowVariable with ids' |
|
' {},'.format(ids.dtype, shadow.ids.dtype)) |
|
|
|
with ops.name_scope(name, "shadow_embedding_lookup"): |
|
with ops.control_dependencies([shadow._reset_ids(ids)]): |
|
return shadow.read_value(do_prefetch=True) |
That is , L250 updates ShadowVariable.ids
(typed ResourceVariable
) every time before the real lookup operation, which is not thread-safe in multi-thread scenario, so are all APIs depending on de.shadow_ops.embedding_lookup()
.
A similar issue of DynamiceEmbedding
was fixed at #24 , and a quick fixing can be borrowed.
I have solved and verified this issue in my environment and am willing to contribute the fixing.
@Lifann @rhdong Can you please take a look at this.