AttributeError: 'NoneType' object has no attribute 'replace'
remko opened this issue · 13 comments
Hi,
We upgraded to google-cloud-ndb==1.10.0
(App Engine Standard app on Python 3.8), and had the error below occurring. It's not a consistent error. We didn't have problems with our previous version of NDB (1.5.2).
Can you tell what happened from the backtrace?
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/_options.py", line 89, in wrapper
return wrapped(*pass_args, **kwargs)
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/utils.py", line 114, in wrapper
return wrapped(*args, **new_kwargs)
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/utils.py", line 146, in positional_wrapper
return wrapped(*args, **kwds)
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/model.py", line 5381, in _put
return self._put_async(_options=kwargs["_options"]).result()
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/tasklets.py", line 191, in result
self.check_success()
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/tasklets.py", line 138, in check_success
raise self._exception
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/tasklets.py", line 311, in _advance_tasklet
yielded = self.generator.throw(type(error), error, traceback)
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/model.py", line 5442, in put
ds_key = yield _datastore_api.put(ds_entity, kwargs["_options"])
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/tasklets.py", line 311, in _advance_tasklet
yielded = self.generator.throw(type(error), error, traceback)
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/_datastore_api.py", line 402, in put
yield _cache.global_unlock_for_write(cache_key, lock)
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/tasklets.py", line 311, in _advance_tasklet
yielded = self.generator.throw(type(error), error, traceback)
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/_cache.py", line 654, in global_unlock_for_write
yield _update_key(key, new_value)
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/tasklets.py", line 315, in _advance_tasklet
yielded = self.generator.send(send_value)
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/_cache.py", line 666, in _update_key
value = new_value(old_value)
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/_cache.py", line 650, in new_value
return old_value.replace(lock, b"")
AttributeError: 'NoneType' object has no attribute 'replace'```
FYI, a few minutes later, it happened again, on a different model.
Thank you for reporting. Do you have any way of seeing whether the request this occurred on took an unusually long time to complete? Longer than, say, 32 seconds?
Also, are you using the Redis or Memcached implementation for the global cache?
This is on Redis. Both requests were very fast (200ms-500ms)
Do you know if you're hitting your memory limit with Redis or if it's configured to evict keys? Any redis process restarts around that same time?
There's clearly a fault tolerance issue that wasn't taken into account, I'm just trying to get a handle on the nature of it to see what we're up against.
I can't find anything interesting I'm afraid:
- This is a HA Cloud Memorystore instance (fresh from today as it happens), with the default configuration, so no eviction AFAIK.
- Memory usage is very low
- At the moment, the instance is used exclusively by NDB
- There have been no restarts, no uptime changes, no noticeable peaks in events/commands around the time
- The change has been deployed for 5 hours now, and it only happened twice so far, with only 7 minutes in between. Different handlers, different models, different instances.
One thing that may or may not be relevant: we switched from Basic Cloud Memorystore to an HA Cloud Memorystore around the same time as we did the NDB upgrade (which was hours before the error occurred), so I can't tell for sure that it wouldn't have happened with 1.5.2. However, I don't think it makes a difference to NDB whether this is HA or not.
Thanks for this info. I'll see if I can reproduce this.
Are you passing anything for strict_write
when instantiating RedisCache
?
No, just RedisCache(client)
.
@chrisrossi I may have some more info from another occurrence. Here is what I see:
-
There are 3 parallel requests that happen simultaneously that access the cached model:
2021-07-25T02:57:10.706243Z GET 200 448 B 229 ms 2021-07-25T02:57:10.709093Z GET 200 448 B 296 ms 2021-07-25T02:57:10.709312Z GET 500 402 B 384 ms
The last request fails with the above error.
-
All 3 requests immediately load the model, and keep the model in memory until the end of the request.
-
All 3 requests also update the model near the end of the request (pbb not relevant, but i think it's an irregularly updated property 'lastAccessedAround' that would only be updated once if they would happen sequentially, but since all 3 calls think they need to update it, they all do).
-
The failing update to the model (in the last request) happens after the first request finished updating the model.
-
I think you can assume that the model was not cached before these 3 calls. (from the logs, the previous access is from before the global cache timeout, and all accesses should be logged)
I looked at the other 2 occurrences, and they all exhibit the same pattern above. One of them happens on only 2 requests that access the model:
2021-07-23T14:43:40.863671Z GET 200 701 B 317 ms
2021-07-23T14:43:40.864445Z GET 500 402 B 436 ms
- The model was (extremely likely) not cached before the first call
- Both requests happen simultaneously, read the model at start of request, keep it in memory, try to write at the end of the request, and the second fails.
So, it seems you can reproduce it with 2 well timed requests.
The other occurrence even had 4 simultaneous requests accessing the model in flight, of which (only) the last failed.
Thank you, this is helpful!
I think I've managed to find the root cause of this. It is a bug with concurrent writes and the global cache3:
Thread 1 and Thread 2 are both attempting to modify the same entity at the same time:
Thread 1: Calls global_lock_for_write, writes 00.lock1
Thread 2: Calls global_lock_for_write, watches 00.lock1
Thread 1: Calls global_unlock_for_write, reads 00.lock1
for old lock value, computes new lock value of 00
Thread 2: Completes CAS operation, writing 00.lock1.lock2
Thread 1: Deletes key, based on computed 00
value, which means there are no locks left.
Thread 2: Calls global_unlock_for_write, gets None
for old value, above exception is raised.