Possibility to get into perpetual "DEAD cache hit" - stale result state
Closed this issue · 1 comments
I'm not positive this is a bug, but its happened several times so far in development and I'm worried whether this might happen in production. Basically, something causes Celery to choke, and it discards or loses the task to update the cache. From then on, the Job will perpetually return the stale result "DEAD cache hit" until cache is cleared.
I'm not familiar enough with the framework to know what the alternative to this would be, but it seems like it could be problematic since the DEAD cache hit message is only logged at DEBUG level so one would be unlikely to realize whether or not this happened. Thus, it would seem feasible that if celery or the task chokes on something and loses or intentionally discards a cacheback task, the app would never refresh the Job until the cache item gets pushed out of memory (which could be never).
I think this is a real issue - I've seen similar symptoms myself but haven't got around to investigating properly. The problem is that, to prevent dog-piling, the cache is set to a "dead" value to indicate that a refresh is in progress. However, if that refresh task dies for some reason then we get into a limbo state as you describe, where the cache is never refreshed.
I will alter the behaviour to use a timeout to avoid getting into limbo. Then if celery chokes, then the refresh job will get triggered again.