Queue stuck entries with single node system
sbrossie opened this issue · 4 comments
In situations where some notification entries are left IN_PROCESSING
state, we typically have a reaper mechanism to re-dispatch such entries. However in a single node scenario, we cannot re-dispatch to another node and currently the code just prints a warning. Note that such situations could arise as a result of a shutdown that timed out or a node that was killed abruptly.
Such entries will not be picked up since the query for ready entries only looks at AVAILABLE
state.
One proposal would be to have the reaper reset the state to AVAILABLE
- in multi-node scenario this is what it does when it re-dispatches to another node. The worst case scenario is that such notification gets dispatched twice - which is an expected edge use case anyways.
Actually, one issue is if system gets really late: It would make no sense to re-dispatch (reap) entries on the same node, so the question is how can we differentiate between the 2 use cases:
- A few entries were left hanging in
IN_PROCESSING
state and would need to be reaped - System is late and we have potentially a few (or many entries) late
The reaper query would pick up entries where processing_state
=IN_PROCESSING
(case 1) but would also pick entries where processing_state
is null (case 2), so we have no way to discriminate unless we change the query.
Proposal: When we detect an entryIsBeingProcessedByThisNode
we could check that status is IN_PROCESSING
and only re-dispatch (reap) those entries.
For reference this is what such entry looks like for case 1/
*************************** 1. row ***************************
record_id: 71560
class_name: org.killbill.billing.invoice.notification.NextBillingDateNotificationKey
event_json: {"uuidKey":null,"uuidKeys":["abfe39d2-1f98-41fc-92f9-66891dd66835"],"targetDate":"2023-11-10T09:08:03.000Z","isDryRunForInvoiceNotification":false,"isRescheduled":false}
user_token: 066bcb36-3f3e-4964-b7fc-2cb7fe1a7632
created_date: 2023-11-09 09:08:05
creating_owner: ip-172-31-64-209.us-west-2.compute.internal
processing_owner: ip-172-31-64-209.us-west-2.compute.internal
processing_available_date: 2023-11-10 09:13:13
processing_state: IN_PROCESSING
error_count: 0
search_key1: 16224
search_key2: 5859
queue_name: invoice-service:next-billing-date-queue
effective_date: 2023-11-10 09:08:03
future_user_token: 84b8497e-44cc-49cf-bff5-051c4fc066e5
I have build up the illmill war from the branch containing this fix.
Could you please let me know what tests you ran to confirm the fix?
I need to do the same.