killbill/killbill-commons

Queue stuck entries with single node system

sbrossie opened this issue · 4 comments

In situations where some notification entries are left IN_PROCESSING state, we typically have a reaper mechanism to re-dispatch such entries. However in a single node scenario, we cannot re-dispatch to another node and currently the code just prints a warning. Note that such situations could arise as a result of a shutdown that timed out or a node that was killed abruptly.

Such entries will not be picked up since the query for ready entries only looks at AVAILABLE state.

One proposal would be to have the reaper reset the state to AVAILABLE - in multi-node scenario this is what it does when it re-dispatches to another node. The worst case scenario is that such notification gets dispatched twice - which is an expected edge use case anyways.

Actually, one issue is if system gets really late: It would make no sense to re-dispatch (reap) entries on the same node, so the question is how can we differentiate between the 2 use cases:

  1. A few entries were left hanging in IN_PROCESSING state and would need to be reaped
  2. System is late and we have potentially a few (or many entries) late

The reaper query would pick up entries where processing_state=IN_PROCESSING (case 1) but would also pick entries where processing_state is null (case 2), so we have no way to discriminate unless we change the query.

Proposal: When we detect an entryIsBeingProcessedByThisNode we could check that status is IN_PROCESSING and only re-dispatch (reap) those entries.

For reference this is what such entry looks like for case 1/

*************************** 1. row ***************************
                record_id: 71560
               class_name: org.killbill.billing.invoice.notification.NextBillingDateNotificationKey
               event_json: {"uuidKey":null,"uuidKeys":["abfe39d2-1f98-41fc-92f9-66891dd66835"],"targetDate":"2023-11-10T09:08:03.000Z","isDryRunForInvoiceNotification":false,"isRescheduled":false}
               user_token: 066bcb36-3f3e-4964-b7fc-2cb7fe1a7632
             created_date: 2023-11-09 09:08:05
           creating_owner: ip-172-31-64-209.us-west-2.compute.internal
         processing_owner: ip-172-31-64-209.us-west-2.compute.internal
processing_available_date: 2023-11-10 09:13:13
         processing_state: IN_PROCESSING
              error_count: 0
              search_key1: 16224
              search_key2: 5859
               queue_name: invoice-service:next-billing-date-queue
           effective_date: 2023-11-10 09:08:03
        future_user_token: 84b8497e-44cc-49cf-bff5-051c4fc066e5

I have build up the illmill war from the branch containing this fix.
Could you please let me know what tests you ran to confirm the fix?
I need to do the same.