CDLUC3/ezid

[EPIC] Redesign Queue System for Improved Error Handling, Retry Logic, and Monitoring

Opened this issue · 2 comments

Background

As described in #696, the current EZID queue system relies on daemons and background service scripts to execute tasks asynchronously. While functional, the system lacks any robust error handling and retry mechanisms, leading to permanent registration failures without logging or notifications, including to end users in the UI and reports.

Objective

Redesign the queue system to implement improved retry logic, error logging, with corresponding notifications derived therefrom. Update UI and report to indicate task failures to end users in the UI.

Features

1. Retry Mechanism

  • Implement a configurable retry system for failed tasks
  • Define retry attempts, intervals, and backoff strategies for each category of failed task
  • Distinguish between temporary and permanent failures and handle appropriately

2. Error Logging

  • Improve error logging to capture detailed information about failures

3. Queue Health Monitoring

  • Implement monitoring of queue health
  • Set up alerts for abnormal queue behavior or high failure rates with sane rate limiting and aggregation to prevent excessive notifications
  • Create an OpenSearch dashboard to visualize and analyze task failures

4. UI and Reporting Changes

  • Update the record UI to indicate when a registration has failed so an end user can identify and intervene
  • Update batch reporting to allow for identification of failed registrations
  • Create ongoing job to allow for identification of failed registrations, similar to link checker

Success Criteria

  • Reduced number of unaddressed permanent failures
  • Improved visibility into task state
  • Faster resolution of issues through notifications and better tooling

Dependencies

  • Review of current failure scenarios and error patterns
  • Define retry policies and notification levels
  • Logging changes needed to support new monitoring

Sketch of current task queue design:

flowchart TD
    Start([Start]) --> Queue[Task added to Queue]
    Queue --> AsyncProcess[Async Processing Daemon]
    AsyncProcess --> CheckStatus{Check Status}
    CheckStatus -->|UNSUBMITTED| Process[Process Task]
    CheckStatus -->|UNCHECKED| Verify[Verify Submission]
    Process --> Operation{Operation Type}
    Operation -->|Create| CreateOp[Create Operation]
    Operation -->|Update| UpdateOp[Update Operation]
    Operation -->|Delete| DeleteOp[Delete Operation]
    CreateOp --> AttemptSubmit[Attempt to Submit]
    UpdateOp --> AttemptSubmit
    DeleteOp --> AttemptSubmit
    AttemptSubmit --> SubmitResult{Submission Result}
    SubmitResult -->|Success| MarkSubmitted[Mark as SUBMITTED]
    SubmitResult -->|Ignored| MarkIgnored[Mark as IGNORED]
    SubmitResult -->|Failure| HandleFailure{Failure Type}
    HandleFailure -->|Temporary| MarkTransientFailure[Mark as TRANSIENT_FAILURE]
    HandleFailure -->|Permanent| MarkFailure[Mark as FAILURE]
    MarkSubmitted --> Verify
    Verify --> VerifyResult{Verify Result}
    VerifyResult -->|Success| MarkSuccess[Mark as SUCCESS]
    VerifyResult -->|Warning| MarkWarning[Mark as WARNING]
    VerifyResult -->|Failure| HandleFailure
    MarkSuccess --> UpdateStatus[Update Task Status]
    MarkWarning --> UpdateStatus
    MarkIgnored --> UpdateStatus
    MarkTransientFailure --> UpdateStatus
    MarkFailure --> UpdateStatus
    UpdateStatus --> NextTask[Move to Next Task]
    NextTask --> CheckStatus
    CheckStatus -->|All Processed| Sleep[Sleep]
    Sleep --> AsyncProcess

    subgraph QueueTypes[Queue Types]
        BinderQueue[Binder Queue]
        CrossrefQueue[Crossref Queue]
        DataciteQueue[Datacite Queue]
        SearchIndexerQueue[Search Indexer Queue]
    end
    Queue --> QueueTypes

    subgraph StatusTypes[Status Types]
        UNSUBMITTED[UNSUBMITTED]
        UNCHECKED[UNCHECKED]
        SUBMITTED[SUBMITTED]
        WARNING[WARNING]
        FAILURE[FAILURE]
        TRANSIENT_FAILURE[TRANSIENT_FAILURE]
        IGNORED[IGNORED]
        SUCCESS[SUCCESS]
    end
Loading

Possible redesign:

flowchart TD
    Start([Start]) --> Queue[Task added to Queue]
    Queue --> AsyncProcess[Async Processing Daemon]
    AsyncProcess --> CheckStatus{Check Status}
    CheckStatus -->|UNSUBMITTED or Retry| Process[Process Task]
    CheckStatus -->|UNCHECKED| Verify[Verify Submission]
    Process --> Operation{Operation Type}
    Operation -->|Create| CreateOp[Create Operation]
    Operation -->|Update| UpdateOp[Update Operation]
    Operation -->|Delete| DeleteOp[Delete Operation]
    CreateOp --> AttemptSubmit[Attempt to Submit]
    UpdateOp --> AttemptSubmit
    DeleteOp --> AttemptSubmit
    AttemptSubmit --> SubmitResult{Submission Result}
    SubmitResult -->|Success| MarkSubmitted[Mark as SUBMITTED]
    SubmitResult -->|Ignored| MarkIgnored[Mark as IGNORED]
    SubmitResult -->|Failure| HandleFailure{Failure Type}
    HandleFailure -->|Temporary| RetryMechanism[Retry Mechanism]
    HandleFailure -->|Permanent| MarkFailure[Mark as FAILURE]
    MarkSubmitted --> Verify
    Verify --> VerifyResult{Verify Result}
    VerifyResult -->|Success| MarkSuccess[Mark as SUCCESS]
    VerifyResult -->|Warning| MarkWarning[Mark as WARNING]
    VerifyResult -->|Failure| HandleFailure
    CheckStatus -->|All Processed| Sleep[Sleep]
    Sleep --> AsyncProcess

    subgraph RetryMechanism [Retry Mechanism]
        CheckRetryCount{Retry Count < Max}
        CheckRetryCount -->|Yes| ScheduleRetry[Schedule Retry]
        CheckRetryCount -->|No| MarkMaxRetriesReached[Mark Max Retries Reached]
        ScheduleRetry --> MarkTransientFailure[Mark as TRANSIENT_FAILURE]
    end

    subgraph Logging [Logging]
        LogSuccess[Log Success]
        LogWarning[Log Warning]
        LogIgnored[Log Ignored]
        LogRetryAttempt[Log Retry Attempt]
        LogFailure[Log Failure]
        LogMaxRetriesReached[Log Max Retries Reached]
    end

    subgraph StatusUpdate [Status Update]
        UpdateStatus[Update Task Status]
        NextTask[Move to Next Task]
    end

    MarkSuccess --> LogSuccess
    MarkWarning --> LogWarning
    MarkIgnored --> LogIgnored
    MarkTransientFailure --> LogRetryAttempt
    MarkFailure --> LogFailure
    MarkMaxRetriesReached --> LogMaxRetriesReached

    LogSuccess --> UpdateStatus
    LogWarning --> UpdateStatus
    LogIgnored --> UpdateStatus
    LogRetryAttempt --> UpdateStatus
    LogFailure --> UpdateStatus
    LogMaxRetriesReached --> UpdateStatus

    UpdateStatus --> NextTask
    NextTask --> CheckStatus

    subgraph QueueTypes [Queue Types]
        CrossrefQueue[Crossref Queue]
        DataciteQueue[Datacite Queue]
        SearchIndexerQueue[Search Indexer Queue]
    end
    Queue --> QueueTypes

    subgraph StatusTypes [Status Types]
        UNSUBMITTED[UNSUBMITTED]
        UNCHECKED[UNCHECKED]
        SUBMITTED[SUBMITTED]
        WARNING[WARNING]
        FAILURE[FAILURE]
        TRANSIENT_FAILURE[TRANSIENT_FAILURE]
        IGNORED[IGNORED]
        SUCCESS[SUCCESS]
    end

Loading