[EPIC] Redesign Queue System for Improved Error Handling, Retry Logic, and Monitoring
Opened this issue · 2 comments
adambuttrick commented
Background
As described in #696, the current EZID queue system relies on daemons and background service scripts to execute tasks asynchronously. While functional, the system lacks any robust error handling and retry mechanisms, leading to permanent registration failures without logging or notifications, including to end users in the UI and reports.
Objective
Redesign the queue system to implement improved retry logic, error logging, with corresponding notifications derived therefrom. Update UI and report to indicate task failures to end users in the UI.
Features
1. Retry Mechanism
- Implement a configurable retry system for failed tasks
- Define retry attempts, intervals, and backoff strategies for each category of failed task
- Distinguish between temporary and permanent failures and handle appropriately
2. Error Logging
- Improve error logging to capture detailed information about failures
3. Queue Health Monitoring
- Implement monitoring of queue health
- Set up alerts for abnormal queue behavior or high failure rates with sane rate limiting and aggregation to prevent excessive notifications
- Create an OpenSearch dashboard to visualize and analyze task failures
4. UI and Reporting Changes
- Update the record UI to indicate when a registration has failed so an end user can identify and intervene
- Update batch reporting to allow for identification of failed registrations
- Create ongoing job to allow for identification of failed registrations, similar to link checker
Success Criteria
- Reduced number of unaddressed permanent failures
- Improved visibility into task state
- Faster resolution of issues through notifications and better tooling
Dependencies
- Review of current failure scenarios and error patterns
- Define retry policies and notification levels
- Logging changes needed to support new monitoring
adambuttrick commented
Sketch of current task queue design:
flowchart TD
Start([Start]) --> Queue[Task added to Queue]
Queue --> AsyncProcess[Async Processing Daemon]
AsyncProcess --> CheckStatus{Check Status}
CheckStatus -->|UNSUBMITTED| Process[Process Task]
CheckStatus -->|UNCHECKED| Verify[Verify Submission]
Process --> Operation{Operation Type}
Operation -->|Create| CreateOp[Create Operation]
Operation -->|Update| UpdateOp[Update Operation]
Operation -->|Delete| DeleteOp[Delete Operation]
CreateOp --> AttemptSubmit[Attempt to Submit]
UpdateOp --> AttemptSubmit
DeleteOp --> AttemptSubmit
AttemptSubmit --> SubmitResult{Submission Result}
SubmitResult -->|Success| MarkSubmitted[Mark as SUBMITTED]
SubmitResult -->|Ignored| MarkIgnored[Mark as IGNORED]
SubmitResult -->|Failure| HandleFailure{Failure Type}
HandleFailure -->|Temporary| MarkTransientFailure[Mark as TRANSIENT_FAILURE]
HandleFailure -->|Permanent| MarkFailure[Mark as FAILURE]
MarkSubmitted --> Verify
Verify --> VerifyResult{Verify Result}
VerifyResult -->|Success| MarkSuccess[Mark as SUCCESS]
VerifyResult -->|Warning| MarkWarning[Mark as WARNING]
VerifyResult -->|Failure| HandleFailure
MarkSuccess --> UpdateStatus[Update Task Status]
MarkWarning --> UpdateStatus
MarkIgnored --> UpdateStatus
MarkTransientFailure --> UpdateStatus
MarkFailure --> UpdateStatus
UpdateStatus --> NextTask[Move to Next Task]
NextTask --> CheckStatus
CheckStatus -->|All Processed| Sleep[Sleep]
Sleep --> AsyncProcess
subgraph QueueTypes[Queue Types]
BinderQueue[Binder Queue]
CrossrefQueue[Crossref Queue]
DataciteQueue[Datacite Queue]
SearchIndexerQueue[Search Indexer Queue]
end
Queue --> QueueTypes
subgraph StatusTypes[Status Types]
UNSUBMITTED[UNSUBMITTED]
UNCHECKED[UNCHECKED]
SUBMITTED[SUBMITTED]
WARNING[WARNING]
FAILURE[FAILURE]
TRANSIENT_FAILURE[TRANSIENT_FAILURE]
IGNORED[IGNORED]
SUCCESS[SUCCESS]
end
adambuttrick commented
Possible redesign:
flowchart TD
Start([Start]) --> Queue[Task added to Queue]
Queue --> AsyncProcess[Async Processing Daemon]
AsyncProcess --> CheckStatus{Check Status}
CheckStatus -->|UNSUBMITTED or Retry| Process[Process Task]
CheckStatus -->|UNCHECKED| Verify[Verify Submission]
Process --> Operation{Operation Type}
Operation -->|Create| CreateOp[Create Operation]
Operation -->|Update| UpdateOp[Update Operation]
Operation -->|Delete| DeleteOp[Delete Operation]
CreateOp --> AttemptSubmit[Attempt to Submit]
UpdateOp --> AttemptSubmit
DeleteOp --> AttemptSubmit
AttemptSubmit --> SubmitResult{Submission Result}
SubmitResult -->|Success| MarkSubmitted[Mark as SUBMITTED]
SubmitResult -->|Ignored| MarkIgnored[Mark as IGNORED]
SubmitResult -->|Failure| HandleFailure{Failure Type}
HandleFailure -->|Temporary| RetryMechanism[Retry Mechanism]
HandleFailure -->|Permanent| MarkFailure[Mark as FAILURE]
MarkSubmitted --> Verify
Verify --> VerifyResult{Verify Result}
VerifyResult -->|Success| MarkSuccess[Mark as SUCCESS]
VerifyResult -->|Warning| MarkWarning[Mark as WARNING]
VerifyResult -->|Failure| HandleFailure
CheckStatus -->|All Processed| Sleep[Sleep]
Sleep --> AsyncProcess
subgraph RetryMechanism [Retry Mechanism]
CheckRetryCount{Retry Count < Max}
CheckRetryCount -->|Yes| ScheduleRetry[Schedule Retry]
CheckRetryCount -->|No| MarkMaxRetriesReached[Mark Max Retries Reached]
ScheduleRetry --> MarkTransientFailure[Mark as TRANSIENT_FAILURE]
end
subgraph Logging [Logging]
LogSuccess[Log Success]
LogWarning[Log Warning]
LogIgnored[Log Ignored]
LogRetryAttempt[Log Retry Attempt]
LogFailure[Log Failure]
LogMaxRetriesReached[Log Max Retries Reached]
end
subgraph StatusUpdate [Status Update]
UpdateStatus[Update Task Status]
NextTask[Move to Next Task]
end
MarkSuccess --> LogSuccess
MarkWarning --> LogWarning
MarkIgnored --> LogIgnored
MarkTransientFailure --> LogRetryAttempt
MarkFailure --> LogFailure
MarkMaxRetriesReached --> LogMaxRetriesReached
LogSuccess --> UpdateStatus
LogWarning --> UpdateStatus
LogIgnored --> UpdateStatus
LogRetryAttempt --> UpdateStatus
LogFailure --> UpdateStatus
LogMaxRetriesReached --> UpdateStatus
UpdateStatus --> NextTask
NextTask --> CheckStatus
subgraph QueueTypes [Queue Types]
CrossrefQueue[Crossref Queue]
DataciteQueue[Datacite Queue]
SearchIndexerQueue[Search Indexer Queue]
end
Queue --> QueueTypes
subgraph StatusTypes [Status Types]
UNSUBMITTED[UNSUBMITTED]
UNCHECKED[UNCHECKED]
SUBMITTED[SUBMITTED]
WARNING[WARNING]
FAILURE[FAILURE]
TRANSIENT_FAILURE[TRANSIENT_FAILURE]
IGNORED[IGNORED]
SUCCESS[SUCCESS]
end