CDLUC3/ezid

[MAINTENANCE] proc-cleanup-async-queues improvements

Opened this issue · 4 comments

Service/repository
EZID - proc-cleanup-async-queues.py script

Describe the current state/issue

As described in #723, the current proc-cleanup-async-queues.py script has some limitations that prevent proper queue clearing. It is currently constrained by a batch size of 100, which appears to be leading to missed records when the processing time range contains more than this is the allowed time range. Relative to this, the current 2-week processing window may not be optimal for cleanup. The script also generally lacks the flexibility to handle potential status changes in other queues and record types, limiting its adaptability to the task queue redesign on #707. Inadequate logging also makes it difficult to identify issues or track unprocessed records effectively.

Describe the desired state/solution
An optimized proc-cleanup-async-queues.py script that:

  1. Handles both "S" and "O" statuses for successful processing in the DataCite queue, accounting for the change that occurred on June 29, 2023.

  2. Includes logic to handle potential status changes in other queues to future-proof the script.

  3. Has an adjustable and optimized time window for processing, which can be easily configured based on each queue's needs. Processes all records within the time range, regardless of the number, either by handling larger batches or implementing an alternate means to ensure complete queue processing.

  4. Currently the time window is the past week. The records are selected by updated timestamp and ordered by the primary key in descending order. This means the most recently updated records are deleted first. This might not be the initial plan. We probably want to keep most recently processed records for 1-2 weeks before deleting them from the queues (Confirm with the team).

  5. Processes all records within the time range, regardless of the number, either by handling larger batches or implementing an alternate means to ensure complete queue processing.

  6. Provides logging/reporting to identify unprocessed records.

Work order:
4. Has an adjustable and optimized time window for processing, which can be easily configured based on each queue's needs.
5. Keep most recently processed records in all status for 2 weeks before deleting the successfully processed records from the queues
3. Processes all records within the time range, regardless of the number of qualified records

Checklist:

  • define time window from and to timestamps as optional command options
  • define batch size as an optional command option - find out a best performaning batch size and set it as the default batch size
  • log deleted identifiers (with queue type and status information) when running the cleanup async queues in debugging mode
  • Keep most recently processed records in all status for 2 weeks before deleting the successfully processed records from the queues
  • Processes all records within the time range, regardless of the number of qualified records

Tested the proc-cleanup-async-queues_v2.py script against the DEV database (10/23)

  • date range: between 20220405 and 20220406
  • records selected from the refIdentifier table: 319 records selected
  • matched records in the datacite queue:
    • before running script: 319
    • after running script: 319 (all with status="F"
  • matched records in the crossref queue:
    • before running the script: 319
    • after running the script: 0
  • matched records in the searchIndexrere queue:
  • before running the script: 319
  • after running the script: 0
# before running new async script: 319
# after running new async script: 319
select id, identifier, from_unixtime(createTime), from_unixtime(updateTime) from ezidapp_refidentifier 
where from_unixtime(updateTime) between '2022-04-05' and  '2022-04-06';

# before running new async script: 95077 
# after running new async script: 95077 
select count(id) from ezidapp_refidentifier;

# before: 159035
# after: 159035
select count(seq) from ezidapp_datacitequeue;

# before: 159035
# after:  158716 (reduced 319)
select count(seq) from ezidapp_crossrefqueue;

# before: 774270
# after:  773951 (reduced 319)
select count(seq) from ezidapp_searchindexerqueue;

# 319 records remained after the running the cleanup script; all with status=F
select *, from_unixtime(enqueueTime), from_unixtime(submitTime) from ezidapp_datacitequeue 
where refidentifier_id in (
select id from ezidapp_refidentifier 
where from_unixtime(updateTime) between '2022-04-05' and  '2022-04-06');

# before: 319; after: 0
select *, from_unixtime(enqueueTime), from_unixtime(submitTime) from ezidapp_crossrefqueue 
where refidentifier_id in (
select id from ezidapp_refidentifier 
where from_unixtime(updateTime) between '2022-04-05' and  '2022-04-06');

# before: 319; after: 0
select *, from_unixtime(enqueueTime), from_unixtime(submitTime) from ezidapp_searchindexerqueue 
where refidentifier_id in (
select id from ezidapp_refidentifier 
where from_unixtime(updateTime) between '2022-04-05' and  '2022-04-06');

python manage.py proc-cleanup-async-queues_v2 --updated_from 20220405 --updated_to 20220406
  • Deployed proc-cleanup-async-queues_v2.py on EZID-PRD Oct 30 with release v3.2.27