nasa/opera-sds-ops

[New Feature]: Automation to regularly clean-up docs from Elasticsearch

riverma opened this issue · 7 comments

Checked for duplicates

Yes - I've already checked

Alternatives considered

Yes - and alternatives don't suffice

Related problems

We have an ever increasing amount of documents being stored in Elasticsearch that left unchecked, can affect Elasticsearch stability and therefore operations.

Describe the feature request

We'd like automation (a script) that can be run at regular intervals to automatically clean up unnecessary documents in indices that are growing. Unnecessary is determined both by a specific age of a document as well as a specific field within that document. For example, the job-deduped status field of job_status-current index, when the document is over 30 days old. We want to be flexible about modifying the time window and the specific field matching within the script so that OPS can adjust the script over time, but also leave out deleting documents that may aid in debugging down-the-line.

Here are some of the target indices and conditions we know of currently:

  • Mozart ES
    • worker_status-current
      • Document over 30 days old and has status field values: worker-heartbeat
    • task_status-current
      • Document over 30 days old and has status field values: task-succeeded or task-received or task-sent
    • job_status-current
      • Document over 30 days old and has status field values: job-successful or job-deduped
    • event_status-current
      • Document over 30 days old and has status field values: clobber
  • Metrics
    • logstash-YYYY.MM.DD
      • Index over 30 days old
    • sdswatch-YYYY.MM.DD
      • Index over 30 days old
  • GRQ
    • grq_v2.0_l2_hls_s30
      • Document over 90 days old
    • grq_v2.0_l2_hls_l30
      • Document over 90 days old
    • grq_v1.0_l3_dswx_hls
      • Document over 90 days old

What should be the configuration items for the script?

  • Indices to target
  • Conditions per index

What should the script return?

  • List of document IDs of deleted documents

A more generic version of this ticket has also been created at: https://hysds-core.atlassian.net/browse/HC-454

The script should also return the number of deleted documents. Document IDs may not be as useful because they are just random strings. Perhaps it should also keep a running log of "datetime_executed, index, num_doc_deleted" in a text log/CSV file.

@LalaP @niarenaw (and others), I have two recommended ways to encode this feature as a script. Which do you all prefer from an operations perspective?

  • 3 scripts, one for each ES host (Mozart, GRQ, Metrics) that is called without any arguments in the command-line because each script has the delete conditions (specified above) pre-set within the scripts. The conditions can be modified within the script via variables.
  • 1 script that takes command-line arguments for ES host, indices, and a list of conditions to delete by (age of document, key/val matching fields to search docs for)

This is basically a trade-off between simplicity of invocation vs configurability during runtime usage. I am aware that there's already a purge feature within GRQ, so I'd be interested to hear whether that feature already provides the maximum configurability you all need too.

Thoughts?

The script should also return the number of deleted documents. Document IDs may not be as useful because they are just random strings. Perhaps it should also keep a running log of "datetime_executed, index, num_doc_deleted" in a text log/CSV file.

Thanks @philipjyoon. A separate log file would be better for the detailed information. I'm thinking that listing the IDs as a return statement allows this tool to be piped to another that can do further detailed analysis if need be (i.e. count number of deleted docs, double-check they are in fact deleted, etc.).

@niarenaw recommends not specifying ES conditions on the command-line as arguments as they can be highly complex. Also recommends that we can use GRQ for surgical deletes if need be.

@riverma suggest to use del yaml/json config file(s) for any desired parameters - human readable and version control to track/ diff/debug issues.

Some possible updated solutions based on @niarenaw and @toandn-jpl feedback.

Three possible automation solutions

A wrapper script

image

cURL scripts

A set of bash scripts that encode cURL commands for our respective delete_by_query conditions

GRQ/Figaro trigger rules

Create custom GRQ/Figaro trigger rules that enact a particular delete_by_query action when a given condition arises, or when the OPS team chooses to run the action

Resolved.