[New Feature]: Automation to regularly clean-up docs from Elasticsearch
riverma opened this issue · 7 comments
Checked for duplicates
Yes - I've already checked
Alternatives considered
Yes - and alternatives don't suffice
Related problems
We have an ever increasing amount of documents being stored in Elasticsearch that left unchecked, can affect Elasticsearch stability and therefore operations.
Describe the feature request
We'd like automation (a script) that can be run at regular intervals to automatically clean up unnecessary documents in indices that are growing. Unnecessary is determined both by a specific age of a document as well as a specific field within that document. For example, the job-deduped
status field of job_status-current
index, when the document is over 30 days old. We want to be flexible about modifying the time window and the specific field matching within the script so that OPS can adjust the script over time, but also leave out deleting documents that may aid in debugging down-the-line.
Here are some of the target indices and conditions we know of currently:
- Mozart ES
worker_status-current
- Document over 30 days old and has
status
field values:worker-heartbeat
- Document over 30 days old and has
task_status-current
- Document over 30 days old and has
status
field values:task-succeeded
ortask-received
ortask-sent
- Document over 30 days old and has
job_status-current
- Document over 30 days old and has
status
field values:job-successful
orjob-deduped
- Document over 30 days old and has
event_status-current
- Document over 30 days old and has
status
field values:clobber
- Document over 30 days old and has
- Metrics
logstash-YYYY.MM.DD
- Index over 30 days old
sdswatch-YYYY.MM.DD
- Index over 30 days old
- GRQ
grq_v2.0_l2_hls_s30
- Document over 90 days old
grq_v2.0_l2_hls_l30
- Document over 90 days old
grq_v1.0_l3_dswx_hls
- Document over 90 days old
What should be the configuration items for the script?
- Indices to target
- Conditions per index
What should the script return?
- List of document IDs of deleted documents
A more generic version of this ticket has also been created at: https://hysds-core.atlassian.net/browse/HC-454
The script should also return the number of deleted documents. Document IDs may not be as useful because they are just random strings. Perhaps it should also keep a running log of "datetime_executed, index, num_doc_deleted" in a text log/CSV file.
@LalaP @niarenaw (and others), I have two recommended ways to encode this feature as a script. Which do you all prefer from an operations perspective?
- 3 scripts, one for each ES host (Mozart, GRQ, Metrics) that is called without any arguments in the command-line because each script has the delete conditions (specified above) pre-set within the scripts. The conditions can be modified within the script via variables.
- 1 script that takes command-line arguments for ES host, indices, and a list of conditions to delete by (age of document, key/val matching fields to search docs for)
This is basically a trade-off between simplicity of invocation vs configurability during runtime usage. I am aware that there's already a purge feature within GRQ, so I'd be interested to hear whether that feature already provides the maximum configurability you all need too.
Thoughts?
The script should also return the number of deleted documents. Document IDs may not be as useful because they are just random strings. Perhaps it should also keep a running log of "datetime_executed, index, num_doc_deleted" in a text log/CSV file.
Thanks @philipjyoon. A separate log file would be better for the detailed information. I'm thinking that listing the IDs as a return statement allows this tool to be piped to another that can do further detailed analysis if need be (i.e. count number of deleted docs, double-check they are in fact deleted, etc.).
@niarenaw recommends not specifying ES conditions on the command-line as arguments as they can be highly complex. Also recommends that we can use GRQ for surgical deletes if need be.
@riverma suggest to use del yaml/json config file(s) for any desired parameters - human readable and version control to track/ diff/debug issues.
Some possible updated solutions based on @niarenaw and @toandn-jpl feedback.
Three possible automation solutions
A wrapper script
cURL scripts
A set of bash scripts that encode cURL commands for our respective delete_by_query
conditions
GRQ/Figaro trigger rules
Create custom GRQ/Figaro trigger rules that enact a particular delete_by_query
action when a given condition arises, or when the OPS team chooses to run the action
Resolved.