LME stops working when >90% of disk space is consumed.

Question

LME stops working when >90% of disk space is consumed.

karl-james opened this issue 3 years ago · 4 comments

This has just happened for the second time, Disk fills up, processing appears to stop (CPU usage returns to idle) and logging into the dashboard times out or I get 'Kibana server is not ready'

The first time this happened I extended the volume and everything sprung back into life, but obviously I can't keep doing that.

logstash log is full of these entries:
"reason"=>"index [winlogbeat-16.01.2022] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];"}}'

kibana log:
2022-01-17T14:43:03.998424133Z lme_kibana.1.x2ajp5ww95kx@ | {"type":"log","@timestamp":"2022-01-17T14:43:03+00:00","tags":["fatal","root"],"pid":8,"message":"Error: Unable to complete saved object migrations for the [.kibana] index. Please check the health of your Elasticsearch cluster and try again. Unexpected Elasticsearch ResponseError: statusCode: 429, method: PUT, url: /.kibana_7.16.3_001/_mapping?timeout=60s error: [cluster_block_exception]: index [.kibana_7.16.3_001] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];,\n at migrationStateActionMachine (/usr/share/kibana/src/core/server/saved_objects/migrationsv2/migrations_state_action_machine.js:164:13)\n at processTicksAndRejections (node:internal/process/task_queues:96:5)\n at async Promise.all (index 0)\n at SavedObjectsService.start (/usr/share/kibana/src/core/server/saved_objects/saved_objects_service.js:181:9)\n at Server.start (/usr/share/kibana/src/core/server/server.js:330:31)\n at Root.start (/usr/share/kibana/src/core/server/root/index.js:69:14)\n at bootstrap (/usr/share/kibana/src/core/server/bootstrap.js:120:5)\n at Command.<anonymous> (/usr/share/kibana/src/cli/serve/serve.js:229:5)"} 2022-01-17T14:43:04.001481853Z lme_kibana.1.x2ajp5ww95kx@ | {"type":"log","@timestamp":"2022-01-17T14:43:04+00:00","tags":["info","plugins-system","standard"],"pid":8,"message":"Stopping all plugins."} 2022-01-17T14:43:04.002075801Z lme_kibana.1.x2ajp5ww95kx@ | {"type":"log","@timestamp":"2022-01-17T14:43:04+00:00","tags":["info","plugins","monitoring","monitoring","kibana-monitoring"],"pid":8,"message":"Monitoring stats collection is stopped"} 2022-01-17T14:43:04.009529894Z lme_kibana.1.x2ajp5ww95kx@ | {"type":"log","@timestamp":"2022-01-17T14:43:04+00:00","tags":["info","savedobjects-service"],"pid":8,"message":"[.kibana_task_manager] OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT -> OUTDATED_DOCUMENTS_SEARCH_READ. took: 20ms."} 2022-01-17T14:43:04.015588737Z lme_kibana.1.x2ajp5ww95kx@ | {"type":"log","@timestamp":"2022-01-17T14:43:04+00:00","tags":["info","savedobjects-service"],"pid":8,"message":"[.kibana_task_manager] OUTDATED_DOCUMENTS_SEARCH_READ -> OUTDATED_DOCUMENTS_SEARCH_CLOSE_PIT. took: 6ms."} 2022-01-17T14:43:04.027903470Z lme_kibana.1.x2ajp5ww95kx@ | {"type":"log","@timestamp":"2022-01-17T14:43:04+00:00","tags":["info","savedobjects-service"],"pid":8,"message":"[.kibana_task_manager] OUTDATED_DOCUMENTS_SEARCH_CLOSE_PIT -> UPDATE_TARGET_MAPPINGS. took: 12ms."} 2022-01-17T14:43:04.030473248Z lme_kibana.1.x2ajp5ww95kx@ | {"type":"log","@timestamp":"2022-01-17T14:43:04+00:00","tags":["error","savedobjects-service"],"pid":8,"message":"[.kibana_task_manager] Unexpected Elasticsearch ResponseError: statusCode: 429, method: PUT, url: /.kibana_task_manager_7.16.3_001/_mapping?timeout=60s error: [cluster_block_exception]: index [.kibana_task_manager_7.16.3_001] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];,"} 2022-01-17T14:43:34.010645726Z lme_kibana.1.x2ajp5ww95kx@ | {"type":"log","@timestamp":"2022-01-17T14:43:34+00:00","tags":["warning","plugins-system","standard"],"pid":8,"message":"\"eventLog\" plugin didn't stop in 30sec., move on to the next."} 2022-01-17T14:43:34.248232854Z lme_kibana.1.x2ajp5ww95kx@ | 2022-01-17T14:43:34.248274831Z lme_kibana.1.x2ajp5ww95kx@ | FATAL Error: Unable to complete saved object migrations for the [.kibana] index. Please check the health of your Elasticsearch cluster and try again. Unexpected Elasticsearch ResponseError: statusCode: 429, method: PUT, url: /.kibana_7.16.3_001/_mapping?timeout=60s error: [cluster_block_exception]: index [.kibana_7.16.3_001] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];, 2022-01-17T14:43:34.248284796Z lme_kibana.1.x2ajp5ww95kx@ |

and similar in elasticsearch log:
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]", 2022-01-17T15:07:34.001125760Z lme_elasticsearch.1.1wyo17pbw5uw@ | "at java.lang.Thread.run(Thread.java:833) [?:?]"] } 2022-01-17T15:07:38.634370947Z lme_elasticsearch.1.1wyo17pbw5uw@ | {"type": "server", "timestamp": "2022-01-17T15:07:38,633Z", "level": "WARN", "component": "o.e.c.r.a.DiskThresholdMonitor", "cluster.name": "loggingmadeeasy-es", "node.name": "es01", "message": "high disk watermark [90%] exceeded on [5gUa6tZVQgiR68sHtL5dbQ][es01][/usr/share/elasticsearch/data/nodes/0] free: 52.5gb[5.2%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete", "cluster.uuid": "CPpiEr30RyaGluifmhyF7Q", "node.id": "5gUa6tZVQgiR68sHtL5dbQ" } 2022-01-17T15:08:38.654877464Z lme_elasticsearch.1.1wyo17pbw5uw@ | {"type": "server", "timestamp": "2022-01-17T15:08:38,654Z", "level": "WARN", "component": "o.e.c.r.a.DiskThresholdMonitor", "cluster.name": "loggingmadeeasy-es", "node.name": "es01", "message": "high disk watermark [90%] exceeded on [5gUa6tZVQgiR68sHtL5dbQ][es01][/usr/share/elasticsearch/data/nodes/0] free: 52.5gb[5.2%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete", "cluster.uuid": "CPpiEr30RyaGluifmhyF7Q", "node.id": "5gUa6tZVQgiR68sHtL5dbQ" } 2022-01-17T15:09:38.679334327Z lme_elasticsearch.1.1wyo17pbw5uw@ | {"type": "server", "timestamp": "2022-01-17T15:09:38,678Z", "level": "WARN", "component": "o.e.c.r.a.DiskThresholdMonitor", "cluster.name": "loggingmadeeasy-es", "node.name": "es01", "message": "high disk watermark [90%] exceeded on [5gUa6tZVQgiR68sHtL5dbQ][es01][/usr/share/elasticsearch/data/nodes/0] free: 52.5gb[5.2%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete", "cluster.uuid": "CPpiEr30RyaGluifmhyF7Q", "node.id": "5gUa6tZVQgiR68sHtL5dbQ" }

This was a clean installation in mid-December, so I assume everything's current.

Answer 1 · 2022-01-19T12:29:25.000Z

Hi,

Are your retention settings correct as shown here ?

Thanks,
Duncan

Answer 2 · 2022-03-21T14:46:40.000Z

Hi @duncan-ncc, just a heads up that for Kibana as shipped with LME 0.4 the instructions at https://github.com/ukncsc/lme/blob/master/docs/retention.md have outdated screenshots. If I get chance I might have opportunity to assist, but I don't know when that will be.

It appears you need to browse to https://YOUR-LME-KIBANA-INSTANCE/app/management/data/index_lifecycle_management/policies/edit/lme_ilm_policy and then the section is in the delete phase under "Move data into phase when: X days old" (31 in my case).

Answer 3 · 2022-10-06T14:45:50.000Z

I have the opposite problem as I have a second disk on my server (to prevent the OS filling and make expanding easier) but the deploy script will only detect the OS volume. I edited the docker-compose-stack.yml and docker-compose-stack-live.yml files but there is no reference to RETENTION.

Looking at the deploy.sh the script, it seems to have moved on from the documentation as it doesn't change that file but instead, it calculates 80% of the disk space and appears to use that value in the retention in the policy as days, not disk usage. I have edited the days manually (https://yourserver/app/management/data/index_lifecycle_management/policies) for now but it looks as though using a hot phase rollover set shard sizes would allow the same effect.

Answer 4 · 2023-04-03T08:04:54.000Z

Closed due to project archive