ElasticSearch cluster throwing "Data too large" errors

Nimbus logs cluster nodes have been throwing errors like these:

org.elasticsearch.transport.RemoteTransportException: [store-02.he-eu-hel1.logs.nimbus][172.17.1.4:9300][internal:cluster/shard/failure]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [internal:cluster/shard/failure] would be [50337954328/46.8gb], which is larger than the limit of [49061245747/45.6gb], real usage: [50337921000/46.8gb], new bytes reserved: [33328/32.5kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=33328/32.5kb, model_inference=0/0b, eql_sequence=0/0b, accounting=27200236/25.9mb]

Which is also causing Kibana queries to fail. This is most probably not a scaling issue but config tuning issues.

Seems node store-02 can't recovery after failure. During the recovery is gets dataset bigger than 45.3GB (46.3 GB). Limit is calculated based on JVM options, and implemented circuit breaking mechanism. According to the elasticsearch, problem is at least on node store-03.

    "reason": "ALLOCATION_FAILED",
    "at": "2022-05-04T21:10:24.157Z",
    "failed_allocation_attempts": 5,
    "details": "failed shard on node [m4r7OjyRQTitiTDsFMAZTQ]: failed recovery, failure RecoveryFailedException[[.kibana-event-log-7.14.2-000004][0]: Recovery failed from {store-01.he-eu-hel1.logs.nimbus}{zZNgFeSBRvmsmDd2nCVoMA}{K2QYv8AFQBSNHi073dHJ2g}{10.14.0.45}{10.14.0.45:9300}{cdfhilmrstw}{ml.machine_memory=67362332672, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=51643416576, transform.node=true} into {store-03.he-eu-hel1.logs.nimbus}{m4r7OjyRQTitiTDsFMAZTQ}{aGu44XRyQPGB_IVwL_VoqA}{10.14.0.44}{10.14.0.44:9300}{cdfhilmrstw}{ml.machine_memory=67362332672, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=51643416576}]; nested: RemoteTransportException[[store-01.he-eu-hel1.logs.nimbus][172.17.2.2:9300][internal:index/shard/recovery/start_recovery]]; nested: CircuitBreakingException[[parent] Data too large, data for [internal:index/shard/recovery/start_recovery] would be [49771669956/46.3gb], which is larger than the limit of [49061245747/45.6gb], real usage: [49771668192/46.3gb], new bytes reserved: [1764/1.7kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=1764/1.7kb, model_inference=0/0b, eql_sequence=0/0b, accounting=26278868/25mb]]; ",
    "last_allocation_status": "no_attempt"
  },

As a temporary solution I'd recommend increase JVM XmX (there is a free physical memory on the host), restart ES, wait for recovery and bring old values back.

The current memory ratio is 80%:

infra-nimbus/ansible/group_vars/logs.nimbus.yml

Lines 40 to 44 in 6994786

    
           # JVM memory settings 
        
           es_jvm_mem_ratio: 0.8 
        
           es_jvm_mem: '{{ (ansible_memtotal_mb * es_jvm_mem_ratio) | round(0, "floor") }}' 
        
           es_jvm_min_heap: '{{ es_jvm_mem | int }}m' 
        
           es_jvm_max_heap: '{{ es_jvm_mem | int }}m'

I think on a host with 64 GB of RAM it should be fine to go to 90% or even 95%.

Also, the fact that such a big data set was created indicates that we might need to do more rigorous filtering of data.

I bet there's one(or few) very big indices compared to the rest. In case recovery is not doable we can just purge the cluster and re-create it from scratch. The only victim I would kinda care about is the Kibana config which includes dashboards.

Increasing Xmx to 95% allowed cluster to sync. After that I rolled changes back to 80%. Cluster is green (see below):

{
  "status" : "green",
  "active_primary_shards" : 73,
  "active_shards" : 112,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0
}

I'll keep this issue open and perform further diagnostic later.

I've added some rules to Logstash config to lower the volume of logs coming into the cluster:

https://github.com/status-im/infra-hq/commit/b4df0fa8 - logstash: drop sync-related messages due to volume

    # Drop these to reduce size of indices in ES cluster.
    if [json][msg] in [
        "Attestation resolved",
        "Attestation received",
        "Aggregate received",
        "Sync committee message received",
        "Sync committee message resolved",
        "Block backfilled",
        "Peer's syncing status"
    ] {
      drop { id => "beacon-node-spam" }
    }

The current size of the indices is absurd:

admin@store-01.he-eu-hel1.logs.nimbus:~ % c localhost:9200/_cat/indices/logstash\*\?s=index   
green open logstash-2022.05.11 EolbCKoORR-PQcs85JQZAw 1 0  55220562 0   23gb   23gb
green open logstash-2022.05.12 wk46PRGkTUqB8F7ikRY32g 1 0  60501537 0 26.3gb 26.3gb
green open logstash-2022.05.13 K_Z9ablhSU-Qb7zFzVlzZg 1 0  61928234 0 26.8gb 26.8gb
green open logstash-2022.05.14 FDCrrz35QMuZ6KBIIXsckw 1 0 116341863 0 50.7gb 50.7gb
green open logstash-2022.05.15 Y1mjFmGRTBiTiHz5gNR72Q 1 0 141828405 0 61.5gb 61.5gb
green open logstash-2022.05.16 c3Wglp3WQDC5DbvizjV5hA 1 0  54760136 0 22.9gb 22.9gb

We should clean that up a bit.

That's a lot of garbage:

 > ./elasticsearch/esclean.py -q 'json.msg="Sync committee message.*"'       
Cluster: nimbus-logs-search
logstash-2022.05.11    count: 11131606
logstash-2022.05.12    count: 10148071
logstash-2022.05.13    count: 10609753
logstash-2022.05.14    count: 22723827
logstash-2022.05.15    count: 27097499
logstash-2022.05.16    count: 10898201

 > ./elasticsearch/esclean.py -q "json.msg='Peer\'s syncing status'" 
Cluster: nimbus-logs-search
logstash-2022.05.11    count:  2046073
logstash-2022.05.12    count:  3477558
logstash-2022.05.13    count:  3448888
logstash-2022.05.14    count:  5028311
logstash-2022.05.15    count:  5625083
logstash-2022.05.16    count:  1476804

I'm going to clean that up. I'm using the esclean.py script.

Cleaned up a bit:

 > ./elasticsearch/esclean.py -q 'json.msg="Sync committee message.*"' -d
Cluster: nimbus-logs-search
logstash-2022.05.11    count: 11131606
logstash-2022.05.11    Deleted:   11131606 Failed: 0
logstash-2022.05.12    count: 10148071
logstash-2022.05.12    Deleted:   10148071 Failed: 0
logstash-2022.05.13    count: 10609753
logstash-2022.05.13    Deleted:   10609753 Failed: 0
logstash-2022.05.14    count: 22723827
logstash-2022.05.14    Deleted:          0 Failed: 0
logstash-2022.05.15    count: 27097499
logstash-2022.05.15    Deleted:          0 Failed: 0
logstash-2022.05.16    count: 10905627
logstash-2022.05.16    Deleted:          0 Failed: 0

And others, which resulted in:

admin@store-01.he-eu-hel1.logs.nimbus:~ % c localhost:9200/_cat/indices/logstash\*\?s=index
green open logstash-2022.05.11 EolbCKoORR-PQcs85JQZAw 1 0  42042883 461128   18gb   18gb
green open logstash-2022.05.12 wk46PRGkTUqB8F7ikRY32g 1 0  46875908 983701 20.8gb 20.8gb
green open logstash-2022.05.13 K_Z9ablhSU-Qb7zFzVlzZg 1 0  47869593      0 21.2gb 21.2gb
green open logstash-2022.05.14 FDCrrz35QMuZ6KBIIXsckw 1 0  83308462      0 34.9gb 34.9gb
green open logstash-2022.05.15 Y1mjFmGRTBiTiHz5gNR72Q 1 0 101916235      0 41.8gb 41.8gb
green open logstash-2022.05.16 c3Wglp3WQDC5DbvizjV5hA 1 0  76990942  71403 30.8gb 30.8gb

I'm still seeing these errors:

{                                                                                                                                                                                                                                                          
  "statusCode": 429,
  "error": "Too Many Requests",
  "message": "[all shards failed: search_phase_execution_exception: [circuit_breaking_exception] Reason: [parent] Data too large, data for [indices:data/read/search[phase/query]] would be [49261646014/45.8gb], which is larger than the limit of [49061245747/45.6gb], real usage: [49261645336/45.8gb], new bytes reserved: [678/678b], usages [request=0/0b, fielddata=0/0b, in_flight_requests=678/678b, model_inference=0/0b, eql_sequence=0/0b, accounting=8622004/8.2mb]]: all shards failed"
}

Found something interesting:

Once you cross that magical ~32 GB boundary, the pointers switch back to ordinary object pointers. The size of each pointer grows, more CPU-memory bandwidth is used, and you effectively lose memory. In fact, it takes until around 40–50 GB of allocated heap before you have the same effective memory of a heap just under 32 GB using compressed oops.

The moral of the story is this: even when you have memory to spare, try to avoid crossing the 32 GB heap boundary. It wastes memory, reduces CPU performance, and makes the GC struggle with large heaps.

https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html#compressed_oops

Switched to using 32 GB heap for now: c474a331

infra-nimbus/ansible/group_vars/logs.nimbus.yml

Lines 39 to 41 in c474a33

    
           # JVM memory settings 
        
           es_jvm_min_heap: '32g' 
        
           es_jvm_max_heap: '32g'

I haven't seen this issue in a while.

	# JVM memory settings
	es_jvm_mem_ratio: 0.8
	es_jvm_mem: '{{ (ansible_memtotal_mb * es_jvm_mem_ratio) \| round(0, "floor") }}'
	es_jvm_min_heap: '{{ es_jvm_mem \| int }}m'
	es_jvm_max_heap: '{{ es_jvm_mem \| int }}m'

	# JVM memory settings
	es_jvm_min_heap: '32g'
	es_jvm_max_heap: '32g'