Content Node Always Down

Question

Content Node Always Down

aryamanvinchhi opened this issue 3 months ago · 5 comments

I have a content node that is constantly down (it keeps restarting every 30 min or so). The logs look mostly fine, but I did note this message.

Steps to reproduce:
Nothing specific here, I created a cluster, ingested documents and now I find 1 node is struggling.

Any ideas on how to debug or proceed here? I also tried replacing the node (no data loss since the data is persisted on a mount) but the problem still exists.

"terminate called after throwing as instance of search::chunkException
terminate called recursively
incremented restart penalty to 14 seconds"

Answer 1 · 2024-09-19T13:22:00.000Z

Version 8.270.8

Answer 2 · 2024-09-19T15:42:36.000Z

Quick correction - the pod itself does not restart but it is the vespa-proton indexing service that keeps starting again and again. From what I understand, this is actually not an issue but expected behavior.

I tried stopping and starting services again, but the node continues to show a "Connection reset" error on the cluster controller page. The restart penalty is up to 1800 seconds now.

Answer 3 · 2024-09-20T13:57:16.000Z

The document store data is corrupt for some reason (corruption, incomplete write, bug). We would be interested in looking at it, but I think that will be hard for non-technical reasons, and you are also on a quite old version.

Unless you have configured redundancy 1 the data will already be restored in secondary copies on the other nodes so you can get out of this situation by deleting the data of this node.

Answer 4 · 2024-09-25T12:39:28.000Z

In Vespa 8.413.11 we have extended the chunk exception with more details (#32452) that will be logged if something similar happens again.

Please upgrade to the newest version and report back.

Answer 5 · 2024-09-26T13:21:24.000Z

Sounds great, thank you!