vespa-engine/vespa

Content Node Always Down

aryamanvinchhi opened this issue · 5 comments

I have a content node that is constantly down (it keeps restarting every 30 min or so). The logs look mostly fine, but I did note this message.

Steps to reproduce:
Nothing specific here, I created a cluster, ingested documents and now I find 1 node is struggling.

Any ideas on how to debug or proceed here? I also tried replacing the node (no data loss since the data is persisted on a mount) but the problem still exists.

"terminate called after throwing as instance of search::chunkException
terminate called recursively
incremented restart penalty to 14 seconds"

Version 8.270.8

Quick correction - the pod itself does not restart but it is the vespa-proton indexing service that keeps starting again and again. From what I understand, this is actually not an issue but expected behavior.

I tried stopping and starting services again, but the node continues to show a "Connection reset" error on the cluster controller page. The restart penalty is up to 1800 seconds now.

The document store data is corrupt for some reason (corruption, incomplete write, bug). We would be interested in looking at it, but I think that will be hard for non-technical reasons, and you are also on a quite old version.

Unless you have configured redundancy 1 the data will already be restored in secondary copies on the other nodes so you can get out of this situation by deleting the data of this node.

In Vespa 8.413.11 we have extended the chunk exception with more details (#32452) that will be logged if something similar happens again.

Please upgrade to the newest version and report back.

Sounds great, thank you!