Most recent tip of node is sometimes stuck

Question

Most recent tip of node is sometimes stuck

nielstron opened this issue a year ago · 6 comments

What Git revision / release tag are you using?

$  kupo --version
v2.5.0

Describe what the problem is?

Occasionally, kupo gets "stuck", i.e. does not progress further in updating the database. This occurs often due to rollbacks or garbage collection on larger kupo instances. Most of the time, this can be caught by querying /health and comparing the most recent node tip with the most recent checkpoint.

Recently I have noticed that it may also happend that kupo shows most recent node tip == most recent checkpoint - but neither of them are actually the current nodetip, but they may be 1000 slots or more behind. This is problematic since the difference (based on my reading of the docs) should indicate inconsistency.

What should be the expected behavior?

Either the current tip should always correctly point to the actual current tip of the cardano node, or the docs should indicate that this tip may be different from the actual tip of the node (not sure what else Absolute slot number of the current tip of the node. is referring to then).

If applicable, what are the logs from the server around the occurence of the problem?

The stuck kupo instance operates on mainnet with match * and --prune-utxo. The non-stuck instance operates on a few addresses with --prune-utxo.

# instance 1 of kupo (not stuck)
$ curl http://localhost:1449/health
# TYPE kupo_connection_status gauge
kupo_connection_status  1.0
# TYPE kupo_most_recent_checkpoint counter
kupo_most_recent_checkpoint  107100805
# TYPE kupo_most_recent_node_tip counter
kupo_most_recent_node_tip  107100805
$ curl http://localhost:6666/health
# instance 2 of kupo (stuck)
# TYPE kupo_connection_status gauge
kupo_connection_status  1.0
# TYPE kupo_most_recent_checkpoint counter
kupo_most_recent_checkpoint  107091824
# TYPE kupo_most_recent_node_tip counter
kupo_most_recent_node_tip  107091824

Note that the "stuck" situation was just before the restart. After the restart /health correctly returned the current node tip and most recent checkpoint indicated that the database was lagging behind.

Oct 30 05:21:53 ns366159 bash[604887]: {"severity":"Notice","timestamp":"2023-10-30T05:21:53.060725079Z","thread":"11","message":{"Consumer":{"tag":"ConsumerRollBackward","point":107077019}},"version":"v2.5.0"}
Oct 30 07:22:34 ns366159 bash[604887]: {"severity":"Notice","timestamp":"2023-10-30T07:22:34.777298117Z","thread":"11","message":{"Consumer":{"tag":"ConsumerRollBackward","point":107084260}},"version":"v2.5.0"}
Oct 30 08:02:25 ns366159 bash[604887]: {"severity":"Notice","timestamp":"2023-10-30T08:02:25.023006211Z","thread":"11","message":{"Consumer":{"tag":"ConsumerRollBackward","point":107086643}},"version":"v2.5.0"}
Oct 30 08:28:48 ns366159 bash[604887]: {"severity":"Notice","timestamp":"2023-10-30T08:28:48.759648272Z","thread":"13","message":{"Gardener":{"tag":"GardenerBeginGarbageCollection","pruning":"inputs"}},"version":"v2.5.0"}
Oct 30 08:35:13 ns366159 bash[604887]: {"severity":"Error","timestamp":"2023-10-30T08:35:13.950278741Z","thread":"1277002","message":{"Http":{"tag":"HttpUnexpectedError","hint":"Thread killed by timeout manager"}},"version":"v2.5.0"}
Oct 30 08:39:14 ns366159 bash[604887]: {"severity":"Error","timestamp":"2023-10-30T08:39:14.019618461Z","thread":"1277770","message":{"Http":{"tag":"HttpUnexpectedError","hint":"Thread killed by timeout manager"}},"version":"v2.5.0"}
Oct 30 08:39:14 ns366159 bash[604887]: {"severity":"Error","timestamp":"2023-10-30T08:39:14.019775963Z","thread":"1277679","message":{"Http":{"tag":"HttpUnexpectedError","hint":"Thread killed by timeout manager"}},"version":"v2.5.0"}
Oct 30 08:39:14 ns366159 bash[604887]: {"severity":"Error","timestamp":"2023-10-30T08:39:14.019920442Z","thread":"1277638","message":{"Http":{"tag":"HttpUnexpectedError","hint":"Thread killed by timeout manager"}},"version":"v2.5.0"}
Oct 30 08:39:14 ns366159 bash[604887]: {"severity":"Error","timestamp":"2023-10-30T08:39:14.020023655Z","thread":"1277633","message":{"Http":{"tag":"HttpUnexpectedError","hint":"Thread killed by timeout manager"}},"version":"v2.5.0"}
Oct 30 08:39:14 ns366159 bash[604887]: {"severity":"Error","timestamp":"2023-10-30T08:39:14.020095894Z","thread":"1277629","message":{"Http":{"tag":"HttpUnexpectedError","hint":"Thread killed by timeout manager"}},"version":"v2.5.0"}
Oct 30 08:39:14 ns366159 bash[604887]: {"severity":"Error","timestamp":"2023-10-30T08:39:14.020194269Z","thread":"1277627","message":{"Http":{"tag":"HttpUnexpectedError","hint":"Thread killed by timeout manager"}},"version":"v2.5.0"}
Oct 30 08:43:37 ns366159 bash[604887]: {"severity":"Notice","timestamp":"2023-10-30T08:43:37.574315196Z","thread":"11","message":{"Consumer":{"tag":"ConsumerRollBackward","point":107088571}},"version":"v2.5.0"}
Oct 30 08:49:14 ns366159 bash[604887]: {"severity":"Error","timestamp":"2023-10-30T08:49:14.688130765Z","thread":"1279499","message":{"Http":{"tag":"HttpUnexpectedError","hint":"Thread killed by timeout manager"}},"version":"v2.5.0"}
Oct 30 08:59:18 ns366159 bash[604887]: {"severity":"Error","timestamp":"2023-10-30T08:59:18.496097533Z","thread":"1280988","message":{"Http":{"tag":"HttpUnexpectedError","hint":"Thread killed by timeout manager"}},"version":"v2.5.0"}
Oct 30 09:03:19 ns366159 bash[604887]: {"severity":"Error","timestamp":"2023-10-30T09:03:19.303274302Z","thread":"1281518","message":{"Http":{"tag":"HttpUnexpectedError","hint":"Thread killed by timeout manager"}},"version":"v2.5.0"}
Oct 30 09:11:19 ns366159 bash[604887]: {"severity":"Error","timestamp":"2023-10-30T09:11:19.52097517Z","thread":"1282596","message":{"Http":{"tag":"HttpUnexpectedError","hint":"Thread killed by timeout manager"}},"version":"v2.5.0"}
Oct 30 09:15:20 ns366159 bash[604887]: {"severity":"Error","timestamp":"2023-10-30T09:15:20.73444377Z","thread":"1283178","message":{"Http":{"tag":"HttpUnexpectedError","hint":"Thread killed by timeout manager"}},"version":"v2.5.0"}
Oct 30 09:23:23 ns366159 bash[604887]: {"severity":"Error","timestamp":"2023-10-30T09:23:23.045698368Z","thread":"1284479","message":{"Http":{"tag":"HttpUnexpectedError","hint":"Thread killed by timeout manager"}},"version":"v2.5.0"}
Oct 30 09:27:23 ns366159 bash[604887]: {"severity":"Error","timestamp":"2023-10-30T09:27:23.168315543Z","thread":"1285119","message":{"Http":{"tag":"HttpUnexpectedError","hint":"Thread killed by timeout manager"}},"version":"v2.5.0"}
Oct 30 09:28:42 ns366159 bash[604887]: {"severity":"Notice","timestamp":"2023-10-30T09:28:42.124875251Z","thread":"13","message":{"Gardener":{"tag":"GardenerExitGarbageCollection","totalPrunedRows":251491}},"version":"v2.5.0"}
Oct 30 09:28:42 ns366159 bash[604887]: {"severity":"Notice","timestamp":"2023-10-30T09:28:42.124951318Z","thread":"13","message":{"Gardener":{"tag":"GardenerBeginGarbageCollection","pruning":"binary_data"}},"version":"v2.5.0"}
Oct 30 12:02:30 ns366159 systemd[1]: Stopping kupo service...
Oct 30 12:02:31 ns366159 systemd[1]: kupo-mainnet.service: Deactivated successfully.
Oct 30 12:02:31 ns366159 systemd[1]: Stopped kupo service.
Oct 30 12:02:31 ns366159 systemd[1]: kupo-mainnet.service: Consumed 2h 47min 21.495s CPU time.
Oct 30 12:02:31 ns366159 systemd[1]: Started kupo service.
Oct 30 12:02:32 ns366159 bash[2732457]: {"severity":"Notice","timestamp":"2023-10-30T12:02:32.20586506Z","thread":"9","message":{"Http":{"tag":"HttpServerListening","host":"127.0.0.1","port":6666}},"version":"v2.5.0"}
Oct 30 12:02:32 ns366159 bash[2732457]: {"severity":"Notice","timestamp":"2023-10-30T12:02:32.211187813Z","thread":"11","message":{"Consumer":{"tag":"ConsumerRollBackward","point":107091824}},"version":"v2.5.0"}
Oct 30 12:06:32 ns366159 bash[2732457]: {"severity":"Error","timestamp":"2023-10-30T12:06:32.270548979Z","thread":"62","message":{"Http":{"tag":"HttpUnexpectedError","hint":"Thread killed by timeout manager"}},"version":"v2.5.0"}
Oct 30 12:10:34 ns366159 bash[2732457]: {"severity":"Error","timestamp":"2023-10-30T12:10:34.78945036Z","thread":"791","message":{"Http":{"tag":"HttpUnexpectedError","hint":"Thread killed by timeout manager"}},"version":"v2.5.0"}

requerying health

$ curl http://localhost:6666/health
# TYPE kupo_connection_status gauge
kupo_connection_status  1.0
# TYPE kupo_most_recent_checkpoint counter
kupo_most_recent_checkpoint  107091840
# TYPE kupo_most_recent_node_tip counter
kupo_most_recent_node_tip  107101044

Answer 1 · 2023-10-30T12:34:51.000Z

From the logs it looks to me like kupo gets seriously stuck when pruning binary data - and it prunes binary data for 3 hours +

It is fine if kupo lags behind sometimes due to garbage collection. However, it is an issue if the most recent node tip is not updated during this time.

Answer 2 · 2023-10-30T17:31:29.000Z

Please upgrade to >= 2.6.0 as it was addressed in 2.6.0 (see CHANGELOG).

Answer 3 · 2023-10-30T17:35:15.000Z

Out of curiosity, what interval do you use for --gc-interval ?

Answer 4 · 2023-10-30T21:27:43.000Z

Thanks, I saw the note in the changelog but was not sure if this also applies to garbage collection.

The GC-interval is set to 24h (86400) because we figured that it would take at least 10-20 minutes each time - which is way too long for a 60min interval.

Answer 5 · 2023-10-31T07:41:10.000Z

24h might be a bit much to be honest. I'd suggest leaving the default of 60min; it should not take 10-20 minutes, and the GC is done incrementally without disturbing neither the main writer (fetching blocks from the networks), nor the concurrent reader clients. So in principle, there's no reason to delay it.

Answer 6 · 2023-10-31T09:14:40.000Z

trying this out now. Thanks for the patch!