prune progress intermittently not updating

Question

prune progress intermittently not updating

Opened this issue 5 months ago · 0 comments

taratorio commented 5 months ago

System information

Erigon version: v2.60.0

OS & Version: Linux

Commit hash: 88680ca

Erigon Command (with flags/config):

nohup ./erigon/build/bin/erigon \
--datadir=/mnt/disks/ssd1/bor-mainnet \
--chain=bor-mainnet \
--bor.heimdall=http://localhost:1317 \
--prune=hrtc \
--metrics=true \
--pprof=true \
--db.size.limit=16TB

Consensus Layer: N/A (Polygon with Heimdall)

Consensus Layer Command (with flags/config): N/A

Chain/Network: Polygon

Expected behaviour

Prune progress gets updated after an iteration of the sync loop

Actual behaviour

After running several iterations of the stage loop, I noticed that the prune progress was not getting update for stage LogIndex.

The pattern was:

stage loop iteration 1 runs Forward and then Prune: the logs for LogIndex stage the logs indicate that pruneFrom=0 which is correct the first time
stage loop iteration 2 runs Forward and then Prune: the logs LogIndex stage the logs indicate that pruneFrom=0 which is not correct this time as in the previous iteration the stage got pruned to block ~56 million
stage loop iteration 3 again shows pruneFrom=0

Iteration 1:

INFO] [05-04|03:45:38.457] [13/15 LogIndex Prune]                   table=TransactionLog block=55878394 pruneFrom=0 pruneTo=56400564
[INFO] [05-04|03:45:59.917] [mem] memory stats                       Rss=29.4GB Size=0B Pss=29.4GB SharedClean=1.6MB SharedDirty=0B PrivateClean=21.3GB PrivateDirty=8.1GB Referenced=28.5GB Anonymous=8.1GB Swap=0B alloc=6.8GB sys=13.0GB
[INFO] [05-04|03:46:08.445] [13/15 LogIndex Prune]                   table=TransactionLog block=55880273 pruneFrom=0 pruneTo=56400564
[INFO] [05-04|03:46:38.470] [13/15 LogIndex Prune]                   table=TransactionLog block=55883020 pruneFrom=0 pruneTo=56400564
...
[INFO] [05-04|05:37:08.444] [13/15 LogIndex Prune]                   table=TransactionLog block=56399565 pruneFrom=0 pruneTo=56400564
[INFO] [05-04|08:15:23.719] [13/15 LogIndex] Prune done              in=4h30m15.276317608s

Iteration 2:

[INFO] [05-05|09:56:15.329] [13/15 LogIndex Prune]                   table=TransactionLog block=55878398 pruneFrom=0 pruneTo=56474130
[INFO] [05-05|09:56:45.329] [13/15 LogIndex Prune]                   table=TransactionLog block=55881003 pruneFrom=0 pruneTo=56474130
[INFO] [05-05|09
...
[INFO] [05-05|12:04:45.329] [13/15 LogIndex Prune]                   table=TransactionLog block=56473500 pruneFrom=0 pruneTo=56474130
[INFO] [05-05|12:04:56.720] [13/15 LogIndex Prune] Flushed buffer file name=erigon-sortable-buf-2775363091
[INFO] [05-05|14:47:28.881] [13/15 LogIndex] Prune done              in=4h51m43.553679824s

Iteration 3:

[INFO] [05-06|09:29:48.991] [13/15 LogIndex Prune]                   table=TransactionLog block=55878461 pruneFrom=0 pruneTo=56523622
...
[INFO] [05-06|11:46:49.004] [13/15 LogIndex Prune]                   table=TransactionLog block=56477434 pruneFrom=0 pruneTo=56523622
...
<I interrupted and killed the node here as I noticed this weirdness>

In the code pruneFrom = s.PruneProgress which gets persisted at the end of the stage using s.DoneAt(pruneTo). So iteration 2 should have had pruneFrom = 56400564, iteration 3 should have had pruneFrom = 56474130

In addition, integration print_stages also displayed prune_at for LogIndex equal to 0, although the prune had run through a few iterations already and it was past 0.

Note: after restarting Erigon it looks like the issue disappeared and the prune progress magically got updated after the first iteration of the stage loop. This makes me wonder if this may be some weird MDBX commit issue?

Iteration 1:

[INFO] [05-06|22:34:38.127] [13/15 LogIndex Prune]                   table=TransactionLog block=55878402 pruneFrom=0 pruneTo=56553775
...
[INFO] [05-07|00:57:08.199] [13/15 LogIndex Prune]                   table=TransactionLog block=56553106 pruneFrom=0 pruneTo=56553775
[INFO] [05-07|03:43:57.004] [13/15 LogIndex] Prune done              in=5h9m48.879256096s

Iteration 2:

[INFO] [05-07|15:50:05.062] [13/15 LogIndex Prune]                   table=TransactionLog block=56556660 pruneFrom=56553775 pruneTo=56583214
...
[INFO] [05-07|15:55:05.062] [13/15 LogIndex Prune]                   table=TransactionLog block=56581649 pruneFrom=56553775 pruneTo=56583214
[INFO] [05-07|16:35:27.706] [13/15 LogIndex] Prune done              in=45m52.646752289s

Here, as we can see pruneFrom = 0 in iteration 1 and then in iteration 2 it is correctly reflected as pruneFrom = 56553775.
Additionally, the output of integration print_stages was correctly reflecting the prune_at progress.
The fact this worked fine straight after restart makes me think it may be related to some flushing/committing issue.

Steps to reproduce the behaviour

setup a pruned node
stop node
can run integration stage_log_index --resetPruneAt=true --datadir=... --chaindata=... to simulate restarting to 0
start node
explore logs (if you are lucky you will reproduce, may be an intermittent issue)