prune progress intermittently not updating
Opened this issue · 0 comments
System information
Erigon version: v2.60.0
OS & Version: Linux
Commit hash: 88680ca
Erigon Command (with flags/config):
nohup ./erigon/build/bin/erigon \
--datadir=/mnt/disks/ssd1/bor-mainnet \
--chain=bor-mainnet \
--bor.heimdall=http://localhost:1317 \
--prune=hrtc \
--metrics=true \
--pprof=true \
--db.size.limit=16TB
Consensus Layer: N/A (Polygon with Heimdall)
Consensus Layer Command (with flags/config): N/A
Chain/Network: Polygon
Expected behaviour
Prune progress gets updated after an iteration of the sync loop
Actual behaviour
After running several iterations of the stage loop, I noticed that the prune progress was not getting update for stage LogIndex
.
The pattern was:
- stage loop iteration 1 runs
Forward
and thenPrune
: the logs forLogIndex
stage the logs indicate thatpruneFrom=0
which is correct the first time - stage loop iteration 2 runs
Forward
and thenPrune
: the logsLogIndex
stage the logs indicate thatpruneFrom=0
which is not correct this time as in the previous iteration the stage got pruned to block ~56 million - stage loop iteration 3 again shows
pruneFrom=0
Iteration 1:
INFO] [05-04|03:45:38.457] [13/15 LogIndex Prune] table=TransactionLog block=55878394 pruneFrom=0 pruneTo=56400564
[INFO] [05-04|03:45:59.917] [mem] memory stats Rss=29.4GB Size=0B Pss=29.4GB SharedClean=1.6MB SharedDirty=0B PrivateClean=21.3GB PrivateDirty=8.1GB Referenced=28.5GB Anonymous=8.1GB Swap=0B alloc=6.8GB sys=13.0GB
[INFO] [05-04|03:46:08.445] [13/15 LogIndex Prune] table=TransactionLog block=55880273 pruneFrom=0 pruneTo=56400564
[INFO] [05-04|03:46:38.470] [13/15 LogIndex Prune] table=TransactionLog block=55883020 pruneFrom=0 pruneTo=56400564
...
[INFO] [05-04|05:37:08.444] [13/15 LogIndex Prune] table=TransactionLog block=56399565 pruneFrom=0 pruneTo=56400564
[INFO] [05-04|08:15:23.719] [13/15 LogIndex] Prune done in=4h30m15.276317608s
Iteration 2:
[INFO] [05-05|09:56:15.329] [13/15 LogIndex Prune] table=TransactionLog block=55878398 pruneFrom=0 pruneTo=56474130
[INFO] [05-05|09:56:45.329] [13/15 LogIndex Prune] table=TransactionLog block=55881003 pruneFrom=0 pruneTo=56474130
[INFO] [05-05|09
...
[INFO] [05-05|12:04:45.329] [13/15 LogIndex Prune] table=TransactionLog block=56473500 pruneFrom=0 pruneTo=56474130
[INFO] [05-05|12:04:56.720] [13/15 LogIndex Prune] Flushed buffer file name=erigon-sortable-buf-2775363091
[INFO] [05-05|14:47:28.881] [13/15 LogIndex] Prune done in=4h51m43.553679824s
Iteration 3:
[INFO] [05-06|09:29:48.991] [13/15 LogIndex Prune] table=TransactionLog block=55878461 pruneFrom=0 pruneTo=56523622
...
[INFO] [05-06|11:46:49.004] [13/15 LogIndex Prune] table=TransactionLog block=56477434 pruneFrom=0 pruneTo=56523622
...
<I interrupted and killed the node here as I noticed this weirdness>
In the code pruneFrom = s.PruneProgress
which gets persisted at the end of the stage using s.DoneAt(pruneTo)
. So iteration 2 should have had pruneFrom = 56400564
, iteration 3 should have had pruneFrom = 56474130
In addition, integration print_stages
also displayed prune_at
for LogIndex
equal to 0, although the prune had run through a few iterations already and it was past 0.
Note: after restarting Erigon it looks like the issue disappeared and the prune progress magically got updated after the first iteration of the stage loop. This makes me wonder if this may be some weird MDBX commit issue?
Iteration 1:
[INFO] [05-06|22:34:38.127] [13/15 LogIndex Prune] table=TransactionLog block=55878402 pruneFrom=0 pruneTo=56553775
...
[INFO] [05-07|00:57:08.199] [13/15 LogIndex Prune] table=TransactionLog block=56553106 pruneFrom=0 pruneTo=56553775
[INFO] [05-07|03:43:57.004] [13/15 LogIndex] Prune done in=5h9m48.879256096s
Iteration 2:
[INFO] [05-07|15:50:05.062] [13/15 LogIndex Prune] table=TransactionLog block=56556660 pruneFrom=56553775 pruneTo=56583214
...
[INFO] [05-07|15:55:05.062] [13/15 LogIndex Prune] table=TransactionLog block=56581649 pruneFrom=56553775 pruneTo=56583214
[INFO] [05-07|16:35:27.706] [13/15 LogIndex] Prune done in=45m52.646752289s
Here, as we can see pruneFrom = 0
in iteration 1 and then in iteration 2 it is correctly reflected as pruneFrom = 56553775
.
Additionally, the output of integration print_stages
was correctly reflecting the prune_at
progress.
The fact this worked fine straight after restart makes me think it may be related to some flushing/committing issue.
Steps to reproduce the behaviour
- setup a pruned node
- stop node
- can run
integration stage_log_index --resetPruneAt=true --datadir=... --chaindata=...
to simulate restarting to 0 - start node
- explore logs (if you are lucky you will reproduce, may be an intermittent issue)