litecoin-project/litecoin

Unable to start Litecoin via snapshot since upgrading from 0.18.3 to 0.21.3

samclusker opened this issue · 1 comments

Issue:

Running: RPC Node

In 0.18.3 we could take a snapshot of the data disk (separately attached disk) and use that snapshot to deploy a new machine and get us synced faster, pretty standard.

Since upgrading 0.21.3 disk snapshots now fail, with litecoind process requiring a reindex of blocks in all circumstances.

I ended up syncing from 0 and getting a healthy node this way, then snapshotting the disk and attempting to start a fresh node with this snapshot, but it fails. Reverting upgrades with a fresh node and snapshotting resolves the issue. Reporting as a bug since this behaviour is only evident post-upgrade.

Expected Behaviour:
Block rewind and eventual syncing of final blocks missing from the snapshot.

Actual Behaviour:
Blocks rewind several blocks before failing and requiring reindex:

litecoin-1  | 2024-09-21T14:05:46Z Verifying last 24 blocks at level 3
litecoin-1  | 2024-09-21T14:05:46Z [0%]...ERROR: DisconnectBlock(): Failed to disconnect MWEB block
litecoin-1  | 2024-09-21T14:05:46Z ERROR: VerifyDB(): *** irrecoverable inconsistency in block data at 2759027, hash=ed909be0679ff0c2f8ba953a5885d29cbea87ffd4c9fd2dc50311d04b2a1419e
litecoin-1  | 2024-09-21T14:05:46Z : Corrupted block database detected.
litecoin-1  | Please restart with -reindex or -reindex-chainstate to recover.
litecoin-1  | : Corrupted block database detected.
litecoin-1  | Please restart with -reindex or -reindex-chainstate to recover.
litecoin-1  | 2024-09-21T14:05:46Z Aborted block database rebuild. Exiting.

This does indicate potentially final blocks are the issue - On one effort of troubleshooting, I have attempted to remove any block files that were dated just before snapshotting.

Reproducing Issue:

Since upgrade the issue is consistently reproducible - the 0.21.3 main node is fully synced. I have shut down the process to take a consistent snapshot on multiple occasions to check for potential corrupted snapshots but 100% failures on all snapshots.

We're running a GCP VM with attached persistent disk which is mounted at a volume. This disk is snapshotted for use by other nodes.

Version Used and Build Method:
0.21.3 built from source within a Dockerfile: https://github.com/flare-foundation/connected-chains-docker/blob/main/images/litecoind/Dockerfile

System Details
GCP e2 virtual machine using a balanced persistent disk
Ubuntu 22.04 OS
Running with Docker