MetisProtocol/metis-replica-node

ndoe stop sync

Closed this issue · 20 comments

What Git revision / release tag are you using?

20240314-2

What is your hardware specification?

Intel(R) Xeon(R) Platinum 8369B CPU @ 2.70GHz

What Docker version are you using?

Docker version 23.0.1, build a5ee5b1

What Linux distribution are you using?

Ubuntu 22.04.2 LTS

Describe what the problem is?

my metis-replica-node update from 20220815 to 20240314-2,After being upgraded node stop sync

l2geth log

INFO [03-16|12:31:27.942] Unlocked account                         address=0x00000398232E2064F896018496b4b44b3D62751F
INFO [03-16|12:31:27.942] Transaction pool price threshold updated price=0
INFO [03-16|12:31:27.943] Transaction pool price threshold updated price=0
ERROR[03-16|12:31:30.270] Invalid raw tx meta                      number=15238366 err=EOF
INFO [03-16|12:31:37.192] Block synchronisation started
INFO [03-16|12:31:37.192] Mining aborted due to sync
ERROR[03-16|12:31:45.171] handler blocksBeforeInsert get equeue err LOG15_ERROR= LOG15_ERROR="Normalized odd number of arguments by adding nil"
WARN [03-16|12:31:45.171] Synchronisation failed, retrying         err="element not found"

Then I had another node, synchronised from zero, and it panicked

INFO [03-16|12:32:04.075] Mining aborted due to sync
panic: Refund counter below zero (gas: 4800 > refund: 0)

goroutine 997 [running]:
github.com/ethereum-optimism/optimism/l2geth/core/state.(*StateDB).SubRefund(0xc0028627e0, 0x12c0)
	/l2geth/core/state/statedb.go:224 +0xdc
github.com/ethereum-optimism/optimism/l2geth/core/vm.init.makeGasSStoreFunc.func12(0xc003174000, 0xc000000a80, 0xc000000a80?, 0x12fcb5b?, 0xc003e24c30?)
	/l2geth/core/vm/operations_acl.go:73 +0x479
github.com/ethereum-optimism/optimism/l2geth/core/vm.(*EVMInterpreter).Run(0xc003178000, 0xc000000a80, {0xc00719fd84, 0x44, 0x7c}, 0x0)
	/l2geth/core/vm/interpreter.go:261 +0x86e
github.com/ethereum-optimism/optimism/l2geth/core/vm.run(0xc003174000, 0xc000000a80, {0xc00719fd84, 0x44, 0x7c}, 0x61?)
	/l2geth/core/vm/evm.go:75 +0x3a2
github.com/ethereum-optimism/optimism/l2geth/core/vm.(*EVM).Call(0xc003174000, {0x12f98e0, 0xc000000900}, {0xde, 0xad, 0xde, 0xad, 0xde, 0xad, 0xde, ...}, ...)
	/l2geth/core/vm/evm.go:256 +0x89c
github.com/ethereum-optimism/optimism/l2geth/core/vm.opCall(0xc003174000?, 0xc003178000, 0xc000000900, 0xc002d15c40, 0xc003ca0de0)
	/l2geth/core/vm/instructions.go:771 +0x30b
github.com/ethereum-optimism/optimism/l2geth/core/vm.(*EVMInterpreter).Run(0xc003178000, 0xc000000900, {0xc003a6cc7e, 0xa4, 0x182}, 0x0)
	/l2geth/core/vm/interpreter.go:277 +0xa96
github.com/ethereum-optimism/optimism/l2geth/core/vm.run(0xc003174000, 0xc000000900, {0xc003a6cc7e, 0xa4, 0x182}, 0x5b?)
	/l2geth/core/vm/evm.go:75 +0x3a2
github.com/ethereum-optimism/optimism/l2geth/core/vm.(*EVM).Call(0xc003174000, {0x12f98e0, 0xc000000600}, {0x5a, 0xb3, 0x90, 0x8, 0x48, 0x12, 0xe1, ...}, ...)
	/l2geth/core/vm/evm.go:256 +0x89c
github.com/ethereum-optimism/optimism/l2geth/core/vm.opCall(0xc003174000?, 0xc003178000, 0xc000000600, 0xc002d15120, 0xc0034a10c8)
	/l2geth/core/vm/instructions.go:771 +0x30b
github.com/ethereum-optimism/optimism/l2geth/core/vm.(*EVMInterpreter).Run(0xc003178000, 0xc000000600, {0xc001d16a00, 0x124, 0x124}, 0x0)
	/l2geth/core/vm/interpreter.go:277 +0xa96
github.com/ethereum-optimism/optimism/l2geth/core/vm.run(0xc003174000, 0xc000000600, {0xc001d16a00, 0x124, 0x124}, 0x5e?)
	/l2geth/core/vm/evm.go:75 +0x3a2
github.com/ethereum-optimism/optimism/l2geth/core/vm.(*EVM).Call(0xc003174000, {0x12f9100, 0xc001401140}, {0x2d, 0x4f, 0x78, 0x8f, 0xdb, 0x26, 0x2a, ...}, ...)
	/l2geth/core/vm/evm.go:256 +0x89c
github.com/ethereum-optimism/optimism/l2geth/core.(*StateTransition).TransitionDbWithBlockNumber(0xc0027c5650, 0x0)
	/l2geth/core/state_transition.go:303 +0xb1e
github.com/ethereum-optimism/optimism/l2geth/core.(*StateTransition).TransitionDb(...)
	/l2geth/core/state_transition.go:238
github.com/ethereum-optimism/optimism/l2geth/core.ApplyMessage(0x107c1a0?, {0x1307de0?, 0xc00285cea0?}, 0x78e6265f41b5b63a?)
	/l2geth/core/state_transition.go:164 +0x25
github.com/ethereum-optimism/optimism/l2geth/core.precacheTransaction(_, {_, _}, _, _, _, _, _, {0x0, {0x0, ...}, ...})
	/l2geth/core/state_prefetcher.go:83 +0x225
github.com/ethereum-optimism/optimism/l2geth/core.(*statePrefetcher).Prefetch(_, _, _, {0x0, {0x0, 0x0}, 0x0, 0x0, {{0x0, 0x0, ...}, ...}, ...}, ...)
	/l2geth/core/state_prefetcher.go:64 +0x2b4
github.com/ethereum-optimism/optimism/l2geth/core.(*BlockChain).insertChainWithFuncAndCh.func2({0xc1758232814169b6?, 0x3f376ab0e?, 0x1a598c0?})
	/l2geth/core/blockchain.go:2083 +0x149
created by github.com/ethereum-optimism/optimism/l2geth/core.(*BlockChain).insertChainWithFuncAndCh in goroutine 889
	/l2geth/core/blockchain.go:2081 +0x1a8a

If applicable, what are the logs from the server around the occurence of the problem?

Can you provide a link to the available snapshots?

Hi, We are investigating this error, please use snapshot we provided first.

Did you provide a new snapshot that's current? The original snapshot provided is over two weeks old and this same error occurs on blocks from about a week ago when trying to sync.

Also, this is not a great way to provide snapshots as the volume is rehydrated from S3 as it's used. Making a tar ball and sticking it in S3 would actually be more efficient.

Our node is not running on AWS, can you provide a link to the snapshot available for download?

Hi, the l2geth snapshot is just for fresh setup, you can use your previous l2geth data if you have ran a legacy replica node.

the l2geth's snapshot is too very large to package and compress it to oss like s3.

the l1dtl's snapshot has stored on s3, please refer to the latest release for that.

The issue persists even with fully-synced L1 DTL and --gcmode=archive for L2 Geth:

INFO [03-17|10:13:05.593] sync from other node                     index=2310344 hash=0xe100193ed4eb54090a639b30d53398c033ffd303ee2c2a4b38cd95731eb7df19                                          
INFO [03-17|10:13:05.593] sync from other node applyTransactionToTip finish current latest=2310344                                                                                                
panic: Refund counter below zero (gas: 15000 > refund: 8400)    

Hi, the l2geth snapshot is just for fresh setup, you can use your previous l2geth data if you have ran a legacy replica node.

The only reason anyone is using your snapshot is because you specifically stated l2geth data from legacy nodes won't work. Regardless, that doesn't work either. Even with GCMODE=archive. Any syncing method results in the error above.

the l2geth's snapshot is too very large to package and compress it to oss like s3.

Both the l2geth data we have locally and the data you provided in your snapshot is only ~300G. This is extremely small as far as chain data goes.

What do you need to help you replicate, acknowledge, and resolve this issue? We can provide anything.

Hi, the l2geth snapshot is just for fresh setup, you can use your previous l2geth data if you have ran a legacy replica node.

The only reason anyone is using your snapshot is because you specifically stated l2geth data from legacy nodes won't work. Regardless, that doesn't work either. Even with GCMODE=archive. Any syncing method results in the error above.

the l2geth's snapshot is too very large to package and compress it to oss like s3.

Both the l2geth data we have locally and the data you provided in your snapshot is only ~300G. This is extremely small as far as chain data goes.

What do you need to help you replicate, acknowledge, and resolve this issue? We can provide anything.

The truth is

The snapshot is from a legacy replica

The block height was created two weeks ago

Our rpc instances are using the snapshot

I didn't see any errors

The truth is

The snapshot is from a legacy replica

The block height was created two weeks ago

Our rpc instances are using the snapshot

I didn't see any errors

Please provide the l2 geth snapshot - it will solve all issues.

@ericlee42 Are your RPC instances using L1DTL? Or still L2?

We have multiple healthy replicas running with L2. The new L1 sync requirement is the issue. Trying that again now for the 82nd, 83rd, and 84th time with a few different tweaks using our L2geth chaindata.

There is no l2dtl in the new replica, you must not use the data and config of the l2dtl.

We provided the latest l1dtl snapshot yesterday, you can use it to spin up your l1dtl service, it should be very quick to catch up.

You can use the data of legacy replica l2geth, just need to bump the image version.

Yeah, that's the problem. You're saying that, but your L1 configs and data don't work.

A question was asked above that you may have missed, please answer it.

new replica node should use l1dtl

But is that what you are doing?

I am unable to sync l2 geth from scratch with --gcmode=archive using new l1 dtl service.
This error persists:
Refund counter below zero (gas: 4800 > refund: 0)

I am unable to sync l2 geth from scratch with --gcmode=archive using new l1 dtl service.

We have provided an aws ebs snapshot, you can use it to start l2geth if you don't have a replica node before.

Refer to the README for the details.

We have provided an aws ebs snapshot, you can use it to start l2geth if you don't have a replica node before.

Since replica node data from 5 nodes and 3 days worth of backups don't work, I'll give this a shot as well. It only takes about 3-5 hours to rehydrate the EBS snap to make it usable.

First using the latest l1dtl snapshot and the old l2geth data, you can restore the synchronisation

  1. Using l1dtl and l2geth to zero-sync will panic, even if l2geth has gcmode=archive set.
  2. Using the old l2dtl and l2geth data will not synchronise with the latest image, please replace it with l1dtl snapshot data.

Yes. Nothing works. Thank you for confirming?

So, for anyone that finds this prematurely closed issue hoping for a solution, here's what worked for us:

  • The latest L1DTL snapshot provided by the Metis team (the original one has something wrong with it).
  • A clean snapshot of l2geth from a few hours prior to the upgrade. The longer it has to sync the more chance of problems, any time after and it won't sync at all. (we had to provide this from our own l2geth backups)
  • gcmode must match whatever data you're using. So if your backup was of a full node, use full (the snapshots Metis is providing appear to be full, not archive as well)
  • Either do not start l2geth until after l1dtl is fully in sync, or stop it after it starts with the compose stack and replace the chaindata then.