0xPolygonHermez/zkevm-prover

OOCS when reprocessFullBatch

Opened this issue · 4 comments

ENV:

1 synchronizer + 1 eth-tx-manager + 1 sequencer +1 sequence-sender + 1 l2gaspricer + 1aggregator + 1 rpc

the sync, seq, rpc and agg use the same executor and hashdb.

Describe:

when we do pressure test, some batchs run ok but when do reprocessFullBatch for the same batch, some may arise ooc error. One of OOCS log (prover.log) is:

20231020_053312_071273 7df4d2b eff9640 --> EXECUTOR_PROCESS_BATCH starting...
20231020_053312_071324 7df4d2b eff9640 ExecutorServiceImpl::ProcessBatch() got externalRequestId= sequencerAddr=f39fd6e51aad88f6f4ce6ab8827279cfffb92266 batchL2DataLength=498 batchL2Data=0xf901ad80843b9aca0184...84b80bf12f0d841e1cff oldStateRoot=6b500e4cfd4745ccc9d6fa7c790342f27e9c40b5218f0279f5c0e0c6277e4ab2 oldAccInputHash=0 oldBatchNum=9 chainId=1001 forkId=5 globalExitRoot=ad3228b676f7d3cd4284a5443f17f1962b36e491b30a40b2405849e597ba5fb5 timestamp=1697779988 from=0xDC58E4c5617E5bdE4507c014190BB0Cb66bBaD2C bUpdateMerkleTree=0 bNoCounters=0 traceConfig=bEnabled=0,bDisableStorage=0,bDisableStack=0,bEnableMemory=0,bEnableReturnData=0,txHashToGenerateExecuteTrace=,txHashToGenerateCallTrace=
20231020_053312_071328 7df4d2b eff9640 --> PROVER_PROCESS_BATCH starting...
20231020_053312_071330 7df4d2b eff9640 Prover::processBatch() timestamp=20231020_053312_071304 UUID=d9addd02-a808-4168-963c-73c8f184a808
20231020_053312_071332 7df4d2b eff9640 Executor::process_batch() fork 5 generated
20231020_053313_524163 7df4d2b eff9640 FullTracer::onError() error=OOCS zkPC=4659 rom= jmpAddr=4690 JMP=1 useJmpAddr=1 cmdBefore[0]={ op=functionCall funcName=eventLog params[0]={ op=getVar varName=onError } params[1]={ op=getVar varName=OOCS }} fileName=utils.zkasm line=829 lineStr=    $${eventLog(onError, OOCS)} :JMP(handleBatchError)
20231020_053313_526934 7df4d2b eff9640 --> STATE_MANAGER_FLUSH starting...
20231020_053313_528140 7df4d2b eff9640 State Manager calls: Duration=1449283us time/duration=56% flush success=1199us(14%)(0%)=1times=1199000ns/time semi flush=2211us(26%)(1%)=1times=2211000ns/time read success=40856us(496%)(28%)=116810times=349ns/time read not found=6207us(75%)(4%)=20335times=305ns/time deleteNodes=4329us(52%)(2%)=13854times=312ns/time write=17963us(218%)(12%)=21723times=826ns/time setStateRoot=9573us(116%)(6%)=3374times=2837ns/time TOTAL=82338us(1000)(56%)=176098times=467ns/time
20231020_053313_535515 7df4d2b eff9640 <-- STATE_MANAGER_FLUSH done: 0.008576 s
20231020_053313_535527 7df4d2b eff9640 Database map: dbMetrics MT.size=0 cached=137207times=44646us=0us/time db=0times=0us=0us/time cacheHitRatio=100% PROGRAM.size=0 cached=356times=1582us=4us/time db=0times=0us=0us/time cacheHitRatio=100% GET_TREE 0times=0us=0us/time=0fields=0.000000fields/time=0us/field
20231020_053313_535531 7df4d2b eff9640 main_exec_generated_fast() done lastStep=7968537 (94.992363%)
20231020_053313_542064 7df4d2b eff9640 <-- PROVER_PROCESS_BATCH done: 1.470733 s
20231020_053313_547196 7df4d2b eff9640 ExecutorServiceImpl::ProcessBatch() returns result=1 old_state_root=6b500e4cfd4745ccc9d6fa7c790342f27e9c40b5218f0279f5c0e0c6277e4ab2 new_state_root=0x6b500e4cfd4745ccc9d6fa7c790342f27e9c40b5218f0279f5c0e0c6277e4ab2 new_acc_input_hash=0x52fa3112cbba5ddc78bd76c7f059dbfd70712771f9f02c0b5f70ab80a4adf5f2 new_local_exit_root=0x0000000000000000000000000000000000000000000000000000000000000000 old_batch_num=9 steps=7968537 gasUsed=16496508 counters.keccakF=457 counters.poseidonG=156232 counters.paddingPG=43170 counters.memAlign=5345 counters.arith=17271 counters.binary=354738 flush_id=38 last_sent_flush_id=37 externalRequestId= nTxs=1 tx[0].hash=0x809e2eefff63c3168e1d1c0d79f31a2ef7468c92bd0825fb14b56817eb3c7b09 stateRoot=0x6b500e4cfd4745ccc9d6fa7c790342f27e9c40b5218f0279f5c0e0c6277e4ab2 gasUsed=16496508 gasLeft=3503492 gasUsed+gasLeft=20000000 gasRefunded=0 result=OOCS
20231020_053313_547205 7df4d2b eff9640 <-- EXECUTOR_PROCESS_BATCH done: 1.475931 s
20231020_053313_547717 7df4d2b eff9640 ExecutorServiceImpl::ProcessBatch() done counter=37 B=498 TX=1 gas=16496508 time=1.475931 TP=337.414147B/s=0.677538TX/s=11177018.437854gas/s=33125.518072gas/B totalTP(10s)=4127.354919B/s=0.769240TX/s=2526335.830189gas/s=612.095611gas/B totalTP(ever)=725.899942B/s=0.125029TX/s=284642.558892gas/s=392.123683gas/B totalTime=6.251408 filedesc=73

One interesting thing is that, when OOCS happend, we will always get:

20231020_053313_535527 7df4d2b eff9640 Database map: dbMetrics MT.size=0 cached=137207times=44646us=0us/time db=0times=0us=0us/time **cacheHitRatio=100%** PROGRAM.size=0 cached=356times=1582us=4us/time db=0times=0us=0us/time **cacheHitRatio=100%** GET_TREE 0times=0us=0us/time=0fields=0.000000fields/time=0us/field

The MT cacheHitRatio and program cacheHitRatio will always be 100%, but on the contrary, not true.

Problem:

Maybe the HashDB has been polluted when we run multi executor concurrent.

What Prover version are you using? Can you provide more information about the test? Do all test inputs have their required DB entries, or do you depend on an external DB to have some data? OOCS errors are triggered by the ROM, so it is very strange that sometimes they happen, and sometimes they don't. @louisliu2048

hermeznetwork/zkevm-prover:v2.2.1
is a test on xen protocol,
no required db entries
and no depend on external DB

What Prover version are you using? Can you provide more information about the test? Do all test inputs have their required DB entries, or do you depend on an external DB to have some data? OOCS errors are triggered by the ROM, so it is very strange that sometimes they happen, and sometimes they don't. @louisliu2048

Yes, and I don't think is ROM's bug that lead to this OOCS error, I think there will be some logical bugs when we run executor concurrent -- they share the same hashDB instance.

We have provided doc to reproduce this error to your colleages, and we can reproduce this OOCS error every time. Hop you can reproduce it and find out the reason behind it.