Miner bug: `CommitFailed` with nil CommR causes panic
Opened this issue · 0 comments
rvagg commented
From slack: https://filecoinproject.slack.com/archives/CPFTWMY7N/p1723758975588969
2024-08-16T05:33:52.389+0800 ERROR evtsm go-statemachine@v1.0.3/machine.go:116 executing step: panic: runtime error: invalid memory address or nil pointer dereference
goroutine 35499 [running]:
github.com/filecoin-project/lotus/storage/pipeline.(*Sealing).Plan.func1.1()
/home/runner/work/lotus/lotus/lotus/storage/pipeline/fsm.go:48 +0x7b
panic({0x4b32940?, 0xa1df940?})
/opt/hostedtoolcache/go/1.21.12/x64/src/runtime/panic.go:914 +0x21f
github.com/filecoin-project/lotus/storage/pipeline.(*Sealing).checkCommit(_, {_, _}, {{0xc01e075940, 0xc}, 0x226c, 0xd, 0x66ad4e8e, {0xc03fea1220, 0x1, ...}, ...}, ...)
/home/runner/work/lotus/lotus/lotus/storage/pipeline/checks.go:240 +0x43b
github.com/filecoin-project/lotus/storage/pipeline.(*Sealing).handleCommitFailed(_, {{_, _}, _}, {{0xc01e075940, 0xc}, 0x226c, 0xd, 0x66ad4e8e, {0xc03fea1220, ...}, ...})
/home/runner/work/lotus/lotus/lotus/storage/pipeline/states_failed.go:333 +0x44c
github.com/filecoin-project/lotus/storage/pipeline.(*Sealing).Plan.func1({{_, _}, _}, {{0xc01e075940, 0xc}, 0x226c, 0xd, 0x66ad4e8e, {0xc03fea1220, 0x1, ...}, ...})
/home/runner/work/lotus/lotus/lotus/storage/pipeline/fsm.go:63 +0xd6
reflect.Value.call({0x4a9fc80?, 0xc016ce1fc0?, 0x55eefc?}, {0x5013777, 0x4}, {0xc029d17f98, 0x2, 0x5559c5?})
/opt/hostedtoolcache/go/1.21.12/x64/src/reflect/value.go:596 +0xce7
reflect.Value.Call({0x4a9fc80?, 0xc016ce1fc0?, 0x4c203a7261762076?}, {0xc029d17f98?, 0x434152545f425553?, 0x454352554f535245?})
/opt/hostedtoolcache/go/1.21.12/x64/src/reflect/value.go:380 +0xb9
github.com/filecoin-project/go-statemachine.(*StateMachine).run.func3()
/home/runner/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.3/machine.go:113 +0x269
created by github.com/filecoin-project/go-statemachine.(*StateMachine).run in goroutine 19110
/home/runner/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.3/machine.go:109 +0x656
It fails here: https://github.com/filecoin-project/lotus/blob/v1.28.2/storage/pipeline/checks.go#L240
User has a CommitFailed
sector that won't go away:
2024-08-03 18:37:29 +0800 CST: [event;sealing.SectorForceState] {"User":{"State":"PreCommit2"}}
5489. 2024-08-03 18:37:29 +0800 CST: [event;sealing.SectorForceState] {"User":{"State":"PreCommit2"}}
5490. 2024-08-03 18:37:29 +0800 CST: [error;*xerrors.wrapError] state machine error: running planner for state Committing failed: planCommitting got event of unknown type sealing.SectorRetrySealPreCommit1, events: [{User:{}} {User:sector had nil commR or commD} {User:{State:PreCommit2}} {User:{State:PreCommit2}}]
5491. 2024-08-03 18:37:29 +0800 CST: [event;sealing.SectorRetrySealPreCommit1] {"User":{}}
5492. 2024-08-03 18:37:29 +0800 CST: [event;sealing.SectorCommitFailed] {"User":{}}
sector had nil commR or commD
5493. 2024-08-03 18:37:29 +0800 CST: [event;sealing.SectorForceState] {"User":{"State":"PreCommit2"}}
5494. 2024-08-03 18:37:29 +0800 CST: [event;sealing.SectorForceState] {"User":{"State":"PreCommit2"}}
5495. 2024-08-03 18:37:29 +0800 CST: [error;*xerrors.wrapError] state machine error: running planner for state Committing failed: planCommitting got event of unknown type sealing.SectorRetrySealPreCommit1, events: [{User:{}} {User:sector had nil commR or commD} {User:{State:PreCommit2}} {User:{State:PreCommit2}}]
5496. 2024-08-03 18:41:20 +0800 CST: [event;sealing.SectorCommitFailed] {"User":{}}
sector had nil commR or commD
5497. 2024-08-03 18:41:20 +0800 CST: [event;sealing.SectorRetryWaitSeed] {"User":{}}
5498. 2024-08-03 18:41:20 +0800 CST: [event;sealing.SectorSeedReady] {"User":{"SeedValue":"zsVNjgIhKPS1dWJybMxGVmVb/XM5nl5325FOAjZwB9s=","SeedEpoch":4145422}}
5499. 2024-08-03 18:41:20 +0800 CST: [event;sealing.SectorCommitFailed] {"User":{}}
sector had nil commR or commD
When this comes into the pipeline, it hits handleCommitting
and fails here: https://github.com/filecoin-project/lotus/blob/v1.28.2/storage/pipeline/states_sealing.go#L583-L585
It then moves in to handleCommitFailed
which tries to dereference the nil CommR: https://github.com/filecoin-project/lotus/blob/v1.28.2/storage/pipeline/states_failed.go#L333
So it never gets resolved.