filecoin-project/lotus

Miner bug: `CommitFailed` with nil CommR causes panic

Opened this issue · 0 comments

From slack: https://filecoinproject.slack.com/archives/CPFTWMY7N/p1723758975588969

2024-08-16T05:33:52.389+0800	ERROR	evtsm	go-statemachine@v1.0.3/machine.go:116	executing step: panic: runtime error: invalid memory address or nil pointer dereference
goroutine 35499 [running]:
github.com/filecoin-project/lotus/storage/pipeline.(*Sealing).Plan.func1.1()
	/home/runner/work/lotus/lotus/lotus/storage/pipeline/fsm.go:48 +0x7b
panic({0x4b32940?, 0xa1df940?})
	/opt/hostedtoolcache/go/1.21.12/x64/src/runtime/panic.go:914 +0x21f
github.com/filecoin-project/lotus/storage/pipeline.(*Sealing).checkCommit(_, {_, _}, {{0xc01e075940, 0xc}, 0x226c, 0xd, 0x66ad4e8e, {0xc03fea1220, 0x1, ...}, ...}, ...)
	/home/runner/work/lotus/lotus/lotus/storage/pipeline/checks.go:240 +0x43b
github.com/filecoin-project/lotus/storage/pipeline.(*Sealing).handleCommitFailed(_, {{_, _}, _}, {{0xc01e075940, 0xc}, 0x226c, 0xd, 0x66ad4e8e, {0xc03fea1220, ...}, ...})
	/home/runner/work/lotus/lotus/lotus/storage/pipeline/states_failed.go:333 +0x44c
github.com/filecoin-project/lotus/storage/pipeline.(*Sealing).Plan.func1({{_, _}, _}, {{0xc01e075940, 0xc}, 0x226c, 0xd, 0x66ad4e8e, {0xc03fea1220, 0x1, ...}, ...})
	/home/runner/work/lotus/lotus/lotus/storage/pipeline/fsm.go:63 +0xd6
reflect.Value.call({0x4a9fc80?, 0xc016ce1fc0?, 0x55eefc?}, {0x5013777, 0x4}, {0xc029d17f98, 0x2, 0x5559c5?})
	/opt/hostedtoolcache/go/1.21.12/x64/src/reflect/value.go:596 +0xce7
reflect.Value.Call({0x4a9fc80?, 0xc016ce1fc0?, 0x4c203a7261762076?}, {0xc029d17f98?, 0x434152545f425553?, 0x454352554f535245?})
	/opt/hostedtoolcache/go/1.21.12/x64/src/reflect/value.go:380 +0xb9
github.com/filecoin-project/go-statemachine.(*StateMachine).run.func3()
	/home/runner/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.3/machine.go:113 +0x269
created by github.com/filecoin-project/go-statemachine.(*StateMachine).run in goroutine 19110
	/home/runner/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.3/machine.go:109 +0x656

It fails here: https://github.com/filecoin-project/lotus/blob/v1.28.2/storage/pipeline/checks.go#L240

User has a CommitFailed sector that won't go away:

 2024-08-03 18:37:29 +0800 CST:  [event;sealing.SectorForceState]        {"User":{"State":"PreCommit2"}}
5489.   2024-08-03 18:37:29 +0800 CST:  [event;sealing.SectorForceState]        {"User":{"State":"PreCommit2"}}
5490.   2024-08-03 18:37:29 +0800 CST:  [error;*xerrors.wrapError]      state machine error: running planner for state Committing failed: planCommitting got event of unknown type sealing.SectorRetrySealPreCommit1, events: [{User:{}} {User:sector had nil commR or commD} {User:{State:PreCommit2}} {User:{State:PreCommit2}}]
5491.   2024-08-03 18:37:29 +0800 CST:  [event;sealing.SectorRetrySealPreCommit1]       {"User":{}}
5492.   2024-08-03 18:37:29 +0800 CST:  [event;sealing.SectorCommitFailed]      {"User":{}}
        sector had nil commR or commD
5493.   2024-08-03 18:37:29 +0800 CST:  [event;sealing.SectorForceState]        {"User":{"State":"PreCommit2"}}
5494.   2024-08-03 18:37:29 +0800 CST:  [event;sealing.SectorForceState]        {"User":{"State":"PreCommit2"}}
5495.   2024-08-03 18:37:29 +0800 CST:  [error;*xerrors.wrapError]      state machine error: running planner for state Committing failed: planCommitting got event of unknown type sealing.SectorRetrySealPreCommit1, events: [{User:{}} {User:sector had nil commR or commD} {User:{State:PreCommit2}} {User:{State:PreCommit2}}]
5496.   2024-08-03 18:41:20 +0800 CST:  [event;sealing.SectorCommitFailed]      {"User":{}}
        sector had nil commR or commD
5497.   2024-08-03 18:41:20 +0800 CST:  [event;sealing.SectorRetryWaitSeed]     {"User":{}}
5498.   2024-08-03 18:41:20 +0800 CST:  [event;sealing.SectorSeedReady] {"User":{"SeedValue":"zsVNjgIhKPS1dWJybMxGVmVb/XM5nl5325FOAjZwB9s=","SeedEpoch":4145422}}
5499.   2024-08-03 18:41:20 +0800 CST:  [event;sealing.SectorCommitFailed]      {"User":{}}
        sector had nil commR or commD

When this comes into the pipeline, it hits handleCommitting and fails here: https://github.com/filecoin-project/lotus/blob/v1.28.2/storage/pipeline/states_sealing.go#L583-L585

It then moves in to handleCommitFailed which tries to dereference the nil CommR: https://github.com/filecoin-project/lotus/blob/v1.28.2/storage/pipeline/states_failed.go#L333

So it never gets resolved.