[Support Ticket]: Boost stops accepting deals, performing retrievals, deals are stuck in 'Announcing' state, 'Piece Doctor' page does not load.

Question

[Support Ticket]: Boost stops accepting deals, performing retrievals, deals are stuck in 'Announcing' state, 'Piece Doctor' page does not load.

Shekelme opened this issue a year ago · 21 comments

Boost component

Boost Version

boost version 2.1.1+mainnet+git.e9d18ac
boostd version 2.1.1+mainnet+git.e9d18ac
boostd-data version 1.5.1+mainnet+git.e9d18ac
booster-bitswap version 2.1.1+mainnet+git.e9d18ac

Describe the problem

After some time of more or less successful work (it can be several hours or a day or two), something goes wrong - boostd stops accepting deals, performing retrievals, deals are stuck in 'Announcing' state, 'Piece Doctor' page does not load.

Yugabyte setup details

YugabyteDB v2.18.2.1 (replication factor 3, 3 nodes)
YugabyteDB nodes are running on Lenovo P620 workstations. HW specs of 3 hosts:

Host	OS	YB NVME	FS	CPU	RAM	Role
len4	Ubuntu 22.04.3 LTS	WD Black SN750 2TB WDS200T3X0C-00SJG0	XFS	AMD Ryzen Threadripper PRO 3955WX 16-Cores	512GB DDR4-3200 ECC	PR2/C2 worker
len91	Ubuntu 22.04.3 LTS	WD Black SN750 2TB WDS200T3X0C-00SJG0	XFS	AMD Ryzen Threadripper PRO 3945WX 12-Cores	256GB DDR4-3200 ECC	lotus-miner
len101	Ubuntu 22.04.3 LTS	WD Black SN750 2TB WDS200T3X0C-00SJG0	XFS	AMD Ryzen Threadripper PRO 3995WX 64-Cores	1024GB DDR4-3200 ECC	RU/PC2 worker

In addition to `lotus` software (on all servers) and `boost` (lun91), `subspace` nodes are also installed on two workers, but separate drives are allocated for Yugabyte, and processor resources are not even half occupied. Sealing is carried out intermittently, from time to time.

How does your boostd-data talks to yugabyte

I use boostd-data service:

Description=boostd-data
After=network-online.target
After=yb-tserver.service
Requires=network-online.target
Wants=yb-tserver.service

[Service]
ExecStart=/usr/local/bin/boostd-data run yugabyte --hosts 192.168.11.91,192.168.11.14,192.168.11.101 --connect-string="postgresql://yugabyte:yugabyte@192.168.11.91:5433?sslmode=disable" --addr 0.0.0.0:8044
Restart=always
RestartSec=5
TimeoutStartSec=20
TimeoutStopSec=20
User=adm
Group=adm
Type=simple


[Install]
WantedBy=multi-user.target

I've tried using another option as well, ?sslmode=disable&load_balance=true, wuth the same result.

Boostd logs and boostd-data logs

boostd-data.log
boostd-data's log attached. The boost upgrade to v2.1.1 was carried out on 11.12.2023
Boostd log - please download from here: https://disk.yandex.ru/d/ugDMZaikacDfVQ

Recent error, warning and INFO logs from Yugabyte

Attached.
yb-master-tserver-logs.zip

Logging Information

Logs has been attached as files

Repo Steps

Install lotus-miner v1.25.1+boost v2.2.1+YugabyteDB v2.18.2.1 (replication factor 3, 3 nodes)
Start all processes
Wait for the described symptoms to appear.
...

Answer 1 · 2023-12-15T05:29:17.000Z

boost goroutines atttached
boost-15-12-2023.zip

Answer 2 · 2023-12-15T10:47:15.000Z

There are at least three miners affected by this problem: 2 big ones (YugabyteDB, though on second miner there are no errors (timeouts) in cutrrent boostd-data log) - I can open the "Local Index Directory" page, but not "Piece Doctor". On third small miner (LevelDB) I can't even open the "Local Index Directory", no errors or timeouts in cutrrent boostd-data log.

Answer 3 · 2023-12-19T13:27:27.000Z

Do deals get "unstuck" from Announcing after boostd restart?

Answer 4 · 2023-12-19T14:14:42.000Z

I don't remember trying to do this, however, when restarting boost-data service, the Local Index Directory page does not start displaying until you restart the entire set of boost software.
I checked that right now.

Answer 5 · 2023-12-19T14:31:31.000Z

I don't remember trying to do this, however, when restarting boost-data service, the Local Index Directory page does not start displaying until you restart the entire set of boost software.

I checked that right now.

Do deals get unstuck from the announcing state then?

Answer 6 · 2023-12-19T14:34:32.000Z

There is no way to verify this yet. Let's wait for some deals...

Answer 7 · 2023-12-19T14:52:13.000Z

What I mean is, as per your description -

deals are stuck in 'Announcing' state, 'Piece Doctor' page does not load

I wonder if those get unstuck after restart?

Answer 8 · 2023-12-19T14:57:40.000Z

Yes, after stop and start Boost processess (we do not touch the YugabyteDB services) the deals get unstuck and become 'Proving' after a while.

Answer 9 · 2023-12-20T06:54:24.000Z

After yesterday's restart, the problem was repeated. No new deals have been received yet. I will be glad for advice on which logs to look at.
For which log subsystems should I set 'set-level' to some more detailed level?

Answer 10 · 2023-12-20T08:32:21.000Z

CC @LexLuthr

Answer 11 · 2023-12-20T08:43:14.000Z

If no new deals came in - what did get stuck in announcing then?

Answer 12 · 2023-12-20T08:54:34.000Z

@Shekelme I suggest to look into debug logs for piecedirectory. The log lines that start with add index: .... and podsi: .... Both will have breakdown of time / concurrency settings that should help to understand the problem further.

Answer 13 · 2023-12-20T09:52:00.000Z

If no new deals came in - what did get stuck in announcing then?

As you may have noticed, the title of this issue includes more than just this problem. So, with the exception of deals stuck in this status (because there were simply no new deals), the rest of the symptoms are present at the moment.

If possible, someone can throw me a deal to f0187709 while it is exposed to this problemand see what will happen.

Answer 14 · 2023-12-20T10:23:52.000Z

@Shekelme I want to disable the piece doctor on your system and check how it behaves after that. Can you please do the following?

cd boost and confirm that you are running v2.1.1
Create a new test branch git checkout -b test/pdoff
Edit file node/builder.go and search for line Override(StartPieceDoctorKey, modules.NewPieceDoctor), and comment it out. Then try to rebuild the binary.
If binary builds without error, deploy this binary and check LID pages and other things.

Answer 15 · 2023-12-20T13:37:23.000Z

Done.

admfc@lenovo91:~/boost$ git status
HEAD detached at v2.1.1
nothing to commit, working tree clean
-----compilation-----
boostd version 2.1.1+mainnet+git.e9d18ac.dirty

Answer 16 · 2023-12-20T14:01:50.000Z

@Shekelme By the screenshot, I am assuming it worked. If that is not the case, please let me know. If it is working as you expected then please observe deals, indexing , retrievals etc for next 1 week and let us know.

Answer 17 · 2023-12-21T06:10:06.000Z

It worked after restarting the processes, but this morning the problem occurred again.

Answer 18 · 2023-12-21T12:32:26.000Z

@Shekelme I suggest to look into debug logs for piecedirectory. The log lines that start with add index: .... and podsi: .... Both will have breakdown of time / concurrency settings that should help to understand the problem further.

BTW nothing happens when I set boostd log set-level piecedirectory=debug

Now I have set the value *=debug. I am attaching the log entries for the last 10 minutes. So as goroutines dump.
boost-21-12-2023.dump.zip
boost_go.zip

Also, I don't know if this is related or not, but today I found this entry in the log:

2023/12/21 09:10:14 graphql: panic occurred: runtime error: invalid memory address or nil pointer dereference
goroutine 179363120 [running]:
github.com/graph-gophers/graphql-go/log.(*DefaultLogger).LogPanic(0xc01b8ab388?, {0x74fe9e8?, 0xc03af20d80}, {0x4f695a0, 0xae63a80})
	/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/log/log.go:21 +0x65
github.com/graph-gophers/graphql-go/internal/exec.execFieldSelection.func2.1()
	/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:187 +0x89
panic({0x4f695a0, 0xae63a80})
	/snap/go/current/src/runtime/panic.go:884 +0x213
github.com/filecoin-project/boost/gql.(*resolver).IpniDistanceFromLatestAd(0xc00d09e240, {0x74fe9e8, 0xc03af20e10}, {{0xc039b6a280?, 0x4?}, {0xc039b6a2c0?, 0x2?}})
	/home/admfc/boost/gql/resolver_ipni.go:221 +0x1ba
reflect.Value.call({0x5439aa0?, 0xc00d09e240?, 0x81f0d6?}, {0x54796d5, 0x4}, {0xc03af20e40, 0x2, 0x87b30e?})
	/snap/go/current/src/reflect/value.go:586 +0xb07
reflect.Value.Call({0x5439aa0?, 0xc00d09e240?, 0xc012267706?}, {0xc03af20e40?, 0x545b320?, 0x0?})
	/snap/go/current/src/reflect/value.go:370 +0xbc
github.com/graph-gophers/graphql-go/internal/exec.execFieldSelection.func2(0x0?, {0x74fe9e8?, 0xc03af20d80?}, 0xc015d072d8, 0xc01146ccc0, 0xc01b8abec0, {0x74fe9e8?, 0xc03af20e10})
	/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:211 +0x4c6
github.com/graph-gophers/graphql-go/internal/exec.execFieldSelection({0x74fe9e8, 0xc03af20d80}, 0xc031e18400, 0x90?, 0xc01146ccc0, 0x1?, 0x1)
	/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:231 +0x1ce
github.com/graph-gophers/graphql-go/internal/exec.(*Request).execSelections.func1(0xc01146ccc0)
	/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:81 +0x1b8
created by github.com/graph-gophers/graphql-go/internal/exec.(*Request).execSelections
	/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:77 +0x1d0

context: context.Background.WithValue(type *http.contextKey, val <not Stringer>).WithValue(type *http.contextKey, val 192.168.11.91:8081).WithCancel.WithCancel.WithValue(type opentracing.contextKey, val <not Stringer>)

Answer 19 · 2023-12-21T13:44:25.000Z

These 2 patches might help speed things up;
#1855
#1853

Answer 20 · 2023-12-22T15:56:25.000Z

You can speed up something that works somehow, which is not observed on my miners. ((
I will try to downgrade Node.js to 16.x

Answer 21 · 2023-12-26T17:41:50.000Z

No, it looks like the problem is not with the Node.js version and not in YugabyteDB, because I tried downgrading the Node.js on the miner with LevelDB, and after about 1.5 days the problem appeared on it again.
I need your ideas, gentlemen.