[Support Ticket]: Boost stops accepting deals, performing retrievals, deals are stuck in 'Announcing' state, 'Piece Doctor' page does not load.
Shekelme opened this issue · 21 comments
Boost component
- boost daemon - storage providers
- boost client
- boost UI
- boost data-transfer
- boost index-provider
- booster-http
- booster-bitswap
- LID Database - Yugabyte/LevelDB
- boostd-data
- Other
Boost Version
boost version 2.1.1+mainnet+git.e9d18ac
boostd version 2.1.1+mainnet+git.e9d18ac
boostd-data version 1.5.1+mainnet+git.e9d18ac
booster-bitswap version 2.1.1+mainnet+git.e9d18ac
Describe the problem
After some time of more or less successful work (it can be several hours or a day or two), something goes wrong - boostd
stops accepting deals, performing retrievals, deals are stuck in 'Announcing' state, 'Piece Doctor' page does not load.
Yugabyte setup details
YugabyteDB v2.18.2.1 (replication factor 3, 3 nodes)
YugabyteDB nodes are running on Lenovo P620 workstations. HW specs of 3 hosts:
Host | OS | YB NVME | FS | CPU | RAM | Role |
---|---|---|---|---|---|---|
len4 | Ubuntu 22.04.3 LTS | WD Black SN750 2TB WDS200T3X0C-00SJG0 | XFS | AMD Ryzen Threadripper PRO 3955WX 16-Cores | 512GB DDR4-3200 ECC | PR2/C2 worker |
len91 | Ubuntu 22.04.3 LTS | WD Black SN750 2TB WDS200T3X0C-00SJG0 | XFS | AMD Ryzen Threadripper PRO 3945WX 12-Cores | 256GB DDR4-3200 ECC | lotus-miner |
len101 | Ubuntu 22.04.3 LTS | WD Black SN750 2TB WDS200T3X0C-00SJG0 | XFS | AMD Ryzen Threadripper PRO 3995WX 64-Cores | 1024GB DDR4-3200 ECC | RU/PC2 worker |
How does your boostd-data talks to yugabyte
I use boostd-data service:
Description=boostd-data
After=network-online.target
After=yb-tserver.service
Requires=network-online.target
Wants=yb-tserver.service
[Service]
ExecStart=/usr/local/bin/boostd-data run yugabyte --hosts 192.168.11.91,192.168.11.14,192.168.11.101 --connect-string="postgresql://yugabyte:yugabyte@192.168.11.91:5433?sslmode=disable" --addr 0.0.0.0:8044
Restart=always
RestartSec=5
TimeoutStartSec=20
TimeoutStopSec=20
User=adm
Group=adm
Type=simple
[Install]
WantedBy=multi-user.target
I've tried using another option as well, ?sslmode=disable&load_balance=true
, wuth the same result.
Boostd logs and boostd-data logs
boostd-data.log
boostd-data's log attached. The boost upgrade to v2.1.1 was carried out on 11.12.2023
Boostd log - please download from here: https://disk.yandex.ru/d/ugDMZaikacDfVQ
Recent error, warning and INFO logs from Yugabyte
Attached.
yb-master-tserver-logs.zip
Logging Information
Logs has been attached as files
Repo Steps
- Install lotus-miner v1.25.1+boost v2.2.1+YugabyteDB v2.18.2.1 (replication factor 3, 3 nodes)
- Start all processes
- Wait for the described symptoms to appear.
...
boost goroutines atttached
boost-15-12-2023.zip
There are at least three miners affected by this problem: 2 big ones (YugabyteDB, though on second miner there are no errors (timeouts) in cutrrent boostd-data log) - I can open the "Local Index Directory" page, but not "Piece Doctor". On third small miner (LevelDB) I can't even open the "Local Index Directory", no errors or timeouts in cutrrent boostd-data log.
Do deals get "unstuck" from Announcing after boostd
restart?
I don't remember trying to do this, however, when restarting boost-data
service, the Local Index Directory page does not start displaying until you restart the entire set of boost software.
I checked that right now.
I don't remember trying to do this, however, when restarting
boost-data
service, the Local Index Directory page does not start displaying until you restart the entire set of boost software.I checked that right now.
Do deals get unstuck from the announcing state then?
There is no way to verify this yet. Let's wait for some deals...
What I mean is, as per your description -
deals are stuck in 'Announcing' state, 'Piece Doctor' page does not load
I wonder if those get unstuck after restart?
Yes, after stop and start Boost processess (we do not touch the YugabyteDB services) the deals get unstuck and become 'Proving' after a while.
After yesterday's restart, the problem was repeated. No new deals have been received yet. I will be glad for advice on which logs to look at.
For which log subsystems should I set 'set-level' to some more detailed level?
If no new deals came in - what did get stuck in announcing then?
@Shekelme I suggest to look into debug logs for piecedirectory
. The log lines that start with add index: ....
and podsi: ...
. Both will have breakdown of time / concurrency settings that should help to understand the problem further.
If no new deals came in - what did get stuck in announcing then?
As you may have noticed, the title of this issue includes more than just this problem. So, with the exception of deals stuck in this status (because there were simply no new deals), the rest of the symptoms are present at the moment.
If possible, someone can throw me a deal to f0187709 while it is exposed to this problemand see what will happen.
@Shekelme I want to disable the piece doctor on your system and check how it behaves after that. Can you please do the following?
cd boost
and confirm that you are running v2.1.1- Create a new test branch
git checkout -b test/pdoff
- Edit file
node/builder.go
and search for lineOverride(StartPieceDoctorKey, modules.NewPieceDoctor),
and comment it out. Then try to rebuild the binary. - If binary builds without error, deploy this binary and check LID pages and other things.
@Shekelme By the screenshot, I am assuming it worked. If that is not the case, please let me know. If it is working as you expected then please observe deals, indexing , retrievals etc for next 1 week and let us know.
It worked after restarting the processes, but this morning the problem occurred again.
@Shekelme I suggest to look into debug logs for piecedirectory. The log lines that start with add index: .... and podsi: .... Both will have breakdown of time / concurrency settings that should help to understand the problem further.
BTW nothing happens when I set boostd log set-level piecedirectory=debug
Now I have set the value *=debug
. I am attaching the log entries for the last 10 minutes. So as goroutines dump.
boost-21-12-2023.dump.zip
boost_go.zip
Also, I don't know if this is related or not, but today I found this entry in the log:
2023/12/21 09:10:14 graphql: panic occurred: runtime error: invalid memory address or nil pointer dereference
goroutine 179363120 [running]:
github.com/graph-gophers/graphql-go/log.(*DefaultLogger).LogPanic(0xc01b8ab388?, {0x74fe9e8?, 0xc03af20d80}, {0x4f695a0, 0xae63a80})
/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/log/log.go:21 +0x65
github.com/graph-gophers/graphql-go/internal/exec.execFieldSelection.func2.1()
/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:187 +0x89
panic({0x4f695a0, 0xae63a80})
/snap/go/current/src/runtime/panic.go:884 +0x213
github.com/filecoin-project/boost/gql.(*resolver).IpniDistanceFromLatestAd(0xc00d09e240, {0x74fe9e8, 0xc03af20e10}, {{0xc039b6a280?, 0x4?}, {0xc039b6a2c0?, 0x2?}})
/home/admfc/boost/gql/resolver_ipni.go:221 +0x1ba
reflect.Value.call({0x5439aa0?, 0xc00d09e240?, 0x81f0d6?}, {0x54796d5, 0x4}, {0xc03af20e40, 0x2, 0x87b30e?})
/snap/go/current/src/reflect/value.go:586 +0xb07
reflect.Value.Call({0x5439aa0?, 0xc00d09e240?, 0xc012267706?}, {0xc03af20e40?, 0x545b320?, 0x0?})
/snap/go/current/src/reflect/value.go:370 +0xbc
github.com/graph-gophers/graphql-go/internal/exec.execFieldSelection.func2(0x0?, {0x74fe9e8?, 0xc03af20d80?}, 0xc015d072d8, 0xc01146ccc0, 0xc01b8abec0, {0x74fe9e8?, 0xc03af20e10})
/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:211 +0x4c6
github.com/graph-gophers/graphql-go/internal/exec.execFieldSelection({0x74fe9e8, 0xc03af20d80}, 0xc031e18400, 0x90?, 0xc01146ccc0, 0x1?, 0x1)
/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:231 +0x1ce
github.com/graph-gophers/graphql-go/internal/exec.(*Request).execSelections.func1(0xc01146ccc0)
/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:81 +0x1b8
created by github.com/graph-gophers/graphql-go/internal/exec.(*Request).execSelections
/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:77 +0x1d0
context: context.Background.WithValue(type *http.contextKey, val <not Stringer>).WithValue(type *http.contextKey, val 192.168.11.91:8081).WithCancel.WithCancel.WithValue(type opentracing.contextKey, val <not Stringer>)
You can speed up something that works somehow, which is not observed on my miners. ((
I will try to downgrade Node.js to 16.x
No, it looks like the problem is not with the Node.js version and not in YugabyteDB, because I tried downgrading the Node.js on the miner with LevelDB, and after about 1.5 days the problem appeared on it again.
I need your ideas, gentlemen.