filecoin-project/boost

[Support Ticket]: Boost stops accepting deals, performing retrievals, deals are stuck in 'Announcing' state, 'Piece Doctor' page does not load.

Shekelme opened this issue · 21 comments

Boost component

  • boost daemon - storage providers
  • boost client
  • boost UI
  • boost data-transfer
  • boost index-provider
  • booster-http
  • booster-bitswap
  • LID Database - Yugabyte/LevelDB
  • boostd-data
  • Other

Boost Version

boost version 2.1.1+mainnet+git.e9d18ac
boostd version 2.1.1+mainnet+git.e9d18ac
boostd-data version 1.5.1+mainnet+git.e9d18ac
booster-bitswap version 2.1.1+mainnet+git.e9d18ac

Describe the problem

After some time of more or less successful work (it can be several hours or a day or two), something goes wrong - boostd stops accepting deals, performing retrievals, deals are stuck in 'Announcing' state, 'Piece Doctor' page does not load.

Yugabyte setup details

YugabyteDB v2.18.2.1 (replication factor 3, 3 nodes)
YugabyteDB nodes are running on Lenovo P620 workstations. HW specs of 3 hosts:

Host OS YB NVME FS CPU RAM Role
len4 Ubuntu 22.04.3 LTS WD Black SN750 2TB WDS200T3X0C-00SJG0 XFS AMD Ryzen Threadripper PRO 3955WX 16-Cores 512GB DDR4-3200 ECC PR2/C2 worker
len91 Ubuntu 22.04.3 LTS WD Black SN750 2TB WDS200T3X0C-00SJG0 XFS AMD Ryzen Threadripper PRO 3945WX 12-Cores 256GB DDR4-3200 ECC lotus-miner
len101 Ubuntu 22.04.3 LTS WD Black SN750 2TB WDS200T3X0C-00SJG0 XFS AMD Ryzen Threadripper PRO 3995WX 64-Cores 1024GB DDR4-3200 ECC RU/PC2 worker
In addition to `lotus` software (on all servers) and `boost` (lun91), `subspace` nodes are also installed on two workers, but separate drives are allocated for Yugabyte, and processor resources are not even half occupied. Sealing is carried out intermittently, from time to time.

How does your boostd-data talks to yugabyte

I use boostd-data service:

Description=boostd-data
After=network-online.target
After=yb-tserver.service
Requires=network-online.target
Wants=yb-tserver.service

[Service]
ExecStart=/usr/local/bin/boostd-data run yugabyte --hosts 192.168.11.91,192.168.11.14,192.168.11.101 --connect-string="postgresql://yugabyte:yugabyte@192.168.11.91:5433?sslmode=disable" --addr 0.0.0.0:8044
Restart=always
RestartSec=5
TimeoutStartSec=20
TimeoutStopSec=20
User=adm
Group=adm
Type=simple


[Install]
WantedBy=multi-user.target

I've tried using another option as well, ?sslmode=disable&load_balance=true, wuth the same result.

Boostd logs and boostd-data logs

boostd-data.log
boostd-data's log attached. The boost upgrade to v2.1.1 was carried out on 11.12.2023
Boostd log - please download from here: https://disk.yandex.ru/d/ugDMZaikacDfVQ

Recent error, warning and INFO logs from Yugabyte

Attached.
yb-master-tserver-logs.zip

Logging Information

Logs has been attached as files

Repo Steps

  1. Install lotus-miner v1.25.1+boost v2.2.1+YugabyteDB v2.18.2.1 (replication factor 3, 3 nodes)
  2. Start all processes
  3. Wait for the described symptoms to appear.
    ...

boost goroutines atttached
boost-15-12-2023.zip

There are at least three miners affected by this problem: 2 big ones (YugabyteDB, though on second miner there are no errors (timeouts) in cutrrent boostd-data log) - I can open the "Local Index Directory" page, but not "Piece Doctor". On third small miner (LevelDB) I can't even open the "Local Index Directory", no errors or timeouts in cutrrent boostd-data log.

Do deals get "unstuck" from Announcing after boostd restart?

I don't remember trying to do this, however, when restarting boost-data service, the Local Index Directory page does not start displaying until you restart the entire set of boost software.
I checked that right now.

I don't remember trying to do this, however, when restarting boost-data service, the Local Index Directory page does not start displaying until you restart the entire set of boost software.

I checked that right now.

Do deals get unstuck from the announcing state then?

There is no way to verify this yet. Let's wait for some deals...

What I mean is, as per your description -

deals are stuck in 'Announcing' state, 'Piece Doctor' page does not load

I wonder if those get unstuck after restart?

Yes, after stop and start Boost processess (we do not touch the YugabyteDB services) the deals get unstuck and become 'Proving' after a while.

After yesterday's restart, the problem was repeated. No new deals have been received yet. I will be glad for advice on which logs to look at.
For which log subsystems should I set 'set-level' to some more detailed level?

If no new deals came in - what did get stuck in announcing then?

@Shekelme I suggest to look into debug logs for piecedirectory. The log lines that start with add index: .... and podsi: .... Both will have breakdown of time / concurrency settings that should help to understand the problem further.

If no new deals came in - what did get stuck in announcing then?

As you may have noticed, the title of this issue includes more than just this problem. So, with the exception of deals stuck in this status (because there were simply no new deals), the rest of the symptoms are present at the moment.
image
image
If possible, someone can throw me a deal to f0187709 while it is exposed to this problemand see what will happen.

@Shekelme I want to disable the piece doctor on your system and check how it behaves after that. Can you please do the following?

  1. cd boost and confirm that you are running v2.1.1
  2. Create a new test branch git checkout -b test/pdoff
  3. Edit file node/builder.go and search for line Override(StartPieceDoctorKey, modules.NewPieceDoctor), and comment it out. Then try to rebuild the binary.
  4. If binary builds without error, deploy this binary and check LID pages and other things.

Done.

admfc@lenovo91:~/boost$ git status
HEAD detached at v2.1.1
nothing to commit, working tree clean
-----compilation-----
boostd version 2.1.1+mainnet+git.e9d18ac.dirty

image

@Shekelme By the screenshot, I am assuming it worked. If that is not the case, please let me know. If it is working as you expected then please observe deals, indexing , retrievals etc for next 1 week and let us know.

It worked after restarting the processes, but this morning the problem occurred again.

@Shekelme I suggest to look into debug logs for piecedirectory. The log lines that start with add index: .... and podsi: .... Both will have breakdown of time / concurrency settings that should help to understand the problem further.

BTW nothing happens when I set boostd log set-level piecedirectory=debug

Now I have set the value *=debug. I am attaching the log entries for the last 10 minutes. So as goroutines dump.
boost-21-12-2023.dump.zip
boost_go.zip

Also, I don't know if this is related or not, but today I found this entry in the log:

2023/12/21 09:10:14 graphql: panic occurred: runtime error: invalid memory address or nil pointer dereference
goroutine 179363120 [running]:
github.com/graph-gophers/graphql-go/log.(*DefaultLogger).LogPanic(0xc01b8ab388?, {0x74fe9e8?, 0xc03af20d80}, {0x4f695a0, 0xae63a80})
	/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/log/log.go:21 +0x65
github.com/graph-gophers/graphql-go/internal/exec.execFieldSelection.func2.1()
	/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:187 +0x89
panic({0x4f695a0, 0xae63a80})
	/snap/go/current/src/runtime/panic.go:884 +0x213
github.com/filecoin-project/boost/gql.(*resolver).IpniDistanceFromLatestAd(0xc00d09e240, {0x74fe9e8, 0xc03af20e10}, {{0xc039b6a280?, 0x4?}, {0xc039b6a2c0?, 0x2?}})
	/home/admfc/boost/gql/resolver_ipni.go:221 +0x1ba
reflect.Value.call({0x5439aa0?, 0xc00d09e240?, 0x81f0d6?}, {0x54796d5, 0x4}, {0xc03af20e40, 0x2, 0x87b30e?})
	/snap/go/current/src/reflect/value.go:586 +0xb07
reflect.Value.Call({0x5439aa0?, 0xc00d09e240?, 0xc012267706?}, {0xc03af20e40?, 0x545b320?, 0x0?})
	/snap/go/current/src/reflect/value.go:370 +0xbc
github.com/graph-gophers/graphql-go/internal/exec.execFieldSelection.func2(0x0?, {0x74fe9e8?, 0xc03af20d80?}, 0xc015d072d8, 0xc01146ccc0, 0xc01b8abec0, {0x74fe9e8?, 0xc03af20e10})
	/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:211 +0x4c6
github.com/graph-gophers/graphql-go/internal/exec.execFieldSelection({0x74fe9e8, 0xc03af20d80}, 0xc031e18400, 0x90?, 0xc01146ccc0, 0x1?, 0x1)
	/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:231 +0x1ce
github.com/graph-gophers/graphql-go/internal/exec.(*Request).execSelections.func1(0xc01146ccc0)
	/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:81 +0x1b8
created by github.com/graph-gophers/graphql-go/internal/exec.(*Request).execSelections
	/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:77 +0x1d0

context: context.Background.WithValue(type *http.contextKey, val <not Stringer>).WithValue(type *http.contextKey, val 192.168.11.91:8081).WithCancel.WithCancel.WithValue(type opentracing.contextKey, val <not Stringer>)

These 2 patches might help speed things up;
#1855
#1853

You can speed up something that works somehow, which is not observed on my miners. ((
I will try to downgrade Node.js to 16.x

No, it looks like the problem is not with the Node.js version and not in YugabyteDB, because I tried downgrading the Node.js on the miner with LevelDB, and after about 1.5 days the problem appeared on it again.
I need your ideas, gentlemen.