Poor performance of `Filecoin.StateGetBeaconEntry` for old beacons
Closed this issue · 11 comments
Checklist
- This is not a security-related bug/issue. If it is, please follow please follow the security policy.
- I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
- I am running the
Latest release
, the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these. - I did not make any code changes to lotus.
Lotus component
- lotus daemon - chain sync
- lotus fvm/fevm - Lotus FVM and FEVM interactions
- lotus miner/worker - sealing
- lotus miner - proving(WindowPoSt/WinningPoSt)
- lotus JSON-RPC API
- lotus message management (mpool)
- Other
Lotus Version
❯ ./lotus version
Daemon: 1.29.2-rc1+calibnet+git.4004ca691+api1.5.0
Local: lotus version 1.29.2-rc1+calibnet+git.4004ca691
Repro Steps
curl --silent -X POST -H "Content-Type: application/json" \
--data '{"jsonrpc":"2.0","id":2,"method":"Filecoin.StateGetBeaconEntry","params": [ 1 ] }' \
"http://127.0.0.1:1234/rpc/v1"
First run takes well over a minute (91s on my machine)
Describe the Bug
It is a regression from old performance (pre #12414 and pre change that lead to creation of this issue), where it took significantly less time to get that beacon information (the results are identical).
Logging Information
Lotus v1.29.2-rc1
❯ time curl --silent -X POST -H "Content-Type: application/json" \
--data '{"jsonrpc":"2.0","id":2,"method":"Filecoin.StateGetBeaconEntry","params": [ 1 ] }' \
"http://127.0.0.1:1234/rpc/v1"
{"id":2,"jsonrpc":"2.0","result":{"Round":2396511,"Data":"okAr/DU2fJ9WFpyqunhK3kkI6Z6dG5uZqBk+kfcEYXSTYIbHLL/y5ka0ZRopqAdEGAfODlP4wUaYnWh0c1COMWBtkfFGXBWGc15oU8LTDs2oZB9Sr/DNkfTQLJD9BtcA"}}
________________________________________________________
Executed in 184.56 secs fish external
usr time 7.72 millis 0.00 micros 7.72 millis
sys time 4.42 millis 566.00 micros 3.86 millis
Lotus v1.28.0
in comparison:
❯ time curl --silent -X POST -H "Content-Type: application/json" \
--data '{"jsonrpc":"2.0","id":2,"method":"Filecoin.StateGetBeaconEntry","params": [ 1 ] }' \
"http://127.0.0.1:1234/rpc/v1"
{"id":2,"jsonrpc":"2.0","result":{"Round":2396511,"Data":"okAr/DU2fJ9WFpyqunhK3kkI6Z6dG5uZqBk+kfcEYXSTYIbHLL/y5ka0ZRopqAdEGAfODlP4wUaYnWh0c1COMWBtkfFGXBWGc15oU8LTDs2oZB9Sr/DNkfTQLJD9BtcA"}}
________________________________________________________
Executed in 277.03 millis fish external
usr time 6.21 millis 0.00 micros 6.21 millis
sys time 4.93 millis 771.00 micros 4.16 millis
OK, wow, that's worth a look, but I suppose it's to do with tipset lookup
I can't replicate this unfortunately @LesnyRumcajs, first run or otherwise:
$ lotus -version
lotus version 1.29.2-rc1+mainnet+git.4004ca691
$ uname -a
Linux fletcher 6.8.0-40-generic #40~22.04.3-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 30 17:30:19 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
$ /usr/bin/time -f 'Took: %Es' curl -s -X POST -H "Content-Type: application/json" --data '{"method":"Filecoin.StateGetBeaconEntry","params":[1],"id":2,"jsonrpc":"2.0"}' http://127.0.0.1:1234/rpc/v1 | jq .result.Round
81086
Took: 0:00.09s
In fact, I can run it with any random epoch and it's all pretty quick:
$ /usr/bin/time -f 'Took: %Es' curl -s -X POST -H "Content-Type: application/json" --data '{"method":"Filecoin.StateGetBeaconEntry","params":['$(shuf -i 1-4000000 -n 1)'],"id":2,"jsonrpc":"2.0"}' http://127.0.0.1:1234/rpc/v1 | jq .result.Round
2679701
Took: 0:00.04s
$ /usr/bin/time -f 'Took: %Es' curl -s -X POST -H "Content-Type: application/json" --data '{"method":"Filecoin.StateGetBeaconEntry","params":['$(shuf -i 1-4000000 -n 1)'],"id":2,"jsonrpc":"2.0"}' http://127.0.0.1:1234/rpc/v1 | jq .result.Round
2498234
Took: 0:00.04s
$ /usr/bin/time -f 'Took: %Es' curl -s -X POST -H "Content-Type: application/json" --data '{"method":"Filecoin.StateGetBeaconEntry","params":['$(shuf -i 1-4000000 -n 1)'],"id":2,"jsonrpc":"2.0"}' http://127.0.0.1:1234/rpc/v1 | jq .result.Round
1920535
Took: 0:00.05s
$ /usr/bin/time -f 'Took: %Es' curl -s -X POST -H "Content-Type: application/json" --data '{"method":"Filecoin.StateGetBeaconEntry","params":['$(shuf -i 1-4000000 -n 1)'],"id":2,"jsonrpc":"2.0"}' http://127.0.0.1:1234/rpc/v1 | jq .result.Round
3723317
Took: 0:00.02s
$ /usr/bin/time -f 'Took: %Es' curl -s -X POST -H "Content-Type: application/json" --data '{"method":"Filecoin.StateGetBeaconEntry","params":['$(shuf -i 1-4000000 -n 1)'],"id":2,"jsonrpc":"2.0"}' http://127.0.0.1:1234/rpc/v1 | jq .result.Round
3639175
Took: 0:00.02s
$ /usr/bin/time -f 'Took: %Es' curl -s -X POST -H "Content-Type: application/json" --data '{"method":"Filecoin.StateGetBeaconEntry","params":['$(shuf -i 1-4000000 -n 1)'],"id":2,"jsonrpc":"2.0"}' http://127.0.0.1:1234/rpc/v1 | jq .result.Round
2753507
Took: 0:00.04s
$ /usr/bin/time -f 'Took: %Es' curl -s -X POST -H "Content-Type: application/json" --data '{"method":"Filecoin.StateGetBeaconEntry","params":['$(shuf -i 1-4000000 -n 1)'],"id":2,"jsonrpc":"2.0"}' http://127.0.0.1:1234/rpc/v1 | jq .result.Round
851290
Took: 0:00.08s
I had to upgrade to rc1 to get this going, so it's fairly soon after running. And I'm running splitstore on this so only have a ~170G blockstore. But it looks like you're running calibnet so I imagine it's not huge, or is it a non-splitstore calibnet node perhaps? Or is this a particularly slow machine? What might be the difference here?
ahh, restart my node and call it as soon as it's live:
$ /usr/bin/time -f 'Took: %Es' curl -s -X POST -H "Content-Type: application/json" --data '{"method":"Filecoin.StateGetBeaconEntry","params":[1],"id":2,"jsonrpc":"2.0"}' http://127.0.0.1:1234/rpc/v1 | jq .result.Round
81086
Took: 4:05.58s
I must have waited too long after restart the first time and something had the chance to prime the tipset cache.
Yeah, it's the issue after startup. For context, the setup in Forest's parity tests is:
- Wait for Lotus and Forest to sync (in parallel)
- Call methods and compare results.
Given that, Lotus started to timeout (over 120s on a regular GHA runner) for this method since v1.29.2-rc1
, which was fine in the past (excluding some minor versions in-between where the method was broken).
The conclusion from a brief investigation and internal chat about this is that it comes down to hydrating the tipset cache, which is only done on demand. So if your call to fetch tipset 1
is early enough then you're going to trigger that process and it involves a full walk-back from the known-good head and a caching of a skip-list in the process. Once that's done it's fairly speedy but of course you have the problem of wanting to do it pretty early.
I'm not sure what we can do about this in the short term, but given that mainnet is now 4.3M tipsets long, this is one of those growth problems that probably should be looked at soon. Fortunately though we don't have a lot of users reaching that far back to early tipsets after the start of their node. Your case is pretty unique.
A possible solution might be to lump this into the ChainIndexer work, where we're rolling up the 3 (optional) sqlite indexes of chain data into a single entity, which can be extended to index other parts of the chain, or state. Having to re-parse the entire chain each restart isn't ideal, but having a trusted index that could be used immediately and checked occasionally, might be a good approach here.
For now, could you cheat a bit on this one? If you're doing a whole bunch of compatibility checks, could you send a request for the epoch 1 beacon at the beginning of your checks and let it run in the background (ignoring the result but letting it process) while you do the rest of the checks and then do the beacon checks right at the end? So you get lotus starting on that tipset walk process in parallel to the other tests you're running?
I mean, it's fine, our current workaround is to just check a more recent epoch.
Could you please help me understand why the current solution takes so much time, whereas the previous one (v1.28.0) didn't? I'm curious, because a similar change might need to be implemented in Forest as well (or not), which does this, currently, relatively fast (code for reference)
I think you're doing a call out to drand here, right? https://github.com/ChainSafe/forest/blob/5fd1ec506ee7db4d83fa8aca14338429bf11938c/src/rpc/methods/state.rs#L1755
Previously Lotus did this, but the client configs were removed for mainnet, so we only have client configs for quicknet.
There's a couple of issues with this:
It doesn't scale, because it requires drand to be maintaining those networks in perpetuity, which isn't the case. Prior to Filecoin mainnet launch (proper), between epochs 0 and 51000
, we used the "incentinet" drand network, which has apparently been down for years now, so requesting those beacons isn't going to work anyway (and the incentinet client configs were removed long ago from Lotus anyway, so asking for those epochs would have returned an error like the one you experienced that raised this issue in the first place!). You don't notice this in your testing because CalibrationNet is newer than incentinet so only uses drand mainnet and now quicknet. Eventually drand mainnet is going to go away too, so those beacons will stop responding too.
The other problem is that we have a mismatch on some epochs between the drand epoch and the filecoin epoch. It only got super strict by NV15, and during NV14 it was off-by-one. So we've had to decide what StateGetBeaconEntry
is supposed to do - get the precise drand beacon entry for that filecoin epoch? Or get that drand beacon entry that would have been used for that epoch because it got baked into the blocks? We've chosen to go with that approach, because it also matches what would be used if you were to use get_beacon_randomness
from a smart contract. The exact historical behaviour was summarised in filecoin-project/FIPs#1054 after I ended up having to work through it all.
So, if you want to match Lotus precisely on this API you're going to have to fish around in tipsets to find beacon entries using that same algorithm for historic look-ups. Then if you want to be even more comprehensive in testing compatibility you could look at null and non-null rounds within each of the main periods of beacon history: pre-nv13, nv13, nv14, post-nv14 and then post-quicknet which was 120 epochs after nv23. Probably unnecessary, however, given FIP-0095, it's now possible for smart contracts to reach back in history for these beacons, so it'll end up being a consensus issue if Forest is providing different entries to Lotus and someone writes a contract that goes back into these periods. (Also depends on how you're exposing the beacon lookups to that syscall, maybe that's different to how you're serving StateGetBeaconEntry
?).
I'd be interested to know if you come up with an efficient and dependable way to establish the epoch->tipset mapping, both on startup and continuously.