Lightwalletd can return incorrect consensus branch id of 00000000
ccjernigan opened this issue · 4 comments
What is the bug?
lightwalletd.testnet.electriccoin.co went down today. When it came back up, attempts to get lightdinfo from the server returned incorrect results.
grpcurl lightwalletd.testnet.electriccoin.co:9067 cash.z.wallet.sdk.rpc.CompactTxStreamer/GetLightdInfo
{
"version": "v0.4.13",
"vendor": "ECC LightWalletD",
"taddrSupport": true,
"chainName": "test",
"saplingActivationHeight": "280000",
"consensusBranchId": "00000000",
"gitCommit": "2d3943b8e995a3b2c5648ec9859dccc67c535386",
"buildDate": "2022-07-22",
"buildUser": "root",
"estimatedHeight": "1984160",
"zcashdBuild": "v5.3.1",
"zcashdSubversion": "/MagicBean:5.3.1/"
}
One theory is that lightwalletd is having trouble syncing with zcashd, leaving the server in a bad state.
Additional context
This incorrect value causes the mobile wallet clients to fail syncing and the clients do not retry.
Solution
If lightwalletd is in a bad state, it should probably return an error instead of responding to the getlightd info request. (If we do decide that returning 00000000
is what we want, then that should be documented and the mobile clients updated to accomodate it).
In the quoted output above, it's strange that the blockHeight
field isn't present. Looking at the source code, it isn't conditional, it should always be displayed.
When I run that command now, blockHeight
is shown, and when I rerun it a few seconds later, it's a few thousand higher (it's around 500k currently). So it appears to be syncing. Also, consensusBranchId
is 76b809bb
(not all zeros). Is it possible that you caught it just as it was starting up?
I don't understand why it would be syncing from scratch (sapling activation); the block cache files (by default at /var/lib/lightwalletd/db/
) should allow it to restart quickly without having to sync the whole chain again. If it's running in docker, it is necessary to map this directory as a volume so it's preserved across restarts. @mdr0id knows about this. Maybe that's not being done for the testnet instance?
@LarryRuane It was returning this value for over 2 hours today, so it seems like a larger issue than a brief timing bug during server startup.
This was a result of lwd being in the timeout state after knocking to a zcashd node below the minimum cache height, from an IBD. We are working to resolve the sync time issue from a snapshot.
Closing, feel free to reopen if still a problem.