zcash/lightwalletd

Server resets its latest block from time to time

hhanh00 opened this issue · 13 comments

What is the bug?

Occasionally, but not rarely, the server reports a very low height.
For example:

$ grpcurl mainnet.lightwalletd.com:9067 cash.z.wallet.sdk.rpc.CompactTxStreamer/GetLatestBlock
{
  "height": "622062",
  "hash": "vhlJK+qM9RkcHwfds7TYnTuoLtCwDaBOGvp6AQAAAAA="
}

Additional context
The height returned by the server info call is accurate though. It looks like lightwalletd is performing a deep reorg.
After a while, it gets back in sync.

skyl commented

I launched a server using master branch at db2795a
and I have yet to experience this. mainnet.lightwalletd.com:9067 seems to have been working for about ~24 hours now as well. I'm watching to see if this happens again .. I did observe it a couple of days ago on mainnet.lightwalletd.

» grpcurl mainnet.lightwalletd.com:9067 cash.z.wallet.sdk.rpc.CompactTxStreamer/GetLatestBlock
{
  "height": "1732043",
  "hash": "RgDuhFy+411fXeOkNlmZtSxZtdt07KJU9u1HAQAAAAA="
}
» grpcurl zuul.free2z.cash:9067 cash.z.wallet.sdk.rpc.CompactTxStreamer/GetLatestBlock
{
  "height": "1732043",
  "hash": "RgDuhFy+411fXeOkNlmZtSxZtdt07KJU9u1HAQAAAAA="
}
skyl commented

mainnet.lightwalletd.com is currently reset to a low height again

$ grpcurl mainnet.lightwalletd.com:9067 cash.z.wallet.sdk.rpc.CompactTxStreamer/GetLatestBlock
{
  "height": "488166",
  "hash": "VTzEONp5y4D7QBzt8RNgr3rVw8xMlaaRYHYvAAAAAAA="
}
$ grpcurl zuul.free2z.cash:9067 cash.z.wallet.sdk.rpc.CompactTxStreamer/GetLatestBlock

{
  "height": "1749543",
  "hash": "j6/Z2kpvQFLMe0F8gl1MJyXFX3HvEj0X6c60AQAAAAA="
}

Please post here the result of GetLightdInfo when you're seeing this problem.

I haven't seen this myself, and I can't think of what could cause this problem. Can you provide more detail? Let's say you were running GetLatestBlock from a loop every few seconds to get a stream of samples. Would it report the correct height many times in a row, and then on the very next sample, report a very low height, and then on the next sample, back to the correct height? Something like that?

Or, after reporting the very low height, would it slowly return to the correct height (like maybe advancing several hundred blocks per second)?

In other words, I'd like to understand the timing behavior. That would help narrow down the source of the problem. Thanks!

Sorry, I forgot to post GetLightInfod results though I tried it. The server height reported is the correct one, i.e., Getlastblock resets but not GetLightInfod.

Getlastblock indeed slowly goes back up as if block ingestion is repeated. Eventually, the server is back on track. However the process takes a few hours.

It is hard to tell when it happens and it seems to fix itself after a while.

However, wallet users are reporting lost funds (they are not lost) because wallets think they are synced but actually haven't processed the later blocks.

The server height reported is the correct one, i.e., Getlastblock resets but not GetLightInfod

Okay, but I was asking for the GetLightdInfo output to see which version was running. Oh but I just ran it myself (should have thought of that!), and I see it's now at v0.4.13, which is the latest version.

My guess is that this is some sort of infrastructure problem, rather than in the lightwalletd code itself. We (ECC) don't run the mainnet.lightwalletd.com instance, so it's unclear what they're doing. The lightwalletd server creates a (compact) block cache in the form of two disk files, here is the default location (for mainnet):

$ ls -l /var/lib/lightwalletd/db/main
total 2358072
-rw-r--r-- 1 larry larry 2409336991 Jul 22 12:12 blocks
-rw-r--r-- 1 larry larry    5310024 Jul 22 12:12 lengths
$ 

If something outside of lightwalletd deletes or corrupts these files, the result would be what you're seeing. There's no way for us to know if that's actually what's happening, I'm just saying it's something like this. Maybe the file system holding these files runs out of space? Although I think lightwalletd would just halt; I don't know why it would back up many blocks.

Who runs that instance, isn't it Nighthawk wallet? Do they have a GitHub repo that you can open a ticket on for this?

I contacted him on telegram and he assured me that the ECC was working actively on it as a top priority. I thought that you work at the ECC?

I contacted him on telegram and he assured me that the ECC was working actively on it as a top priority. I thought that you work at the ECC?

Yes, I do, but I'm not aware of any bug that makes the block height back up like that. There have been a few fixes lately, maybe you can try it again and make sure it's still happening (if you don't mind).

From a client's point of view, it happens randomly. But when it does, there is no clear indication that something is wrong. I could check both GetLastBlock and GetLightdInfo to see if they match but it seems to be a kludge since it is not supposed to happen anyway.

Btw, a user reported an issue 2 days ago, so I think it's still happening.

You definitely should not need to call both, I'll dig into this and try to figure out what's going on.

I just noticed that mainnet.lightwalletd.com seems to be down. I was going to say that I haven't had any luck seeing this problem. For the last several days, I've been continuously running this command:

while sleep 10;do grpcurl mainnet.lightwalletd.com:9067 cash.z.wallet.sdk.rpc.CompactTxStreamer/GetLatestBlock ;done 2>/dev/null | jq '.height|tonumber|select(. < 1753466)'

This checks every 10 seconds to see if the reported block height drops below 1753466 (just what the tip height happened to be when I started the script), and it hasn't printed anything (meaning no block was reported below that height). But that server has been down, since I don't know when. I just started the script to my local instance, just to see if I can reproduce it that way.

Update, it's been running for about 16 hours, and no block height resets have occurred.

There's been some speculation that this could have been caused by #408 (which should be fixed by #409, awaiting review), the idea being that if the lightwalletd doesn't cleanly shut down, the blocks cache could be in a strange state that required resynchronizing from a long way back in the chain. This is superficially plausible. But I don't think it's right. I just killed my lightwalletd without giving it a chance to cleanly shut down, and when I restarted it, it didn't reset (resync many blocks), it picked up exactly where it was when I killed it. (And, knowing how the code works, this is what I would expect.)

Is this problem still being seen? I haven't made any progress in reproducing it.