cosmos/relayer

Reached max retries querying for block

gaia opened this issue · 10 comments

gaia commented

on v2.5.1 and at least also v2.5.0, i have a synced osmosis and celestia private RPCs being queried but showing Reached max retries querying for block even for very recent blocks (in addition to having at least 2 weeks before pruning)

hey thanks for opening the issue!

we aren't seeing the same behavior in our infra. is there anything unique about your setup, perhaps a load balancer or the need to use port forwarding, etc? could you also try configuring some of the publicly available endpoints from the chain registry to rule out this being an issue with your nodes?

gaia commented

no load balancers. I'm running

image

I tried using all external RPCs, and I still get the same error. But as before, only on celestia<>osmosis (while cosmoshub<>osmosis works fine, using the same osmosis RPC)

Reached max retries querying for block, skipping {"chain_name": "celestia", "chain_id": "celestia", "height": 893372} && warn Reached max retries querying for block, skipping {"chain_name": "osmosis", "chain_id": "osmosis-1", "height": 14058148} (note the recent blockheights). This error is intermittent: it's not shown sequentially for every single block. After a while, it starts to only happen on Celestia (local or 3rd party RPC)

does rly establish a connection in which it needs an inbound port? or websockets? it's behind NAT at the router and NAT at the hypervisor (LXC/LXD)

PS: I can establish a websocket connection to a 3rd party using websocat fine

Would you mind giving me the exact query it is trying to do so that I can try it manually?

the log you shared does have chain_name as celestia so it would seem that the Celestia RPC is the problematic one here. when you start the relayer are you using the debug flag -d? mostly asking to see if there are some details related to the error that are going unseen. i do remember an issue someone reported where the relayer was unable to sync with Celestia and it was due to some configuration on the node, see #1383

the relayer does not use websockets, it just makes RPC calls to the configured node

if i'm not mistaken the logs you are seeing are related to the block_results RPC endpoint

gaia commented

thanks, i will look into it again.

PS: port forwarding IS in use. there is NAT at the router to the public IP and also in the LAN IP of the host (since the relayer runs in an LXC container)

let me know what you turn up!

I'm thinking this is possibly related to some silent error or issue that is only being logged at the debug level related to the Celestia node, which could be stemming from some configuration value that is specific to Celestia. From what you described i don't think there is anything wrong with your relayer/node setup necessarily. Perhaps @agouin can take a peek at this and confirm that the system configuration you are using is fine?

gaia commented

i will run on rly again in the near future and report back. for now I am running hermes.
you can however use our rpc node for testing. i can send you some TIA.

i will run on rly again in the near future and report back. for now I am running hermes. you can however use our rpc node for testing. i can send you some TIA.

appreciate it! yeah if you wanna share your node i would be happy to try debugging this a bit when i have some extra cycles

gaia commented

i will run on rly again in the near future and report back. for now I am running hermes. you can however use our rpc node for testing. i can send you some TIA.

appreciate it! yeah if you wanna share your node i would be happy to try debugging this a bit when i have some extra cycles

happy to share. send me a DM on twitter (@wholesum), you are @Ethereal0ne, right?

The team did some debugging with your Celestia node between Celestia<>Osmosis and it turns out the node is currently configured to discard ABCI responses, which the relayer needs to work properly.

The same issue was described in #1383 and the solution is to go into the node's config and set the field discard_abci_responses = false. After that rly should have no problems connecting to the node and successfully relaying IBC packets.