Basic Fault Tolerance is not working
syspulse opened this issue · 6 comments
I have two upstream nodes configured as primary and secondary/fallback.
When primary node is not available/fails at network level, secondary node is not tried
Docker: emeraldpay/dshackle:0.13.1
Request:
curl -i localhost:8545/ethereum -POST -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","method":"eth_getBlockByNumber","params":["0x01", true],"id":100}'
Log:
2023-01-05 14:06:08.196 | WARN | CompoundReader | Failed to read from io.emeraldpay.dshackle.upstream.ethereum.EthereumDirectReader$2@5f9adf81
reactor.core.Exceptions$RetryExhaustedException: Retries exhausted: 3/3
....
Caused by: io.emeraldpay.dshackle.upstream.rpcclient.JsonRpcException: op-2: Name or service not known
Configuration:
version: v1
proxy:
host: 0.0.0.0
port: 8545
routes:
- id: ethereum
blockchain: ethereum
cluster:
upstreams:
- id: op1
chain: ethereum
role: primary
options:
disable-validation: true
connection:
ethereum:
rpc:
url: "http://op-2:8545"
- id: op2
chain: ethereum
role: fallback
options:
disable-validation: true
connection:
ethereum:
rpc:
url: "https://mainnet.optimism.io"
Expected behavior: After failed primary upstream, secondart/fallback node is tried
op-2: Name or service not known
. Can you please check that Dshackle can resolve the hostname op-2
?
It cannot on purpose, this is the whole point. I'm testing a topology when primary is not available due to networking issues
I have identified the root of the issue.
When you set disable-validation: true
, it essentially instructs Dshackle to always consider the upstream. This disables all checks, making it a potentially dangerous and breaking option by itself. And it seems it is incompatible with the primary
/fallback
roles, as Dshackle will always attempt to use the primary, believing it never fails (even if the host cannot be resolved).
I am uncertain how to address this issue, as it is essentially the expected behavior. However, for problems like invalid hostname, it seems illogical.
May I ask why you want to disable the validation? Is it because Dshackle is unable to validate Optimism? I have never tried using it with Optimism, so I am unsure if there are any differences. But maybe I can add an alternative option(s) to ensure compatibility with Optimism.
I was under impression that disable-validation was about checking if node is synced. I actually don't need nodes to be synced and checked with RPC calls.
I removed disable-validation
and configured geth instances and it seems to be working with this behavior:
-
First request is always quite long and I see the log:
Multistream | State of ETH: height=17181425, status=[UNAVAILABLE/2], lag=[0, 0], weak=[op1, op2]
. -
When primary node shuts down (clean TCP fin), the request is still retried and there is a noticable delay. Secondary node is always available
How do I disable in-memory cache ?
disable-validation
is more general and basically is a shortcut that makes the upstream as always OK.
The first request seems to be slow because [UNAVAILABLE/2]
so it waits until one of the upstreams becomes available. Similar issue, I think, is for the second case. It doesn't immediately learns that the upstream is down and tries it for a some time. I'm going to make a new release in the following week, and with the new release the process of discovering failed upstreams should be more smooth.
The in memory cache cannot be disabled now, and it's needed for some internal operations, though I think I can come with some options to tune it. Is your main concert about the memory use for caching?
The first request seems to be slow because [UNAVAILABLE/2] so it waits until one of the upstreams becomes available.
I start with all nodes available immediately (I can see succesful checks for latest blocks on the network). I am not sure why it waits so long to be marked as Available
.
Second case I can understand why. Just hoped it would work like fast fault tolerant Loadbalancing: no connection - go to the next, retry connection in the background; timeout - go the next, don't exhaust retrying ;-)