Feature: Add retry on failed delivery if multiple addresses of an agent are registered
Closed this issue · 5 comments
Problem
We have the following scenario where an agent is either:
- residing in an unstable network, or
- sometimes crashes due to external factors (hardware faults, vserver restrictions)
while at the same time it needs to be available at all times.
Desired state
- An agent A who tries to contact another agent B is given a list of endpoints.
(which represent multiple instances of the same agent B) - Agent A chooses at random which endpoint to contact but instead of failing and stopping, retries all available endpoints first.
- Each of these tries should be equipped with a timeout and a proper log message.
- If none of the endpoints are available the agent should notify the user, log its state, and not raise exceptions.
Current state
At the moment we already have some of the required aspects implemented:
- multiple agent instances can be spun up which have access to the same wallet (and therefore have the same address).
- we can register multiple endpoints for one agent address within the almanac smart contract.
- upon agent address resolution (query of the almanac) we are given a list of addresses to choose from.
For more information see the following figure and please ask questions if something needs more clarification.
This makes sense to me. One option for implementing this would be:
- Add a new resolver, say
RobustResolver
which returns a random set of endpoints rather than a single one, up to some limit. If the limit equals the total number of endpoints, then it obviously isn't random anymore: - Update
Context.send()
to iterate through the list of endpoints returned by the resolver until successful or all endpoints were tried.
Does this sound like a reasonable implementation? Any concerns from anyone?
- Add a new resolver, say
RobustResolver
which returns a random set of endpoints rather than a single one, up to some limit. If the limit equals the total number of endpoints, then it obviously isn't random anymore:
- Why a new resolver and not make this standard behaviour when more than 1 endpoint is registered?
- Is it possible to combine several resolvers when creating an agent or do you need to choose one?
- Why a new resolver and not make this standard behaviour when more than 1 endpoint is registered?
Yes, that's probably even better, but we could make the limit (number of endpoints to try) configurable. Setting this to one would effectively replicate the current behaviour.
- Is it possible to combine several resolvers when creating an agent or do you need to choose one?
Not really, besides the GlobalResolver
which determines whether to call the Almanac
or NameService
resolver. Did you have a particular use case in mind?
Thanks for your inputs @jrriehl and @Dacksus.
I'd also like to see this be the standard behaviour when more than 1 endpoint is registered as this feature wouldn't break or change any current implementations - assuming that most of the agents registered on the almanac only have 1 endpoint associated (can we check that?) And even if multiple addresses exist, they would be given one address at a time anyway.
- Update
Context.send()
to iterate through the list of endpoints returned by the resolver until successful or all endpoints were tried.
I think I'd also tackle that in the Context.send_raw()
specifically in
uAgents/python/src/uagents/context.py
Line 380 in bddc7b8
by returning a list from the
AlmanacResolver
: uAgents/python/src/uagents/resolver.py
Line 129 in bddc7b8
We would need to add a # of retries
config and potentially limit that to a maximum internally for when someone tries to set up an agent farm with hundreds of agents or more.