fetchai/uAgents

Feature: Add retry on failed delivery if multiple addresses of an agent are registered

Closed this issue · 5 comments

Problem

We have the following scenario where an agent is either:

  • residing in an unstable network, or
  • sometimes crashes due to external factors (hardware faults, vserver restrictions)

while at the same time it needs to be available at all times.

Desired state

  • An agent A who tries to contact another agent B is given a list of endpoints.
    (which represent multiple instances of the same agent B)
  • Agent A chooses at random which endpoint to contact but instead of failing and stopping, retries all available endpoints first.
  • Each of these tries should be equipped with a timeout and a proper log message.
  • If none of the endpoints are available the agent should notify the user, log its state, and not raise exceptions.

Current state

At the moment we already have some of the required aspects implemented:

  • multiple agent instances can be spun up which have access to the same wallet (and therefore have the same address).
  • we can register multiple endpoints for one agent address within the almanac smart contract.
  • upon agent address resolution (query of the almanac) we are given a list of addresses to choose from.

For more information see the following figure and please ask questions if something needs more clarification.

redundancy_proposal

This makes sense to me. One option for implementing this would be:

  • Add a new resolver, say RobustResolver which returns a random set of endpoints rather than a single one, up to some limit. If the limit equals the total number of endpoints, then it obviously isn't random anymore:
  • Update Context.send() to iterate through the list of endpoints returned by the resolver until successful or all endpoints were tried.

Does this sound like a reasonable implementation? Any concerns from anyone?

  • Add a new resolver, say RobustResolver which returns a random set of endpoints rather than a single one, up to some limit. If the limit equals the total number of endpoints, then it obviously isn't random anymore:
  • Why a new resolver and not make this standard behaviour when more than 1 endpoint is registered?
  • Is it possible to combine several resolvers when creating an agent or do you need to choose one?
  • Why a new resolver and not make this standard behaviour when more than 1 endpoint is registered?

Yes, that's probably even better, but we could make the limit (number of endpoints to try) configurable. Setting this to one would effectively replicate the current behaviour.

  • Is it possible to combine several resolvers when creating an agent or do you need to choose one?

Not really, besides the GlobalResolver which determines whether to call the Almanac or NameService resolver. Did you have a particular use case in mind?

Thanks for your inputs @jrriehl and @Dacksus.
I'd also like to see this be the standard behaviour when more than 1 endpoint is registered as this feature wouldn't break or change any current implementations - assuming that most of the agents registered on the almanac only have 1 endpoint associated (can we check that?) And even if multiple addresses exist, they would be given one address at a time anyway.

  • Update Context.send() to iterate through the list of endpoints returned by the resolver until successful or all endpoints were tried.

I think I'd also tackle that in the Context.send_raw() specifically in

destination_address, endpoint = await self._resolver.resolve(destination)

by returning a list from the AlmanacResolver:
return destination, random.choices(endpoints, weights=weights)[0]

We would need to add a # of retries config and potentially limit that to a maximum internally for when someone tries to set up an agent farm with hundreds of agents or more.

Some initial work here: #150. Still need to test, but I'd be interested if this is close to what you had in mind.