cloudfoundry/bosh-agent

Agent could be blocked by blobstore access issues

Closed this issue · 8 comments

The retry logic for blobstore operations introduced with #258 can block the agent to execute other tasks. This situation can occur when a blobstore operation is triggered from a synchronous task like sync_dns_with_signed_url and there are blobstore access issues. In this case agents will try for 15s to access the blobstore and won't be able to start other tasks or respond to task status requests. This can lead to failed deployments or unresponsive agents shown by the bosh vms command. We need to rethink how we can implement the retry logic without blocking the agent by this. Some ideas:

  • reduce the wait limit
  • retry only from async tasks
  • retry only on specific errors like network issues

The retry mechanism had been introduced since the agent sometimes was not able to create a connection towards S3 for a certain period in time. We don't have clear data on how long this was a problem. We could experiment with reducing retry to 3 * 1s instead of 3 * 5s.
The blobstore is accessed from a mixture of async and sync actions.
The sync_dns_with_signed_url action is synchronuous, maybe it could also be async? Can this just be changed or would this also require changes on the Director side of things? What would be the implications? I guess the sync dns broadcasts can happen frequently.

Retrying or not-retrying on specific errors would definitely be reasonable.

This issue was marked as Stale because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale label or comment below.

This issue was closed because it has been labeled Stale for 7 days without subsequent activity. Feel free to re-open this issue at any time by commenting below.

Still relevant.

@mvach Do you have any context on this?

mvach commented

Hi @rkoster,
so this issue is still existing and could potentially occur. Team internally we will run a POC to get some insights how to finally fix the issue. Max could offer timelines ;-)

mvach commented

We cannot reproduce this issue anymore and therefore should close it.

@mvach thanks for getting back on this.