bookingcom/shipper

Don't block when resolving chart versions

parhamdoustdar opened this issue · 0 comments

Right now, when repo.FetchChartVersions is called, it waits for 2 seconds as a timeout.

This caused an incident when we tried to take down one of the chart repositories that Shipper depends on. As a large number of old releases were still around and were referring to this chart repository, the queues got clogged and we had to bring the chart repository back online.

Instead of having a delay, we should:

  • Change the repo to wait for a longer time period (e.g. 10 seconds) for the first time that a new repo has been encountered
  • Don't wait for resolution if it hasn't happened already while processing a job, and update the condition that we're waiting for the chart repository index to be fetched. That way we report to the user, and also move on to process releases that might be completely unaffected by this chart repository going down. In our incident, that would mean that new releases that are not using the older chart repository would still be handled just fine.

This would pretty much mean to remove the select statement from repo.FetchChartVersions, and move it to repo.Start, and add an argument like timeout to it.