slackhq/astra

Fail a recovery task after a hard timeout

bryanlb opened this issue · 2 comments

Description

Sometimes recovery tasks fail for a variety of reasons: data in kafka has expired, corrupt or missing partition metadata etc.. Currently, recovery task spins when this happens. So, add a hard timeout to recovery task after which the assigned task fails.

It may be better to mark the recovery task as failed so it is not assigned to another node again and an admin manually takes care of it.

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 30 days.