NTH drains nodes immediately after processing a scheduled event, instead of waiting for the scheduled time
cnmcavoy opened this issue · 2 comments
Describe the bug
When running NTH in IMDS mode, when it is processing scheduled events for a node, the NTH uses InterruptionEvent start time instead of scheduled time to calculate when a node should be drained of running pods. So when we have nodes with scheduled ec2 maintenance, NTH sees this event, it is often far in the future (weeks or more), but NTH drains the node after processing the event.
Looking at the implementation, I think the problem is involved in how the StartTime
field is set when processing scheduled events:
events = append(events, monitor.InterruptionEvent{
EventID: scheduledEvent.EventID,
Kind: monitor.ScheduledEventKind,
Monitor: ScheduledEventMonitorKind,
Description: fmt.Sprintf("%s will occur between %s and %s because %s\n", scheduledEvent.Code, scheduledEvent.NotBefore, scheduledEvent.NotAfter, scheduledEvent.Description),
State: scheduledEvent.State,
NodeName: m.NodeName,
StartTime: time.Now(),
EndTime: notAfter,
PreDrainTask: preDrainFunc,
})
}
Is time.Now()
the accurate StartTime
in this context? Should this be scheduledEvent.NotBefore
?
Because when the time to drain the node is calculated, the StartTime
field is re-used, which is now-time:
// TimeUntilDrain returns the duration until a node drain should occur (can return a negative duration)
func (s *Store) TimeUntilDrain(interruptionEvent *monitor.InterruptionEvent) time.Duration {
nodeTerminationGracePeriod := time.Duration(s.NthConfig.NodeTerminationGracePeriod) * time.Second
drainTime := interruptionEvent.StartTime.Add(-1 * nodeTerminationGracePeriod)
return time.Until(drainTime)
}
Possibly this is the same issue described in #858 but it's vague enough that I'm not certain.
Expected outcome
I expected NTH to drain the node at the scheduled time for the maintenance, not at the time that it received the event.
Environment
- NTH App Version: 1.20.0
- NTH Mode (IMDS/Queue processor): IMDS
- OS/Arch: Ubuntu 20.04
- Kubernetes version: 1.24.14
- Installation method: ArgoCD / Helm
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.
This issue was closed because it has become stale with no activity.