replicatedhq/troubleshoot

Improve collectd collector performance by detecting hostPath mount errors

banjoh opened this issue · 1 comments

banjoh commented

Describe the rationale for the suggested feature.

Whenever the collectd collector runs, it mounts /var/lib/collectd host path. If the path does not exist, the pod get stuck in ContainerCreating state until its forcefully terminated. This leads to a lot of time wasting. The collector runs for 90s unnecessarily.

Describe the feature

We need to figure out how we can detect if a pod is failing due to not finding the /var/lib/collectd directory and stop the collector it gracefully.

Events:
  Type     Reason       Age                From               Message
  ----     ------       ----               ----               -------
  Normal   Scheduled    59s                default-scheduler  Successfully assigned default/troubleshoot-copyfromhost-pgw7w-t2jsc to k3d-mycluster-server-0
  Warning  FailedMount  27s (x7 over 59s)  kubelet            MountVolume.SetUp failed for volume "host" : hostPath type check failed: /var/lib/collectd is not a directory

Additional context

This collector pod gets launched using a DaemonSet. This means that there is a pod restart policy to consider. We do not want to have it to Never cause there may be legitimate intermittent conditions stopping the pod from starting.

I think we can use pod event to check if it has failedMount event. Then we can terminate it immediately. I have added it to the PR