coryodaniel/k8s

Timeout instead of reference when watching a specific resource

linkdd opened this issue · 7 comments

Hello,

I've been having some trouble using the K8s.Client.watch function:

operation = K8s.Client.list("tekton.dev/v1beta1", :pipelinerun, namespace: :all)
{:ok, ref} = K8s.Client.watch(conn, operation, stream_to: self())

But instead of receiving: {:ok, ref}
I receive: {:error, %HTTPoison.Error{id: nil, reason: :timeout}}

As if the runner does not perform an async request.
For what it's worth, using K8s.Client.run(conn, operation) returns the error as well.

But using kubectl get pipelinerun works fine.

I get this behavior only with this resource and not with the 3 other CRDs I'm watching.

NB: I have a huge number of PipelineRun resource in the cluster (over 100), could it be the problem?

Please read thorough #145, especially the lower part (#145 (comment)). Does it answer the question?

No it does not. I expect the timeout to be received as a message in the process mailbox, not as a direct return value of the watch function.

I call this watch function within a GenServer and use the handle_info function to process correctly the events and the timeouts as well This works fine for 3 CRDs.

But for the PipelineRun resources, instead of returning a reference and sending the timeout error as a message, I get the timeout directly.

To quote the #145 issue, my code fails before entering the loop_receive function.

Oh I see.

Any ideas on why this happen, or recommandations on how to avoid it?

No I really have no idea why this is happening. It is the API server that returns a timeout. I'd expect the answer somewhere in your cluster.

See if this command gives you the resource version of the list resource (using jq to remove verbose list of items, but you can skip that if you want):

kubectl get --raw /api/v1/namespaces/default/pipelineruns | jq "del(.items)"

Ok the problem was indeed on my end.

FYI, I'm developing a Kubernetes Operator, Kubirds.

It provides some CRDs, especially the Unit CRD which creates a PipelineRun resource (from Tekton) according to a schedule (every 5 minutes in our case).

The Unit resource has an optional field: history which tells our operator how many PipelineRun we want to keep. If that field is not specified, we assume we want to keep them all.

After a month on our preproduction environment, we accumulated more than 12 000 PipelineRun resources, assuming they hold 1MB of data each, that means the API Server was trying to load and return 12GB of data, which obviously times out.

This was making our gen_server process crash which was then getting restarted by the supervisor, every 5 seconds. This was also killing our API Server and making our k8s unreachable.

The fix is now easy, if the history field is not present, assume it's 0 not infinity.

TBH, I was kind of naively expecting there would be a pagination mechanism on the Kubernetes API to avoid such problems.

Yup, the fix is up and running, and the huge number of resource has been deleted, no more timeouts now. Closing this ticket.

Thank you for your time :)