
Func deploy does not wait for autoscaler to start new nodes

gabbler97 opened this issue · 4 comments

I use Knative just as mentioned in the documentation:

 kn version
Version:      v1.12.0
Build Date:   2023-10-25 15:45:39
Git Revision: ae357368
Supported APIs:
* Serving
  - (knative-serving v1.12.0)
* Eventing
  - (knative-eventing v1.12.0)
  - (knative-eventing v1.12.0)
func version
I installed Istio with Istioctl
I have EKS 1.26
I use cluster autoscaler
I have one nodegroup without taints and another nodegroup which has taints (reserved-mynodes: true)
When I am deploying my functions and there are not enough resources in the cluster

cd hello
func --namespace my-ns deploy --registry my-registry-knative-test-go
  Type     Reason            Age              From                Message
  ----     ------            ----             ----                -------
  Warning  FailedScheduling  76s              default-scheduler   0/5 nodes are available: 1 Too many pods, 1 node(s) had untolerated taint {reserved-mynodes: true}, 4 Insufficient cpu. preemption: 0/5 nodes are available: 1 Preemption is not helpful for scheduling, 4 No preemption victims found for incoming pod..

Sometimes I just got timeout after 120s.

func --namespace my-ns deploy --registry my-registry/knative-test-go
Warning: function is in namespace 'my-ns', but requested namespace is 'my-ns'. Continuing with deployment to 'my-ns'.
Warning: namespace chosen is 'my-ns', but currently active namespace is 'default'. Continuing with deployment to 'my-ns'.
function up-to-date. Force rebuild with --build
Pushing function image to the registry "my-registry" using the "my-user" user credentials
⬆️  Deploying function to the cluster

Service output:

deploy error: knative deployer failed to wait for the Knative Service to become ready: timeout: service 'hello' not ready after 120 seconds
Error: knative deployer failed to wait for the Knative Service to become ready: timeout: service 'hello' not ready after 120 seconds

That is clearly caused by cluster autoscaler. It takes 2-3 minutes to bring up a new worker node if there are not enough resources in the cluster. After I create the function with failed state and the new node is there I can retry to deploy my functions without any issue.

func --namespace my-namespace deploy --registry my-artifactory/knative-test-go
Warning: namespace chosen is 'my-namespace', but currently active namespace is 'default'. Continuing with deployment to 'my-namespace'.
function up-to-date. Force rebuild with --build
Pushing function image to the registry "my-artifactory" using the "my-user" user credentials
⬆️  Deploying function to the cluster
✅ Function updated in namespace "my-namespace" and exposed at URL:

How I am able to increase the timeout? I found no --timeout flag or something like this.
Should I find the solution by setting something in knative-eventing?
Thank you very much in advance!

Hello Everyone!
Any clue?

Hello @gabbler97

I am sorry but this timeout is not currently configurable.

I will add this request to our open issues backlog.

I would post your question about knative serving in the CNCF Serving Slack channel. You might get some help there.

In addition to a simple --timeout option, I would prefer we were able to detect that a new node is being allocated, and inform the user; auto-increasing the timeout.

Dear @lkingland ,
Thank you very much for your answer! :)

Hey @lkingland, I think to achieve this we can configure the K8 client, thereby initializing a watcher over the nodes, and look for the events. If a new worker node is allocated, then increasing the timeout.