openfaas/of-watchdog

Support request on scaling from zero

t83714 opened this issue ยท 15 comments

If you invoke a function (use of-watchdog HTTP mode) after it's been scaled to zero, a new function pod has been successfully created but get Server returned unexpected status code: 500 - result.

If you look into the function pod log below:

Forking - node [bootstrap.js]
2020/03/10 11:23:19 Started logging stderr from function.
2020/03/10 11:23:19 Started logging stdout from function.
2020/03/10 11:23:19 OperationalMode: http
2020/03/10 11:23:19 Timeouts: read: 1m5s, write: 1m5s hard: 1m0s.
2020/03/10 11:23:19 Listening on port: 8080
2020/03/10 11:23:19 Writing lock-file to: /tmp/.lock
2020/03/10 11:23:19 Metrics listening on port: 8081
2020/03/10 11:23:21 Upstream HTTP request error: Post http://127.0.0.1:3000/: dial tcp 127.0.0.1:3000: connect: connection refused
2020/03/10 11:23:27 stdout: OpenFaaS Node.js listening on port: 3000

you can tell of-watch throw an Upstream HTTP request error 6 seconds before the express Node.js server started to listen at port 3000.

Expected Behaviour

When invoking a function that has been scaled to zero, the function needs to be successfully executed and return the correct response.

Extra response time due to the pod initialisation is acceptable.

Current Behaviour

When invoking a function that has been scaled to zero, a Server returned unexpected status code: 500 - response is returned.

Possible Solution

Possible Solution 1:

  • Make of-watch re-try the connection to the internal HTTP server (under HTTP mode) during a certain timeframe. During the timeframe, the of-watch should not throw an Upstream HTTP request error

Possible Solution 2:

  • Implement a readiness probe in addition to current liveness Probe (that serve at /_/health). Before the internal HTTP server is up, the readiness probe should not return 200 status code.

Steps to Reproduce (for bugs)

  1. Turn on the scale to zero feature of faas-idler and deploy the function with com.openfaas.scale.zero=true label
  2. Wait 30 mins and make sure the function pod is terminated by the faas-idler
  3. invoke the function and watch the result

Context

This issue makes scale to zero feature unusable as the first request after scaled to zero will always fail.

Your Environment

CLI:
 commit:  ea687659ecf14931a29be46c4d2866899d36c282
 version: 0.11.8

Gateway
 uri:     http://127.0.0.1:8080
 version: 0.18.10
 sha:     80b6976c106370a7081b2f8e9099a6ea9638e1f3
 commit:  Update Golang versions to 1.12


Provider
 name:          openfaas-operator
 orchestration: kubernetes
 version:       0.14.1 
 sha:           e747b6ace86bc54184d899fa10cf46dada331af1

Hi in order to look at this we will need to know your complete set-up and how we can reproduce the issue. I can't actually reproduce this so I would think you may have missed some key information?

/set title: Support request on scaling from zero

/add label: support

@alexellis Thanks for your reply.
Here is the repo of our project (Branch openfaas):
https://github.com/magda-io/magda-ckan-connector/tree/openfaas

I tried to create a customised docker image to suit our project.
You don't have to build the project as the docker image is available from docker hub.
We deploy it like (helm tpl is here):

apiVersion: openfaas.com/v1
kind: Function
metadata:
  name: ckan-data-url-processor
  namespace: "openfaas-fn"
  labels: 
    magdaCategory: connectors
    magdaSubCategory: ckan-connector
    magdaType: data-url-processor
spec:
  name: ckan-data-url-processor
  handler: node bootstrap.js
  image: "docker.io/data61/magda-ckan-connector:0.0.57-0-openfaas"
  labels: 
    magdaCategory: connectors
    magdaSubCategory: ckan-connector
    magdaType: data-url-processor
    com.openfaas.scale.zero: "true"
  environment:
    handler: remoteDataUrl
    read_timeout: 65s
    write_timeout: 65s
    exec_timeout: 60s
    userId: 00000000-0000-4000-8000-000000000000

If you do need to build the docker image locally for testing, you can run the following in the cloned repo:

yarn install
yarn build
# assume you use minikube --- setup docker env vars
eval $(minikube docker-env)
yarn docker-build-local

To invoke the function

POST the following:

{
  "handler": "remoteDataUrl",
  "remoteDataUrl": "https://demo.ckan.org/dataset/databaza-ziadosti-o-dotacie-mk-sr-za-rok-2019_test2020_02_18_14_50"
}

to http://localhost:8080/function/ckan-data-url-processor

Assume you have port-forwarded openfaas gateway to localhost:8080

The function simply pulls data from a public available API, convert it to JSON data and return it.

Please let me know if you need more information to reproduce the issue. Thanks again!

@alexellis
Not familiar with GO but quick browsing the code and looks like _/health endpoint will return 200 immediately once of-watchdog's HTTP server starts to listen?

See the line below. The acceptingConnections has been set to 1 here.

https://github.com/openfaas-incubator/of-watchdog/blob/dd56bb916862d17f72d52d82469c631713bdd414/main.go#L146

After that, _/health endpoint will always return 200 code.

At this stage, the executor HTTP server may not be ready yet (although buildRequestHandler is called earlier).

Could we probably add some logic to forward HTTP request to executor HTTP server's _health endpoint at here after acceptingConnections==1 && lockFilePresent() both true?

This will give us a chance to make sure executor HTTP server is online before serve any function invoke request.

How long does it take until your function server is ready?

Would you be open to trying out a custom health check path? https://docs.openfaas.com/reference/workloads/

It should be proxied to your function code.

4 to 6 seconds.
From the logs, I can tell the of-watchdog's HTTP server is online and start to serve _health request within 1 second.

ou be open to trying out a custom health check path

Thanks a lot! Will have a try the custom health check path and report back ๐Ÿ‘

@alexellis

I've added a /_/health endpoint to my express executor:

app.get("/_/health", (req, res) => {
    res.status(200).send("OK");
});

And delete the function and re-deploy the function with com.openfaas.health.http.path: "/_/health" annotations:

apiVersion: openfaas.com/v1
kind: Function
metadata:
  name: ckan-data-url-processor
  namespace: openfaas-fn
  labels: 
    magdaCategory: connectors
    magdaSubCategory: ckan-connector
    magdaType: data-url-processor
spec:
  name: ckan-data-url-processor
  handler: node bootstrap.js
  image: "localhost:5000/data61/magda-ckan-connector:latest"
  labels: 
    magdaCategory: connectors
    magdaSubCategory: ckan-connector
    magdaType: data-url-processor
    com.openfaas.scale.zero: "true"
  annotations:
    com.openfaas.health.http.path: "/_/health"
  environment:
    handler: remoteDataUrl
    read_timeout: 65s
    write_timeout: 65s
    exec_timeout: 60s
    userId: 00000000-0000-4000-8000-000000000000
  limits:
    cpu: 100m
  requests:
    cpu: 50m
    memory: 30Mi

However, seems still got the same error.
The invocation request was received after 5s but the express server was up after 10s:

Forking - node [bootstrap.js]
2020/03/12 12:49:51 Started logging stderr from function.
2020/03/12 12:49:51 Started logging stdout from function.
2020/03/12 12:49:51 OperationalMode: http
2020/03/12 12:49:51 Timeouts: read: 1m5s, write: 1m5s hard: 1m0s.
2020/03/12 12:49:51 Listening on port: 8080
2020/03/12 12:49:51 Writing lock-file to: /tmp/.lock
2020/03/12 12:49:51 Metrics listening on port: 8081
2020/03/12 12:49:55 Upstream HTTP request error: Post http://127.0.0.1:3000/: dial tcp 127.0.0.1:3000: connect: connection refused
2020/03/12 12:50:00 stdout: OpenFaaS Node.js listening on port: 3000

Seems custom health check path didn't work.
Is it something to do with how I setup the custom health check path?

You need to use a different path otherwise you'll conflict with the watchdog

Set com.openfaas.health.http.path to a different path works!
I do need to set com.openfaas.health.http.initialDelay: "10s" as well to avoid container being restarted.
Thanks a lot! @alexellis ๐Ÿ‘

What was your typical start up time for your container?

Given your usage, you may benefit from joining the Insiders Program, from 5USD/mo. https://openfaas.com/support

I think the typical container start-up time is around 10s.
A closer look at it I found the long startup time may cause by the require() in NodeJs.
We use require() to load a lot of JSON files from another NPM package ---- it might be the culprit that causes the delay.

Regarding the support service, we just start to look at FaaS solutions and evaluate the suitability of Openfaas for our project.
Our project is an open-source project (i.e. we won't make any money from it and probably won't have the budget for the support service ).
But I think if we do adopt Openfaas as part of our solution, many of our users will be interested in the support service.
I'm also quite happy to contribute to the project via feedback/bugs report or even some coding (can do NodeJs & Scala and keen to pick up some GO)

Thanks again @alexellis for your help & well done on the Openfaas project ๐Ÿ‘

Our project is an open-source project (i.e. we won't make any money from it and probably won't have the budget for the support service ).

Just so that you're aware, openfaas is also an open source project which has no external funding. I didn't link you to a support program, but a landing page which has the newsletter subscription - since you are using OpenFaaS (it seems heavily), the least you can do is check it out and read the sample. If you benefited from me giving you free technical support here (which you did), then you will benefit from the weekly email updates on the project and ecosystem.