camunda-community-hub/zeebe-http-worker

Jobs are not executed after a while

urbanisierung opened this issue · 13 comments

Issue: Jobs are executed if worker boots up and jobs are already waiting. As soon as the worker is started and new instances have been started some jobs are executed but at some point further jobs are not activated and instances are pending.
Environment: Cloud and locally
Steps to reproduce locally:

cd operate
docker-compose up
java -jar zeebe-http-worker.jar
  • Deploy a workflow including a service task with type http and start some instances until jobs are not executed anymore. I used https://github.com/camunda-cloud/camunda-cloud-examples for this - in the example the service task calls the github api and updates the labels

    • For the first time this works like a charme
  • Wait two minutes

  • Start a new instance again: job is not executed

Might be related camunda/camunda#3585

I still cannot reproduce this with a simple http workflow locally. So it could be related with something else. I will try to figure out how to setup the cloud examples

Further testing indicated that the problem is that a request to github is hanging and does not finish, as the worker does not set a request timeout the worker hangs forever. As the worker is single threaded and blocking on the http request it would explain why no more jobs are executed.

OK - so a fix could be to have a request timeout.
Probably we should also think about making that worker multi-threaded?

Actually it has a timeout set of 1 Minute but it seems the http client is not triggering it https://github.com/zeebe-io/zeebe-http-worker/blob/86e4ddafe41335bf71291eb9bef2402a403d11db/src/main/java/io/zeebe/http/HttpJobHandler.java#L89

I was able to get it working by using sendAsync and to do timed get call on the response future. it then will explode after a while and handle the next request.

Probably we should also think about making that worker multi-threaded?

Yes that would be good, the zeebe job worker has a config option for it

@berndruecker my quick fix was something like this in HttpJobHandler#handle

response = client.sendAsync(request, HttpResponse.BodyHandlers.ofString()).get(30, TimeUnit.SECONDS);

but I'm still not sure why the timeout of java http client does not work.

Investigating a bit it looks like the timeout above is a_connection timeout_, meaning it only applies to establishing the connection, even if the ApiDocs say differently ("Sets a timeout for this request. If the response is not received within the specified timeout then an {@link HttpTimeoutException} is thrown"). However, I do not see a real downside of sending it async and wait for the Future a limited amount of time, so I think we can simply change it this way as it proved to solve the problem.

Unfortunately I have to re-open this bug.

I can now see the timeouts that are implemented, but they don't help the worker to process all jobs again. When I restart the worker, all open jobs are processed again.

Setup: used the docker-compose env from the description, beside:

@urbanisierung: Damn it - can you share your workflow definition than I will try this locally

@berndruecker here is the workflow definition I'm using:
editGithubIssue.zip

  • just one service task to patch a github issue
  • using payload:
const body = {
      labels: [Date.now() + ""]
    };
  • with this payload a new label array will be set
  • you need a github token to get it running

steps:

  • start creating instances: all fine
  • stopping creating instances, waiting 1-2min, starting again to create new instances and jobs are not activated anymore by the worker
  • observing timeout handling
  • restarting the worker helps to execute all open jobs

hope this helps!

@berndruecker here is a script to reproduce it:

#!/bin/bash

bpmn=editGithubIssue.bpmn
id=editGithubIssue

echo "Zeebe Status"
zbctl status --insecure

echo "Deploying workflow definition $bpmn"
zbctl deploy $bpmn --insecure

# $1 number of iterations
# $2 waiting time in seconds between iterations
create_instances() {
    i="0"
    while [ $i -lt $1 ]
    do
        now=$(date)
        echo "$i/$1 Creating new instance with label $now"
        zbctl create instance $id --variables "{\"body\":{\"labels\":[\"$now\"]}}" --insecure
        sleep $2
        i=$[$i+1]
    done
}

create_instances 10 1

echo "Waiting 60sec"
sleep 60

create_instances 3 3

AWSOME - Thanks!

Java version update did fix it