Netflix/genie

jobs not killed after timeout

jelez opened this issue · 23 comments

jelez commented

genie-app-3.0.7.jar

when jobs timeout job is marked in genie UI as FAILED but the unix process remains running forever and the application is running on yarn. (even if we kill the process from the UI the status is FAILED)

in genie log we are getting repeatedly:

2017-06-28 20:55:20.090  INFO 5502 --- [askScheduler-50] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[2d6d0716-5c26-11e7-a50e-0afab93cad90] has timed out

com.netflix.genie.common.exceptions.GenieTimeoutException: Job has exceeded its timeout time of 2017-06-28T17:36:15.792+0000
        at com.netflix.genie.core.util.UnixProcessChecker.checkProcess(UnixProcessChecker.java:79) ~[genie-core-3.0.7.jar!/:3.0.7]
        at com.netflix.genie.web.tasks.job.JobMonitor.run(JobMonitor.java:135) ~[genie-web-3.0.7.jar!/:3.0.7]
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) [spring-context-4.2.8.RELEASE.jar!/:4.2.8.RELEASE]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_131]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_131]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_131]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]

however, it seems that the event listener is never receiving the event. we never see any messages logged by LocalJobKillServiceImpl.java

Also we never get this message in the log for those jobs which message we get sometimes on other EC2 instances:

2017-06-28 18:17:16.845  INFO 5493 --- [askScheduler-42] c.n.genie.web.tasks.job.JobMonitor       : Job e4eb5eba-5c2b-11e7-9233-0a83bd41bebe has finished

nevertheless the message "has finished" points to capturing wrong exception cause we timed out, did not just finish.

jelez commented

Here is an instance when the job times out but is processed by "has finished" exception

017-06-28 18:17:11.835  INFO 5493 --- [askScheduler-42] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[e4eb5eba-5c2b-11e7-9233-0a83bd41bebe] has timed out

com.netflix.genie.common.exceptions.GenieTimeoutException: Job has exceeded its timeout time of 2017-06-28T18:17:09.622+0000
        at com.netflix.genie.core.util.UnixProcessChecker.checkProcess(UnixProcessChecker.java:79) ~[genie-core-3.0.7.jar!/:3.0.7]
        at com.netflix.genie.web.tasks.job.JobMonitor.run(JobMonitor.java:135) ~[genie-web-3.0.7.jar!/:3.0.7]
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) [spring-context-4.2.8.RELEASE.jar!/:4.2.8.RELEASE]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_131]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_131]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_131]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]

2017-06-28 18:17:16.845  INFO 5493 --- [askScheduler-42] c.n.genie.web.tasks.job.JobMonitor       : Job e4eb5eba-5c2b-11e7-9233-0a83bd41bebe has finished

this potentially could be due to job finished before the next JobMonitor check

jelez commented

if it is helpful we have the following settings:

    users:
      creationEnabled: false
      runAsUserEnabled: false

@jelez what OS are you running on?

Also can you try with 3.0.8? @mprimi made some improvements to the run script:
#534 and want to make sure it doesn't cover any of these cases for you.

jelez commented

its amazon linux:
Linux ip-10-171-20-29 4.9.27-14.31.amzn1.x86_64 #1 SMP Wed May 10 01:58:40 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

jelez commented

also due to runAsUserEnabled = false we should not get into any "sudo" use cases

How many threads do you have allocated to the thread pools? it was another thing we fixed in the default configuration of 3.0.8

jelez commented

81 on 40CPU instance

@jelez can you provide some repro steps/environment that we can try to reproduce it on our end?

We have an integration test for the timeout but it may not be catching this case. We'll take a look but some repro steps would definitely help.

@jelez From the logs you pasted, it seems that JobMonitor emits a kill event. This should be consumed by LocalKillJobServiceImpl which attempts to kill.
Can you provide any log produced by this class?

This should give us some info on wether:

  • Kill service ever kicks into action
  • Kill service attempts to kill
  • Kill service succeeds (or at least it thinks it does)

There is a single debug line in this class, but seeing it would point us in a specific direction.
Alternatively, seeing exceptions in the logs originated in this class would tell us what happened.

jelez commented

export JAVA_OPTS="-Djava.io.tmpdir=/opt/genie/tmp -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Addresses=true"
exec java ${JAVA_OPTS} -jar /opt/genie/lib/genie-app-3.0.7.jar --spring.config.location=/opt/genie/config/

genie runs as user hadoop

we run batch spark jobs to EMR 5.6.0 clusters. Spark 2.1.0
jobs are submitted via genie python client: nflx-genie-client (3.3.2)

jobs run for about 2 minutes and exit.
timeout in genie client is 300 sec (5min)
in some cases when there are many jobs they stay in the Yarn queue as accepted and genie times them out
this is where we see the problem above - Genie times them out and we see them as FAILED (shouldn't we see them as TIMEOUT or something?) on UI.
on the OS we see the processes running on Genie nodes
in Yarn we see the applications running

jelez commented

@mprimi I agree. this is exactly my observation I see JobMonitor throwing the exception at intervals repeatedly but it is never captured by the LocalKillJobServiceImpl because i do not see any messages from it in the logs (as I mentioned earlier). let me know if I can do anything to increase verbose level.
to me it seems that kill service is either not detecting or not running - there is not a single line with LocalKillJobServiceImpl in the genie log

jelez commented

Also even if we KILL a job from UI it is killed but status is FAILED, not KILLED

It is hard for us to guess what is going on without knowing the details of your command, but from the fact that we've never seen this in our (very busy) Genie clusters, makes me suspect it is something about your command or environment or configuration.

First, can you check various logfiles in the job folder (stdout, stderr, genie.log, genie.done, ...) to see if there's something interesting?

If nothing is there, could you try creating a very simple command, e.g. sleep {arg}, and experiment with that?
You can for example try

  • sleep 200 with a timeout of 10 -> verify timeout actually kills
  • sleep 200 and kill via UI/API
  • sleep 10 and verify it runs to completion
  • sleep foo and verify it fails
  • etc.

If you see something unexpected with such a simple command, it will tell us that something is off with your OS/Environment/configuration, and we can chase down what.
If instead things work as expected using sleep, then we can dig into your command and see why it's behaving differently (asynchrony, forking/detaching children maybe?).

jelez commented

@mprimi ok, will let you know

And if you would like to increase verbosity, set:

logging:
  level:
    com:
      netflix:
        genie: DEBUG
jelez commented

@mprimi I did try to duplicate the issue manually but I could not duplicate it on DEV. However, it definitely did happen and we were able to see the "run" scripts in the OS process list.

here are details on how I tried to execute. The spark submit command is the same as the one that was failing.

genie_test.py

from __future__ import print_function
import time
import sys
from pyspark.sql import SparkSession
if __name__ == "__main__":
    sleep_time = sys.argv[1]
    spark = SparkSession.builder.getOrCreate()
    print("Sleeping for: {} seconds".format(sleep_time))
    time.sleep(float(sleep_time))
    spark.stop()

Spark Submit Command
as taken from the genie run script

spark-submit --master yarn --deploy-mode client --name genie_test_dag --master yarn --deploy-mode client --conf spark.executor.memory=1G --conf spark.dynamicAllocation.enabled=false --conf spark.driver.memory=1G --conf spark.executor.cores=2 --conf spark.ui.port=9055 --conf spark.hadoop.fs.s3.consistent=false --conf spark.executor.instances=1 --py-files ./genie_test.py genie_test.py 3600

timeout: 60 sleep: 30 - let the job finish gracefully
status:SUCCEEDED, all seems OK here

timeout: 60 sleep: 3600 - Genie must timeout the job and kill it
on timeout job was killed by gene and died on YARN but genie status:FAILED <-- is this correct?

	2017-06-30 00:11:00.105 DEBUG 2194 --- [askScheduler-35] c.n.genie.web.tasks.job.JobMonitor       : Job 81143562-5d28-11e7-ba09-06d9bdf7a6fa is still running...
	2017-06-30 00:11:05.118  INFO 2194 --- [askScheduler-35] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[81143562-5d28-11e7-ba09-06d9bdf7a6fa] has timed out
	com.netflix.genie.common.exceptions.GenieTimeoutException: Job has exceeded its timeout time of 2017-06-30T00:11:04.967+0000
			at com.netflix.genie.core.util.UnixProcessChecker.checkProcess(UnixProcessChecker.java:79) ~[genie-core-3.0.7.jar!/:3.0.7]
			at com.netflix.genie.web.tasks.job.JobMonitor.run(JobMonitor.java:135) ~[genie-web-3.0.7.jar!/:3.0.7]
			at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) [spring-context-4.2.8.RELEASE.jar!/:4.2.8.RELEASE]
			at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_131]
			at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_131]
			at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_131]
			at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_131]
			at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_131]
			at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131]
			at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]
	2017-06-30 00:11:05.119 DEBUG 2194 --- [taskExecutor-68] c.n.g.c.j.s.JpaJobSearchServiceImpl      : Called with id 81143562-5d28-11e7-ba09-06d9bdf7a6fa
	2017-06-30 00:11:05.121 DEBUG 2194 --- [taskExecutor-68] c.n.g.c.j.s.JpaJobSearchServiceImpl      : Called with id 81143562-5d28-11e7-ba09-06d9bdf7a6fa
	2017-06-30 00:11:10.130  INFO 2194 --- [askScheduler-35] c.n.genie.web.tasks.job.JobMonitor       : Job 81143562-5d28-11e7-ba09-06d9bdf7a6fa has finished
	2017-06-30 00:11:10.131 DEBUG 2194 --- [taskExecutor-77] c.n.g.c.s.impl.JobStateServiceImpl       : Successfully cancelled job task for job 81143562-5d28-11e7-ba09-06d9bdf7a6fa
	2017-06-30 00:11:10.132 DEBUG 2194 --- [taskExecutor-69] c.n.g.c.j.s.JpaJobSearchServiceImpl      : Called with id 81143562-5d28-11e7-ba09-06d9bdf7a6fa
	2017-06-30 00:11:10.134 DEBUG 2194 --- [taskExecutor-69] c.n.g.w.tasks.job.JobCompletionService   : Updating the status of the job.
	2017-06-30 00:11:10.135 DEBUG 2194 --- [taskExecutor-69] c.n.g.c.j.s.JpaJobPersistenceServiceImpl : Called with id: 81143562-5d28-11e7-ba09-06d9bdf7a6fa, exit code: 143, status: FAILED, status message: Job failed., std out size: 27, std err size 12246

timeout: 60 sleep: 3600
Kill job from Genie UI gefore timeout --> job killed on YARN but genie status:FAILED <-- is this correct?

	2017-06-30 00:01:40.568 DEBUG 2194 --- [askScheduler-15] c.n.genie.web.tasks.job.JobMonitor       : Job 26c99b3e-5d27-11e7-ac8f-027b686e326a is still running...
	2017-06-30 00:01:42.388  INFO 2194 --- [nio-8080-exec-1] c.n.g.web.controllers.JobRestController  : [killJob] Called for job id: 26c99b3e-5d27-11e7-ac8f-027b686e326a. Forwarded from: http://genie.xyz.com/api/v3/jobs/26c99b3e-5d27-11e7-ac8f-027b686e326a
	2017-06-30 00:01:42.388  INFO 2194 --- [nio-8080-exec-1] c.n.g.web.controllers.JobRestController  : Job 26c99b3e-5d27-11e7-ac8f-027b686e326a is on this node. Attempting to kill.
	2017-06-30 00:01:42.389 DEBUG 2194 --- [nio-8080-exec-1] c.n.g.c.j.s.JpaJobSearchServiceImpl      : Called with id 26c99b3e-5d27-11e7-ac8f-027b686e326a
	2017-06-30 00:01:42.392 DEBUG 2194 --- [nio-8080-exec-1] c.n.g.c.j.s.JpaJobSearchServiceImpl      : Called with id 26c99b3e-5d27-11e7-ac8f-027b686e326a
	2017-06-30 00:01:43.218 DEBUG 2194 --- [nio-8080-exec-3] c.n.g.web.controllers.JobRestController  : [getJobStatus] Called for job with id: 26c99b3e-5d27-11e7-ac8f-027b686e326a
	2017-06-30 00:01:43.219 DEBUG 2194 --- [nio-8080-exec-3] c.n.g.c.j.s.JpaJobSearchServiceImpl      : Called with id 26c99b3e-5d27-11e7-ac8f-027b686e326a
	2017-06-30 00:01:45.579  INFO 2194 --- [askScheduler-15] c.n.genie.web.tasks.job.JobMonitor       : Job 26c99b3e-5d27-11e7-ac8f-027b686e326a has finished
	2017-06-30 00:01:45.580 DEBUG 2194 --- [taskExecutor-75] c.n.g.c.s.impl.JobStateServiceImpl       : Successfully cancelled job task for job 26c99b3e-5d27-11e7-ac8f-027b686e326a
	2017-06-30 00:01:45.581 DEBUG 2194 --- [taskExecutor-73] c.n.g.c.j.s.JpaJobSearchServiceImpl      : Called with id 26c99b3e-5d27-11e7-ac8f-027b686e326a
	2017-06-30 00:01:45.584 DEBUG 2194 --- [taskExecutor-73] c.n.g.w.tasks.job.JobCompletionService   : Updating the status of the job.
	2017-06-30 00:01:45.585 DEBUG 2194 --- [taskExecutor-73] c.n.g.c.j.s.JpaJobPersistenceServiceImpl : Called with id: 26c99b3e-5d27-11e7-ac8f-027b686e326a, exit code: 143, status: FAILED, status mess
	age: Job failed., std out size: 27, std err size 12246

timeout: 60 sleep: foo
pass bad argument and have it fail --> status:FAILED
all seems OK here

Please, let me know what information we can provide once this happens again? would jstack, jmap dumps help?

To clarify, in cases of timeout or user kill, do you still see the runaway process (i.e. something that is supposed to be killed but is left running)?

As for final status FAILED when you would expect KILLED (by user or by monitor), there is a known issue here:

  • When Genie decides to kill, it sends a nice kill signal
  • If the process is still alive after 30 seconds, it kills with -9
    In the latter case (i.e. the process did not respond to the nice kill by shutting down within 30 seconds), the final status is FAILED rather than KILLED.

We are debating wether semantically this makes sense or not, this was introduced in a recent change, but that's a different story.

In your case, can we confirm that's what's going on by looking at genie.log?

I would still recommend stripping away Spark and testing with the simplest possible command first (/bin/sleep).
This would be interesting because if this doesn't work, then it means Genie is not behaving well.
If instead it works as expected, then something is going on with your Spark command.

jelez commented

@mprimi I am mostly concerned with orphaned EMR processes which Genie considers failed. We observed this on one genie instance and I assume this is a result of overloading it somehow. I will let you know how our tests go and if we continue to observe this.

Indeed, that would be really bad.
Hence the suggestion to start with a very simple command like sleep.
If something as simple as sleep doesn't work, we dig down (Genie, OS, settings, ...).
If sleep works and your command doesn't then we dig up, i.e. look more into your command.
Even in the latter case, we discover some new edge case that Genie does not currently handle, and we want to fix.

jelez commented

@mprimi As i mentioned even with spark command it works as expected - i do not see any sense in trying /bin/sleep since it will work and in most cases it works with our setup.
It is in exception cases (which are frequent for us) that we see this behavior.

As a matter of fact we just saw this happening at scale on genie 3.0.8
The processes below are from a genie node but genie UI has marked this job as FAILED.

 17227 ?        Ss     0:00 su -s /bin/sh -c exec "start-genie.sh" hadoop TERM=linux PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin PWD=/ ...
 17228 ?        Ssl   25:28  \_ java -Djava.io.tmpdir=/opt/genie/tmp -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Addresses=true -jar /opt/genie/lib/genie-app-3.0.8.jar --spring.config.location=/opt/genie/config/ TERM=linux SHELL=/bin/sh JAVA_OPTS=-Djava.io.tmpdir=/opt/genie/tmp -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Addresses=true USER=hadoop PATH=/usr/local/sbin:/usr/local/...
 56394 ?        Ss     0:00      \_ bash /opt/genie/tmp/jobs/8b4d26ec-5d0f-11e7-ab5f-0266396ed958/run TERM=linux SHELL=/bin/sh JAVA_OPTS=-Djava.io.tmpdir=/opt/genie/tmp -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Addresses=true USER=hadoop PATH=/usr/loc...
 56993 ?        Sl     5:00      |   \_ /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.131-2.b11.30.amzn1.x86_64/jre/bin/java -cp /usr/lib/hadoop-lzo/lib/*:/opt/genie/tmp/jobs/8b4d26ec-5d0f-11e7-ab5f-0266396ed958/genie/applications/hadoop-2.7.3/dependencies/hadoop-2.7.3/...
 57489 ?        Sl     0:07      |       \_ python /mydir/myapp.py PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin:/opt/genie/tmp/jobs/8b4d26ec-5d0f-11e7-ab5f-0266396ed958/genie/applications/spark-2.1.0/dependenc...
 59183 ?        Rl   1537:08     |       \_ python -m xxx...

2017-06-29 21:11:26.592  INFO 17228 --- [io-8080-exec-16] c.n.g.web.controllers.JobRestController  : [submitJob] Called multipart method to submit job: {"id":"8b4d26ec-5d0f-11e7-ab5f-0266396ed958","created":null,"updated":null,"tags":[...
2017-06-29 21:11:26.592  INFO 17228 --- [io-8080-exec-16] c.n.g.web.controllers.JobRestController  : Saving attachments for job 8b4d26ec-5d0f-11e7-ab5f-0266396ed958
2017-06-29 21:11:26.596  INFO 17228 --- [io-8080-exec-16] c.n.g.c.s.i.JobCoordinatorServiceImpl    : Called to schedule job launch for job 8b4d26ec-5d0f-11e7-ab5f-0266396ed958
2017-06-29 21:11:26.603  INFO 17228 --- [io-8080-exec-16] c.n.g.c.s.i.JobCoordinatorServiceImpl    : Selecting cluster for job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958]
2017-06-29 21:11:26.611  INFO 17228 --- [io-8080-exec-16] c.n.g.c.s.i.JobCoordinatorServiceImpl    : Selected cluster Optional[j-XXXXXXXXXXXXX] for job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958]
2017-06-29 21:11:26.611  INFO 17228 --- [io-8080-exec-16] c.n.g.c.s.i.JobCoordinatorServiceImpl    : Selecting command attached to cluster j-XXXXXXXXXXXXX for job 8b4d26ec-5d0f-11e7-ab5f-0266396ed958
2017-06-29 21:11:26.613  INFO 17228 --- [io-8080-exec-16] c.n.g.c.s.i.JobCoordinatorServiceImpl    : Selected command Optional[spark-2.1.0-submit] for job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958]
2017-06-29 21:11:26.613  INFO 17228 --- [io-8080-exec-16] c.n.g.c.s.i.JobCoordinatorServiceImpl    : Selecting applications for job 8b4d26ec-5d0f-11e7-ab5f-0266396ed958 and command spark-2.1.0-submit
2017-06-29 21:11:26.614  INFO 17228 --- [io-8080-exec-16] c.n.g.c.s.i.JobCoordinatorServiceImpl    : Selected applications Optional[hadoop-2.7.3,spark-2.1.0] for job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958]
2017-06-29 21:11:26.624  INFO 17228 --- [io-8080-exec-16] c.n.g.c.s.i.JobCoordinatorServiceImpl    : Checking if can run job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] on this node
2017-06-29 21:11:26.624  INFO 17228 --- [io-8080-exec-16] c.n.g.c.s.i.JobCoordinatorServiceImpl    : Job 8b4d26ec-5d0f-11e7-ab5f-0266396ed958 can run on this node as only 26624/30720 MB are used and requested 1024 MB
2017-06-29 21:11:26.624  INFO 17228 --- [io-8080-exec-16] c.n.g.c.s.i.JobCoordinatorServiceImpl    : Publishing job scheduled event for job 8b4d26ec-5d0f-11e7-ab5f-0266396ed958
2017-06-29 21:11:26.627  INFO 17228 --- [askScheduler-54] c.n.g.core.services.impl.LocalJobRunner  : Beginning local job submission for {"id":"8b4d26ec-5d0f-11e7-ab5f-0266396ed958","created":null,"updated":null,"tags":["...
2017-06-29 21:11:26.627  INFO 17228 --- [askScheduler-54] c.n.g.core.services.impl.LocalJobRunner  : Created job dir /opt/genie/tmp/jobs/8b4d26ec-5d0f-11e7-ab5f-0266396ed958
2017-06-29 21:11:26.627  INFO 17228 --- [askScheduler-54] c.n.g.core.services.impl.LocalJobRunner  : Created run script /opt/genie/tmp/jobs/8b4d26ec-5d0f-11e7-ab5f-0266396ed958/run
2017-06-29 21:11:26.627  INFO 17228 --- [askScheduler-54] c.n.g.core.services.impl.LocalJobRunner  : Executing job workflow for job 8b4d26ec-5d0f-11e7-ab5f-0266396ed958
2017-06-29 21:11:26.627  INFO 17228 --- [askScheduler-54] .j.w.i.JobFailureAndKillHandlerLogicTask : Starting Job Failure and Kill Handler Task for job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958]
2017-06-29 21:11:26.627  INFO 17228 --- [askScheduler-54] .j.w.i.JobFailureAndKillHandlerLogicTask : Finished Job Failure and Kill Handler Task for job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958]
2017-06-29 21:11:26.627  INFO 17228 --- [askScheduler-54] c.n.g.c.j.w.impl.InitialSetupTask        : Starting Initial Setup Task for job 8b4d26ec-5d0f-11e7-ab5f-0266396ed958
2017-06-29 21:11:26.627  INFO 17228 --- [askScheduler-54] c.n.g.c.j.w.impl.InitialSetupTask        : Finished Initial Setup Task for job 8b4d26ec-5d0f-11e7-ab5f-0266396ed958
2017-06-29 21:11:26.627  INFO 17228 --- [askScheduler-54] c.n.g.c.jobs.workflow.impl.ClusterTask   : Starting Cluster Task for job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958]
2017-06-29 21:11:26.717  INFO 17228 --- [askScheduler-54] c.n.g.c.jobs.workflow.impl.ClusterTask   : Finished Cluster Task for job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958]
2017-06-29 21:11:26.717  INFO 17228 --- [askScheduler-54] c.n.g.c.j.workflow.impl.ApplicationTask  : Starting Application Task for job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958]
2017-06-29 21:11:27.265  INFO 17228 --- [askScheduler-54] c.n.g.c.j.workflow.impl.ApplicationTask  : Finished Application Task for job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958]
2017-06-29 21:11:27.265  INFO 17228 --- [askScheduler-54] c.n.g.c.jobs.workflow.impl.CommandTask   : Starting Command Task for job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958]
2017-06-29 21:11:27.280  INFO 17228 --- [askScheduler-54] c.n.g.c.jobs.workflow.impl.CommandTask   : Finished Command Task for job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958]
2017-06-29 21:11:27.280  INFO 17228 --- [askScheduler-54] c.n.g.core.jobs.workflow.impl.JobTask    : Starting Job Task for job 8b4d26ec-5d0f-11e7-ab5f-0266396ed958
2017-06-29 21:11:27.350  INFO 17228 --- [askScheduler-54] c.n.g.core.jobs.workflow.impl.JobTask    : Finished Job Task for job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958]
2017-06-29 21:11:27.350  INFO 17228 --- [askScheduler-54] c.n.g.c.j.workflow.impl.JobKickoffTask   : Starting Job Kickoff Task for job 8b4d26ec-5d0f-11e7-ab5f-0266396ed958
2017-06-29 21:11:27.351  INFO 17228 --- [askScheduler-54] c.n.g.c.j.workflow.impl.JobKickoffTask   : Finished Job Kickoff Task for job 8b4d26ec-5d0f-11e7-ab5f-0266396ed958
2017-06-29 21:11:27.351  INFO 17228 --- [askScheduler-54] c.n.g.core.services.impl.LocalJobRunner  : Finished Executing job workflow for job 8b4d26ec-5d0f-11e7-ab5f-0266396ed958
2017-06-29 21:11:27.351  INFO 17228 --- [askScheduler-54] c.n.g.core.services.impl.LocalJobRunner  : Saving job execution for job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958]
2017-06-29 21:11:27.365  INFO 17228 --- [askScheduler-54] c.n.g.core.services.impl.LocalJobRunner  : Publishing job started event for job 8b4d26ec-5d0f-11e7-ab5f-0266396ed958
2017-06-29 21:11:27.366  INFO 17228 --- [taskExecutor-68] c.n.g.w.t.job.JobMonitoringCoordinator   : Scheduled job monitoring for Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958]
2017-06-29 21:26:31.423  INFO 17228 --- [askScheduler-77] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:26:36.448  INFO 17228 --- [askScheduler-66] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:26:41.474  INFO 17228 --- [taskScheduler-4] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:26:46.498  INFO 17228 --- [askScheduler-44] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:26:51.523  INFO 17228 --- [taskScheduler-9] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:26:56.550  INFO 17228 --- [askScheduler-49] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:27:01.573  INFO 17228 --- [askScheduler-67] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:27:06.598  INFO 17228 --- [askScheduler-63] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:27:11.620  INFO 17228 --- [askScheduler-54] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:27:16.644  INFO 17228 --- [askScheduler-40] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:27:21.667  INFO 17228 --- [askScheduler-81] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:27:26.689  INFO 17228 --- [askScheduler-57] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:27:31.705  INFO 17228 --- [askScheduler-13] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:27:36.727  INFO 17228 --- [askScheduler-30] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:27:41.752  INFO 17228 --- [askScheduler-38] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:27:46.781  INFO 17228 --- [askScheduler-50] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:27:51.810  INFO 17228 --- [askScheduler-80] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:27:56.835  INFO 17228 --- [askScheduler-46] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:28:01.862  INFO 17228 --- [askScheduler-27] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:28:06.889  INFO 17228 --- [askScheduler-41] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:28:11.915  INFO 17228 --- [askScheduler-52] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out
2017-06-29 21:28:16.939  INFO 17228 --- [taskScheduler-1] c.n.genie.web.tasks.job.JobMonitor       : Job Optional[8b4d26ec-5d0f-11e7-ab5f-0266396ed958] has timed out

Please, let me know how you would approach this

jelez commented

One thing I see in the logs repeated continuously are the lines below. I do not believe they have anything to do with this issue but decided to post them here:

2017-06-29 03:52:24.518 ERROR 17228 --- [io-8080-exec-18] c.n.g.w.c.GenieExceptionMapper           : A job with id 5994121e-5c7e-11e7-a519-0afab93cad90 already exists
2017-06-29 03:52:24.531 ERROR 17228 --- [io-8080-exec-12] c.n.g.w.c.GenieExceptionMapper           : A job with id 55480882-5c7e-11e7-b9a5-06e23ce966f0 already exists
2017-06-29 03:52:24.537 ERROR 17228 --- [nio-8080-exec-2] c.n.g.w.c.GenieExceptionMapper           : A job with id 55472ee4-5c7e-11e7-b9a5-06e23ce966f0 already exists
2017-06-29 03:52:24.552 ERROR 17228 --- [io-8080-exec-19] c.n.g.w.c.GenieExceptionMapper           : A job with id 5b335e9a-5c7e-11e7-936f-065c7bc9931e already exists
2017-06-29 03:52:33.477 ERROR 17228 --- [io-8080-exec-18] c.n.g.w.c.GenieExceptionMapper           : Job 704aec26-5c7e-11e7-9e0e-02f230638ee6 can't run on this node 30720/30720 MB are used and requested 1024 MB
2017-06-29 03:52:33.543 ERROR 17228 --- [io-8080-exec-10] c.n.g.w.c.GenieExceptionMapper           : A job with id 704b4914-5c7e-11e7-9e0e-02f230638ee6 already exists
2017-06-29 03:52:33.546 ERROR 17228 --- [nio-8080-exec-4] c.n.g.w.c.GenieExceptionMapper           : Job 699b6342-5c7e-11e7-ab5b-0266396ed958 can't run on this node 30720/30720 MB are used and requested 1024 MB
2017-06-29 03:52:33.555 ERROR 17228 --- [io-8080-exec-17] c.n.g.w.c.GenieExceptionMapper           : Job 5ab1b8a4-5c7e-11e7-b9a5-06e23ce966f0 can't run on this node 30720/30720 MB are used and requested 1024 MB
2017-06-29 03:52:33.560 ERROR 17228 --- [nio-8080-exec-6] c.n.g.w.c.GenieExceptionMapper           : Job 5f0c32ee-5c7e-11e7-a519-0afab93cad90 can't run on this node 30720/30720 MB are used and requested 1024 MB

If this is not working reliably, we are definitely interested in figuring out what is going on and fix it.
We have a large and busy deployment and are not seeing this issue, our tests also work reliably. So we need your help, if you have the time, to debug.

There are 3 categories of things that are different about your deployment:

  1. OS and OS configuration
  2. Genie configuration
  3. Command and command binaries

In order to start isolating variables, I think /bin/sleep would great because it's a known quantity and it basically takes out number 3 above.
If you're seeing this error with sleep we can dig into what situation Genie is not handling correctly in your specific setup and we can forget about looking into Spark.

To save you some time, here's some curl that sets up and launches a sleep command:

BASE_GENIE_URL='https://genie.example.com'

ADMIN_TOKEN=`echo ?`;

echo "Obtained token: $ADMIN_TOKEN"

# List jobs
curl -X GET "${BASE_GENIE_URL}/api/v3/jobs" -H "Authorization: Bearer ${ADMIN_TOKEN}" | jq

# Delete all commands, clusters, applications
curl "${BASE_GENIE_URL}/api/v3/commands/my-sleep-cmd/applications" -i -X DELETE -H "Authorization: Bearer ${ADMIN_TOKEN}"
curl "${BASE_GENIE_URL}/api/v3/clusters/dummy-cluster-id/commands" -i -X DELETE -H "Authorization: Bearer ${ADMIN_TOKEN}"
curl "${BASE_GENIE_URL}/api/v3/commands/my-sleep-cmd" -i -X DELETE -H "Authorization: Bearer ${ADMIN_TOKEN}"
curl "${BASE_GENIE_URL}/api/v3/clusters/dummy-cluster-id" -i -X DELETE -H "Authorization: Bearer ${ADMIN_TOKEN}"
curl "${BASE_GENIE_URL}/api/v3/applications/my-sleep-app" -i -X DELETE -H "Authorization: Bearer ${ADMIN_TOKEN}"

# Create an application
curl -v "${BASE_GENIE_URL}/api/v3/applications" -i  -H 'Content-Type: application/json' -H "Authorization: Bearer ${ADMIN_TOKEN}" -d '{
  "id" : "my-sleep-app",
  "created" : null,
  "updated" : null,
  "tags" : [ "type:test" ],
  "version" : "1.0.0",
  "user" : "genie",
  "name" : "test-app",
  "description" : "App that does nothing",
  "setupFile" : null,
  "configs" : null,
  "dependencies" : null,
  "status" : "ACTIVE",
  "type" : null
}'

# Create cluster
curl "${BASE_GENIE_URL}/api/v3/clusters" -i -X POST -H 'Content-Type: application/json' -H "Authorization: Bearer ${ADMIN_TOKEN}" -d '{
  "id" : "dummy-cluster-id",
  "created" : null,
  "updated" : null,
  "tags" : [ ],
  "version" : "2.7.1",
  "user" : "genie",
  "name" : "dummy-cluster",
  "description" : null,
  "setupFile" : null,
  "configs" : [ ],
  "status" : "UP"
}'

# Create a command (sleep)
curl -v "${BASE_GENIE_URL}/api/v3/commands" -i -H 'Content-Type: application/json' -H "Authorization: Bearer ${ADMIN_TOKEN}" -d '{
  "id" : "my-sleep-cmd",
  "created" : null,
  "updated" : null,
  "tags" : [ ],
  "version" : "1.0.0",
  "user" : "genie",
  "name" : "sleep",
  "description" : null,
  "setupFile" : null,
  "configs" : [ ],
  "status" : "ACTIVE",
  "executable" : "/bin/sleep",
  "checkDelay" : 10000,
  "memory" : null
}'

# Associate commands to the dummy-cluster
curl "${BASE_GENIE_URL}/api/v3/clusters/dummy-cluster-id/commands" -i -X PUT -H 'Content-Type: application/json' -H "Authorization: Bearer ${ADMIN_TOKEN}" -d '[ "my-sleep-cmd" ]'

# Associate my-sleep-cmd to my-app
curl "${BASE_GENIE_URL}/api/v3/commands/my-sleep-cmd/applications" -i -H 'Content-Type: application/json' -H "Authorization: Bearer ${ADMIN_TOKEN}" -d '[ "my-sleep-app" ]'

# Execute submit a job that uses my-sleep-cmd
curl -v "${BASE_GENIE_URL}/api/v3/jobs" -i -H 'Content-Type: application/json' -H "Authorization: Bearer ${ADMIN_TOKEN}" -d '{
  "id" : null,
  "created" : null,
  "updated" : null,
  "tags" : [ ],
  "version" : "1.0",
  "user" : "genie",
  "name" : "Test job",
  "description" : "Test job that just sleeps",
  "setupFile" : null,
  "commandArgs" : "120",
  "clusterCriterias" : [ {"tags" : ["genie.name:dummy-cluster"]} ],
  "commandCriteria" : [ "genie.name:sleep" ],
  "group" : null,
  "disableLogArchival" : true,
  "email" : null,
  "cpu" : null,
  "memory" : null,
  "timeout" : null,
  "dependencies" : [ ],
  "applications" : [ ]
}'
jelez commented

I will try to have someone recreate this on a cluster and start blasting it with sleeps. But again, we are seeing this issue sporadically - as a matter of fact we have not seen it last several days on the same machines we used to see it with practically no changes in environment.