Cluster always self destructs after 30 minutes.

Question

Cluster always self destructs after 30 minutes.

torkjel opened this issue 7 years ago · 7 comments

The spydra/init-actions/self-destruct.sh script will install the self destruct cron job on both master and the 0th node. However only the master receives the heartbeat updates, so the 0th worker will always kill the cluster when the collector timeout is reached.

Answer 1 · 2018-01-11T08:39:26.000Z

Hi torkjel,

Thanks for reporting. I ran into a similar issue recently. Could you confirm wether this issue is appearing for you if you fix the Dataproc image version to 1.1 by adding the following configuration to the cluster section of your Spydra configuration file: "image-version": "1.1.24".

Answer 2 · 2018-01-11T09:32:38.000Z

Hi Steffen,

I can't see how the image version would matter... There are two self destruct cron jobs running, but spydra only sends heartbeats to the one on master. The DynamicSubmitter.Heartbeater code is not aware that there exist a second self destruct job on the 0th worker.

Answer 3 · 2018-01-11T09:40:49.000Z

Yeah I agree with you it looks like it is broken. Interestingly, it doesn't affect our jobs though using the 1.1.24 image but that might be cause by something else. We are looking into it.

Answer 4 · 2018-01-11T09:55:28.000Z

Maybe you're running with a large enough timeout so you don't encounter it that much? That's my workaround for now. Thanks for looking into it!

Answer 5 · 2018-01-12T14:12:09.000Z

@torkjel I just merged a fix for the issue, there's no release yet but feel free to test it and let us know if it helps.

The caveat of the solution is that total size of the metadata for project and/or instance is limited by 512Kb, so in theory you can reach that limit by spawning a lot of clusters. I believe it's still a good enough solution for now and should work for most workloads.

Answer 6 · 2018-01-16T15:03:20.000Z

@torkjel a fix was release in 0.3.15

Answer 7 · 2018-02-28T17:19:05.000Z

Sorry for not getting back to issue this sooner. 0.3.15 works exactly as expected in this regard. Thanks!