graphops/launchpad-charts

Feat req: `terminationGracePeriodSeconds: 0` for non-query graph-nodes

aasseman opened this issue · 11 comments

Helps with the very annoying issue of graph-node ignoring SIGTERM: graphprotocol/graph-node#4712

Since there is no data risk with SIGKILLing block ingestor as well as index nodes, their terminationGracePeriodSeconds could be set to 0 without any downsides.

Not query nodes however, as it would kill currently processing GraphQL queries.

Hi @aasseman, sorry for the late reply. Still not sure if we'll change the graphNodeDefaults, but we will at least change our default indexer group and block ingestor groups to set it to 0.

In case you may not be aware, you can easily override that parameter on a per NodeGroup like so:

graphNodeGroups:
  index:
    terminationGracePeriodSeconds: "0"

or in your case it would make more sense to override it to "0" at graphNodeDefaults, and then set it for the query node groups to "60" or some other value you deem fit. (actually the env var overrides per Node Group working as intended was bugged for a bit, but fixed already on the latest stable chart some weeks back).

Oh good to know! Thanks for the tip!

I tried with graphNodeDefaults, and somehow that doesn't affect the statefulsets at all.
It works if I apply for the groups, but not with a 0 value (be it with or without quotes). I set to 1 and it works.

@aasseman can you provide a minimal reproducible example? (sorry to bother you) but it seems to work for me (by looking at templating results with different values) and it's hard (for me at least at the moment) to come up with reasons for templating to be working with "1" but not with "0", all else remaining equal. Or are you evaluating the "not working" by the workload not exhibiting the expected behavior even though the env var is being set correctly? Because in that case the issue may reside outside the helm chart templating part

Sure the 0 vs 1 doesn't really matter.
But setting the value through graphNodeDefaults does not work in the sense that it doesn't do anything at all to the statefulset.

By chance are you using the graph-node-0.4.3-canary.2 build? something more recent than latest stable. On that one (coming from this branch: #288) we were leaving the graphNodeDefaults at "60", but overriding it at the two default indexing node groups level (setting index and block-ingestor both to "0"). That means that on that build graphNodeDefaults would only impact the default query node group (or any of you own defined node groups) as that one isn't set at node group level (which takes precedence). Come to think of it, think we'll need a better solution for this particular case.

but on latest main (stable), using this values file:

graphNodeDefaults:
  terminationGracePeriodSeconds: "3"

I get:

$ helm template . -f test.yaml | grep Grace                                                                    ✭main 
      terminationGracePeriodSeconds: 3
      terminationGracePeriodSeconds: 3
      terminationGracePeriodSeconds: 3
$ helm version
version.BuildInfo{Version:"v3.13.3", GitCommit:"c8b948945e52abba22ff885446a1486cb5fd3474", GitTreeState:"clean", GoVersion:"go1.20.11"}

I'm on v0.4.0. I tested again to make sure. I set:

graphNodeDefaults:
  terminationGracePeriodSeconds: 1

And I get

terminationGracePeriodSeconds: 30

In the resulting statefulset.

It's not a big issue for me though, I'm fine with setting this in the groups.

@aasseman that explains it as v0.4.1 (https://github.com/graphops/launchpad-charts/releases/tag/graph-node-0.4.1) is the one that brings the fix I mentioned about this logic being broken for a while but having been fixed "a few weeks ago" (early April).
In the meantime I've been pondering on how to best address this while not making things behave in tricky/unexpected ways and think I'm going to add a "terminationGracePeriodSeconds_QueryNodes", and use that as the default for any node group that is meant for query nodes (what identifies what's a query node will be based on the env var for "role", which is mandatory anyway). That way people will be able to manage this use-case better and won't be as heavy on "hidden black magic that behaves in unexpected ways" as the alternatives I was coming up with before to try and improve this. (this will be done and released in the next hours)

@aasseman check this release out: https://github.com/graphops/launchpad-charts/releases/tag/graph-node-0.4.3

It has two different defaults you can set now for terminationGracePeriod, where one will affect query nodes specifically, and it also allows you to override "kind" and use a Deployment as per this request: #269.

Please let us know if it doesn't fix your issues :)

@aasseman check this release out: https://github.com/graphops/launchpad-charts/releases/tag/graph-node-0.4.3

It has two different defaults you can set now for terminationGracePeriod, where one will affect query nodes specifically, and it also allows you to override "kind" and use a Deployment as per this request: #269.

Please let us know if it doesn't fix your issues :)

Just tested, works great. Thanks!