getgauge/gauge

More intelligently detect if a plugin should be killed, rather than waiting for X seconds

Closed this issue · 14 comments

Sometimes, a Gauge plugin takes a long time to run. Gauge makes an effort to not freeze the user's machine and so attempts to preemptively kill a frozen plugin.

Expected behavior

Gauge checks if a plugin is still doing work, such as by examining the time stamp of log files. If there has been work in the last few seconds, assume that the plugin is still active.

Actual behavior

Gauge forcibly kills a plugin after the number of seconds determined by the plugin_kill_timeout property (by default: 4 seconds), regardless of if the plugin is actively doing work. The code for this begins on this line in plugin/plugin.go.

This has especially caused issues with the HTML Report plugin on large projects (see HTML Report #120, #121, #153, as well as this project's #636). Theoretically, it could happen with any plugin, but so far, no users have reported that other plugins have run for longer than the default 4 seconds.

Gauge version

Gauge version: 1.0.0
Commit Hash: 5a99965

Plugins
-------
csharp (0.10.3)
html-report (4.0.5)
java (0.6.8)
json-report (0.2.1)
screenshot (0.0.1)
xml-report (0.2.0)
sriv commented

Thanks for bringing this up, @Thunderforge

Here is a proposal:

  • The plugins send a "keep-alive" ping back to gauge every N seconds, (N can be X-1).
  • Gauge will reset the timer when it receives a "keep-alive" ping.
  • This should result in gauge killing the plugin only if it does not hear anything for over the plugin_kill_timeout interval.

Thoughts?

@sriv That sounds like an excellent way to do it. Obviously, it would mean updating all the plugins to send a "keep-alive" ping, but I think that in the long run, this is a better solution.

sriv commented

tech notes

There are (at least?) two possible ways to implement a "keep-alive"

Gauge core orchestrated

  1. gauge sends a "should I keep you alive" request ?
  2. plugin responds with a "yes, please!"
  3. gauge resets the timer
  4. not done? - goto (1)
  5. the plugin is killed when it fails to respond to "should I keep you alive".

plugin initiated

  1. plugins knows it will be killed in plugin_kill_timeout time interval.
  2. plugin sends a "wait I am not done yet" request to gauge just before the timeout is about to expire.
  3. gauge resets the timeout timer.
  4. not done? - goto (2)

/cc @getgauge/core - please add your thoughts.

If we do the Gauge core initiated then plugins need to have additional metadata (ex. capabilities) that should be honoured by Gauge before sending these requests.

I prefer the first approach where Gauge core sends a should I keep you alive request to the plugins. This seems more in terms of our current communication approach where the communication happens from Gauge to the plugins and plugins only respond to requests by Gauge.
But this also means that the plugins need to keep listening to the requests sent by Gauge. A single threaded approach may not work here as plugin may not be able to respond to Gauge immediately.

The fix should be available in nightly >= 8-2-2019

@Thunderforge Did you get a change to try this out after the fix?

Screen Shot 2019-04-03 at 3 04 41 PM

Able to reproduce this with the below nightly version.
Commit Hash: 9328b57

Plugins
-------
html-report (4.0.9.nightly-2019-03-26)```

@Debashis9012 I have not had a chance to test it, and likely won't be able to any time soon.

This seems to be working as expected.
Tried first with gauge versions

Gauge version: 1.0.4
Commit Hash: 3a9a647

Plugins
-------
html-report (4.0.8)
js (2.3.5.nightly-2019-02-25)
screenshot (0.0.1)

The suite had approximately 80 specs and 120 scenarios.
I changed the plugin_kill_timeout to 500 (0.5 seconds).
When gauge run was executed, the html-report plugin failed with Plugin [Html Report] with pid [27204] did not exit after 0.50 seconds. Forcefully killing it..

Later changed the Gauge version to

Gauge version: 1.0.5.nightly-2019-04-04
Commit Hash: f41dccf

Plugins
-------
html-report (4.0.8)
js (2.3.5.nightly-2019-02-25)
screenshot (0.0.1)

When gauge run was executed, the html-report plugin succeeded in creating the reports.

@Thunderforge could you please give a try with the fix version and let us know whether its working for you or not?

@Debashis9012 Unfortunately, I am no longer in a position where I can test this. So please proceed with QA without me.

The fix should be available in nightly >= 26-4-2019

I tried to run with a large project in a fresh machine(both windows and mac) against the fix version 26-4-2019
Observation:

  1. At first run it works absolutely fine without any plugin kill time out error.
    After first run I am observing the plugin kill time out error continuously.
  2. If our plugin kill time out is set to plugin_kill_timeout 4000 which is default value. After completion of execution it will take required amount of time to generate HTML report.
  3. If we set plugin_kill_timeout to any other value except default value then it will immediately throw an error.

Kindly find the below console output:

Plugin [Html Report] with pid [9527] did not exit after 4.00 seconds. Forcefully killing it.
Specifications: 961 executed    961 passed      0 failed        699 skipped
Scenarios:      1919 executed   1919 passed     0 failed        1401 skipped 

I don't see this issue with the latest(master) gauge and html-report.
So as of now closing this issue.