kolide/launcher

Retry launching osquery instance on failure

RebeccaMahany opened this issue · 0 comments

When the osquery runner cannot launch an osquery instance, we currently return an error, which will shut down launcher entirely.

Looking over the logs and past issues we've investigated, I see two primary errors: 1) timeout waiting for osqueryd to create socket, indicating the osquery process did not start up, and 2) could not create an extension client where the socket file does not exist or the connection is refused.

In both of these cases, restarting launcher is overkill, and even detrimental to solving the issue. In some cases, we can see these errors happen when the current osquery version is old and not compatible with the current database; restarting launcher in this case is actively harmful because it resets the autoupdate delay, preventing a newer osquery version from being downloaded.

So! We want to change the runner behavior to repeatedly retry starting osquery instances and not exit from the runner.

  1. If osquery instance launch fails, retry launching the instance -- potentially with backoff
  2. If osquery instance launch fails, also consider triggering an autoupdate check for osquery
  3. Runner should still be responsive to calls to Shutdown