kolide/launcher

Desktop process stuck in state where all client requests return "connection refused"

Opened this issue · 4 comments

A device began reporting "connection refused" repeatedly when talking to the /refresh endpoint on Dec 11 for the process for user 502, and then stayed in this state through "failing" to send notifications to user 502 on Dec 12.

Notifications did go through for user 501. @James-Pickett suggests that launcher didn't handle user switching appropriately. When looking through similar logs in Cloud Log, we noticed that this "connection refused" error appears pretty regularly when a user exists with a UID greater than 501 -- i.e. on devices with multiple user accounts.

We should investigate how devices are getting stuck in this state, and figure out an appropriate way to remediate the issue.

Example log:

{
    "time":"2024-12-11T18:52:00.121731Z",
    "level":"ERROR",
    "source":{
        "function":"github.com/kolide/launcher/ee/desktop/runner.(*DesktopUsersProcessesRunner).refreshMenu",
        "file":"/Users/runner/work/launcher/launcher/ee/desktop/runner/runner.go",
        "line":553
    },
    "msg":"sending refresh command to user desktop process",
    "component":"desktop_runner",
    "uid":"502",
    "pid":28623,
    "path":"/usr/bin/sudo",
    "err":"Get \"http://unix/refresh\": dial unix /var/kolide-k2/k2device.kolide.com/desktop_502/desktop.sock_4944: connect: connection refused"
}

My hunch is this is slightly misreported... I cannot imagine connection refused is a misreported error. But I could believe that in addition to the 40+ connection refused errors, there are 40+ connections that should have been flagged as success

I noticed that the UID was 502, typically the 2nd user created on macos. Wonder if there is some user switching at play here that caused things to go haywire.

We've solved at least some of the mystery:

  • launcher attempts to send the notification to all user processes -- here, we had a process for the 501 user and the 502 user
  • sending the notification to the 501 user succeeded
  • sending the notification to the 502 user failed
  • launcher counts that as a failure to notify overall -- and so launcher retried in one minute

We will update to count the above scenario as successful rather than failed, which will fix the behavior that prompted filing this issue.

I'm leaving the issue open because I think it's still useful to track down why devices get stuck in a state with all desktop process requests for a particular user being "connection refused" -- like James mentioned, maybe something went wrong with user switching, and we had a desktop process still extant that should've been cleaned up. I'll edit the title and description accordingly.

nice sleuthing