Gops hangs if there are too many Go processes to examine

Question

Gops hangs if there are too many Go processes to examine

siebenmann opened this issue 4 years ago · 17 comments

If you have more than ten (I believe) Go processes, gops will hang in goprocess/gp.go's FindAll() function. This happens because the 'for .. range pss { ...}' loop attempts to obtain a limitCh token before starting the goroutine, but the goroutines only release their token after they have sent their reply to the found channel. Since the found channel is unbuffered and is only read from by the main code after the 'for .. range pss {}' loop finishes, this deadlocks if the number of goroutines would ever be limited.

I think either the entire 'for .. range pss { ... }' loop needs to be in a goroutine so that it doesn't block reading from the found channel, or the limitCh token must be released by the code before sending to found. The former change seems easier and I think it's correct.

Answer 1 · 2020-09-03T17:13:42.000Z

This is impacting me as well. Was able to get around it temporarily by upping the magic number to 100 without a problem, though that's obviously not the right fix.

Answer 2 · 2020-09-04T08:19:06.000Z

@siebenmann @dbraley could you please check if this PR fixes this issue.

Answer 3 · 2020-09-04T15:54:49.000Z

Purely on inspection I believe that this will fix the issue I'm experiencing, however I believe it will also reintroduce issue #118. Since limitCh is now sent to in a new goroutine that has no other purpose, there's nothing to stop a flood of goroutines calling isGo() and using too many file descriptors at once.

Answer 4 · 2020-09-04T16:14:56.000Z

@siebenmann thanks for the insight. I have modified the code to use wg instead of limitCh.

Answer 5 · 2020-09-05T18:41:02.000Z

The change introduced above (see #125) moves the whole for-loop into a go routine (also suggested in the issue description). Ignoring the whitespace changes (git diff -b) is far the best way to understand the diff.

Answer 6 · 2020-09-05T22:04:16.000Z

It seems to me the simplest fix would be to make found buffered, with size concurrencyProcesses.

Answer 7 · 2020-09-06T11:17:00.000Z

I personally prefer to avoid buffered channels, but yeah -- it's possible as well.
UPD: See the comment below, where I explain why this actually isn't gonna work with concurrencyProcesses buffer size.

Answer 8 · 2020-09-06T19:13:52.000Z

Actually, did not initially notice that concurrencyProcess buffer limit proposed by @josharian. Imho it isn't gonna help. Imagine that we have more than concurrencyProcess Go processes. After the first batch of concurrencyProcess workers we will fill the output buffer, and will get to exactly the same situation as we are in right now. The only correct way in this case is to create a buffer of future len(results), which is unknown, and the nearest known value would be len(pss).

Answer 9 · 2020-09-06T19:16:43.000Z

But by that point, receiving from found will have commenced. In any case, there's no need to speculate. That's what tests are for. :)

Answer 10 · 2020-09-06T19:18:40.000Z

But reading from found happens after the main for-loop. It will never get there, because it will get stuck in the same way it gets stuck now.

Answer 11 · 2020-09-06T20:27:40.000Z

I see now. Using len(pss) would work.

Answer 12 · 2020-09-06T20:28:15.000Z

@josharian I've refactored my cl and made it testable.
UPD: Removed "Please try with len(pss) -- it will hang." Of course it will work. Sorry seems I have problems finishing reading sentences today (-;

Answer 13 · 2020-09-06T20:32:22.000Z

@josharian Added a fix-123-buffered branch. Please take a look and feel free to try it out yourself.
UPD: And yes, surely it would work with len(pss) as we've agreed earlier.

Answer 14 · 2020-09-06T20:35:53.000Z

Oups, sorry, forgot to push the changes. Now it's there.

Answer 15 · 2020-09-06T21:45:41.000Z

Also added a version, which creates less go-routines, and doesn't use buffered channels. Please take a look https://github.com/yarcat/gops/blob/fix-123-no-buffers/goprocess/gp.go (the branch is still based on a PR #125)

Answer 16 · 2020-09-07T20:51:11.000Z

Since this is impacting many users, instead of leaving comments on #124, I sent out #126.

Answer 17 · 2020-09-11T18:30:53.000Z

Sorry for the delay! Can confirm this fixed the problem for me as well, thanks @rakyll, @yarcat, @mseshachalam,
& @josharian !