google/gops

Gops hangs if there are too many Go processes to examine

siebenmann opened this issue · 17 comments

If you have more than ten (I believe) Go processes, gops will hang in goprocess/gp.go's FindAll() function. This happens because the 'for .. range pss { ...}' loop attempts to obtain a limitCh token before starting the goroutine, but the goroutines only release their token after they have sent their reply to the found channel. Since the found channel is unbuffered and is only read from by the main code after the 'for .. range pss {}' loop finishes, this deadlocks if the number of goroutines would ever be limited.

I think either the entire 'for .. range pss { ... }' loop needs to be in a goroutine so that it doesn't block reading from the found channel, or the limitCh token must be released by the code before sending to found. The former change seems easier and I think it's correct.

This is impacting me as well. Was able to get around it temporarily by upping the magic number to 100 without a problem, though that's obviously not the right fix.

@siebenmann @dbraley could you please check if this PR fixes this issue.

Purely on inspection I believe that this will fix the issue I'm experiencing, however I believe it will also reintroduce issue #118. Since limitCh is now sent to in a new goroutine that has no other purpose, there's nothing to stop a flood of goroutines calling isGo() and using too many file descriptors at once.

@siebenmann thanks for the insight. I have modified the code to use wg instead of limitCh.

The change introduced above (see #125) moves the whole for-loop into a go routine (also suggested in the issue description). Ignoring the whitespace changes (git diff -b) is far the best way to understand the diff.

It seems to me the simplest fix would be to make found buffered, with size concurrencyProcesses.

I personally prefer to avoid buffered channels, but yeah -- it's possible as well.
UPD: See the comment below, where I explain why this actually isn't gonna work with concurrencyProcesses buffer size.

Actually, did not initially notice that concurrencyProcess buffer limit proposed by @josharian. Imho it isn't gonna help. Imagine that we have more than concurrencyProcess Go processes. After the first batch of concurrencyProcess workers we will fill the output buffer, and will get to exactly the same situation as we are in right now. The only correct way in this case is to create a buffer of future len(results), which is unknown, and the nearest known value would be len(pss).

But by that point, receiving from found will have commenced. In any case, there's no need to speculate. That's what tests are for. :)

But reading from found happens after the main for-loop. It will never get there, because it will get stuck in the same way it gets stuck now.

I see now. Using len(pss) would work.

@josharian I've refactored my cl and made it testable.
UPD: Removed "Please try with len(pss) -- it will hang." Of course it will work. Sorry seems I have problems finishing reading sentences today (-;

@josharian Added a fix-123-buffered branch. Please take a look and feel free to try it out yourself.
UPD: And yes, surely it would work with len(pss) as we've agreed earlier.

Oups, sorry, forgot to push the changes. Now it's there.

Also added a version, which creates less go-routines, and doesn't use buffered channels. Please take a look https://github.com/yarcat/gops/blob/fix-123-no-buffers/goprocess/gp.go (the branch is still based on a PR #125)

Since this is impacting many users, instead of leaving comments on #124, I sent out #126.

Sorry for the delay! Can confirm this fixed the problem for me as well, thanks @rakyll, @yarcat, @mseshachalam,
& @josharian !