test-bot failing with `brew doctor` errors and leftover kegs
apjanke opened this issue · 13 comments
Starting today, all the test bot runs for the main Homebrew are failing with brew doctor
errors, complaining about missing dependencies and /usr/local/sbin
not being on the path. Reruns are not fixing it
#50546 and Homebrew/homebrew-science#3476 ("Jenkins (bot.brew.sh) fails sporadically") looks related; it's failing on brew doctor
calling out missing dependencies there too.
The brew doctor
errors in the build logs look like this.
Error Message
failed: brew doctor
Stacktrace
Please note that these warnings are just used to help the Homebrew maintainers
with debugging if you file an issue. If everything you use Homebrew for is
working fine: please don't worry and just ignore them. Thanks!
Warning: Homebrew's sbin was not found in your PATH but you have installed
formulae that put executables in /usr/local/sbin.
Consider setting the PATH for example like so
echo 'export PATH="/usr/local/sbin:$PATH"' >> ~/.bash_profile
Warning: Some installed formula are missing dependencies.
You should `brew install` the missing dependencies:
brew install epstool erlang fftw fontconfig freetype gd ghostscript gl2ps gmp gnuplot graphicsmagick arpack glpk hdf5 octave qhull qrupdate suite-sparse421 veclibfort imagemagick jpeg libpng libtiff libtool little-cms2 lua pcre plotutils pstoedit pyqt qscintilla2 qt sip szip tbb unixodbc wxmac xz
Run `brew missing` for more details.
The first failed PR from this set of failures was #50620.
The Jenkins build history says:
Failed Homebrew Pull Requests » mavericks #44050 10 hr broken since this build
Homebrew Core Pull Requests » el_capitan #2 11 hr broken since this build
Failed Homebrew Core Pull Requests » yosemite #2 11 hr broken since this build
Diagnosis
The doctor
warnings for missing dependencies, stuff in sbin
, and so on only get triggered if there are formulae installed. And there shouldn't be, at the start of a test run. But there are, plenty of them.
elcapitanvm:Cellar brew$ date
Sun Apr 3 04:33:11 BST 2016
elcapitanvm:Cellar brew$ brew ls
cairo fontforge libarchive pango
codequery freetype libevent pixman
confuse fwup libffi pkg-config
...
Looks like keg installations from a previous test-bot
run are getting left over.
Work
51e4e64 was an attempt to fix this using git clean
. Didn't work; reverted in e13ff62. Martin thinks the issue was that the git
command is executed in the context of the tap the current job operates on, never for Homebrew/homebrew itself.
There's existing cleanup code at https://github.com/Homebrew/homebrew/blob/e13ff6294a0bbc28fc98d6f797b1b56ce93ecd67/Library/Homebrew/cmd/test-bot.rb#L633-L645 which already runs git clean -ffdx
. It's probably not addressing this because it's run inside the job's tap and not core $HOMEBREW_PREFIX
.
The cleanup code should be put in https://github.com/Homebrew/homebrew/blob/master/Library/Homebrew/cmd/test-bot.rb#L633-L645
It seems like the issue is that --cleanup
only affects the repo it's currently under - which would often be the main Homebrew repo for core PRs. But it should really clean up both the repo for the target tap (which gets clean formula code), and the main Homebrew prefix repo (which blows away all the kegs in the Cellar
).
That could explain why the failure for homebrew-science in #50546 is sporadic: if there are leftover kegs on the machine, the test-bot --cleanup
run on the tap PR wouldn't clean them up. But the next test run for a core PR would. But now after core/formula separation, core formula PRs are only ending up cleaning the homebrew/core tap, and not HOMEBREW_PREFIX
. Does that make sense?
Is the intent of --cleanup
to always clean up HOMEBREW_PREFIX
and remove installed formulae? If so, seems like we could fix this by just doing all the git
cleanup work on both @repository
and HOMEBREW_PREFIX
instead of just repository
like it does now.
I have a theory: the test-bot
sequence does install
on the dependent formulae it finds using brew uses
. And it later calls uninstall
on them. But it does not call uninstall
on their dependencies which were installed implicitly as a result of the brew install $(brew uses foo)
. That leaves those kegs installed after the test-bot
run unless cleanup was performed on the main HOMEBREW_PREFIX
repo, which results in uninstalling all formulae.
And if some of those indirectly-installed dependencies themselves have dependencies on some of the formula that brew uninstall unchanged_depdencies
removed, they now have broken dependencies, and will show up in the next brew doctor
run. Or if any of the indirectly-installed dependencies install anything to sbin
, then you'll get that other brew doctor
warning.
Do we always want to uninstall all formulae after a brew test-bot
run? Maybe we could just do brew uninstall $(brew ls)
in cleanup_before
and cleanup_after
as a quick fix.
I think that's the tested formula's own dependencies will be left over on the machine after the test-bot
run, too. But there's no chance of them getting their dependencies broken because they won't depend on the tested formula its dependents, because that would cause a dependency cycle, which brew
doesn't allow. So they'd only be a problem if they put stuff in sbin
.
I'm thinking we still need to cleanup tap files as well to avoid cross-tap jobs interference. So how about below commands run in Jenkins config similar to our travis config.
git -C "$(brew --repo)" reset --hard origin/master
git -C "$(brew --repo)" clean -qxdff
Yeah, the git reset
/git clean
is better.
Wouldn't it make more sense to do those in cleanup_before
and cleanup_after
inside test-bot
, though, so people running it interactively get similar results, and we rely less on the Jenkins configuration?
Yes you are right.
I put in a fix in 1e92c7f which will do the git clean
on the core repo as well as the tap being tested.
Some of its work is redundant: since the taps are under the core repo, a single git clean -xdff
on the core should take care of all the taps too, I think. And some of the code could be refactored. But I wanted to make the minimal changes to get it working now.
Reopening until this is actually tested. (It got auto-closed when I pushed the commit.)
That didn't work. (Stored log here.)
It did: git clean -ffdx
in the core repo in both of the cleanup_*
methods. That broke the Homebrew installation because it blew away the taps under Library/Taps
.
==> Updated Formulae
cassandra
+ brew test-bot --ci-pr
["--ci-pr", "--cleanup", "--junit", "--local"]
HEAD is now at 1e92c7f test-bot: have --cleanup clean core repo as well as tested tap
Already on 'master'
Removing Cellar/
...
Removing Library/Taps/
...
==> git reset --hard
Error: No such file or directory - /usr/local/Library/Taps/homebrew/homebrew-core
==> FAILED
/usr/local/Library/Homebrew/extend/pathname.rb:328:in `chdir': No such file or directory - /usr/local/Library/Taps/homebrew/homebrew-core (Errno::ENOENT)
from /usr/local/Library/Homebrew/extend/pathname.rb:328:in `cd'
from /usr/local/Library/Homebrew/cmd/test-bot.rb:155:in `block in run'
from /usr/local/Library/Homebrew/cmd/test-bot.rb:150:in `fork'
from /usr/local/Library/Homebrew/cmd/test-bot.rb:150:in `run'
from /usr/local/Library/Homebrew/cmd/test-bot.rb:687:in `test'
from /usr/local/Library/Homebrew/cmd/test-bot.rb:665:in `cleanup_after'
from /usr/local/Library/Homebrew/cmd/test-bot.rb:741:in `ensure in run'
from /usr/local/Library/Homebrew/cmd/test-bot.rb:741:in `run'
from /usr/local/Library/Homebrew/cmd/test-bot.rb:910:in `test_bot'
from /usr/local/Library/brew.rb:84:in `<main>'
Build step 'Execute shell' marked build as failure
It needs to be done with either a single -f
or to --exclude /Library/Taps
. I put in another commit 235d819 to do it the --exclude /Library/Taps
way, keeping the extra -f
in case there are other git-controlled droppings around.
Testing this now.
Here's the Jenkins run with that change. Yosemite and El Capitan succeeded. Mavericks failed, but with an error in brew test fasd
, which looks like an actual issue with that PR, not this brew doctor
stuff.
I think this might be fixed. Going to requeue a couple additional runs, including alternating between different taps, to see if it's really working.
Reopening. (Adding the commit to homebrew/brew
triggered the auto-closing. Might have to watch out for that if we're keeping them in parallel.)
Okay, I think this is fixed now. Incoming test jobs are running okay, and we've seen some from multiple taps.
Many thanks for figuring this out! 🙇 Yesterday night I was obviously too tired to get this right …