Homebrew/legacy-homebrew

test-bot failing with `brew doctor` errors and leftover kegs

apjanke opened this issue · 13 comments

Starting today, all the test bot runs for the main Homebrew are failing with brew doctor errors, complaining about missing dependencies and /usr/local/sbin not being on the path. Reruns are not fixing it
#50546 and Homebrew/homebrew-science#3476 ("Jenkins (bot.brew.sh) fails sporadically") looks related; it's failing on brew doctor calling out missing dependencies there too.

The brew doctor errors in the build logs look like this.

Error Message

failed: brew doctor

Stacktrace

        Please note that these warnings are just used to help the Homebrew maintainers
with debugging if you file an issue. If everything you use Homebrew for is
working fine: please don't worry and just ignore them. Thanks!

Warning: Homebrew's sbin was not found in your PATH but you have installed
formulae that put executables in /usr/local/sbin.
Consider setting the PATH for example like so
    echo 'export PATH="/usr/local/sbin:$PATH"' >> ~/.bash_profile

Warning: Some installed formula are missing dependencies.
You should `brew install` the missing dependencies:

    brew install epstool erlang fftw fontconfig freetype gd ghostscript gl2ps gmp gnuplot graphicsmagick arpack glpk hdf5 octave qhull qrupdate suite-sparse421 veclibfort imagemagick jpeg libpng libtiff libtool little-cms2 lua pcre plotutils pstoedit pyqt qscintilla2 qt sip szip tbb unixodbc wxmac xz

Run `brew missing` for more details.  

The first failed PR from this set of failures was #50620.

The Jenkins build history says:

Failed  Homebrew Pull Requests » mavericks #44050  10 hr   broken since this build
Homebrew Core Pull Requests » el_capitan #2    11 hr   broken since this build
Failed  Homebrew Core Pull Requests » yosemite #2  11 hr   broken since this build

Diagnosis

The doctor warnings for missing dependencies, stuff in sbin, and so on only get triggered if there are formulae installed. And there shouldn't be, at the start of a test run. But there are, plenty of them.

elcapitanvm:Cellar brew$ date
Sun Apr  3 04:33:11 BST 2016
elcapitanvm:Cellar brew$ brew ls
cairo           fontforge       libarchive      pango
codequery       freetype        libevent        pixman
confuse         fwup            libffi          pkg-config
...

Looks like keg installations from a previous test-bot run are getting left over.

Work

51e4e64 was an attempt to fix this using git clean. Didn't work; reverted in e13ff62. Martin thinks the issue was that the git command is executed in the context of the tap the current job operates on, never for Homebrew/homebrew itself.

There's existing cleanup code at https://github.com/Homebrew/homebrew/blob/e13ff6294a0bbc28fc98d6f797b1b56ce93ecd67/Library/Homebrew/cmd/test-bot.rb#L633-L645 which already runs git clean -ffdx. It's probably not addressing this because it's run inside the job's tap and not core $HOMEBREW_PREFIX.

It seems like the issue is that --cleanup only affects the repo it's currently under - which would often be the main Homebrew repo for core PRs. But it should really clean up both the repo for the target tap (which gets clean formula code), and the main Homebrew prefix repo (which blows away all the kegs in the Cellar).

That could explain why the failure for homebrew-science in #50546 is sporadic: if there are leftover kegs on the machine, the test-bot --cleanup run on the tap PR wouldn't clean them up. But the next test run for a core PR would. But now after core/formula separation, core formula PRs are only ending up cleaning the homebrew/core tap, and not HOMEBREW_PREFIX. Does that make sense?

Is the intent of --cleanup to always clean up HOMEBREW_PREFIX and remove installed formulae? If so, seems like we could fix this by just doing all the git cleanup work on both @repository and HOMEBREW_PREFIX instead of just repository like it does now.

I have a theory: the test-bot sequence does install on the dependent formulae it finds using brew uses. And it later calls uninstall on them. But it does not call uninstall on their dependencies which were installed implicitly as a result of the brew install $(brew uses foo). That leaves those kegs installed after the test-bot run unless cleanup was performed on the main HOMEBREW_PREFIX repo, which results in uninstalling all formulae.

And if some of those indirectly-installed dependencies themselves have dependencies on some of the formula that brew uninstall unchanged_depdencies removed, they now have broken dependencies, and will show up in the next brew doctor run. Or if any of the indirectly-installed dependencies install anything to sbin, then you'll get that other brew doctor warning.

Do we always want to uninstall all formulae after a brew test-bot run? Maybe we could just do brew uninstall $(brew ls) in cleanup_before and cleanup_after as a quick fix.

I think that's the tested formula's own dependencies will be left over on the machine after the test-bot run, too. But there's no chance of them getting their dependencies broken because they won't depend on the tested formula its dependents, because that would cause a dependency cycle, which brew doesn't allow. So they'd only be a problem if they put stuff in sbin.

I'm thinking we still need to cleanup tap files as well to avoid cross-tap jobs interference. So how about below commands run in Jenkins config similar to our travis config.

git -C "$(brew --repo)" reset --hard origin/master
git -C "$(brew --repo)" clean -qxdff

Yeah, the git reset/git clean is better.

Wouldn't it make more sense to do those in cleanup_before and cleanup_after inside test-bot, though, so people running it interactively get similar results, and we rely less on the Jenkins configuration?

Yes you are right.

I put in a fix in 1e92c7f which will do the git clean on the core repo as well as the tap being tested.

Some of its work is redundant: since the taps are under the core repo, a single git clean -xdff on the core should take care of all the taps too, I think. And some of the code could be refactored. But I wanted to make the minimal changes to get it working now.

Reopening until this is actually tested. (It got auto-closed when I pushed the commit.)

That didn't work. (Stored log here.)

It did: git clean -ffdx in the core repo in both of the cleanup_* methods. That broke the Homebrew installation because it blew away the taps under Library/Taps.

==> Updated Formulae
cassandra
+ brew test-bot --ci-pr
["--ci-pr", "--cleanup", "--junit", "--local"]
HEAD is now at 1e92c7f test-bot: have --cleanup clean core repo as well as tested tap
Already on 'master'
Removing Cellar/
...
Removing Library/Taps/
...
==> git reset --hard
Error: No such file or directory - /usr/local/Library/Taps/homebrew/homebrew-core
==> FAILED
/usr/local/Library/Homebrew/extend/pathname.rb:328:in `chdir': No such file or directory - /usr/local/Library/Taps/homebrew/homebrew-core (Errno::ENOENT)
    from /usr/local/Library/Homebrew/extend/pathname.rb:328:in `cd'
    from /usr/local/Library/Homebrew/cmd/test-bot.rb:155:in `block in run'
    from /usr/local/Library/Homebrew/cmd/test-bot.rb:150:in `fork'
    from /usr/local/Library/Homebrew/cmd/test-bot.rb:150:in `run'
    from /usr/local/Library/Homebrew/cmd/test-bot.rb:687:in `test'
    from /usr/local/Library/Homebrew/cmd/test-bot.rb:665:in `cleanup_after'
    from /usr/local/Library/Homebrew/cmd/test-bot.rb:741:in `ensure in run'
    from /usr/local/Library/Homebrew/cmd/test-bot.rb:741:in `run'
    from /usr/local/Library/Homebrew/cmd/test-bot.rb:910:in `test_bot'
    from /usr/local/Library/brew.rb:84:in `<main>'
Build step 'Execute shell' marked build as failure

It needs to be done with either a single -f or to --exclude /Library/Taps. I put in another commit 235d819 to do it the --exclude /Library/Taps way, keeping the extra -f in case there are other git-controlled droppings around.

Testing this now.

Here's the Jenkins run with that change. Yosemite and El Capitan succeeded. Mavericks failed, but with an error in brew test fasd, which looks like an actual issue with that PR, not this brew doctor stuff.

I think this might be fixed. Going to requeue a couple additional runs, including alternating between different taps, to see if it's really working.

Reopening. (Adding the commit to homebrew/brew triggered the auto-closing. Might have to watch out for that if we're keeping them in parallel.)

Okay, I think this is fixed now. Incoming test jobs are running okay, and we've seen some from multiple taps.

Many thanks for figuring this out! 🙇 Yesterday night I was obviously too tired to get this right …