haskell/cabal

`cabal test` freezes but `cabal test --show-details=streaming` works

jstolarek opened this issue · 46 comments

I'm running the testsuite for singletons package. When I run cabal test the testsuite freezes, but if I add --show-details=streaming it works perfectly and finishes after ~30seconds. This happens with cabal 1.20 and GHC 7.8.2.

@ttuegel do you mind taking a look?

@jstolarek Which operating system are you testing on?

This is 64-bit Linux

@jstolarek Could you strace it to get an idea of where it's frozen?

Sure. Log is about 1.5MB - where should I upload it?

Could you just paste the last 10 lines?

On Thu, Apr 24, 2014 at 11:41 AM, Jan Stolarek notifications@github.comwrote:

Sure. Log is about 1.5MB - where should I upload it?


Reply to this email directly or view it on GitHubhttps://github.com//issues/1810#issuecomment-41261417
.

Sure. Last 2K lines or so look identical:

wait4(26241, 0x7fffe3a2c4cc, 0, NULL)   = ? ERESTARTSYS (To be restarted)
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0x1a)                      = 61
wait4(26241, 0x7fffe3a2c4cc, 0, NULL)   = ? ERESTARTSYS (To be restarted)
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0x1a)                      = 61
wait4(26241, 0x7fffe3a2c4cc, 0, NULL)   = ? ERESTARTSYS (To be restarted)
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0x1a)                      = 61
wait4(26241, 0x7fffe3a2c4cc, 0, NULL)   = ? ERESTARTSYS (To be restarted)
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
write(9, "\377\0\0\0\0\0\0\0", 8)       = 8
rt_sigreturn(0x2)                       = 61
wait4(26241,  <unfinished ...>

This is interrupted several times with

write(9, "\377\0\0\0\0\0\0\0", 8)       = 8
rt_sigreturn(0x2)                       = 61

And before all that start there's:

rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0
rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7fc530a1c2d0}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [QUIT], [], 8) = 0
rt_sigaction(SIGQUIT, {SIG_IGN, [], SA_RESTORER, 0x7fc530a1c2d0}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
pipe([14, 15])                          = 0
rt_sigprocmask(SIG_BLOCK, [], [], 8)    = 0
timer_settime(0x1, 0, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
vfork()                                 = 26241
close(15)                               = 0
fcntl(14, F_SETFD, FD_CLOEXEC)          = 0
read(14, "", 4)                         = 0
close(14)                               = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
timer_settime(0x1, 0, {it_interval={0, 10000000}, it_value={0, 10000000}}, NULL) = 0
wait4(26241, 0x7fffe3a2c4cc, 0, NULL)   = ? ERESTARTSYS (To be restarted)
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0x1a)                      = 61

Hope that helps.

I'm experiencing the same issue here with Cabal 1.20. Here is a failed Travis CI build. Here is a source tree, which reproduces the issue.

@ttuegel have you been able to reproduce the issue or do you need help with that?

@tibbe I have reproduced @jstolarek's problem, but I haven't checked if it's the same as @nikita-volkov's. The freeze happens because rawSystemIOWithEnv (which we use to invoke the test executable) never finishes. Based on CPU usage, I think the tests are running, but deadlocking at the end. For example, the singletons tests run for 13s on my machine after last output; then CPU usage drops off. I cannot reproduce the problem with cabal-install's test suite, either. I would say it's a bug in how some test frameworks are doing output, but the test framework can't even tell what the --show-details= setting is. Thoughts?

So if this didn't happen in 1.18, where we didn't have streaming, then we must either be invoking the process in the same way, not waiting for it to finish in the same way, or not passing it handles that in the same state. Some printf style debugging might tell us.

This issue is present running the vty test suite under cabal 1.20. Using cabal 1.18 the issue is not present. Using "--show-details=streaming" does not appear to make a difference.

I attached gdb to the "stuck" process and acquired this backtrace:

(gdb) thread apply all bt

Thread 1 (Thread 0x7fe0b5f8f700 (LWP 9862)):
#0  0x00007fe0b488c263 in __select_nocancel () from /nix/store/pdskwizjw8ar31hql2wjnnx6g0s6xc50-glibc-2.19/lib/libc.so.6
#1  0x0000000000a502ed in awaitEvent ()
#2  0x0000000000a433b5 in scheduleWaitThread ()
#3  0x0000000000a488e7 in hs_main ()
#4  0x000000000040739f in main ()

Which is interesting: There is only a single thread and the thread appears to be just waiting on an event that never occurs.

Not all vty tests freeze. Out of ~14 test suites 4 freeze. These are a subset of those that use quickcheck. If the use of quickcheck is removed then the tests do not freeze.

I have started putting together a minimal repro case from the vty test suite. For the impatient:

I'll be pulling this code into a separate project when I get a chance.

I wonder if this is I/O manager related.

Here is a (almost) minimal project which reproduces the issue: https://github.com/coreyoconnor/cabal-issue-1810

$ cabal test with-quickcheck
Preprocessing library cabalIssue1810-0.1.0.0...
[1 of 1] Compiling Dummy            ( Dummy.hs, dist/build/Dummy.o )
In-place registering cabalIssue1810-0.1.0.0...
Preprocessing test suite 'with-quickcheck' for cabalIssue1810-0.1.0.0...
[1 of 2] Compiling Dummy            ( Dummy.hs, dist/build/Dummy.o )
In-place registering with-quickcheck-0.1.0.0...
[1 of 1] Compiling Main             ( dist/build/with-quickcheckStub/with-quickcheckStub-tmp/with-quickcheckStub.hs, dist/build/with-quickcheckStub/with-quickcheckStub-tmp/Main.o )
Linking dist/build/with-quickcheckStub/with-quickcheckStub ...
Running 1 test suites...
Test suite with-quickcheck: RUNNING...
^C^C

$ cabal test without-quickcheck
Preprocessing library cabalIssue1810-0.1.0.0...
[1 of 1] Compiling Dummy            ( Dummy.hs, dist/build/Dummy.o )
In-place registering cabalIssue1810-0.1.0.0...
Preprocessing test suite 'without-quickcheck' for cabalIssue1810-0.1.0.0...
[1 of 2] Compiling Dummy            ( Dummy.hs, dist/build/Dummy.o )
In-place registering without-quickcheck-0.1.0.0...
[1 of 1] Compiling Main             ( dist/build/without-quickcheckStub/without-quickcheckStub-tmp/without-quickcheckStub.hs, dist/build/without-quickcheckStub/without-quickcheckStub-tmp/Main.o )
Linking dist/build/without-quickcheckStub/without-quickcheckStub ...
Running 1 test suites...
Test suite without-quickcheck: RUNNING...
Test suite without-quickcheck: PASS
Test suite logged to: dist/test/cabalIssue1810-0.1.0.0-without-quickcheck.log
1 of 1 test suites (20 of 20 test cases) passed.

I am using 64bit linux and ghc 7.8.2.

Just ran into this. I think the problem is with how cabal install is saving output (stdout and stderr) from tests. When streaming is disabled, 'tee' dumps the process output into a pipe, but that pipe isn't read until after the process finishes. This means the pipe fills up and then the process just goes to sleep, waiting for the pipe to drain.

If you enable streaming, one of the branches in tee can keep reading from the pipe because it is always mirroring output to stdout.

@travitch I suspect you're right. We can always stream the log handle--the one that is saved now--to a temporary file and read it back when needed (that's what the log code did before, anyway).

Actually, tee isn't closing handles when it's done, even when you tell it to. That probably has something to do with it.

I take back my last comment. Of course tee doesn't close handles if the child process never finishes writing to them. I've tried streaming the log directly to file to test the "pipe doesn't drain" hypothesis, and it doesn't make any difference.

@coreyoconnor I just noticed that your example uses detailed-0.9 type test suites. That's very important because the test suite types are handled differently. It's also a surprise, because I didn't think QuickCheck supported being invoked in that way.

@jstolarek Are you also using detailed-0.9 test suites, rather than exitcode-stdio-1.0?

@ttuegel No, I'm only using exitcode-stdio-1.0.

What's interesting is that this bug keeps disappearing and reappearing. For example with the latest commit of singletons everything is fine. Note also that this is 100% reproducible, meaning that if it fails then it fails always and if it works then it also works always. There doesn't seem to be any random runtime factor involved.

@ttuegel Interesting! I was unaware that QuickCheck was not supported under detailed-0.9. vty has been using detailed + QuickCheck for quite some time without issue. I won't mind switching to exitcode-stdio via test-framework if detailed+QuickCheck is really never supposed to work.

@coreyoconnor To support detailed-0.9 tests, each test library (QuickCheck, HUnit, etc.) is supposed to provide values of type Distribution.TestSuite.Test. I see that you've written your own adapter. That's fine; it should work, but our goal is that you wouldn't need to. Short version: what you are experiencing is definitely a bug, but it's a different one. Would you mind filing a separate bug with your example?

@jstolarek Does it ever fail if all the tests pass? I have only been able to reproduce with test suites with at least one failing test.

@ttuegel I don't know, I haven't paid attention to this. At the moment I can only verify that it can work correctly with some tests failing.

Fixed in fb778e3

@snoyberg brought to my attention that the test suite for active continues to exhibit this problem, even with cabal-install-1.20.0.2. The test suite always freezes unless run with --show-details=streaming.

@snoyberg Could you check cabal --version? The fix for this bug made it into 1.20.0.2, but you need cabal-install-1.20.0.2 to also be built with Cabal-1.20.0.2.

+ cabal --version
cabal-install version 1.20.0.2
using version 1.20.0.0 of the Cabal library 

If that's the case, wouldn't it be better to have a lower bound on the Cabal dependency in cabal-install?

Actually, doing a sanity check right now... Cabal-1.20.0.2 isn't even available on Hackage.

@tibbe Can there be a Cabal-1.20.0.2 release to correspond to the recent cabal-install release? The fix for this bug was included in the 1.20 branch, but all the test runner code is in Cabal.

I would also recommend including a cabal-install-1.20.0.3 release that includes a lower bound on Cabal to ensure the bug fix gets included.

In case it's useful for you to know, I'm building a new cabal-install binary on the Stackage build server based on the code at the HEAD of the 1.20 branch. I should know in the next 24 hours or so if that solves the problem for all of Stackage.

I've put out a Cabal-1.20.0.1 with the missing patch. Sorry for the mess-up. I'll send out request for binaries so we can upload cabal-install binaries built against Cabal-1.20.0.1 and cabal-install-1.20.0.2.

Thanks @tibbe. Does this in any way affect the backported patches for cabal-install 1.18? If so, that should probably be brought up for the HP release.

1.18 is unaffected as this is a bug fix for a problem introduced in 1.20 when streaming test output was added.

I have a feeling something is still broken here. I kicked off a Stackage build on Friday, and on Sunday morning, the test suite is stalled running the following four tests:

  • FenwickTree-0.1.1
  • HCodecs-0.5
  • HTF-0.11.3.4
  • HTTP-4000.2.17

I'll try to provide more information, but I can tell you that removing --show-details=streaming caused Stackage to stall once again.

I cannot reproduce the test suite stalling in those four packages. The test suites finish for me with or without --show-details=streaming. I have tested the latest Git version and cabal-install-1.20.0.2/Cabal-1.20.0.1 on x86_64 Linux. Can you provide any more information about your test setup?

I can only reproduce this behavior when run from within Stackage, so it may be some kind of locking going on instead, I can't be certain. In any event, --show-details=streaming does work for me, which is now turned on in the Stackage code base.

The system I'm running this on in an Amazon EC2 small, running Ubuntu 14.04 64-bit.

Sorry guys, but this ticket needs to be re-opened. Testsuite for singletons package stalls again. Again, adding --show-details=streaming fixes the problem (and so does running the compiler binary).

I'm using GHC 7.8.3, Cabal 1.20.0.0, cabal-install 1.20.0.3 and latest singletons from the repo.

Is cabal unpack singletons && cd singletons-X.Y.Z.P && cabal test enough to reproduce? @ttuegel could you please take a look?

Steps to reproduce:

git clone https://github.com/goldfirere/singletons
cd singletons
git submodule init
git submodule update
cd th-desugar
cabal install
cd ..
cabal test

Note that we are currently suffering from a serious memory leak so compiling Data.Singletons.Prelude.List module will take Very Long Time.

You said you are using Cabal-1.20.0.0; the fix for this problem was in one of the point releases. I cannot reproduce with cabal-install-1.20.0.3 and Cabal-1.20.0.2.

@ttuegel you're right, upgrading to Cabal-1.20.0.2 solves the problem. I assumed that if I bootstrap cabal-install it installs the latest Cabal version (now I know it doesn't).

I think it should, but it looks like the bootstrap script didn't get updated with the point releases.

I can confirm this problem.
cabal version:

cabal-install version 1.22.9.0
using version 1.22.8.0 of the Cabal library 

ghc version: The Glorious Glasgow Haskell Compilation System, version 7.10.3

strace cabal test, with or without --show-details=streaming

wait4(20805, 0x7ffc180191bc, 0, NULL)   = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=0, ptr=0}} ---
rt_sigreturn({mask=[]})                 = 61
wait4(20805, 0x7ffc180191bc, 0, NULL)   = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=0, ptr=0}} ---
rt_sigreturn({mask=[]})                 = 61
wait4(20805, 0x7ffc180191bc, 0, NULL)   = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=0, ptr=0}} ---
rt_sigreturn({mask=[]})                 = 61
wait4(20805, 0x7ffc180191bc, 0, NULL)   = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=0, ptr=0}} ---
write(9, "\377\0\0\0\0\0\0\0", 8)       = 8
rt_sigreturn({mask=[]})                 = 61
wait4(20805, 

I 'm using detailed-0.9.

Had (Have) the same problem. Switching to stack using stack (>>stack test<<) works. Though this does not solve the issue it's still a workaround..