xd009642/tarpaulin

Instrumentation address clash errors and segfaults

Closed this issue ยท 83 comments

as seen in failing builds https://circleci.com/gh/holochain/holochain-rust/1514

using 0.6.11 and nightly-2018-12-26

i've tried various flags passed to tarpaulin, but builds always seem to fail with this error

also seeing Error a segfault occured when executing test

potentially related #35 as we are using threads

So the instrumentation clash warnings may not be errors as such, a multiple lines of code may map to the same addresses and that just warns on occurrences of that which may affect the accuracy of results. And #35 has been partially mitigated by making --no-count the default so it disables the breakpoint on a line after that line is hit.

I'm cloning the repo and having a look into what's happening and will report back if I find anything/solve it.

The segfault also happens to me and here I thought my CI was going insane. See https://drone.exqa.de/Cogitri/tmplgen/89/3/3

@thedavidmeister I haven't managed to run your tests yet getting a compile failure I need to look into.
@Cogitri there appears to be a timeout on your test test_bad_env I wonder if this might be leaving things in an odd state that causes the segfault in test tests::test_crate_or_licensing. I'll need to look more into it

hi, a bit more info based on my testing

there are 3 issues that started popping up:

  • the instrumentation address clash complaints
  • segfaults
  • threaded tests failing in tarpaulin that pass in regular cargo test

instrumentation clash could be fixed by disabling thinlto

segfaults could be fixed by disabling incremental compilation

still not sure about the different threading behaviour yet

ah no, i'm wrong, segfaults still happening :(

looks like the threading issues are partly our fault

it's trying to unwrap a None of something stateful that needs to be initialised in our setup

i have no idea why that would work fine in regular cargo test but fail with tarpaulin

oh, and it worked with tarpaulin about a month ago because we ran all our CI through it

so something changed, but also we're handling that change ungracefully at our end

ok, i have a plan to fix our side of things

just need a hand with the segfaults... i'm not even sure where/how to debug those, they appear in different places on CI vs. local for example

I'm trying to add cargo-tarpaulin to the xi-editor project and I also received a segfault: https://travis-ci.com/xi-editor/xi-editor/jobs/169995236.

PR here: xi-editor/xi-editor#1086

So I looked at tmplgen - being the smaller and simpler project and the issue goes away when I replace rayon's par_iter with iter making me think this might be another threading based issue. This may be a fiddly one with a fair bit of investigation so I'll keep you all posted.

Also knowing these sort of issues there is always the chance it may disappear after another nightly upgrade...

I replace rayon's par_iter with iter making me think this might be another threading based issue

Ah, yes, makes sense timing wise. Thanks for figuring that out!

Thanks for looking into this @xd009642! In case you're looking for a minimal repro, the snippet below seems to be enough to cause tarpaulin to segfault.

#![feature(async_await, await_macro, futures_api)]

#[test]
pub fn test() {
    futures::executor::ThreadPool::new();
}

@brunocodutra is that using futures-preview 0.3.0-alpha.12? Because I tried that minimal example and experienced no segfault on my system with the latest nightly.

0.3.0-alpha.12 yes.

Now that's interesting, this definitely triggers a segfault if I add it to reducer, in fact I reduced it from a segfault I first observed in brunocodutra/reducer#11. However I can indeed confirm it doesn't appear to cause any issues on an otherwise empty project.

I'll try a ground up approach this time, pulling in bits from reducer until I can reproduce it.

@xd009642 I think I got it, the issue seems to only manifests itself if there are at least two separate test cases spawning threads, so the snippet bellow triggers the segfault for me on an empty project, not on every execution, but very frequently.

#![feature(async_await, await_macro, futures_api)]

#[test]
pub fn a() {
    futures::executor::ThreadPool::new();
}

#[test]
pub fn b() {
    futures::executor::ThreadPool::new();
}

EDIT: the issue also manifests itself by spawning std::thread::Threads directly, albeit less frequently, so I assume the number of threads spawned is an important factor.

For some more context, using the currently 0.7.0 release, I have some of my test suite that segfaults when run on travis, but not on my local computer.

both travis and my computer running rustc 1.33.0-nightly (daa53a52a 2019-01-17).

The test file causing segfaults on travis contains spawning a thread with std::thread::spawn().

Same here. I only see the segfaults on travis.

I tried to make some bissecting on trapaulin to see if this was a recently introduces issue, all that I can say is that version 0.6.8 already triggers a segfault on @brunocodutra 's example. (I couldn't try earlier versions because I can't simply upgrade them to cargo 0.32 which is required to compile the example using futures-preview).

All this was done on rustc 1.33.0-nightly (01f8e25b1 2019-01-24), but tarpaulin 0.7.0 also displays the issue when compiled on stable rustc 1.32.0 (9fda7c223 2019-01-16).

I have something more regarding the repro. I can reproduce it with the following code that compiles on stable, using futures-executor = 0.2.1:

#[test]
pub fn a() {
    futures_executor::ThreadPool::new();
}

#[test]
pub fn b() {
    futures_executor::ThreadPool::new();
}

The interesting thing is that, if I run cargo tarpaulin (so using the stable compiler, rustc 1.32.0 (9fda7c223 2019-01-16)), I can't seem to trigger the segfault, while running cargo +nightly tarpaulin triggers the segfault roughly 1/4 of the time.

In both cases I use the cargo-tarpaulin binary built from rust stable, I only change the compiler used to compile the tests.

Actually, the segfault also occurs when running cargo +beta tarpaulin, so on rustc 1.33.0-beta.5 (1045131c1 2019-01-31).

I looks like something was introduced in rustc 1.33.0 that breaks tarpaulin.

Perhaps you could do docker image on rust:stable for the time being then, @xd009642?

There should currently be a new images building for the stable and nightly compilers on the develop branch. Just waiting on dockerhub

So as a slight update I've added extra debug logging to help diagnose internal issues in tarpaulin activated via the --debug flag (develop branch only for now). It can spew a ton of info so anyone who wants to add anything to the issue attach a file if it's long or post a link to a gist.

For all the examples on this issue which recreated the issue for me, occasionally I had a segfault and other times tarpaulin just ended up hanging and had to be killed. I've figured out what was causing it to hang! I was assuming anything that wasn't a trap or a segfault was a ignoreable/forwardable signal. The times tarpaulin was hanging was because of a SIGILL - illegal instruction.

Hopefully, now I've got an area to focus in on it shouldn't take me too long to resolve this!

So I was debugging this to try and figure out what compiler change exactly caused it. Here are some notes that might help you:

  1. In @vberger's test case, there were 2 kinds of crashes I could find from the core dumps:
    1.1. crashes in ThreadPoolBuilder::create, where the problem was
    These crashes have this form:

    0x7fffffffcb40: 0x007d2210      0x00000000      0x00000001      0x00000000
    0x7fffffffcb50: 0xffffcff0      0x00000000      0xffffcff0      0x00000000
    0x7fffffffcb60: 0xffffcf90      0x00007fff      0x00000000      0x00000000
    0x7fffffffcb70: 0x007d0910      0x00000000      0x00000000      0x00000000
    
       0x00000000004196ca <+90>:    mov    %rdi,0xb8(%rsp)  
       ...
       0x00000000004199c7 <+855>:   mov    0xb8(%rsp),%rcx                                                        
    => 0x00000000004199cf <+863>:   mov    %rax,0x8(%rcx)  
    
    #0  0x00000000004199cf in futures_executor::thread_pool::ThreadPoolBuilder::create (self=0x7fffffffcf90)
        at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-executor-0.2.1/src/thread_pool.rs:262
    
    (gdb) x/8x $rcx
    0xffffcff0:     Cannot access memory at address 0xffffcff0
    

    i.e., it is trying to write to %rcx, which comes from %rdi, which should contain the struct return address, which should be 0x7fffffffcff0, but instead it is writing to 0xffffcff0 - i.e., the top 32 bits are truncated
    1.2. Stack misalignment crashes. These come from calling movaps when the stack is aligned to 8 (mod 16) bytes, with the misalignment appearing to come from tarpotest::b::{{closure}}.

  2. The "correct" for tarpotest::b appears to be this:

    Dump of assembler code for function tarpotest::b::{{closure}}:
       0x0000000000403860 <+0>:     push   %rax
       0x0000000000403861 <+1>:     mov    %rdi,(%rsp)
       0x0000000000403865 <+5>:     callq  0x403420 <tarpotest::b>
       0x000000000040386a <+10>:    callq  0x403880 <test::assert_test_result>
       0x000000000040386f <+15>:    pop    %rax
       0x0000000000403870 <+16>:    retq   
    
    Dump of assembler code for function tarpotest::b:
       0x0000000000403420 <+0>:     sub    $0x18,%rsp
       0x0000000000403424 <+4>:     mov    %rsp,%rdi
       0x0000000000403427 <+7>:     callq  *0x3c7cdb(%rip)        # 0x7cb108
    => 0x000000000040342d <+13>:    mov    %rsp,%rdi
       0x0000000000403430 <+16>:    callq  0x4036b0 <core::ptr::real_drop_in_place>
       0x0000000000403435 <+21>:    add    $0x18,%rsp
       0x0000000000403439 <+25>:    retq   
    

    Note that tarpotest::b::{{closure}} appears later in address space, but calls tarpotest::b. The breakpoint insertion code puts breakpoints at the start of both functions, resulting in this:

    Dump of assembler code for function tarpotest::b:
       0x0000000000403420 <+0>:     sub    $0x18,%rsp
       0x0000000000403424 <+4>:     int3   
       0x0000000000403425 <+5>:     mov    %esp,%edi
       0x0000000000403427 <+7>:     callq  *0x3c7cdb(%rip)        # 0x7cb108
    => 0x000000000040342d <+13>:    mov    %rsp,%rdi
       0x0000000000403430 <+16>:    callq  0x4036b0 <core::ptr::real_drop_in_place>
       0x0000000000403435 <+21>:    add    $0x18,%rsp
       0x0000000000403439 <+25>:    retq   
    
    Dump of assembler code for function tarpotest::b::{{closure}}:
       0x0000000000403860 <+0>:     int3
       0x0000000000403861 <+1>:     mov    %rdi,(%rsp)
       0x0000000000403865 <+5>:     callq  0x403420 <tarpotest::b>
       0x000000000040386a <+10>:    callq  0x403880 <test::assert_test_result>
       0x000000000040386f <+15>:    pop    %rax
       0x0000000000403870 <+16>:    retq   
    
  3. Breakpoint execution in tarpaulin should take the int3 trap, get to the tarpaulin code, then call the code with the original instruction restored.

  4. When I added a println! to log all the ptrace events, in the non-segfaulting runs I got a ptrace event for the breakpoint at 0x403424, but in the segfaulting runs, I didn't. Looking at memory at 0x403424, the 0xcc byte was there, so the breakpoint should have been hit.
    If this is indeed the case and execution somehow skips past the 0xcc, then the mov %esp,%edi would be executed, resulting in a stack pointer "truncated" to 32 bits. Similarly, if execution skips past the int3 at 0x403860, then push %rax wouldn't be executed, leading to the stack being 8 (mod 16).

  5. When adding some debugging, I noticed that the value that goes to %rcx in the segfault case is already truncated when coming out of tarpotest::b, which increases my belief that the mov was done as a mov %esp,%edi.

Therefore, I suspect that there are some cases in which this code "skips" an int3 event, especially if it occurs quickly after another one (4 instructions afterwards!), and proceeds with execution of the program as if it wasn't there. I don't know ptrace well enough to know whether this can or can't happen.

So this may confirm some of my suspicions about the issue. Unfortunately, with ptrace whenever you continue or step it sends the continue signal to every thread of the process. I would expect continuing a thread on an INT3 to just stay on the INT3 and I've assumed the same for step. However, if step can increment a PC past an INT3 that could cause these issues...

If this was the case I would have expected to see these spurious failures earlier in tarpaulin's life so I'm still unsure about the cause. If that is the cause though it shouldn't be too hard to prove - in theory. I'm away from my main pc this weekend but I'll try and trial something when I get home on Sunday

So today I think I came up with a solution to this problem and potentially #207 and #35. But this is going to be a reasonable change in the state machine so may not drop this week.

Essentially, instead of getting the latest signal, handling it, collecting coverage then resuming I'm going to get all available signals and do two passes of the list.

More technical explanation for the interested below

Essentially if we have an instruction where the opcode is more than 1 byte e.g. MOV and instrument it we write the interrupt instruction to the start of the word. Therefore MOV r/m8,r8 which is 88 /r could become CC /r.

Then when we get the interrupt the program counter is at the \r so we replace CC with the original instruction, move the program counter back 1 and then execute the original instruction either with a ptrace continue or a step.

However, there's a problem with this. If the interrupt is hit by two threads simultaneously then the old way I'd re-enable the instruction, move the first thread back and then use ptrace to continue. But ptrace continues and steps are applied to all threads in the process. So now we have a second thread that's in the middle of an instruction which then tries to execute an invalid instruction.

By calling waitpid until it stops giving me signals however I can build a view up of all the threads that have been trapped and when those traps would cause this issue and solve it. In theory.

This is still just a working theory that I'm in the process of coding up to test. If anyone sees any mistakes in my reasoning or has any additional ideas feel free to lend your 2 cents ๐Ÿ‘

Can't wait for the fix ! Meantime, I found a workaround by forcing execution to a single core using taskset -c 0 cargo tarpaulin ... or forcing Docker to a single core using docker run --cpuset-cpus="0" ...

It's much slower obviously, but still better than having no coverage report.

Okay the fix is in develop. I've tested it some futures and rayon stuff that must have been mentioned above because it's in my issues-190 folder ๐Ÿ‘ @Cogitri there appears to be some other issues with tmplgen, I'll figure out what they are and open another issue.

For everyone else fix is on the develop branch, please try it out and report back and if it's not fixed for everyone I'll reopen the issue

Great, thanks for looking into it! :)

ZoeyR commented

I still see this on the most recent version from develop. You can look at a failed build here: https://dev.azure.com/dgriffen/nestor/_build/results?buildId=122.

@dgriffen so I've ran tarpaulin on nestor about 10 times and haven't managed to recreate the issue. Could you add --debug to your tarpaulin call and include the debug logs?

I could reproduce it as well on the example described in #207 (comment) with the following logs:

[TRACE tarpaulin] Debug mode activated
[INFO tarpaulin] Running Tarpaulin
[INFO tarpaulin] Building project
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
[DEBUG tarpaulin] Processing tarpotest
[INFO tarpaulin] Launching test
[INFO tarpaulin] running /home/levans/test/tarpotest/target/debug/deps/tarpotest-c5aca069c10cd222
[TRACE tarpaulin] Source analysis for src/lib.rs
[TRACE tarpaulin] Ignorable lines: [1, 4, 5, 9]
[TRACE tarpaulin] Adding trace at address 0x403640 in src/lib.rs:2
[TRACE tarpaulin] Adding trace at address 0x403644 in src/lib.rs:3
[TRACE tarpaulin] Adding trace at address 0x403660 in src/lib.rs:7
[TRACE tarpaulin] Adding trace at address 0x403664 in src/lib.rs:8
[TRACE tarpaulin] Adding trace at address 0x403980 in src/lib.rs:7
[TRACE tarpaulin] Adding trace at address 0x403960 in src/lib.rs:2
[TRACE tarpaulin] Test PID is 16208
[TRACE tarpaulin] Caught inferior transitioning to Initialise state
[TRACE tarpaulin] Initialised inferior, transitioning to wait state
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP)]
[TRACE tarpaulin] Hit address 0x403960
[TRACE tarpaulin] Incrementing hit count for trace
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP)]
[TRACE tarpaulin] Hit address 0x403960
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP)]
[TRACE tarpaulin] Hit address 0x403640
[TRACE tarpaulin] Incrementing hit count for trace
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP)]
[TRACE tarpaulin] Hit address 0x403643
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP)]
[TRACE tarpaulin] Hit address 0x403644
[TRACE tarpaulin] Incrementing hit count for trace
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP)]
[TRACE tarpaulin] Hit address 0x403646
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16208), SIGTRAP, 3)]
[TRACE tarpaulin] New thread spawned 16209
[TRACE tarpaulin] Result queue is [Stopped(Pid(16209), SIGSTOP)]
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16208), SIGTRAP, 3)]
[TRACE tarpaulin] New thread spawned 16210
[TRACE tarpaulin] Result queue is [Stopped(Pid(16210), SIGSTOP)]
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16208), SIGTRAP, 3)]
[TRACE tarpaulin] New thread spawned 16211
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16208), SIGTRAP, 3)]
[TRACE tarpaulin] New thread spawned 16212
[TRACE tarpaulin] Result queue is [Stopped(Pid(16212), SIGSTOP)]
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP)]
[TRACE tarpaulin] Hit address 0x403980
[TRACE tarpaulin] Incrementing hit count for trace
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP), Stopped(Pid(16211), SIGSTOP), PtraceEvent(Pid(16209), SIGTRAP, 6)]
[TRACE tarpaulin] Hit address 0x403980
[TRACE tarpaulin] Child exiting
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP), PtraceEvent(Pid(16212), SIGTRAP, 6)]
[TRACE tarpaulin] Hit address 0x403660
[TRACE tarpaulin] Incrementing hit count for trace
[TRACE tarpaulin] Child exiting
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16211), SIGTRAP, 6)]
[TRACE tarpaulin] Child exiting
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP), Exited(Pid(16212), 0)]
[TRACE tarpaulin] Hit address 0x403663
[TRACE tarpaulin] Exited Pid(16212) parent Pid(16208)
[TRACE tarpaulin] Result queue is [Exited(Pid(16211), 0)]
[TRACE tarpaulin] Exited Pid(16211) parent Pid(16208)
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16210), SIGTRAP, 6)]
[TRACE tarpaulin] Child exiting
[TRACE tarpaulin] Result queue is [Exited(Pid(16209), 0)]
[TRACE tarpaulin] Exited Pid(16209) parent Pid(16208)
[TRACE tarpaulin] Result queue is [Exited(Pid(16210), 0)]
[TRACE tarpaulin] Exited Pid(16210) parent Pid(16208)
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16208), SIGTRAP, 3)]
[TRACE tarpaulin] New thread spawned 16213
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16208), SIGTRAP, 3)]
[TRACE tarpaulin] New thread spawned 16214
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16208), SIGTRAP, 3)]
[TRACE tarpaulin] New thread spawned 16215
[TRACE tarpaulin] Result queue is [Stopped(Pid(16214), SIGSTOP)]
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16208), SIGTRAP, 3)]
[TRACE tarpaulin] New thread spawned 16216
[TRACE tarpaulin] Result queue is [Stopped(Pid(16216), SIGSTOP)]
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGSEGV)]
[TRACE tarpaulin] No action suggested to continue tracee. Attempting a continue
[ERROR tarpaulin] Failed to get test coverage! Error: Failed to run tests: A segfault occurred while executing tests
ZoeyR commented

@xd009642 sure, just to make sure, Did you run it on the PR branch that produces the issue?

Ah no I didn't. I got thrown by the master mentioned in the azure GUI, but that's just saying what the PR would be merging into.

I've reproduced with @vberger's example so I'll use that for development and then test on nestor to make sure that I can reproduce the issue and it's fixed ๐Ÿ‘

Hi @xd009642, My all test cases are passed on executing "cargo test" but on executing "cargo tarpaulin -v" some of them giving an error "Error a segfault occured when executing test".
I am not able to infer the reason behind this error.
Please help me out ASAP so that I can continue my work.

I just felt it was necessary to comment on this :)

Please help me out ASAP so that I can continue my work.

I'm sure this is being worked on as best as possible in the limited time everyone participating in this FOSS project has, so sit tight and wait for a fix to arrive please.

Also, next time you report a bug, please include more info: What code is this being run on? What's the complete log of tarpaulin in debug mode? What version of tarpaulin are you using, and so on.

Well one pattern I've observed is that if I get a result queue of [Stopped(Pid(test_parent), SIGSTOP), PtraceEvent(Pid(child_thread), SIGTRAP, 6)] everything's good. If I get one that's just [PtraceEvent(Pid(child_thread), SIGTRAP, 6)] it segfaults after consistently.

I'm still trying to figure out why that might be and how I can prevent it happening.. But if anyone has any ideas from that I'm open to anything.

These threads also were all spawned in a previous test and are now only going during the next test, don't think that matters but maybe?

this is beyond what i've done in rust soz!

The current development docker image with taskset -c 0 cargo tarpaulin ... as mentioned by @appaquet is working for me. One test is failing with a segfault (not when running cargo test) but a report file is written.

@xd009642 Unfortunately, I'm still getting the crash with the latest 0.8.0. Here's a gist that crashes almost all the time on my machine. On my bigger project, I could get 0.7.0 to work by pinning the process to 1 core as I mentionned before, but it now crashes with 0.8.0 with or without pinning.

@xd009642 is this supposed to be fixed now?

should i be re-testing against my codebase(s)?

The changes fixed it for some people not others.

I released it before everyone was sorted because I'd gone a while workout a release and there were other features and also I'm in the process of changing job and mpoving house so things are pretty hectic right now!

Didn't realise I had a commit comment that closed this issue though!

np, thanks!

i'm still getting segfaults

getting a branch setup that has debug output in the CI to reference (using tarpaulin 0.8.0 currently)

holochain/holochain-rust#1405

I think there are far more minimal examples above but FWIW, I'm also experiencing segfaults on 0.8.0. Previously, I'd been running into tests timing out.

Logs with --debug:

https://travis-ci.org/apertium/apertium-stats-service/builds/535572221#L4883

@xd009642 I'm seeing these kinds of segfaults on HdrHistogram/HdrHistogram_rust#82 as well. To reproduce, ignore all the tests in tests/sync.rs that start with mt_ (they take a while to run), change all fn with fn sync_, and run with

$ cargo tarpaulin -v -- sync --test-threads=1

to only run the tests in that file (it'd be handy to be able to give tarpaulin -t like cargo test). It segfaults regularly, usually either on no_block_empty_sync or sync_record_through. no_block_empty_sync run in isolation does not appear to segfault, so I suspect the guess further up that the issue arises when multiple tests spawn threads is accurate.

I also experience segfaults both locally and on azure with my build

https://dev.azure.com/toshi-search/toshi-search/_build/results?buildId=189&view=logs

djc commented

This also seems to happen in Quinn's coverage runs now that the rustc bugs crashing the test coverage compilation there has finally been fixed.

@xd009642 I think this is a pretty high priority issue. It is the thing that's preventing me from using tarpaulin in nearly all my crates at the moment.

I agree... Personally, I have switched to grcov for the time being... It's a bit trickier to get working, but it's stable once setup (see my script here and its usage in GitHub actions here).

I'll happily switch back to Tarpaulin for its simplicity once the issue has been fixed.

I know but unfortunately I'm still stuck on how to actually solve this, plus big job change and move has kept me busy irl ๐Ÿ˜ข I'll dedicate serious time to this this weekend and see if I can make any progress past my multiple not working experiments

@xd009642 It's okay, life happens, and your well-being is the most important! I'm sure if you gave some instructions for how others can help track this down, there are some people here who'd love to help :)

So you can't debug tarpaulin with gdb because ptrace can't trace ptrace using the --debug flag on tarpaulin and the dwarf dumps/disassembly from objdump on the test binary try to reconstruct the behaviour of the program and work out where there could be essentially threading conflicts (because binary is multithreaded but ptrace gives you a single threaded view into it).

Another route is working out how kcov and tarpaulin differ and seeing if the differences could be the root of the problem. The developer wiki ptrace section may be of help. And all the previous comments in this thread

So I thought one of the patches to get coz-rs working might solve this (increasing the signal stack as overflowing it causes segfaults). Didn't work but the process of trying it lead me to get some stats on how often this happens. I'm observing it 11% of runs with the minimum example from before:

#![feature(async_await, await_macro, futures_api)]

#[test]
pub fn a() {
    futures::executor::ThreadPool::new();
}

#[test]
pub fn b() {
    futures::executor::ThreadPool::new();
}

Hopefully now with a test for this in one of my local branches and vague statistics on how often it occurs I'll be able to find when I'm on the right track easier! Current work is on the branch segfault-investigation

It didn't make an improvement at all... To try and aid in debugging for me and anyone who wants to take a crack of it I've gutted tarpaulin keeping only the ptrace and statemachine stuff and added some code to push events into a timeline of what happens and generate a timeline with gnuplot https://github.com/xd009642/minitarp it's still very much a WIP but here's some example output with the futures project in tests/data/futures in the branch segfault-investigation.

Currently, the most interesting thing to me is why some threads appear but I don't appear to get a ptrace clone event ๐Ÿ˜•

output

X Axes ticks are now in minitarp's graphing, also put in more of the ptrace events so you now get the WaitStatus's I get in order, the data intepretted from that and the actions taken. Also, this blog post begins to hint at some of the difficulties faced with ptrace https://www.cyphar.com/blog/post/20160703-remainroot-ptrace-hell

output

Part of me is hoping this is approaching the level of info for someone smarter than me to tackle this for hacktoberfest ๐Ÿ˜‚ but I'm gonna soldier on regardless.

EDIT: Something else I just noticed RIP 0x4043c0 is a line from test case b, the previous two points are the step and continue for test a. So I have test-threads set to 1 for the inferior test so it's going from test case a to b in the tracing before the threads for test-case A have been picked up

So I just got output that passes no segfaults for the first time! Still analysing results and figuring it out but here's a picture for interested parties of the run

output_pass

you can also see the changes and test them here #259 . This so far hasn't failed once for me and I've done 100 runs of the minimal example with 4 tests creating threadpools! This previously failed 100% of the time for me (2 tests failed 10%)

I'm now running tarpaulin on a project with 9 tests using thread pools 100 times. Once this is done if there are no failures I'll merge the branch into develop

And it passed! I'll keep it in develop until the 3rd and then I'll merge it, just to give anyone a chance to find issues. I'm very happy with this currently though I'm seeing tarpaulin do things it's never done before ๐Ÿ˜„

๐ŸŽ‰

Btw, do you plan to write some explanation of what the issue was? I (at least) would find this interesting to know. This is quite puzzling as it appears that the only significant change in #259 is the removal of RUST_TEST_THREADS=1 ... ?

@vberger I'll do a more thorough write up and consider posting it as a blog or just put it here. But this fix is two parts, the first is the queue of pending wait signals that I did a few months ago which fixed a big tarpaulin issue with multi-threading and is what allowed me to remove the test_threads=1 limitation. The test_threads 1 is based on how libtest has changed it's test runner and I'll also go into details on that ๐Ÿ˜„

@xd009642 Could you also publish a new Docker image? That way I can start taking advantage of this immediately! :D

It's on the develop tag, master coming in tonight since there's been no reported regressions

New version 0.9.0 is being released as well speak and docker image will be updated as part of that process once the really long travis build finishes!

It's been "fun" but it's now time to close this issue for good!

I have some bad news : tarpaulin 0.9 segfaults on several of wayland-rs tests, including several test files that do not do any multithreading. I'll try to make some reasonably small example.

Here is a repro:

use std::sync::Arc;

fn do_it() {
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
}

#[test]
fn test_1() {
    do_it()
}

#[test]
fn test_2() {
    do_it()
}

Running tarpaulin 0.9 on this regularly (30-40% of the time) results in:

Error: "Failed to get test coverage!
Error: Failed to run tests:
Error running test - SIGILL raised in 40531"

And sometimes also results on what appears to be a deadlock, or even the usual:

Error: "Failed to get test coverage!
Error: Failed to run tests: A segfault occurred while executing tests"

Increasing the number of Arc::new(...) in the body of the function seems to increase the chances of a segfault or SIGILL occurring.

clux commented

While there might be more issues with this ^, I just wanted to say thanks a lot for all the hard work on this release. This fixed all the issues I had with tarpaulin on a mid-sized project that had heavy use of rayon. It still crashes with --test-threads 1, but removing that flag actually worked ๐Ÿ‘

Although this is being reopened, my build now works beautifully with the new change and my project makes some pretty extensive use of the futures ecosystem. Amazing work

https://dev.azure.com/toshi-search/toshi-search/_build/results?buildId=354

@vberger I'm getting much lower incidence rates for the failure (2/1000 attempts failed) and I've got 25 calls to Arc::new and 4 tests calling do_it. There's a chance this one may be trickier

Indeed, the reproducibility seems to depend on the computer running the test, on my other computer it is much harder to reproduce.

Still, I managed to reproduce it a few times with the do_it() function being simply

fn do_it() {
    for _ in 0..10000 {
        let _a = vec![0];
    }
}

and running it 8 times in parallel.

Given that now there is nothing especially multithreaded in these tests, I strongly suspect that tarpaulin somehow interferes with the behavior of the allocator. This is possibly the same issue cause as #264 ?

Hmm, this is interesting. In evmap, I now get:

cargo tarpaulin --features "" --out Xml
========================== Starting Command Output ===========================
/bin/bash --noprofile --norc /__w/_temp/57325ed4-7e46-4e9b-b76a-e249415e0eef.sh
[INFO tarpaulin] Running Tarpaulin
[INFO tarpaulin] Building project
[INFO tarpaulin] Launching test
[INFO tarpaulin] running /__w/1/s/target/debug/deps/lib-94fc5ea0bf28285e

running 23 tests
Error: "Failed to get test coverage! Error: Failed to run tests: Error running test - SIGILL raised in 369"

So no segfault any more, but an illegal instruction. That seems even weirder!

Inferno still seems to hit the segfaul though. This is all running with the latest Docker image.

So there's a chance the SIGILL and SIGSEGV might be the same issue just manifesting in slightly different ways. I've got an experiment currently running to see if I've made any progress or not. If I have I'll push the branch and ask people here to test it out on their own projects ๐Ÿ‘€

Ok so ran on rust-evmap as @jonhoo mentioned it previously. With the change in the branch affinity did 100 runs with no SIGILL or SIGSEGV. Running on the current latest tarpaulin on crates.io and I can already see some failures before finishing the 100 runs.

So yeah if anyone wants to try it out and report back that would be appreciated!

Edit results for the latest release. Failed 11/100 times!

Another 100 runs on evmap and no failures. Now trying inferno (will edit with some results however far it gets before I shutdown).

EDIT: There appear to be failing tests on inferno. Maybe something I'm missing dependency wise? This kinda messed up my failure rate script but when I did some runs without suppressing the output I didn't see any segfaults... @jonhoo you mind trying it out and letting me know? Otherwise when it gets to the weekend I'll merge as it doesn't have a detrimental effect.

@xd009642 I'm away at a conference, so may be hard for me to find the time, but my guess is that the test failures is due to git submodules. Just run git submodule update --init and the tests should work! Also, awesome work on this. I'm so excited to see tarpaulin working again :D

I'll try that tonight. And have fun at the conference ๐Ÿ‘

Yeah that was it. And I'm doing a run of 100. With latest master tarpaulin it segfaulted every time I tried. So far this fix has worked every time (but that's just 3 times). I'll let it get through all 100 runs but if it passes that I'm merging and might do a release tonight ๐Ÿ‘€

Edit: Release tonight is coming, >1000 runs not a single segfault or sigill!

It appears to be working! Thank you!

Nice! So as this issue ended up resolving two different issues I'm closing it and if there's another segfault or sigill a new issue should be opened ๐ŸŽ‰