heisenbug: debug builds `Integer::repr_discr` has transient ICE on `compile-fail/enum-discrim-too-small2.rs`
pnkfelix opened this issue · 27 comments
My config.toml:
[rust]
codegen-units = 8
debug = true
optimize = true
debug-assertions = true
debuginfo = true
debuginfo-lines = true
And my (odd?) build invocation:
% time python /Users/fklock/Dev/Mozilla/rust-mirborrowck/.git/../x.py test src/tools/tidy && \
time python /Users/fklock/Dev/Mozilla/rust-mirborrowck/.git/../x.py build --stage 1 --incremental --keep-stage 0 src/libstd && \
time python /Users/fklock/Dev/Mozilla/rust-mirborrowck/.git/../x.py test --stage 1 src/test/{mir-opt,compile-fail,run-pass}
Results in stage 1 binary where we get weirdness on this compile-fail test:
% ./objdir-dbgopt/build/x86_64-unknown-linux-gnu/stage1/bin/rustc src/test/compile-fail/enum-discrim-too-small2.rs
error: literal out of range for i8
--> src/test/compile-fail/enum-discrim-too-small2.rs:18:11
|
18 | Ci8 = 223, //~ ERROR literal out of range for i8
| ^^^
|
note: lint level defined here
--> src/test/compile-fail/enum-discrim-too-small2.rs:11:9
|
11 | #![deny(overflowing_literals)]
| ^^^^^^^^^^^^^^^^^^^^
error: literal out of range for i16
--> src/test/compile-fail/enum-discrim-too-small2.rs:25:12
|
25 | Ci16 = 55555, //~ ERROR literal out of range for i16
| ^^^^^
error: literal out of range for i32
--> src/test/compile-fail/enum-discrim-too-small2.rs:32:12
|
32 | Ci32 = 3_000_000_000, //~ ERROR literal out of range for i32
| ^^^^^^^^^^^^^
error: internal compiler error: librustc/ty/layout.rs:553: Integer::repr_discr: `#[repr]` hint too small for discriminant range of enum `Ei64
note: the compiler unexpectedly panicked. this is a bug.
note: we would appreciate a bug report: https://github.com/rust-lang/rust/blob/master/CONTRIBUTING.md#bug-reports
note: rustc 1.25.0-dev running on x86_64-unknown-linux-gnu
thread 'rustc' panicked at 'Box<Any>', librustc_errors/lib.rs:508:9
note: Run with `RUST_BACKTRACE=1` for a backtrace.
I am pretty sure this is specific to debug builds. At least, that's my current best guess as to why this hasn't been caught by bors.
The code layout.rs
looks like this:
if let Some(ity) = repr.int {
let discr = Integer::from_attr(tcx, ity);
let fit = if ity.is_signed() { signed_fit } else { unsigned_fit };
if discr < fit {
bug!("Integer::repr_discr: `#[repr]` hint too small for \
discriminant range of enum `{}", ty)
}
return (discr, ity.is_signed());
}
it should probably be signalling a proper error, or trusting that the lint will eventually fire and catch this, rather than just triggering a call to bug!
(since this does not represent a bug in the compiler itself, right?)
(It seems easy enough to add another LayoutError
variant and returning that from Integer::repr_discr
. Trying a build with that change now.)
Hmm. Apparently we have run-pass
test that is relying on Integer::repr_discr
call not erroring! Investigating...
(I am now suspecting that there is a Layout
bug lurking here...)
FYI the test in question is https://github.com/rust-lang/rust/blob/master/src/test/run-pass/discrim-explicit-23030.rs
I'm double checking what's going on in the layout code now, to see if its something obviously wrong with the these explicitly assigned values...
Update: Something might have been terribly wrong with my builds. Things went south in other very strange ways while I was working on this.
Yes, as far as I can tell, at least some of the problems I was observing seem to have gone away after x.py clean
... :(
Closing this ticket.
(while I am not yet willing to reopen this ticket, I am seeing this same scenario arise on two different computers, which makes me think that while it may be some transient issue, its something we still may need to look more carefully into. But whatever's going on, the fact that it does not consistently reproduce is sign of a deeper problem than what this bug description indicates...)
Update: to be clear: It consistently reproduces for some compiler builds. That is, I can re-run the test over and over and see it. The problem is that small unrelated changes to the rustc
source (e.g. sometimes just adding debug!
calls, or expanding an existing one) can cause the problem to go away, and (AFAICT) it is not trivial to bring the problem back.
... just saw it arise again (on a build that I am pretty sure was working a little while ago...). For the record, on my Linux box, this is the command line I am invoking for each build:
../x.py test src/tools/tidy && ../x.py build --incremental --keep-stage 0 --stage 1 src/libstd && ../x.py test --stage 1 src/test/{mir-opt,compile-fail,run-pass}
I am becoming tempted to start digging into why this keeps cropping up for me...
For the record, I am seeing cases where (presumably inlined) calls to Integer::fit_signed are returning the wrong value for edge cases.
That is, after instrumenting the panic message with more data, I get a message that min: -9223372036854775808
ends up being mapped to I128
by the aforementioned function.
And that should not be happening. (That is, it is presumably some sort of code generation bug.)
Reopening ticket. I'm seeing this on a build on my mac (it can crop up on either linux or mac) so I'm going to spend some time digging into the generated machine code and try to get to the bottom of what's going on.
Okay so here's what I've found after much investigation (and learning how to get Visual Studio Code to do various breakpoint machinations):
Stepping through the instructions, it seems like the essential problem is arising from a miscompiled instruction sequence corresponding to an inlined call to rustc::tylayout::Integer::fit_signed
. here's the definition for that method, for reference:
/// Find the smallest Integer type which can represent the signed value.
pub fn fit_signed(x: i128) -> Integer {
match x {
-0x0000_0000_0000_0080...0x0000_0000_0000_007f => I8,
-0x0000_0000_0000_8000...0x0000_0000_0000_7fff => I16,
-0x0000_0000_8000_0000...0x0000_0000_7fff_ffff => I32,
-0x8000_0000_0000_0000...0x7fff_ffff_ffff_ffff => I64,
_ => I128
}
}
The (lldb disassemble
'd) instruction sequence I am seeing executed in the erroneous case looks like this (where the input to the function is being passed via %r14
and %r15
in the context of where the call is inlined):
(executed instructions annotated with values in relevant registers as each is run)
instruction address | instruction | %rax | %rcx | %rdx | %rdi | %r12 | %r14 | %r15 | rflags | cs |
---|---|---|---|---|---|---|---|---|---|---|
0xffffffff80000000 |
0xffffffffffffffff |
0xffffffffffffffff |
0x0000000000000000 |
0x0000000000000017 |
0xffffffffffffffff |
0x8000000000000000 |
0x000000000000002b |
|||
0x1066ccb36 <+12534> |
movq %r15, %rax |
0x8000000000000000 |
||||||||
0x1066ccb39 <+12537> |
addq $0x80, %rax |
0x8000000000000080 |
0x0000000000000282 |
|||||||
0x1066ccb3f <+12543> |
movq %r14, %rcx |
0xffffffffffffffff |
||||||||
0x1066ccb42 <+12546> |
adcq $0x0, %rcx |
0xffffffffffffffff |
0x0000000000000286 |
|||||||
0x1066ccb46 <+12550> |
cmpq $0x100, %rax |
0x0000000000000a02 |
||||||||
0x1066ccb4c <+12556> |
sbbq $0x0, %rcx |
0xffffffffffffffff |
0x0000000000000286 |
|||||||
0x1066ccb50 <+12560> |
jae 0x1066ccbbf |
0x0000000000000286 |
||||||||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
0x1066ccbbf <+12671> |
movq %r15, %rcx |
0x8000000000000000 |
||||||||
0x1066ccbc2 <+12674> |
addq $0x8000, %rcx |
0x8000000000008000 |
||||||||
0x1066ccbc9 <+12681> |
movq %r14, %rdx |
0xffffffffffffffff |
||||||||
0x1066ccbcc <+12684> |
adcq $0x0, %rdx |
0xffffffffffffffff |
||||||||
0x1066ccbd0 <+12688> |
movb $0x1, %al |
0x8000000000000001 |
||||||||
0x1066ccbd2 <+12690> |
cmpq $0x10000, %rcx |
0x0000000000000a06 |
||||||||
0x1066ccbd9 <+12697> |
sbbq $0x0, %rdx |
0xffffffffffffffff |
0x0000000000000286 |
|||||||
0x1066ccbdd <+12701> |
jb 0x1066ccbfd |
|||||||||
0x1066ccbdf <+12703> |
movl $0x80000000, %eax |
0x0000000080000000 |
||||||||
0x1066ccbe4 <+12708> |
addq %rax, %r15 |
0x8000000080000000 |
||||||||
0x1066ccbe7 <+12711> |
adcq $0x0, %r14 |
0xffffffffffffffff |
||||||||
0x1066ccbeb <+12715> |
shrdq $0x20, %r14, %r15 |
0xffffffff80000000 |
0x0000000000000287 |
|||||||
0x1066ccbf0 <+12720> |
shrq $0x20, %r14 |
0x00000000ffffffff |
0x0000000000000a07 |
|||||||
0x1066ccbf4 <+12724> |
orq %r15, %r14 |
0xffffffffffffffff |
0x0000000000000286 |
|||||||
0x1066ccbf7 <+12727> |
movb $0x2, %al |
0x0000000080000002 |
||||||||
0x1066ccbf9 <+12729> |
je 0x1066ccbfd |
0x0000000080000002 |
||||||||
0x1066ccbfb <+12731> |
movb $0x4, %al |
0x0000000080000004 |
The problem is that this is ending up with the value $0x4
(aka Integer::I128
) in the result register %al
, while this input should be yielding $0x3
(Integer::I64
).
The assembly code that I see via passing --emit=asm,...
to the relevant stage0/rustc
invocation (I'm debugging a stage1 build) is:
Portion of `--emit=asm` output
LBB99_376:
.loc 13 467 0
movq %r13, %rax
addq $128, %rax
movq %r14, %rcx
adcq $0, %rcx
cmpq $256, %rax
sbbq $0, %rcx
jae LBB99_383
Ltmp3275:
.loc 13 0 0 is_stmt 0
xorl %eax, %eax
jmp LBB99_386
LBB99_378:
Ltmp3276:
.loc 13 468 0 is_stmt 1
movq %r13, %rcx
addq $32768, %rcx
movq %r14, %rdx
adcq $0, %rdx
movb $1, %al
cmpq $65536, %rcx
sbbq $0, %rdx
jb LBB99_381
.loc 13 469 0
movl $2147483648, %ecx
addq %r13, %rcx
movq %r14, %rdx
adcq $0, %rdx
shrdq $32, %rdx, %rcx
shrq $32, %rdx
movb $2, %al
orq %rcx, %rdx
je LBB99_381
.loc 13 470 0
movabsq $-9223372036854775808, %rax
addq %rax, %r13
movq %r14, %rax
adcq $0, %rax
setne %al
addb $3, %al
Ltmp3277:
LBB99_381:
.loc 13 467 0
movq %r15, %rcx
addq $128, %rcx
movq %rdi, %rdx
adcq $0, %rdx
cmpq $256, %rcx
movl %esi, %r13d
sbbq $0, %rdx
jae LBB99_388
Ltmp3278:
.loc 13 0 0 is_stmt 0
xorl %ecx, %ecx
jmp LBB99_391
LBB99_383:
Ltmp3279:
.loc 13 468 0 is_stmt 1
movq %r13, %rcx
addq $32768, %rcx
movq %r14, %rdx
adcq $0, %rdx
movb $1, %al
cmpq $65536, %rcx
sbbq $0, %rdx
jb LBB99_386
.loc 13 469 0
movl $2147483648, %ecx
addq %r13, %rcx
movq %r14, %rdx
adcq $0, %rdx
shrdq $32, %rdx, %rcx
shrq $32, %rdx
movb $2, %al
orq %rcx, %rdx
je LBB99_386
.loc 13 470 0
movabsq $-9223372036854775808, %rax
addq %rax, %r13
adcq $0, %r14
setne %al
addb $3, %al
Note in particular that the dumped assembly includes (I think) comparisons against the large constant $-9223372036854775808
(-8000000000000000
in hex). The disassembled instruction stream has no occurrences of that constant that I can see, nor have I been able to identify any instruction sequences to construct the constant or otherwise implement the relevant comparison here.
In short, my theory is that, for some reason, LLVM is optimizing away the code generated for this match arm on the i128
input:
-0x8000_0000_0000_0000...0x7fff_ffff_ffff_ffff => I64,
but I believe this mal-optimization is occurring during LTO.
Its not clear to me whether its happening during ThinLTO or FatLTO. The stage0 binary that is used for this build of the stage1 rustc
is:
% ./build/x86_64-apple-darwin/stage0/bin/rustc --version
rustc 1.24.0-beta.1 (5b496b726 2018-01-02)
which I believe would be a build that predates #47548 but comes after #46382 ; so it is at least conceivable that this bug (which is unfortunately quite hard to reproduce/minimize) is a problem in ThinLTO alone...
This bug is especially scary to me given that we are currently making policy decisions about whether to turn ThinLTO on or off based solely on the compile-time/codesize impact, without any discussion of risk related to its correctness?
Of course I understand that compiler bugs are a fact of life (including in LLVM), but it might be better to have a uniform policy everywhere, so that when a bug does arise, it actually has pressure to get fixed? That is, if we're going to turn ThinLTO off in one context based on codesize/compile-time issues, then I might argue we should have it off everywhere just to keep bug investigation under control...
Nominating for discussion in the @rust-lang/compiler meeting. @pnkfelix can share what the heck is going on here.
Unfortunately I'll be on PTO during the meeting tomorrow.
I'm not sure what I have to share beyond the information I attempted to pack into my earlier comment.
(I'm also a little shocked that I seem to be the only one who is running into this on any sort of consistent basis, on both Linux and Mac. But that could well be because I was working for a couple of weeks against a baseline point on master that was still using ThinLTO by default, which may or may not be a key catalyst for reproducing this bug...)
(Indeed, it seems like updating to a newer master has made my problem go away again...)
I am seeing this problem too =) and I believe @spastorino was too. I'm going to try rebasing, but it's definitely worrisome.
@eddyb -- were you seeing this, or something else?
I don't think I've seen this one, no.
@nikomatsakis unsure if it's exactly the same thing or just related in some way. In my case 4 tests are consistently failing on my local machine.
I see failing run-pass/discrim-explicit-23030.rs
(seems like the same thing), run-pass/enum-discrim-autosizing.rs
(this one seems a bit different), run-pass/enum-discrim-range-overflow.rs
(this seems also a bit different) & run-pass/issue-15523-big.rs
(seems like the same thing).
You can check my log here https://gist.github.com/anonymous/f452914ff6e474c465ca02d28e628a07
@spastorino I wouldn't be surprised, given the nature of the machine code I was seeing, if all of those have the same root cause. (Which, the current theory goes, is due to ThinLTO mis-optimizing handling of 128-bit values...)
I encounter this from time to time. Rebasing has seemed to help though.
I'm seeing something similar with enum-discrim-too-small2.rs
failing in stage 1 but succeeding in stage 2, see #48326 (comment)
should we re-open this?
@nikomatsakis 👍 or open new issues 🙂
I just saw this once again with the enum-discrim-too-small2.rs
test (*) atop a branch that (once again) should not be affecting this, with stage0 like so:
% ./build/x86_64-unknown-linux-gnu/stage0/bin/rustc --version
rustc 1.25.0-beta.2 (1e8fbb143 2018-02-19)
(*) An earlier version of this comment erroneously claimed the issue was arising during bootstrap of the compiler itself. That was wrong.
And this time its on a system with a working copy of rr
, so I'm going to investigate...
I've been seeing this on https://github.com/Zoxc/rust/tree/run-fast-ice with x.py test src/test/run-pass
(on combined run-pass tests, not enum-discrim-too-small2.rs
)
The bug seems to have manifested itself fairly frequently when it was present. Now it is 5 years ago someone reported they encountered it. A generous (and IMHO reasonable) interpretation of that is that the heisenbug is gone and that this I-ICE issue can be closed.
Triage: No objections to the close proposal for two months. Closing!