linux-lts + Electron 1.3.2-3 blank screen
repsac-by opened this issue · 43 comments
Electron 1.3.2-3 lanched with white screen.
Atom 1.9.6-1 crashed at startup.
X11 and Wayland the same behavior.
Mmm... how about electron -i
?
@tensor5 it's work
$ electron -i
> process.versions
{ http_parser: '2.7.1',
node: '6.3.0',
v8: '5.2.361.43',
uv: '1.9.1',
zlib: '1.2.8',
ares: '1.11.0',
modules: '49',
openssl: '1.0.2h',
electron: '1.3.2',
'atom-shell': '1.3.2',
chrome: '52.0.2743.82' }
>
The problem is specific to linux-lts 4.4.16
Electron 1.3.2-3 on linux 4.6.4
works
That explains why I didn't hit the problem. For curiosity, did Chromium 52 work on linux-lts
?
linux-lts
+ chromium 52.0.2743.85-2
all right.
You said that electron starts with a blank screen; by chance, are you able to open a developer console there with ctrl+shift+I
?
ctrl+shift+I
nothing happens, just a blank screen
Tried linux-lts
with electron-1.3.3-1
and confirm the issue. The output of dmesg
:
[...]
[ 42.479909] traps: electron[1537] trap invalid opcode ip:18a4291 sp:7ffefb311030 error:0 in electron[400000+3b8b000]
[...]
It works with electron-1.3.2-2
, which makes me think that it has something to do with 256d7b9.
@tensor5, in response to your question here: https://bugs.archlinux.org/task/50357#comment149928 I confirm experiencing the same, and I'm also using linux-lts
.
I confirm, atom crashes at startup and electron alone starts blank. Works on non-lts kernel.
Workaround: install latest version of electron-prebuilt
package from npm, atom will pick it up automatically after shell restart.
$ npm install -g electron-prebuilt
I'm getting this bug with the linux-samus4
kernel, uname -r
is 4.4.2-6ph
. Workaround works here too; can anyone give an explanation for why using the system toolchain would cause as illegal instruction, rather than the inverse? (Assuming it is 256d7b9
).
The bug seems to be there again in stock Arch Linux kernel 4.7 .
It works with 4.7.0-1-ck
.
@stefanhusmann It works for me also on 4.7.0-1-ARCH
. What architecture is your machine?
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
It sounds like we're invoking UB? The actual function that's giving the error looks like:
retq ; Previous function's return
crash_here: ; Note: named for convenience; no symbol is present in binary
ud2 ; We crash on this instruction
nopw %cs:0x0(%rax,%rax,1) ; For alignment?
nopl (%rax) ; For alignment?
retq
We crash on the ud2 instruction, which intentionally causes SIGILL for debugging purposes. gdb has trouble reading the stack, possibly(?) because crash_here
never creates a stack frame.
EDIT: In case there's a coorelation:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 61
Model name: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz
Stepping: 4
CPU MHz: 2390.062
CPU max MHz: 3000.0000
CPU min MHz: 500.0000
BogoMIPS: 4788.56
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 4096K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap xsaveopt
This is the diff between the .BUILDINFO
s of electron-1.3.2-2
(my old working copy) and 1.3.2-3
:
< pkgbuild_sha256sum = 3f266c8d8ceeeefcf72dd4ff585085159823e5b3181a7f916b73ffc25f8d7c09
---
> pkgbuild_sha256sum = 9855352b6780de0be00b6c9f6b435d50937a67709ae386872b3b947f1ca896fe
30c30
< installed = binutils-2.26.1-1
---
> installed = binutils-2.26.1-2
55c55
< installed = fakeroot-1.21-1
---
> installed = fakeroot-1.21-2
63c63
< installed = fontconfig-2.12.0-1
---
> installed = fontconfig-2.12.1-3
68,69c68,69
< installed = gcc-6.1.1-3
< installed = gcc-libs-6.1.1-3
---
> installed = gcc-6.1.1-4
> installed = gcc-libs-6.1.1-4
78c78
< installed = glibc-2.23-5
---
> installed = glibc-2.24-1
126c126
< installed = libcups-2.1.4-1
---
> installed = libcups-2.1.4-2
208c208
< installed = linux-api-headers-4.5.5-1
---
> installed = linux-api-headers-4.7-1
214,215c214,215
< installed = mesa-12.0.1-5
< installed = mesa-libgl-12.0.1-5
---
> installed = mesa-12.0.1-7
> installed = mesa-libgl-12.0.1-7
Maybe the updated linux-api-headers
and glibc
could play a role here.
I'm guessing glibc, but it's a guess; there's a lot of calls to munmap
, mprotect
, madvise
, etc. in the general neighborhood [EDIT so I assume it's part of malloc, free, etc.]. I'm hand-decompiling right now, I'll update when done. Just for ref. though, alignment rules mean that 0x18a4291
is not the start of the crashing function; I think that it's 0x18a4260
, but it might be 0x18a4240
or earlier. The stack frame got all screwed up, so...
EDIT 2 Is there any chance this is compiled with -fomit-frame-pointer
? Upstream chromium sounds like it does (ಠ_ಠ), so I assume it does? When I crash, rbp
is 4... (which explains the stack issues)
0x18a4240: push %rax
0x18a4241: callq 0x33afeb0 <munmap>
0x18a4246: test %eax,%eax
0x18a4248: jne 0x18a424c
0x18a424a: pop %rax
0x18a424b: retq
0x18a424c: ud2
0x18a424e: xchg %ax,%ax
0x18a4250: push %rax
0x18a4251: xor %edx,%edx
0x18a4253: callq 0x57a2e0 <mprotect@plt>
0x18a4258: test %eax,%eax
0x18a425a: jne 0x18a425e
0x18a425c: pop %rax
0x18a425d: retq
0x18a425e: ud2
0x18a4260: push %rax
0x18a4261: mov $0x3,%edx
0x18a4266: callq 0x57a2e0 <mprotect@plt>
0x18a426b: test %eax,%eax
0x18a426d: sete %al
0x18a4270: pop %rcx
0x18a4271: retq
0x18a4272: nopw %cs:0x0(%rax,%rax,1)
0x18a427c: nopl 0x0(%rax)
0x18a4280: push %rax
0x18a4281: mov $0x8,%edx
0x18a4286: callq 0x5725e0 <madvise@plt>
0x18a428b: test %eax,%eax
0x18a428d: jne 0x18a4291
0x18a428f: pop %rax
0x18a4290: retq
=> 0x18a4291: ud2
0x18a4293: nopw %cs:0x0(%rax,%rax,1)
0x18a429d: nopl (%rax)
0x18a42a0: retq
I'm translating this roughly to:
void foo(void *addr, size_t len) {
// 0x18a4240: push %rax
// 0x18a4241: callq 0x33afeb0 <munmap>
// 0x18a4246: test %eax,%eax
// 0x18a4248: jne 0x18a424c
if(munmap(addr, len) == 0) {
// 0x18a424a: pop %rax
// 0x18a424b: retq
return;
}
// 0x18a424c: ud2
dieWithSIGILL();
}
// 0x18a424e: xchg %ax,%ax
void bar(void *addr, size_t len) {
// 0x18a4250: push %rax
// 0x18a4251: xor %edx,%edx
// 0x18a4253: callq 0x57a2e0 <mprotect@plt>
// 0x18a4258: test %eax,%eax
// 0x18a425a: jne 0x18a425e
if(mprotect(addr, len, 0) == 0) {
// 0x18a425c: pop %rax
// 0x18a425d: retq
return;
}
// 0x18a425e: ud2
dieWithSIGILL();
}
bool baz(void *addr, size_t len) {
// 0x18a4260: push %rax
// 0x18a4261: mov $0x3,%edx
// 0x18a4266: callq 0x57a2e0 <mprotect@plt>
// 0x18a426b: test %eax,%eax
// 0x18a426d: sete %al
// 0x18a4270: pop %rcx
// lolwut why is rax getting moved to rcx?
// 0x18a4271: retq
return (mprotect(addr, len, 3) == 0);
}
// 0x18a4272: nopw %cs:0x0(%rax,%rax,1)
// 0x18a427c: nopl 0x0(%rax)
void xyzzy(void *addr, size_t len) {
// 0x18a4280: push %rax
// 0x18a4281: mov $0x8,%edx
// 0x18a4286: callq 0x5725e0 <madvise@plt>
// 0x18a428b: test %eax,%eax
// 0x18a428d: jne 0x18a4291
if(madvise(addr, len, 8) == 0) {
// 0x18a428f: pop %rax
// 0x18a4290: retq
return;
}
// 0x18a4291: ud2
dieWithSIGILL(); // This is where we die!
// 0x18a4293: nopw %cs:0x0(%rax,%rax,1)
// 0x18a429d: nopl (%rax)
// 0x18a42a0: retq
return; /* Except we return with a borked stack, so we're not returning to the
caller... Instead, we're returning to the return value of whatever was called
immediately before xyzzy. */
}
EDIT
The generated assembly looks more like Clang's than GCC's for the xyzzy
function. I'm going to try agging the chromium source tree for munmap
and friends.
I'm in vendor/node/deps/v8/src/base/platform/platform-posix.cc
. From my previous post:
foo
isOS::Free
bar
isOS::Guard
baz
is a mystery... it resemblesOS::ProtectCode
, butOS::ProtectCode
passesPROT_READ | PROT_EXEC
tomprotect
, and as far as I can tell,0x3
corresponds toPROT_READ | PROT_WRITE
. There's not anywhere else I could find that'd be similar; this might actually be a root cause, if the kernel headers swapped thePROT_WRITE
andPROT_EXEC
flag values recently. (But I doubt they did...)xyzzy
(our function of interest) is another enigma. It callsmadvise
, only mentioned in a completely separate file as part oficu-small
, which has little to nothing to do with v8's sandboxing... Furthermore, it's called in the file withMADV_RANDOM
, while in the core dump it's called withMADV_FREE
. (MADV_FREE
is not mentioned in electron's tree at all).
At this point, I'd advise calling in a speciali- err, a v8 expert.
EDIT Searching the source tree for ud2
, there's a few places in OpenSSL where it's used. None of them look close to icu-small
though...
@remexre thanks for debugging work 👍
I will try compiling with gcc next time, that would explain why chromium is not affected by this bug.
No problem; I need more practice with assembly weirdness for machine architecture class 😛 I'm more than willing to help test, just comment here when you update the PKGBUILD.
I'm at 64 bit.
@stefanhusmann, could you post the output of lscpu
and uname -r
?
I do not have this bug with 4.7-1. However I got a totally different (unrelated) bug that my laptop display does not work at all. Oh boy...
@repsac-by, @pesho, @Frizi, could you all also post the lscpu
and uname -r
outputs?
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 21
Model: 2
Model name: AMD FX(tm)-8320 Eight-Core Processor
Stepping: 0
CPU MHz: 1700.000
CPU max MHz: 3500.0000
CPU min MHz: 1400.0000
BogoMIPS: 7049.48
Virtualization: AMD-V
L1d cache: 16K
L1i cache: 64K
L2 cache: 2048K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb cpb hw_pstate vmmcall bmi1 arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
4.7.0-1-ARCH
with 4.7.0-1-ARCH
I have no problems
For me (my lscpu
is above):
Kernel | uname -r |
Crash? |
---|---|---|
linux-samus4 |
4.4.2-6ph |
Yes |
linux |
4.4.5-1-ARCH |
Yes |
linux-lts |
4.4.16-1-lts |
Yes |
linux |
4.5.0-1-ARCH |
No |
linux |
4.7.0-1-ARCH |
No |
I might try building a kernel 4.7.0 with the same .config
as linux-lts
, and vice versa, and see if that makes a difference. If not, I'm going to try to see which kernel version it is that causes the issue.
EDIT Apparently I forgot that .config
changes between kernel versions... I'm going to try stepping backward through the prebuilt kernel releases until I get it. Also, someone on the LLVM cfe-dev
mailing list suggested that I try compiling with -save-temp
; apparently, it might be a call to __builtin_trap()
that's doing it. So that's what I'll be trying next.
EDIT 2
ಠ_ಠ DEFINE_BOOL(hard_abort, true, "abort by crashing")
I'm guessing this gives a 0.5% performance increase when crashing? mutters darkly
@remexre sure:
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 42
Model name: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
Stepping: 7
CPU MHz: 1269.531
CPU max MHz: 2900,0000
CPU min MHz: 800,0000
BogoMIPS: 3991.18
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid xsaveopt
# uname -r
4.4.16-1-lts
gg 1hr of compiling later, OOM... I should probably build outside of a memory-limited VM...
@repsac-by, @pesho, thanks! If we're getting the issue on AMD and Intel, across several microarchitectures, it's probably nothing related to an "actual" invalid opcode anywhere...
No good news with GCC either.
Is it related to one of our patches, then?
It may be, although I lean more towards the upgraded glibc. Upstream binaries are built using a sysroot, that could be the reason why they are not affected.
I'm setting up a build server, so that I will be able to handle rebuilds much more quickly.
@remexre I recompiled with debugging, and now I have this extra information:
Program terminated with signal SIGILL, Illegal instruction.
#0 0x00000000018a56d1 in WTF::decommitSystemPages(void*, unsigned long) ()
Does this tell you anything?
@repsac-by Thanks for pointing at that 👍, I'll include that patch in the next release.
For the record, this is the diff of /usr/include/bits/mman-linux.h
between glibc
2.23-5
and 2.24-2
:
83a84
> # define MADV_FREE 8 /* Free pages only if memory pressure. */
This line and the diff above explain why the older electron
compiled with the older glibc
worked.
Okay, I must've missed that somehow; I think ag
only searches submodules if they're not .gitignore
'd? Still not sure why this'd get a ud2
, but if Atom works on old kernels again, close?
I guess ud2
is generated by RELEASE_ASSERT.
Right, because __builtin_trap()
. I've still got ud2
internalized as "undefined behavior," rather than "probably but maybe also these other ten things." :P
Feel free to reopen if the problem persists.