SRI-CSL/gllvm

Error from wrong header parsing when compiling CentOS kernel

DanielKriz opened this issue · 10 comments

When building CentOS kernel (v. 4.18..0-193.el8) with gllvm, sometimes there is some non-existent header (usually consisting from one letter and .h file extension, for example r.h). I suspect that this could be because of some bug in parsing.

This kernel and it's config is acquired using rhel-kernel-get

Enviroment

  • Linux 64-bit, Fedora 34
  • go version go1.16.3 linux/amd64
  • the most recent version of gllvm

Example of error

fixdep: error opening file: r.h: No such file or directory
make[2]: *** [scripts/Makefile.build:313: arch/x86/crypto/aesni-intel_glue.o] Error 2
make[1]: *** [scripts/Makefile.build:553: arch/x86/crypto] Error 2
make: *** [Makefile:1069: arch/x86] Error 2

from

  gclang -Wp,-MD,arch/x86/crypto/.aesni-intel_glue.o.d -nostdinc -isystem /usr/lib64/clang/12.0.0/include -I./arch/x86/include -I./arch/x86/include/generated   -I./include/drm-backport -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -Qunused-arguments -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -Werror-implicit-function-declaration -Wno-format-security -std=gnu89 -no-integrated-as -fno-PIE -DCC_HAVE_ASM_GOTO -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -mno-80387 -mstack-alignment=8 -mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -DCONFIG_AS_FXSAVEQ=1 -DCONFIG_AS_SSSE3=1 -DCONFIG_AS_CRC32=1 -DCONFIG_AS_AVX=1 -DCONFIG_AS_AVX2=1 -DCONFIG_AS_AVX512=1 -DCONFIG_AS_SHA1_NI=1 -DCONFIG_AS_SHA256_NI=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mretpoline-external-thunk -fno-delete-null-pointer-checks -Wno-frame-address -Wno-int-in-bool-context -O2 -Werror -Wframe-larger-than=2048 -fstack-protector-strong -Wno-format-invalid-specifier -Wno-gnu -Wno-address-of-packed-member -Wno-tautological-compare -mno-global-merge -Wno-unused-const-variable -g -gdwarf-4 -pg -mfentry -DCC_USING_FENTRY -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -fno-merge-all-constants -fno-stack-check -Werror=implicit-int -Werror=strict-prototypes -Werror=date-time -Werror=incompatible-pointer-types -fmacro-prefix-map=./= -Wno-initializer-overrides -Wno-unused-value -Wno-format -Wno-sign-compare -Wno-format-zero-length -Wno-uninitialized -Wno-pointer-to-enum-cast    -DKBUILD_BASENAME='"aesni_intel_glue"' -DKBUILD_MODNAME='"aesni_intel"' -c -o arch/x86/crypto/.tmp_aesni-intel_glue.o arch/x86/crypto/aesni-intel_glue.c

Intereting thing is, that there is no header ending with r.h. Just to be sure I checked all preceding gclang calls and there is none such header either.

How to reproduce

This error usually happened when using multiple threads/core to compile linux kernel (-j option), it is almost guaranteed to happen at some point during compilation.
Rarely it happens when no number of cores is specified, this way it usually only once per compilation.
When called with make -j1 CC=gclang (because for example ninja build system needs to be called with -j1 to use just one core), it seems to be almost guaranteed to occur.

Is there a way how to fix this?

I have searched high and low for this "race".

Usually restarting the build works. I have no idea what is going on.

I used the go tool for race detection. No luck. (i.e no races)

Searched through the code. Gave up in the end. It was very very annoying.

Restarting the build really usually worked, but I run into totally new thing, i get this error every time i start a new build (even after calling make clean and starting anew). I even tried to download new kernel and everything.

fixdep: error opening file: o.h: No such file or directory
make[2]: *** [scripts/Makefile.build:313: arch/x86/kernel/crash_dump_64.o] Error 2
make[1]: *** [scripts/Makefile.build:553: arch/x86/kernel] Error 2
make: *** [Makefile:1069: arch/x86] Error 2

Whole error log with the same error ran with KBUILD_VERBOSE=1: error_crash_dump64.log
One quite unique specimen of this error is this one:

fixdep: error opening file: elper.h: No such file or directory
make[2]: *** [scripts/Makefile.build:313: arch/x86/crypto/cast6_avx_glue.o] Error 2
make[1]: *** [scripts/Makefile.build:553: arch/x86/crypto] Error 2
make: *** [Makefile:1069: arch/x86] Error 2

As it is not just one letter and .h file extension. Whole log: error_elper.log

As you said, it usually only required to start again, but now I get it every few compiler calls. (perhaps it could be clang issue?)

Would you give me some pointers to gllvm source code and how the parsing works please? I really want to help with this issue.

Edit: Another interesting thing, clang doesn't know option --mfentry and I don't need it for my purposes, so I removed it from makefile and after that this error haven't occured. This suggest that it could actually really be on clangs side.

Excellent. Thank you @DanielKriz! I will give you a tour later today.

Lets concentrate on gclang, gclang++ is almost identical.
The entry point is gllvm/cmd/gclang/main.go which passes
all the work on to shared.Compile(args, "clang"), args here being
the cmd line args not including gclang.

The parsing is done on line 63 of shared/compiler.go
All the parsing is located in shared/parser.go

The parser's job is to:

  1. figure out if we need to actually produce bitcode
  2. divide the options into link time, compile time etc ...

Note that there is no concurrency yet. Once we have parsed the cmds into a
ParserResult object we then decide what to do.

This is where the concurrency occurs, assuming we have to produce bitcode.
Lines 85 and 86 of shared/compiler.go are the two concurrent jobs that
produce the object file(s), and produce the bitcode file(s), respectively.

The parser is long but pretty straightforward, it tries to do exact matches first, then
does some pattern matching. The parser grows as the command lines to clang grow.
You will see comments on the more obscure switches. The kernel of course is the mother lode
of obscure switches.

I am pretty sure, but you should check, that once created the ParserResult object pr is not mutated.

So really not a lot of room for parallel weirdness.

Note that you could pretty easily instrument the code to dump each pr object out to the log.
Something like:

LogWarning("pr: %v",  @pr)

say by adding this to line 64 of compiler.go.

By the way, do you mean that

Another interesting thing, clang doesn't know option --mfentry and I don't need it for my purposes, so I removed it from makefile and after that this error haven't occured. This suggest that it could actually really be on clangs side.

All errors disappear, or just one particular type of error?

Just this particular error, but after some times (kernel compilation is very long) it again threw that error, but much much later. Unfortunately no matter how many times I restarted the build, the error prevailed.
edit: I want also to thank you for the hints

Any progress on this mystery @DanielKriz?

My main suspect is clang, because I have two system fedora 34 with clang 12 and Ubuntu 18,04 with clang 11. On Fedora the bug happens very often and even restarting the build doesn't help. On the ubuntu on the other hand, the build progresses with 0-2 occurences.

I am preparing some containers to try the build with different versions of clang and I am also trying to understand gllvm codebase and Golang (as I started learning it just because of this bug, it's pretty neat language. I must say)

I will update you as I get all the logs from the containers.

Interesting. One possibility is that the two almost identical concurrent calls to clang somehow manage to interfere
with one another. Since they are separate processes this must be via the filesystem or some such other external state.
I wonder if they are writing/reading/removing the same auxiliary files.