The kernel uses its own set of CFLAGS, KCFLAGS. For example, see:
As pointed out by codemac in this topic, one can simply export the value/values for the KCFLAGS
and KCPPFLAGS
before calling make
to achieve the same result, see here.
export KCFLAGS=' -march=znver3'
export KCPPFLAGS=' -march=znver3'
make all
Once applied to a kernel source tree, additional micro-architecture optimizations for the Linux kernel in three broad classes.
The patch named lite-more-uarches-for-kernel-6.8-rc4+.patch will add:
- x86-64-v2
- x86-64-v3
- x86-64-v4
When compiling the Generic x86-64
Processor family target, these are selectable under:
Processor type and features --->
Compiler Micro-Architecture Level
- x86-64-v2 brings support for vector instructions up to Streaming SIMD Extensions 4.2 (SSE4.2) and Supplemental Streaming SIMD Extensions 3(SSSE3), the POPCNT instruction, and CMPXCHG16B.
- x86-64-v3 adds vector instructions up to AVX2, MOVBE, and additional bit-manipulation instructions.
- x86-64-v4 includes vector instructions from some of the AVX-512 variants.
Users of glibc 2.33 and above can see which level is supported by running one of the follownig:
/lib/ld-linux-x86-64.so.2 --help | grep supported
/lib64/ld-linux-x86-64.so.2 --help | grep supported
The group of patches, each unique to a particular version of the kernel named more-uarches-for-kernel-*.patch
will add:
CPU Family | -march= | Min GCC Ver | Min Clang Ver |
---|---|---|---|
AMD Improved K8-family | k8-sse3 | 9.3 | 9.0 |
AMD K10-family | amdfam10 | 9.3 | 9.0 |
AMD Family 10h (Barcelona) | barcelona | 9.3 | 9.0 |
AMD Family 14h (Bobcat) | btver1 | 9.3 | 9.0 |
AMD Family 16h (Jaguar) | btver2 | 9.3 | 9.0 |
AMD Family 15h (Bulldozer) | bdver1 | 9.3 | 9.0 |
AMD Family 15h (Piledriver) | bdver2 | 9.3 | 9.0 |
AMD Family 15h (Steamroller) | bdver3 | 9.3 | 9.0 |
AMD Family 15h (Excavator) | bdver4 | 9.3 | 9.0 |
AMD Family 17h (Zen) | znver1 | 9.3 | 9.0 |
AMD Family 17h (Zen 2) | znver2 | 9.3 | 9.0 |
AMD Family 19h (Zen 3) | znver3 | 10.3 | 12.0 |
AMD Family 19h (Zen 4) | znver4 | 13.0 | 17.0 |
AMD Family 19h (Zen 5) | znver5 | 14.1 | ??? |
Intel Bonnell family Atom | bonnell | 9.3 | 9.0 |
Intel Silvermont family Atom | silvermont | 9.3 | 9.0 |
Intel Goldmont family Atom (Apollo Lake and Denverton) | goldmont | 9.3 | 9.0 |
Intel Goldmont Plus family Atom (Gemini Lake) | goldmont-plus | 9.3 | 9.0 |
Intel 1st Gen Core i3/i5/i7-family (Nehalem) | nehalem | 9.3 | 9.0 |
Intel 1.5 Gen Core i3/i5/i7-family (Westmere) | westmere | 9.3 | 9.0 |
Intel 2nd Gen Core i3/i5/i7-family (Sandybridge) | sandybridge | 9.3 | 9.0 |
Intel 3rd Gen Core i3/i5/i7-family (Ivybridge) | ivybridge | 9.3 | 9.0 |
Intel 4th Gen Core i3/i5/i7-family (Haswell) | haswell | 9.3 | 9.0 |
Intel 5th Gen Core i3/i5/i7-family (Broadwell) | broadwell | 9.3 | 9.0 |
Intel 6th Gen Core i3/i5/i7-family (Skylake) | skylake | 9.3 | 9.0 |
Intel 6th Gen Core i7/i9-family (Skylake X) | skylake-avx512 | 9.3 | 9.0 |
Intel 8th Gen Core i3/i5/i7-family (Cannon Lake) | cannonlake | 9.3 | 9.0 |
Intel 10th Gen Core i7/i9-family (Ice Lake) | icelake-client | 9.3 | 9.0 |
Intel Xeon (Cascade Lake) | cascadelake | 10.2 | 10.0 |
Intel Xeon (Cooper Lake) | cooperlake | 10.2 | 10.0 |
Intel 3rd Gen 10nm++ i3/i5/i7/i9-family (Tiger Lake) | cooperlake | 10.2 | 10.0 |
Intel 4th Gen 10nm++ Xeon (Sapphire Rapids) | sapphirerapids | 11.1 | 12.0 |
Intel 11th Gen i3/i5/i7/i9-family (Rocket Lake) | rocketlake | 11.1 | 12.0 |
Intel 12th Gen i3/i5/i7/i9-family (Alder Lake) | alderlake | 11.1 | 12.0 |
Intel 13th Gen i3/i5/i7/i9-family (Raptor Lake) | raptorlake | 13.0 | 15.0.5 |
Intel 5th Gen 10nm++ Xeon (Emerald Rapids) | emeraldrapids | 13.0 | ??? |
The same group of patches named above will also add the ability to compile by passing the '-march=native' option which, according to the GCC manual "selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine. Using -march=native enables all instruction subsets supported by the local machine and will produce code optimized for the local machine under the constraints of the selected instruction set."
Users of Intel CPUs should select the 'Intel-Native' option and users of AMD CPUs should select the 'AMD-Native' option.
The test machine measured the time it took to make bzImage
of the linux kernel source (.config
generated by make x86_64_defconfig
prior).
Three separate test machines were evaluated:
- AMD Ryzen 9 5950X
- Intel i7-4790K
- Intel N100
Separate kernels were first compiled from source patched with more-uarches-for-kernel-6.8-rc4+.patch.
- Kernel 1 used the default menu config option for Processor family =
Generic x86-64
- Kernel 2 used the menu config option for Processor family =
AMD x86-64-v3
orIntel x86-64-v3
- Kernel 3 used the menu config option for Processor family =
AMD Zen 3
orIntel Haswell
orIntel Alder Lake
Each machine was booted into its respective kernel and the make test was conducted. Then the next kernel was installed and the machine was booted into it and the make test was again conducted.
Consistently across all three test machines, the kernels built with the optimized processor family options introduced by the patch hosted in this repo ran the compile test faster than the kernel compiled with the default processor family option by a small (<1% difference) but statistically significant amount as measured by this make compilation.
What does this mean for real-world usage? Maybe nothing. The intent was to see if something easily automated could show some value in applying the tunings. People have historically gravitated to compilation task-based benchmarks so that coupled with ease-of-use point is why I settled on it. If someone has a good kernel-centric benchmark, I am interested to see a controlled comparison. Maybe something relating to system calls, or context switching, or scheduler latency.
- All the assumptions for ANOVA are met:
- Data are normally distributed
- The population variances are fairly equal
- The boxplot plot clearly show significance for either pair-wise comparison
- Pair-wise analysis by Tukey-Kramer data shown for all pairs (see tables)
In other words, x86-64-v3 is significantly different from generic x86-64. The various subtargets are also significantly different from x86-64.
Processor family option | Mean compile time | Std dev | # of replicates |
---|---|---|---|
Generic x86-64 | 79.800 sec | 0.1076 sec | 12 |
AMD x86-64-v3 | 79.456 sec | 0.0772 sec | 12 |
AMD Zen 3 | 79.440 sec | 0.0912 sec | 12 |
Treatment pairs | Tukey HSD Q stat | Tukey HSD p-value | Tukey HSD interfence |
---|---|---|---|
Generic x86-64 vs AMD x86-64-v3 | 12.8771 | 0.0010053 | |
Generic x86-64 vs AMD Zen 3 | 13.4675 | 0.0010053 | |
AMD x86-64-v3 vs AMD Zen 3 | 9.6524 | 0.8999947 |
Processor family option | Mean compile time | Std dev | # of replicates |
---|---|---|---|
Generic x86-64 | 344.280 sec | 0.6455 sec | 12 |
Intel x86-64-v3 | 342.035 sec | 0.4971 sec | 12 |
Intel Haswell | 342.189 sec | 0.2415 sec | 12 |
Treatment pairs | Tukey HSD Q stat | Tukey HSD p-value | Tukey HSD interfence |
---|---|---|---|
Generic x86-64 vs Intel x86-64-v3 | 28.9652 | 0.0010053 | |
Generic x86-64 vs Intel Haswell | 24.8335 | 0.0010053 | |
Intel x86-64-v3 vs Intel Haswell | 4.1317 | 0.0167155 | |
Processor family option | Mean compile time | Std dev | # of replicates |
---|---|---|---|
Generic x86-64 | 589.457 sec | 0.1596 sec | 12 |
Intel x86-64-v3 | 589.217 sec | 0.1382 sec | 12 |
Intel Alder Lake | 588.797 sec | 0.1532 sec | 12 |
Treatment pairs | Tukey HSD Q stat | Tukey HSD p-value | Tukey HSD interfence |
---|---|---|---|
Generic x86-64 vs Intel x86-64-v3 | 5.5076 | 0.0012818 | |
Generic x86-64 vs Intel Alder Lake | 15.1600 | 0.0010053 | |
Intel x86-64-v3 vs Intel Alder Lake | 9.6524 | 0.0010053 |
All machines ran Arch Linux with the all stock repo packages with the exception of the kernel (see below). At the time of work, the following the toolchain versions were used:
- binutils 2.43+r4+g7999dae6961-1
- gcc 14.2.1+r134+gab884fffe3fc-1
- gcc-libs 14.2.1+r134+gab884fffe3fc-1
- glibc 2.40+r16+gaa533d58ff-2
- linux-api-headers 6.10-1
The kernel packages were built on the official Arch Linux PKGBUILD for kernel version 6.10.10-arch1-1 applying the distro config differing only by the modifications introduced by the aforementioned patch from this repo.
The benchmark was compiling the vanilla Linux kernel version 6.10.10 and as mentioned above, the .config
used was generated by running make x86_64_defconfig
.
- Bash script to run the benchmark: make_bench.sh
- Log file generated by script: results.csv
- Original author: jeroen AT linuxforge DOT net
- Link to original version: http://www.linuxforge.net/docs/linux/linux-gcc.php
- Box plot generated with statisty.app
- ANOVA stats generated with astatsa.com
Find support for older version of the linux kernel and of gcc in the outdated_versions directory.