Compiling 2WN29 binaries for old hardware
Closed this issue · 9 comments
Hi Kade,
From your build scripts for the current 2WN29 version, it seems you have mixed newer and newest instructions into all release targets?
Even for sse2 or sse3 binaries, or whatever, I see popcnt and bmi2, avx or avx2 instructions.
So none of the builds intended for old hardware actually might run there.
I would have built my own binary, but you also use a nightly rust, which I cannot use inside my msys2 environment.
It would be nice if you could build one which has no popcnt and at max sse4.1 instructions (old yorkshire core2) :)
Guenther (RWBC)
Hi Guenther! Thank you so much for letting me know and filing an issue.
From your build scripts for the current 2WN29 version, it seems you have mixed newer and newest instructions into all release targets? Even for sse2 or sse3 binaries, or whatever, I see popcnt and bmi2, avx or avx2 instructions.
Hm, this isn't supposed to be the case. For up to SSE4.1, for example, there is
RUSTFLAGS='-C target-feature=+sse4.1,-sse4.2,-popcnt,-avx,-bmi1,-lzcnt,-bmi2,-avx2' cargo build --target x86_64-unknown-linux-gnu --release
RUSTFLAGS='-C target-feature=+sse4.1,-sse4.2,-popcnt,-avx,-bmi1,-lzcnt,-bmi2,-avx2' cargo build --target x86_64-pc-windows-gnu --release
which explicitly enables SSE4.1 (with +
) but disables SSE4.2, popcnt, AVX, BMI1, LZCNT, BMI2, and AVX2 (with -
). From my end, it looks okay: when I open up the binary with objdump -S expo-2WN29-intel-sse41
and search throught it, I don't see any uses of the ymm
registers or the popcnt
instructions. Likewise with objdump -S expo-2WN29-core-penryn
.
Could you confirm that you're unable to run these binaries? I only checked the Linux builds just now, so it's possible there's an issue with the Windows builds.
(old yorkshire core2)
The closest thing I can find online to Yorkshire is the Core 2 Penryn QC Yorkfield, which I think should be targeted by the expo-2WN29-core-penryn
binary, but I can look into that.
Hi Kade,
Oh, I see you have -and +instructions (I wonder though why the -instructions just cannot be omitted,
or would they silently be taken, if your compiling machine has them?), my bad, but really not a single binary runs here on my hardware, not even the sse2 one, unlike for the previous version.
For 2WQ23 every generic one up to sse41 and also the specific core-penryn runs! (just tested it again)
(and yorkshire was a typo, ofc I meant yorkfield)
Well, hopefully your current nightly rust hasn't dropped support for Win7 since February? I doubt though ;-)
Anyhow something has changed while building since the last version.
Guenther
I wonder though why the -instructions just cannot be ommitted, or would they silently be taken, if your compiling machine has them?
I think you're right, and the -
flags aren't necessary, but I just wanted to be extra careful.
Not a single binary runs here on my hardware, not even the sse2 one, unlike for the previous version.
That's not good! My apologies – I'm really sorry about that.
I didn't change anything in the build-release
script from 2WQ23, and I can't think of any code changes that should affect Windows compatibility, but it's possible I made a mistake somewhere, or as you pointed it, I think it is not unlikely that the problem is the result of changes to Rust.
Can you describe what symptoms you are seeing? Are any error messages printed?
In the meantime, I'll boot up my Windows machine and see if I can reproduce the problem.
Damned! :( It's the same issue I had with BlackMarlin (also Rust, you sure know it) since the last release (before this version all was ok when self compiling)
jnlt3/blackmarlin#93
I fired up real cmd as admin to see the error messages, otherwise the binaries just close too fast
and the windows diagnostic messages for the system are not very helpful.
Whatever, the real error message is very clear:
thread 'main' has overflowed its stack
C:\Downloads\windows\intel>cd generic
C:\Downloads\windows\intel\generic>dir
Datenträger in Laufwerk C: ist LIZARD
Volumeseriennummer: 1CE7-94F3
Verzeichnis von C:\Downloads\windows\intel\generic
30.05.2022 20:00 <DIR> .
30.05.2022 20:00 <DIR> ..
30.05.2022 04:28 9.133.880 expo-2WN29-intel-avx.exe
30.05.2022 04:28 9.123.143 expo-2WN29-intel-avx2.exe
30.05.2022 04:27 8.968.713 expo-2WN29-intel-sse2.exe
30.05.2022 04:28 8.968.755 expo-2WN29-intel-sse3.exe
30.05.2022 04:28 8.960.656 expo-2WN29-intel-sse41.exe
30.05.2022 04:28 8.965.303 expo-2WN29-intel-sse42.exe
30.05.2022 04:28 8.968.740 expo-2WN29-intel-ssse3.exe
7 Datei(en), 63.089.190 Bytes
2 Verzeichnis(se), 77.866.577.920 Bytes frei
C:\Downloads\windows\intel\generic>expo-2WN29-intel-sse2.exe
thread 'main' has overflowed its stack
So there seems to be a new incompatibility for my machine.
The good thing is if you solve it you might solve it for me and that BlackMarlin version too ;-)
Ah, yeah, a stack overflow. This a mildly annoying difference between Linux and Windows that's come up before, when someone was unable to compile Expositor on their Windows machine:
Gabor Szots After that compiling went OK, apparently, [...] but running the generated engine resulted in this message:
thread 'main' has overflowed its stack
Kade The rust compiler uses an external linker; on my machine that's a GCC linker meant for cross-compiling, and on Windows that's an MSVC linker. The problem was that the MSVC linker has a different default value for the stack size and that value is too small, so I needed explicitly set the stack size.
On Linux, the initial stack size is set by the kernel when the program is loaded (although this can be changed with
ulimit
). On Windows, I believe the initial stack size is specified by the executable.
(from this thread on the CCRL forums).
So my guess is that Expositor uses more stack space now than my GCC linker sets by default. Let me figure out how to pass the stack size to the linker, and I should have a working version for you soon.
Thank you for your help and patience!
Alright, it should be fixed! Can you try downloading the release again and test whether it works on your machine?
Thanks a lot all ok now! Can you tell me the command for fixing the stack size (BlackMarlin)?
BTW this is around 9-10% faster than the according binaries for 2WQ23.
sse41
uci
id name Expositor 2WN29
id author Kade
option name Hash type spin default 64 min 1 max 65536
option name Threads type spin default 1 min 1 max 240
option name Overhead type spin default 10 min 0 max 1000
option name Persist type check default true
uciok
go depth 16
info depth 1 seldepth 2 nodes 25 time 0 nps 220801 score cp 34 multipv 1 pv d2d4
info depth 2 seldepth 4 nodes 140 time 1 nps 179222 score cp 35 multipv 1 pv g1f3 e7e6
info depth 3 seldepth 10 nodes 1290 time 4 nps 328950 score cp 31 multipv 1 pv e2e4 c7c5 g1f3
info depth 4 seldepth 10 nodes 2751 time 9 nps 321074 score cp 39 multipv 1 pv e2e4 c7c5 g1f3 g8f6
info depth 5 seldepth 12 nodes 7721 time 22 nps 349470 score cp 35 multipv 1 pv d2d4 g8f6 c2c4 e7e6 g1f3
info depth 6 seldepth 16 nodes 25784 time 72 nps 355996 score cp 39 multipv 1 pv e2e4 c7c6 d2d4 d7d5 e4e5 c8f5
info depth 7 seldepth 16 nodes 36848 time 101 nps 366257 score cp 36 multipv 1 pv e2e4 c7c6 d2d4 d7d5 b1c3 d5e4 c3e4
info depth 8 seldepth 19 nodes 94842 time 256 nps 371166 score cp 41 multipv 1 pv d2d4 d7d5 c2c4 d5c4 e2e3 g8f6 f1c4 e7e6
info depth 9 seldepth 19 nodes 137871 time 365 nps 377275 score cp 32 multipv 1 pv d2d4 d7d5 c2c4 d5c4 e2e3 g8f6 f1c4 c7c5 b1c3
info depth 10 seldepth 20 nodes 247669 time 661 nps 374669 score cp 36 multipv 1 pv d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 b1c3 f8b4 d1a4 b8c6 a2a3
info depth 11 seldepth 28 nodes 403573 time 1074 nps 375880 score cp 38 multipv 1 pv d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 b1c3 f8b4 d1a4 b8c6 e2e3 c8d7
info depth 12 seldepth 28 nodes 744966 time 2009 nps 370896 score cp 28 multipv 1 pv d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 b1c3 f8b4 c4d5 e6d5 c1g5 h7h6
info depth 13 seldepth 28 nodes 1185756 time 3219 nps 368330 score cp 36 multipv 1 pv d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 g2g3 c7c5 f1g2 d5c4 e1g1 b8c6 f3e5
info depth 14 seldepth 29 nodes 1874980 time 5096 nps 367897 score cp 35 multipv 1 pv d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 g2g3 a7a5 f1g2 f8b4 c1d2 e8g8 a2a3 b4e7 e1g1
info depth 15 seldepth 33 nodes 3141962 time 8579 nps 366222 score cp 32 multipv 1 pv d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 g2g3 d5c4 f1g2 c7c5 e1g1 b8c6 f3e5 c8d7 e5c6
info depth 16 seldepth 34 nodes 4949709 time 13651 nps 362577 score cp 32 multipv 1 pv d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 g2g3 d5c4 f1g2 c7c5 e1g1 b8c6 f3e5 c8d7 e5c6 d7c6
bestmove d2d4
core-penryn
uci
id name Expositor 2WN29
id author Kade
option name Hash type spin default 64 min 1 max 65536
option name Threads type spin default 1 min 1 max 240
option name Overhead type spin default 10 min 0 max 1000
option name Persist type check default true
uciok
go depth 16
info depth 1 seldepth 2 nodes 25 time 0 nps 214971 score cp 34 multipv 1 pv d2d4
info depth 2 seldepth 4 nodes 140 time 1 nps 156764 score cp 35 multipv 1 pv g1f3 e7e6
info depth 3 seldepth 10 nodes 1290 time 4 nps 315397 score cp 31 multipv 1 pv e2e4 c7c5 g1f3
info depth 4 seldepth 10 nodes 2751 time 9 nps 308284 score cp 39 multipv 1 pv e2e4 c7c5 g1f3 g8f6
info depth 5 seldepth 12 nodes 7721 time 22 nps 348052 score cp 35 multipv 1 pv d2d4 g8f6 c2c4 e7e6 g1f3
info depth 6 seldepth 16 nodes 25784 time 73 nps 352385 score cp 39 multipv 1 pv e2e4 c7c6 d2d4 d7d5 e4e5 c8f5
info depth 7 seldepth 16 nodes 36848 time 106 nps 346683 score cp 36 multipv 1 pv e2e4 c7c6 d2d4 d7d5 b1c3 d5e4 c3e4
info depth 8 seldepth 19 nodes 94842 time 267 nps 355150 score cp 41 multipv 1 pv d2d4 d7d5 c2c4 d5c4 e2e3 g8f6 f1c4 e7e6
info depth 9 seldepth 19 nodes 137871 time 379 nps 363361 score cp 32 multipv 1 pv d2d4 d7d5 c2c4 d5c4 e2e3 g8f6 f1c4 c7c5 b1c3
info depth 10 seldepth 20 nodes 247669 time 680 nps 364094 score cp 36 multipv 1 pv d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 b1c3 f8b4 d1a4 b8c6 a2a3
info depth 11 seldepth 28 nodes 403573 time 1101 nps 366693 score cp 38 multipv 1 pv d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 b1c3 f8b4 d1a4 b8c6 e2e3 c8d7
info depth 12 seldepth 28 nodes 744966 time 2044 nps 364448 score cp 28 multipv 1 pv d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 b1c3 f8b4 c4d5 e6d5 c1g5 h7h6
info depth 13 seldepth 28 nodes 1185756 time 3268 nps 362892 score cp 36 multipv 1 pv d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 g2g3 c7c5 f1g2 d5c4 e1g1 b8c6 f3e5
info depth 14 seldepth 29 nodes 1874980 time 5186 nps 361533 score cp 35 multipv 1 pv d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 g2g3 a7a5 f1g2 f8b4 c1d2 e8g8 a2a3 b4e7 e1g1
info depth 15 seldepth 33 nodes 3141962 time 8688 nps 361663 score cp 32 multipv 1 pv d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 g2g3 d5c4 f1g2 c7c5 e1g1 b8c6 f3e5 c8d7 e5c6
info depth 16 seldepth 34 nodes 4949709 time 13704 nps 361191 score cp 32 multipv 1 pv d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 g2g3 d5c4 f1g2 c7c5 e1g1 b8c6 f3e5 c8d7 e5c6 d7c6
bestmove d2d4
Cheers,
Guenther
Thanks a lot all ok now! Can you tell me the command for fixing the stack size (BlackMarlin)?
I added the -C link-arg=-Wl,--stack,16777216
argument to the Rust flags in the build-release
script:
WINLINK='-C link-arg=-Wl,--stack,16777216'
FEATURES='-C target-cpu=penryn -Z tune-cpu=penryn -C target-feature=+sse4.1,-sse4.2,-popcnt,-avx,-bmi1,-lzcnt,-bmi2,-avx2'
RUSTFLAGS="$FEATURES $WINLINK" cargo build --target x86_64-pc-windows-gnu --release
If you're compiling on Windows, then the build.bat
script should still work:
set RUSTFLAGS=-C target-cpu=native -C link-args=/STACK:16777216
cargo build --release
BTW this is around 9-10% faster than the according binaries for 2WQ23.
Yes! That's a result of this fix:
• removed sequential dependencies in network evaluation that had a disastrous effect on performance
It's a mistake I'm very embarrassed to have made.
Thanks a lot Kade! With your help by adding the linker command for stack size, I could build a working binary
for current BlackMarlin dev now!