0.7.1 tests failing in ppc64le with g++14 -O2
Closed this issue · 32 comments
TL;DR full symptoms, gathered below:
- test failures with g++14 -O2 cross-compiled to ppc64le
- no failures with g++14 -O1 cross-compiled to ppc64le
- no failures with g++14 -O3 cross-compiled to ppc64le
- no failures with g++13 -O2 cross-compiled to ppc64le
- no failures with g++14 -O2 on x64
Original report
v0.7.1 is showing quite a few regressions on ppc64le
only:
The following tests FAILED:
1 - ryml-test-quickstart (Failed)
7 - ryml-test-parse_engine_6_qmrk (Failed)
8 - ryml-test-parse_engine_7_seqimap (SEGFAULT)
16 - ryml-test-tree (SEGFAULT)
18 - ryml-test-emit (Failed)
19 - ryml-test-style (Failed)
22 - ryml-test-json (Failed)
24 - ryml-test-merge (Failed)
28 - ryml-test-seq (SEGFAULT)
31 - ryml-test-map (SEGFAULT)
38 - ryml-test-scalar_null (Failed)
53 - ryml-test-anchor (Failed)
55 - ryml-test-number (Failed)
(I am not sure if I tested 0.7.0 thoroughly on non-x86_64
architectures or not.)
Originally posted by @musicinmybrain in #440 (comment)
@musicinmybrain Thanks for reporting.
From glancing at the logs I see this is also compiled with -O2
. I am guessing this is the same as #440, but now with the ppc64le frontend. ppc64le is tested in the CI of this project, and it succeeds with 0.7.1, but not with -O2
; only with -O0
and -O3
, as these are the cmake presets. (Of course, if it didn't succeed, it would not be released).
Two significant questions for reflection:
- what specific flag/flags triggered by
-O2
and not by either of-O3
or-O0
causes the problem? - what is the specific UB that these flags rely on, and which triggers the wrong codegen by the frontend?
Having an answer to 2 would help in revising/improving the code. However, while the answer to 1 can be found with some effort (given the constraints with this architecture), I'm not sure that the answer to 2 can be found. But possibly the answer to 1 will significantly narrow the search space.
Also, while we were hypothesizing in #440 that the cause for the problem is UB, it could also be that gcc has a bug in its frontend, and the bug manifests only with the flags from 1.
Chasing this bug will be significantly harder than #440, as I don't have access to ppc64le hardware, and any attempt from my side at diagnosing and fixing the problem must involve a CI roundtrip. Do you have access to ppc64le?
Chasing this bug will be significantly harder than #440, as I don't have access to ppc64le hardware, and any attempt from my side at diagnosing and fixing the problem must involve a CI roundtrip. Do you have access to ppc64le?
I don’t have general-purpose interactive shell access to real ppc64le
hardware.
I can do as many experimental RPM package builds in COPR as I want using real pp64le
hardware – see https://copr.fedorainfracloud.org/coprs/music/rapidyaml/packages/ – and so can anyone who makes a FAS account and is willing to express everything as an RPM package. (As a Fedora packager, I can also do “scratch builds” in real Fedora infrastructure, but this doesn’t help for testing coordinated updates of multiple packages like this – we use COPR for that!)
I can reproduce the error in an emulated ppc64le
mock
chroot on my own machine via qemu-user-static
. This does allow me to play around interactively, including choosing to enable network access and build things directly from git checkouts:
mock -r fedora-rawhide-ppc64le --enable-network --shell
Something like that is probably the most promising avenue if you want to try to play with this interactively yourself.
I’m happy to run tests with the tools at my disposal.
The flags corresponding to each of the optimization levels are documented at https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html. I wonder if this happens with -O1
…
For Fedora’s purposes, I could increase the optimization level to -O3
on ppc64le
as a workaround. We should respect the distribution’s default compiler flags in general, but can override them with a good justification. For -O3
, that would usually be a benchmark showing a significant gain, but this would be a good justification too. Of course, that’s a “magical” fix that could stop working at any time based on compiler implementation details, so I would view it as a temporary workaround.
I just confirmed that tests pass with -O1
for me. I’m checking -O3
now.
If there is a single flag that triggers test failures between -O1
and -O2
, we could probably find it by bisecting. (There are many fewer flags added from -O2
to -O3
than from -O1
to -O2
, but the flag that makes things work again is probably less revealing than the one that breaks them in the first place.) If triggering the problem requires several optimization flags acting together – which would not be surprising – it might be harder to understand exactly which flags are involved.
This actually fails with -O3
for me, too:
It’s possible that some of the other default flags in Fedora (including enabling LTO) are affecting this.
I am trying to create a Minimal Verifiable Example. Looking at the logs you originally posted, I see quickstart fails:
That's convenient to have a MVE!
So I edited the quickstart example to run only the function where this happens, sample_json()
, and tweaked the rarearchs workflow to try with -O0... all the way to -O3, with/without -g -DNDEBUG
. (Branch is fix/464_ub_ppc64le).
Unfortunately none of these fail :-/.
Can you try to get it to fail on your end? Relevant commands should be:
cd samples/add_subdirectory
mkdir -p build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_FLAGS_RELEASE:STRING="...." # fill with appropriate options here
cmake --build build --config Release --target run
So I started by trying to imitate Fedora’s build flags as closely as I reasonably could, with the idea that if I could reproduce this with samples/add_subdirectory
, I would
$ mock -r fedora-rawhide-ppc64le --clean
$ mock -r fedora-rawhide-ppc64le -i git-core cmake gcc-c++ 'cmake(gtest)'
$ mock -r fedora-rawhide-ppc64le --shell --enable-network
<mock-chroot> sh-5.2# git clone https://github.com/biojppm/rapidyaml.git
<mock-chroot> sh-5.2# cd rapidyaml/
<mock-chroot> sh-5.2# git checkout fix/464_ub_ppc64le
<mock-chroot> sh-5.2# git submodule update --init --recursive
<mock-chroot> sh-5.2# cd samples/add_subdirectory
<mock-chroot> sh-5.2# CXXFLAGS="-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mcpu=power8 -mtune=power8 -fasynchronous-unwind-tables -fstack-clash-protection" cmake -S . -B build -DRYML_CXX_STANDARD=14
<mock-chroot> sh-5.2# cmake --build build -j16 --target run
[ 95%] Linking CXX executable ryml-quickstart
/builddir/rapidyaml/ext/c4core/src/c4/charconv.hpp:197:3: warning: type ‘c4::RealFormat_e’ violates the C++ One Definition Rule [-Wodr]
197 | } RealFormat_e;
| ^
/builddir/rapidyaml/ext/c4core/src/c4/charconv.hpp:209:3: note: an enum with different value name is defined in another translation unit
209 | } RealFormat_e;
| ^
/builddir/rapidyaml/ext/c4core/src/c4/charconv.hpp:190:5: note: name ‘FTOA_FLOAT’ is defined as 32-bit while another translation unit defines it as 8-bit
190 | FTOA_FLOAT = static_cast<std::underlying_type<std::chars_format>::type>(std::chars_format::fixed),
| ^
/builddir/rapidyaml/ext/c4core/src/c4/charconv.hpp:202:5: note: mismatching definition
202 | FTOA_FLOAT = 'f',
| ^
[ 95%] Built target ryml-quickstart
[100%] running: /builddir/rapidyaml/samples/add_subdirectory/build/ryml-quickstart
WTF1:
qemu: uncaught target signal 11 (Segmentation fault) - core dumped
gmake[3]: *** [CMakeFiles/run.dir/build.make:71: CMakeFiles/run] Segmentation fault (core dumped)
I haven’t had a chance to try to minimize the compiler flags yet, but that looked interesting enough to go ahead and report it!
The warning persists when I drop -DRYML_CXX_STANDARD=14
, but goes away when I drop -flto=auto -ffat-lto-objects
. That make sense, because the compiler can’t detect ODR violations if we aren’t asking it to consider multiple translation units.
I’ve reduced the compiler flags to
<mock-chroot> sh-5.2# CXXFLAGS="-O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong -m64 -mcpu=power8 -mtune=power8 " cmake -S . -B build
while still reproducing the segfault.
I assume I can reduce them further, but it takes a while to compile the example under emulation.
Same segfault down to CXXFLAGS="-O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong"
; dropping -fstack-protector-strong
gets me a test failure instead of a segfault:
[100%] running: /builddir/rapidyaml/samples/add_subdirectory/build/ryml-quickstart
WTF1:
{"": "","": "","": "","": ""}/builddir/rapidyaml/samples/quickstart.cpp:4293: ERROR: ryml::emitrs_json<std::string>(tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
/builddir/rapidyaml/samples/quickstart.cpp:4294: OK! ryml::emitrs_json<std::string>(json_tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
WTF2:
{"": "","": "","": "","": ""}/builddir/rapidyaml/samples/quickstart.cpp:4299: ERROR: ss.str() == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
Completed 3 checks.
ERROR: 2/3 checks failed.
If I restore the LTO flags (-flto=auto -ffat-lto-objects
), then the ODR violation warning reappears. If I also add std=c++11
so that the example is compiled as C++11 rather than the unspecified default (C++17 in practice), then I can get the test failures without the ODR violation warning. This makes sense because the different definitions of FTOA_FLOAT
were due to different C++ standards in different translation units. However, it also demonstrates that the ODR violation is not the cause of the test failure.
<mock-chroot> sh-5.2# CXXFLAGS="-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -std=c++11" cmake -S . -B build
[ 95%] Linking CXX executable ryml-quickstart
/usr/bin/cmake -E cmake_link_script CMakeFiles/ryml-quickstart.dir/link.txt --verbose=1
/usr/bin/c++ -O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -std=c++11 "CMakeFiles/ryml-quickstart.dir/builddir/rapidyaml/samples/quickstart.cpp.o" -o ryml-quickstart ryml/libryml.a
gmake[3]: Leaving directory '/builddir/rapidyaml/samples/add_subdirectory/build'
[ 95%] Built target ryml-quickstart
/usr/bin/gmake -f CMakeFiles/run.dir/build.make CMakeFiles/run.dir/depend
gmake[3]: Entering directory '/builddir/rapidyaml/samples/add_subdirectory/build'
cd /builddir/rapidyaml/samples/add_subdirectory/build && /usr/bin/cmake -E cmake_depends "Unix Makefiles" /builddir/rapidyaml/samples/add_subdirectory /builddir/rapidyaml/samples/add_subdirectory /builddir/rapidyaml/samples/add_subdirectory/build /builddir/rapidyaml/samples/add_subdirectory/build /builddir/rapidyaml/samples/add_subdirectory/build/CMakeFiles/run.dir/DependInfo.cmake "--color="
gmake[3]: Leaving directory '/builddir/rapidyaml/samples/add_subdirectory/build'
/usr/bin/gmake -f CMakeFiles/run.dir/build.make CMakeFiles/run.dir/build
gmake[3]: Entering directory '/builddir/rapidyaml/samples/add_subdirectory/build'
[100%] running: /builddir/rapidyaml/samples/add_subdirectory/build/ryml-quickstart
./ryml-quickstart
WTF1:
{"": "","": "","": "","": ""}/builddir/rapidyaml/samples/quickstart.cpp:4293: ERROR: ryml::emitrs_json<std::string>(tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
/builddir/rapidyaml/samples/quickstart.cpp:4294: OK! ryml::emitrs_json<std::string>(json_tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
WTF2:
{"": "","": "","": "","": ""}/builddir/rapidyaml/samples/quickstart.cpp:4299: ERROR: ss.str() == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
Completed 3 checks.
ERROR: 2/3 checks failed.
Thanks for the investigation and the repro -- this is invaluable. That is exactly the same error as in your logs, so this means we have a reliable way to reproduce on a local machine.
I did not know about mock, and I am only somewhat familiar with qemu, but now there's a very good reason to pick these up. I'll report back when I'm able to reproduce on my end.
As for the ODR problem, let's track it in the linked issue. Thanks for reporting that one too!
Cannot reproduce it in ubuntu using powerpc64le-linux-gnu-g++ together with qemu-ppc64le-static...
Here's what I did, which succeeds:
set -xe ; \
cd ~/proj/rapidyaml/samples/add_subdirectory/ ; \
sys=powerpc64le ; \
cxxflags="-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -std=c++11 -DNDEBUG" ; \
bdir=build/$sys ; \
mkdir -p $bdir ; \
cmake -S . -B $bdir -D CMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_FLAGS_RELEASE:STRING="$cxxflags" \
--toolchain ../../.github/toolchains/linux-gnu-$sys.cmake ; \
cmake --build $bdir --target run -j --verbose
Also, adding -fstack-protector-strong
did not cause any segfault...
(BTW, I've tweaked the quickstart to make compilation faster; it now goes under 5s in my machine)
I tried further now, using a different ubuntu distro. No luck, could not reproduce it with any of those flags.
If the compiler flags work in one OS, but not in another OS, it is starting to look like something else in the environment, like compiler/linker version.
I don't have easy access to any RedHat distro to use mock
to try to reproduce. Is there a docker image that you recommend? Or another easy way?
FWIW in the previous comment I used ubuntu22.04, which has
root@bach:/src# powerpc64le-linux-gnu-g++ --version
powerpc64le-linux-gnu-g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
root@bach:/src# qemu-ppc64le-static --version
qemu-ppc64le version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.22)
now I used ubuntu18.04 which has
root@bach:/rapidyaml# powerpc64le-linux-gnu-g++ --version
powerpc64le-linux-gnu-g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
root@bach:/rapidyaml# qemu-ppc64le-static --version
qemu-ppc64le version 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.42)
This does not confirm any version related suspicion.
FWIW, the prior setup was simply
$ apt-get install -y \
gcc-powerpc64le-linux-gnu \
g++-powerpc64le-linux-gnu \
qemu \
qemu-system \
qemu-user-static
and then running the command in the previous comment.
I also tried running all the tests you listed as failing. None of them failed...
( set -xe ; \
cxxflags="-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -std=c++11 -DNDEBUG" ; \
cd /rapidyaml/ ; \
sys=powerpc64le ; \
bdir=build/$sys ; \
mkdir -p $bdir ; \
cmake -S . -B $bdir \
-D CMAKE_BUILD_TYPE=Release \
-DRYML_BUILD_TESTS=ON \
-DGIT=/usr/bin/git \\
-DCMAKE_CXX_FLAGS_RELEASE:STRING="$cxxflags" \
--toolchain .github/toolchains/linux-gnu-$sys.cmake ; \
for t in quickstart parse_engine_6_qmrk parse_engine_7_seqimap tree emit json merge seq map scalar_null anchor number ; do \
cmake --build $bdir -j --verbose --target ryml-test-$t \
&& qemu-ppc64le-static $bdir/test/ryml-test-$t \
|| echo "\n\n\nFAILED\n\n" ; \
done )
Hmm, maybe this one could be toolchain-specific after all.
I reduced the command line to CXXFLAGS="-O2 -g -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3" cmake -S . -B build
.
There are official Fedora docker images at https://hub.docker.com/_/fedora/, but I do not have any insight into combining docker with qemu
. I suppose you might be able to run mock
and qemu-user-static
inside a Docker container.
You could also use a VM, either installing from an ISO (https://fedoraproject.org/workstation/download) or using one of the Cloud Base Images (https://fedoraproject.org/cloud/download). You could use a ppc64le
VM if you know a way to run the whole VM under qemu
, or you could use mock
and qemu-user-static
inside the VM.
All of my tests for this particular issue have been in a Fedora Rawhide chroot, with GCC 14.1.1 and binutils 2.42.90. I’ll try the same process from #464 (comment) on some other releases and see if the issue can be reproduced there. Fedora 41 and 40 are both temporarily ahead of Rawhide with GCC 14.2.1; F41 has binutils 2.43 and F40 has binutils 2.41. Fedora 39 has GCC 13.3.1 and binutils 2.40. I can also try some enterprise distros, like CentOS Stream 10 with GCC 14.1.1 and binutils 2.41, Alma Linux 9 with GCC 11.4.1 and binutils 2.35.2, and Rocky Linux 8 with GCC 8.5.0 and binutils 2.30.
(I could try clang, too, but I’ll look at the various GCC’s first.)
Reproduced on:
- Fedora 41, GCC 14.2.1, binutils 2.43
- Fedora 40, GCC 14.2.1, binutils 2.41
Failed to reproduce on:
- Fedora 39, GCC 13.3.1, binutils 2.40
- Alma Linux 9, GCC 11.4.1, binutils 2.35.2
It’s looking very likely that this is a regression specific to GCC 14. Whether it’s a compiler bug / miscompilation – which is very possible – or perhaps a problem in rapidyaml revealed by a new optimization, I couldn’t say.
Made some progress: was able to reproduce with a gcc14.2 cross-compiler, and unable to reproduce with gcc13.2. Using just -O2
was enough to trigger the problem.
I did so by getting a docker image with a gcc14(ubuntu24.10) and gcc13(ubuntu24.04) cross-compiler:
#img=ubuntu:24.04 # uses gcc13.2
img=ubuntu:24.10 # uses gcc14.2
docker pull $img ; docker run --rm -it --network host -v ~/proj:/proj $img
... then inside the container:
apt-get update ; apt-get install -y \
g{++,cc}-powerpc64le-linux-gnu \
build-essential \
cmake \
git \
qemu-user-static
# define a function to try with different compiler flags
function try464()
{
cxxflags="$1"
cmkflags="$2"
( set -xe ; \
cd /proj/rapidyaml/samples/add_subdirectory/ ; \
sys=powerpc64le ; \
bdir=build/$sys-docker ; \
mkdir -p $bdir ; \
cmake -S . -B $bdir \
--toolchain ../../.github/toolchains/linux-gnu-$sys.cmake \
-DCMAKE_BUILD_TYPE=Release \
-DGIT=git \
$cmkflags \
-DCMAKE_CXX_FLAGS_RELEASE:STRING="$cxxflags" \
; \
cmake --build $bdir --target run -j --verbose )
}
... and finally:
$ try464 "-O2" # fails with gcc14, succeeds with gcc13
$ try464 "-O2" "-DRYML_DBG=ON" # succeeds
$ try464 "-O1" # succeeds
$ try464 "-O3" # succeeds
FWIW, here's the error. Notice it's missing the keys:
[100%] running: /proj/rapidyaml/samples/add_subdirectory/build/powerpc64le-docker/ryml-quickstart
qemu-ppc64le-static /proj/rapidyaml/samples/add_subdirectory/build/powerpc64le-docker/ryml-quickstart
WTF1:
{"": "a deer, a female deer","": "a drop of golden sun","": "a name, I call myself","": "a long long way to go"}
/proj/rapidyaml/samples/quickstart.cpp:4293: ERROR: ryml::emitrs_json<std::string>(tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
/proj/rapidyaml/samples/quickstart.cpp:4294: OK! ryml::emitrs_json<std::string>(json_tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
WTF2:
{"": "a deer, a female deer","": "a drop of golden sun","": "a name, I call myself","": "a long long way to go"}
/proj/rapidyaml/samples/quickstart.cpp:4299: ERROR: ss.str() == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
Completed 3 checks.
ERROR: 2/3 checks failed.
make[3]: *** [CMakeFiles/run.dir/build.make:71: CMakeFiles/run] Error 2
EDIT: @musicinmybrain I looked again the error you posted above; notice how it's different:
./ryml-quickstart
WTF1:
{"": "","": "","": "","": ""}
/builddir/rapidyaml/samples/quickstart.cpp:4293: ERROR: ryml::emitrs_json<std::string>(tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
/builddir/rapidyaml/samples/quickstart.cpp:4294: OK! ryml::emitrs_json<std::string>(json_tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
WTF2:
{"": "","": "","": "","": ""}
/builddir/rapidyaml/samples/quickstart.cpp:4299: ERROR: ss.str() == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
Completed 3 checks.
ERROR: 2/3 checks failed.
Nice! At least the association with GCC 14 seems to be consistent, and we have a cross-distro reproduction.
A further finding: this seems to be related with the compiler frontend. I tried with both gcc14.1 and 14.2 for x64, and was not able to reproduce.
I narrowed down the quickstart error to this function:
template<class EventHandler>
csubstr ParseEngine<EventHandler>::_maybe_filter_key_scalar_dquot(ScannedScalar const& C4_RESTRICT sc)
{
csubstr maybe_filtered = sc.scalar;
if(sc.needs_filter)
{
if(m_options.scalar_filtering())
{
maybe_filtered = _filter_scalar_dquot(sc.scalar);
}
else
{
_c4dbgp("dquo scalar left unfiltered");
m_evt_handler->mark_key_scalar_unfiltered();
}
}
else
{
_c4dbgp("dquo scalar doesn't need filtering");
}
return maybe_filtered;
}
When this function is called with sc.needs_filter
set to false, it is returning an empty string in maybe_filtered
, even though the input string at sc.scalar
is non-empty. This was causing the empty keys/values in the paste above.
As with #440, enabling RYML_DBG (or adding any manual prints inside the function) made the problem go away. There were also other weird behaviours I observed during debugging, and I can tell more about them if you wish.
But eventually I tried the following:
template<class EventHandler>
csubstr ParseEngine<EventHandler>::_maybe_filter_key_scalar_dquot(ScannedScalar const& C4_RESTRICT sc)
{
- csubstr maybe_filtered = sc.scalar;
if(sc.needs_filter)
{
if(m_options.scalar_filtering())
{
- maybe_filtered = _filter_scalar_dquot(sc.scalar);
+ return _filter_scalar_dquot(sc.scalar);
}
else
{
_c4dbgp("dquo scalar left unfiltered");
m_evt_handler->mark_key_scalar_unfiltered();
}
}
else
{
_c4dbgp("dquo scalar doesn't need filtering");
}
- return maybe_filtered;
+ return sc.scalar;
}
Which enabled the quickstart test to succeed.
After that, I did the same to every other _maybe_filter_*()
function, and ran the test suite. Doing so enabled every test to pass 🤨.
I'm pushing the changeset now; please try on your end, and let me know what happens.
I repeated the exact steps in #464 (comment), including the “unminimized” CXXFLAGS
etc., and all seems well: the ODR violation warning is absent, and the test passes:
[100%] running: /builddir/rapidyaml/samples/add_subdirectory/build/ryml-quickstart
WTF1:
{"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"}
/builddir/rapidyaml/samples/quickstart.cpp:4293: OK! ryml::emitrs_json<std::string>(tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
/builddir/rapidyaml/samples/quickstart.cpp:4294: OK! ryml::emitrs_json<std::string>(json_tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
WTF2:
{"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"}
/builddir/rapidyaml/samples/quickstart.cpp:4299: OK! ss.str() == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
Completed 3 checks.
SUCCESS!
WTF1:
{"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"}
/builddir/rapidyaml/samples/quickstart.cpp:4293: OK! ryml::emitrs_json<std::string>(tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
/builddir/rapidyaml/samples/quickstart.cpp:4294: OK! ryml::emitrs_json<std::string>(json_tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
WTF2:
{"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"}
/builddir/rapidyaml/samples/quickstart.cpp:4299: OK! ss.str() == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
Completed 3 checks.
SUCCESS!
[100%] Built target run
Next I’ll try applying master...fix/464_ub_ppc64le as a patch to 0.7.1, and double-check that the full test suite passes on all architectures.
Next I’ll try applying master...fix/464_ub_ppc64le as a patch to 0.7.1, and double-check that the full test suite passes on all architectures.
That worked. I also managed a successful test-build of jsonnet, which is the only thing in Fedora that currently depends on rapidyaml. It would be nice to really understand the root cause, but at least everything seems to be working.
Thanks for taking the time to investigate all of these reports. Are you planning a release with the fixes, or should I patch 0.7.1?
Don't patch, I'll cleanup this branch and make a new release.
I'm somewhat bothered by this situation (eg, this issue, and #440, and the previous problem referred therein).
The functions I now had to change look innocuous, and their call sites also look innocuous. It is remarkable that the optimizer is choking on this code, and it is even more remarkable that -O3
works while -O2
doesn't.
Even if UB is the ultimate reason, I find it hard to surmise how that is so. So do the many static analysis tools, and none of them point at anything suspicious.
Given that this issue is so specific (ie only gcc14+ppc64le+-O2), having a test in the CI does not make much sense (or otherwise the already gigantic size of this project's CI would balloon). So I'm just going to do the changes and merge them in.
If this turns out to be a bug/regression in the ppc64le
frontend for GCC, it would not be the first one I’ve seen, e.g. https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=7dd3b2b09cbeb6712ec680a0445cb0ad41070423 which showed up in the tests for https://github.com/shibatch/sleef. Hard to characterize exactly what’s happening and distill it into something reportable, though.
The functions I now had to change look innocuous, and their call sites also look innocuous. It is remarkable that the optimizer is choking on this code, and it is even more remarkable that
-O3
works while-O2
doesn't.
I remember that you said -O3
worked in your CI. It didn’t work in my local tests, though.
Oh, right, the segfault. Well, I did try with -O3
only, but that was working.
I guess I'll try again with the full flags.
Did you try -O3
with this changeset? If not, can you try it?
I just tried again on the tip of the branch with gcc 14.2. All of these succeed:
try464 "-O2"
try464 "-O3"
try464 "-O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong"
try464 "-O3 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong"
Since #467 is a proven fix to at least some of the problems here, I'm merging it now; doing so will close the issue. Feel free to reopen if the segfault persists.
Did you try
-O3
with this changeset? If not, can you try it?
I just now tried #464 (comment) again, with the following differences:
- Used current
master
, 346c0c1 - Did not pass
-DRYML_CXX_STANDARD=14
- Changed
-O2
to-O3
All the tests passed.
@musicinmybrain v0.7.2 is now available. Thanks for your help in chasing this issue.
@musicinmybrain v0.7.2 is now available. Thanks for your help in chasing this issue.
Thank you! This seems to be working well for me, so I’m going to update the rapidyaml
package in Fedora from 0.6.0 to 0.7.2 in Fedora Rawhide (future Fedora 42) and in the upcoming Fedora 41. This will also be the initial version (and probably the only version due to ABI compatibility) of rapidyaml
in EPEL10.