Lifetime bug in exec::__final::__operation_state<...>::__t::__store_result_and_start_next_op
Closed this issue · 4 comments
The function exec::__final::__operation_state<...>::__t::__store_result_and_start_next_op
in <exec/finally.hpp>
has a lifetime bug where a sender is used after its lifetime has ended.
stdexec/include/exec/finally.hpp
Lines 219 to 222 in 9428f35
_FinalSender& __final = std::get_if<0>(&__op_)->__sndr_;
__final_op_t& __final_op = __op_.template emplace<1>(__conv{[&] {
return stdexec::connect(static_cast<_FinalSender&&>(__final), __final_receiver_t{this});
}});
__final
is a reference to a sender that lives within the active member at index 0
of the std::variant
__op_
. The call to __op_.emplace<1>
changes the active member of the variant, which destroys the currently active member before doing anything else, ending the lifetime of the sender that __final
refers to. The use of __final
in the call to stdexec::connect
inside the lambda happens after the destructor has been called and after that memory might have been overwritten by something else.
This bug was found by compiling stdexec with NVC++ with GCC 14.1. The implementation of std::variant
changed in GCC 14. When that version of std::variant
is compiled with NVC++, the memory of the std::variant
is zeroed out whenever the active member of the variant is changed, in between the destructor of the old active member and the constructor of the new active member. That wipes out the value of the sender that __final
refers to, effectively passing an empty sender to stdexec::connect
, which later results in a seg fault.
I don't really know whether or not NVC++ should be zeroing out that memory. GCC does not. I am still investigating that. But even if NVC++ is wrong, the stdexec code is still wrong to reference an object after it has been destroyed.
The stdexec code should be fixed so that the sender is moved out of the variant before the variant is changed, rather than trying to reference it within the variant.
It took three people many days worth of debugging to find the cause of the seg fault. This was not an easy bug to find.
oh wow. thank you so much for finding this and for this detailed bug report. amazing that clang's sanitizers didn't catch this.
I was curious why the sanitizers didn't catch this. I changed MSAN to poison memory when the variant is emplaced into, which mimics what NVC++'s variant does as reported by @dkolsen-pgi), and I can get MSAN to report this.
For libc++ at least, this code hit the __destroy
which does not call any destructor. MSAN does poison memory upon destruction, but since there is not destructor call, MSAN is never notified the bytes should be marked uninitialized. https://github.com/llvm/llvm-project/compare/main...ccotter:llvm-project:msan-libcxx?expand=1 is what I used locally.
In general though, I think MSAN will have trouble if the emplaced value into the variant writes to the same bytes of the "other" variant index, which thankfully didn't end up happening here.
Thanks for the analysis, @ccotter!