Unpacking float32 does not preserve signaling NaN, unlike float64

Question

Unpacking float32 does not preserve signaling NaN, unlike float64

fumoboy007 opened this issue a year ago · 10 comments

Describe the bug
float32/float64 NaNs have a quiet vs. signaling state. This state is encoded using a couple bits in the significand.

Unpacking a float64 signaling NaN using msgpack-c 6.0.0 preserves the quiet vs. signaling state. However, unpacking a float32 signaling NaN always results in a quiet NaN even if the packed value is a signaling NaN. The behavior is inconsistent between float32 and float64.

To Reproduce

static bool is_quiet_nan(double nan_val) {
    uint64_t bit_pattern = reinterpret_cast<uint64_t&>(nan_val);
    int is_quiet_bit_index = DBL_MANT_DIG - 2;
    return (bit_pattern >> is_quiet_bit_index) & 1;
}

TEST(MSGPACKC, simple_buffer_float_signaling_nan)
{
    if (!numeric_limits<float>::has_signaling_NaN) {
        return;
    }

    float val = numeric_limits<float>::signaling_NaN();

    msgpack_sbuffer sbuf;
    msgpack_sbuffer_init(&sbuf);
    msgpack_packer pk;
    msgpack_packer_init(&pk, &sbuf, msgpack_sbuffer_write);
    msgpack_pack_float(&pk, val);

    msgpack_zone z;
    msgpack_zone_init(&z, 2048);
    msgpack_object obj;
    msgpack_unpack_return ret =
        msgpack_unpack(sbuf.data, sbuf.size, NULL, &z, &obj);

    ASSERT_EQ(MSGPACK_UNPACK_SUCCESS, ret);
    ASSERT_EQ(MSGPACK_OBJECT_FLOAT32, obj.type);

    EXPECT_TRUE(isnan(obj.via.f64));
    EXPECT_FALSE(is_quiet_nan(obj.via.f64));
#if defined(MSGPACK_USE_LEGACY_NAME_AS_FLOAT)
    EXPECT_TRUE(isnan(obj.via.dec));
    EXPECT_FALSE(is_quiet_nan(obj.via.dec));
#endif // MSGPACK_USE_LEGACY_NAME_AS_FLOAT

    msgpack_zone_destroy(&z);
    msgpack_sbuffer_destroy(&sbuf);
}

TEST(MSGPACKC, simple_buffer_double_signaling_nan)
{
    if (!numeric_limits<double>::has_signaling_NaN) {
        return;
    }

    double val = numeric_limits<double>::signaling_NaN();

    msgpack_sbuffer sbuf;
    msgpack_sbuffer_init(&sbuf);
    msgpack_packer pk;
    msgpack_packer_init(&pk, &sbuf, msgpack_sbuffer_write);
    msgpack_pack_double(&pk, val);

    msgpack_zone z;
    msgpack_zone_init(&z, 2048);
    msgpack_object obj;
    msgpack_unpack_return ret =
        msgpack_unpack(sbuf.data, sbuf.size, NULL, &z, &obj);

    ASSERT_EQ(MSGPACK_UNPACK_SUCCESS, ret);
    ASSERT_EQ(MSGPACK_OBJECT_FLOAT64, obj.type);
    ASSERT_EQ(MSGPACK_OBJECT_FLOAT, obj.type);
#if defined(MSGPACK_USE_LEGACY_NAME_AS_FLOAT)
    ASSERT_EQ(MSGPACK_OBJECT_DOUBLE, obj.type);
#endif // MSGPACK_USE_LEGACY_NAME_AS_FLOAT

    EXPECT_TRUE(isnan(obj.via.f64));
    EXPECT_FALSE(is_quiet_nan(obj.via.f64));
#if defined(MSGPACK_USE_LEGACY_NAME_AS_FLOAT)
    EXPECT_TRUE(isnan(obj.via.dec));
    EXPECT_FALSE(is_quiet_nan(obj.via.dec));
#endif // MSGPACK_USE_LEGACY_NAME_AS_FLOAT

    msgpack_zone_destroy(&z);
    msgpack_sbuffer_destroy(&sbuf);
}

Expected behavior
Either both tests should pass or both tests should fail. Currently, simple_buffer_double_signaling_nan passes but simple_buffer_float_signaling_nan fails.

Root cause
The unpacker returns float32 values as double. This conversion implicitly causes the float32 signaling NaN to become a float64 quiet NaN.

Possible solution
The unpacker could return float32 values as float instead of converting them to double. This might be a breaking change because callers may need to update their usage to check a different msgpack_object_union field. However, in addition to resolving this issue, it could also be more convenient for callers because they wouldn’t need to cast the double field back to float.

Answer 1 · 2023-08-27T11:22:18.000Z

Thank you for reporting the issue.
I got confused. The code seems that C not C++. But your link to Root cause is C++ code.
Now C and C++ are completely different branch and different implementation.

Which one has the isssue C or C++, or both?
It seems that code is for C.

Answer 2 · 2023-08-27T12:53:49.000Z

I study about signaling/quiet NaN. Now I understand the issue.
It is common in C and C++.

I consider fixing the issue with minimal code change.

This code demonstrates the issue (signaling NaN becomes quiet NaN when convert from float to double), and another possible fix.
https://godbolt.org/z/Yehe6GMGG

Your link indicates the issue is caused by the following function. I agree.

inline void unpack_float(float d, msgpack::object& o)
{ o.type = msgpack::type::FLOAT; o.via.f64 = d; }

When I apply the minimal fix, then the code would be as follows:

inline void unpack_float(float d, msgpack::object& o)
{ 
    o.type = msgpack::type::FLOAT; 
    o.via.f64 = d;
    
    if (isnan(d)) { // could be `if (d != d) { ` for porablility
        if (!is_quiet_nan(d)) {
            clear_quiet(o.via.f64);
        }
    }
}

What do you think ?

Bitimage print version: https://godbolt.org/z/rqaPx4b34

Answer 3 · 2023-08-28T03:16:13.000Z

I ran my poc code on windows/msvc2019.

Here is the result:

has check
float has_quiet_NaN      :1
float has_signaling_NaN  :1
double has_quiet_NaN     :1
double has_signaling_NaN :1

float quiet/signaling check
is_quiet_nan(fqn)        :1
is_quiet_nan(fsn)        :1

double quiet/signaling check
is_quiet_nan(dqn)        :1
is_quiet_nan(dsn)        :1

float to double converted quiet/signaling check
isnan(dqn)               :1
isnan(dsn)               :1
is_quiet_nan(dqn)        :1
is_quiet_nan(dsn)        :1 invalid
bitimage fsn             :01111111110000000000000000000001
bitimage fixed_dsn before:0111111111111000000000000000000000100000000000000000000000000000
bitimage fixed_dsn after :0111111111111000000000000000000000100000000000000000000000000000
isnan(fixed_dsn)         :1
is_quiet_nan(fixed_dsn)  :1 valid

Linux result:

has check
float has_quiet_NaN      :1
float has_signaling_NaN  :1
double has_quiet_NaN     :1
double has_signaling_NaN :1

float quiet/signaling check
is_quiet_nan(fqn)        :1
is_quiet_nan(fsn)        :0

double quiet/signaling check
is_quiet_nan(dqn)        :1
is_quiet_nan(dsn)        :0

float to double converted quiet/signaling check
isnan(dqn)               :1
isnan(dsn)               :1
is_quiet_nan(dqn)        :1
is_quiet_nan(dsn)        :1 invalid
bitimage fsn             :01111111101000000000000000000000
bitimage fixed_dsn before:0111111111111100000000000000000000000000000000000000000000000000
bitimage fixed_dsn after :0111111111110100000000000000000000000000000000000000000000000000
isnan(fixed_dsn)         :1
is_quiet_nan(fixed_dsn)  :0 valid

float signaling NaN


SExp     Fraction(Significand)
01111111110000000000000000000000 Windows quiet     NaN
01111111110000000000000000000000 Linux   quiet     NaN
01111111110000000000000000000001 Windows signaling NaN
01111111101000000000000000000000 Linux   signaling NaN

if Exponent is all 1
   if Fraction is 0
      infinity (Sign is still enabled)
   else
      NaN *1
else
   normal value

*1

It seems that only Fraction != 0 is defined.
So there are many variations.

I think that your solution, introducing float type to object union, is better than my approach.

However, there is still problem.

Create signaling NaN on Linux
pack it
send it to Windows
unpack it on Windows, then got quiet NaN *2

similarlly

Create signaling NaN on Windows
pack it
send it to Linux
unpack it on Linux, then got quiet NaN *2

*2

I'm still not sure how to distinguish quiet or signal.

I think that the following code is not portable. Works well on Linux but doesn't work on Windows:

inline bool is_quiet_nan(float nan_val) {
    std::uint32_t bit_pattern = reinterpret_cast<std::uint32_t&>(nan_val);
    int is_quiet_bit_index = FLT_MANT_DIG - 2;
    return (bit_pattern >> is_quiet_bit_index) & 1;
}

Answer 4 · 2023-08-28T03:23:07.000Z

Perhaps msgpack-c could have a database of double/float signaling/quiet NaN bit representation on common operating systems, like Linux, macosx, Windows.
If they doesn't have conflicts, then unpack function could treat them well.

Conflict means OS-A quiet NaN bit representation is the same as OS-B signaling NaN bit representation.

Is there any information about quiet/signaling NaN bit repserentation ?

Answer 5 · 2023-08-28T04:01:01.000Z

I testes on msvc2022 too.
Then got the same result as msvc2019.

I found a document:
https://learn.microsoft.com/en-us/cpp/build/ieee-floating-point-representation?view=msvc-170#nan---not-a-number

Quiet NaNs have a leading one in the significand, and get propagated through an expression. They represent an indeterminate value, such as the result of dividing by infinity, or multiplying an infinity by zero. Signaling NaNs have a leading zero in the significand.

SExp     Fraction(Significand)
01111111110000000000000000000000 Windows quiet     NaN
01111111110000000000000000000001 Windows signaling NaN

It seems that the document and result are not mached.

Answer 6 · 2023-08-28T08:41:12.000Z

I wrote a PR to fix this issue on C++.
Then some of CI doesn't pass. They are 32bit built using -m32 option.
So I've checked the behavior using godbolt.

32bit

https://godbolt.org/z/8ETYd3Gz4

                         :SExp     Fraction(Significand)
bitimage fqn             :01111111110000000000000000000000
bitimage fsn             :01111111111000000000000000000000
                         :SExp        Fraction(Significand)
bitimage dqn             :0111111111111000000000000000000000000000000000000000000000000000
bitimage dsn             :0111111111111100000000000000000000000000000000000000000000000000

64bit

https://godbolt.org/z/aa8Wxoaao

                         :SExp     Fraction(Significand)
bitimage fqn             :01111111110000000000000000000000
bitimage fsn             :01111111101000000000000000000000
                         :SExp        Fraction(Significand)
bitimage dqn             :0111111111111000000000000000000000000000000000000000000000000000
bitimage dsn             :0111111111110100000000000000000000000000000000000000000000000000

64 bit is expected behavior. The left most bit of Fraction(Significand) indicates quiet(1) or signaling(0).
But 32 bit is both 1. But the 2nd left most bit seems to indicate quiet(0) or signaling(1).
It is very inconsintent for me.
Am I missing something?
My printing way or reinterrupt_cast would be undefined behavior ??

Answer 7 · 2023-08-28T09:04:34.000Z

https://en.wikipedia.org/wiki/IEEE_754#2019

Binary

...
For NaNs, quiet NaNs and signaling NaNs are distinguished by using the most significant bit of the trailing significand field exclusively,[d] and the payload is carried in the remaining bits.

Perhaps the encoding rule for signaling/quiet NaN is very recently decided. If it is, there are various kinds of encoding...
So msgpack-c pack/unpack using its platform one. So introducing f32 is good.

How do I write test for that ...

try rising signal ??

Answer 8 · 2023-08-28T09:34:46.000Z

I checked signal using fetestexcept.

signaling NaN with floating operation results:

Compiler	arch	float	double	godbolt
clang++	64	raised	raised	https://godbolt.org/z/nP697jvTG
clang++	32	not raised	not raised	https://godbolt.org/z/hhs56dnnz
g++	64	raised	raised	https://godbolt.org/z/x18svh5Yr
g++	32	not raised	not raised	https://godbolt.org/z/384YddYqn

arch	type	Fraction(Significand) MSB2bit
64bit	signaling	01
64bit	quiet	10
32bit	signaling	11
32bit	quiet	10

Answer 9 · 2023-08-28T23:41:17.000Z

Thanks for looking into this @redboltz!

I think the Windows issue you encountered may just be an issue with the Windows C++ standard library implementation: https://developercommunity.visualstudio.com/t/stdnumeric-limitssignaling-nan-returns-quiet-nan/155064. Perhaps a floating point operation that generates an sNAN still has the correct bit pattern even if numeric_limits::signaling_NaN() does not?

Is the 32-bit CPU issue you encountered the same as the above Windows issue?

Your solution of clearing the is_quiet bit seems like it should work well. That being said, introducing an f32 field seems like the cleanest solution to me.

Answer 10 · 2023-08-28T23:53:23.000Z

Actually, the Windows issue is a 32-bit CPU issue, not the other way around! More details here.

From what I understood, if a function returns an sNAN, it will always get converted to a qNAN (and raise a floating point exception) because the calling conventions on x86 specify that floating point values are returned in the x87 registers, which are inside the FPU. Hence, numeric_limits::signaling_NaN() always returns a qNAN on x86.