boostorg/locale

utf8_codecvt fails when UTF-16 input ends with surrogate

Flamefire opened this issue · 3 comments

This is a bug I detected in Boost.Nowide which uses the old code from the generic_codecvt used here. But the issue remains.

Situation:

Solution should be to also check the state.

Intrestingly the C++11 codecvt from GCC behaves differently: It does not consume any input and returns ok: https://godbolt.org/z/nASHeL

However https://en.cppreference.com/w/cpp/locale/codecvt/out mentions:

When performing N:M conversions, this function may return std::codecvt_base::partial after consuming all source characters (from_next == from_end). This means that another internal character is needed to complete the conversion (e.g. when converting UTF-16 to UTF-8, if the last character in the source buffer is a high surrogate).

But this is a "may".

Intrestingly the C++11 codecvt from GCC behaves differently: It does not consume any input and returns ok: https://godbolt.org/z/nASHeL

The GCC codecvt's have some active bugs, don't take them as reference.

Regarding this bug, if the UTF-16 string ends with leading surrogate, partial should be returned, and if it ends with unpaired trailing surrogate, error should be returned. You would be more helpful if you post code that reproduces the bug.

From my test suite (slightly adapted):

            char buf[4] = {};
            char* const to = buf;
            char* const to_end = buf + 4;
            char* to_next = to;
            const char16_t* err_utf = u"\xD800"; // Trailing UTF-16 surrogate
            std::mbstate_t mb = std::mbstate_t();
            const char16_t* from = err_utf;
            const char16_t* from_end = from + 1;
            const char16_t* from_next = from;
            cvt_type::result res = cvt.out(mb, from, from_end, from_next, to, to_end, to_next);

FWIW: The above OP contains a full explanation what happens where. Follow that to verify that indeed partial is not returned because the state is not checked.