Failing to read complex Unicode string embedded in JSON
0xg0nz0 opened this issue · 3 comments
Description
I tried to load urltestdata.json in nlohmann-hson, and get:
parse error at line 4853, column 45: syntax error while parsing value - invalid string: surrogate U+D800..U+DBFF must be followed by U+DC00..U+DFFF; last read: '"http://example.com/\uD800\uD801'
But this is the official WHATWG URL validation test set, and multiple JSON validators that I tried online
Reproduction steps
A simple parse of the above file reproduces it:
std::filesystem::path testSourceLocation(__FILE__);
auto buildDir = testSourceLocation.parent_path() / "../../build";
auto testFixturePath = buildDir / "urltestdata.json";
std::ifstream testFixtureIn(testFixturePath);
nlohmann::json testFixtureJson = nlohmann::json::parse(testFixtureIn);
Expected vs. actual results
I expected this file to parse without errors.
Minimal code example
See above.
Error messages
As per above:
parse error at line 4853, column 45: syntax error while parsing value - invalid string: surrogate U+D800..U+DBFF must be followed by U+DC00..U+DFFF; last read: '"http://example.com/\uD800\uD801'
### Compiler and operating system
Ubunto 22.04 (Noble) with gcc 13.2
### Library version
3.11.3 (vcpkg)
### Validation
- [ ] The bug also occurs if the latest version from the [`develop`](https://github.com/nlohmann/json/tree/develop) branch is used.
- [ ] I can successfully [compile and run the unit tests](https://github.com/nlohmann/json#execute-unit-tests).
Note I do have a hacky workaround for this in my CMakeLists.txt, which nicely demonstrates that it really is just that one test case which appears to be causing issues for nlohmann-json:
# Download the WTP horror show of URL conformance tests
set(JSON_URL "https://raw.githubusercontent.com/web-platform-tests/wpt/master/url/resources/urltestdata.json")
set(JSON_DEST "${CMAKE_BINARY_DIR}/urltestdata.json")
set(JSON_FILE "urltestdata.json")
# Download the JSON file at configure time
file(DOWNLOAD ${JSON_URL} ${JSON_DEST}
STATUS download_status)
# Check if the download was successful
list(GET download_status 0 status_code)
if(status_code EQUAL 0)
message(STATUS "Downloaded ${JSON_FILE} from GitHub")
else()
message(FATAL_ERROR "Download of ${JSON_FILE} failed: ${download_status}")
endif()
# Remove bad test cases from the JSON file
message(STATUS "Removing bad test case from ${JSON_FILE} with sed")
execute_process(COMMAND sed -i 4852,4866d ${JSON_DEST}
RESULT_VARIABLE sed_result
ERROR_VARIABLE sed_error)
# Check if the sed command was successful
if(NOT sed_result EQUAL "0")
message(FATAL_ERROR "Failed to execute sed command: ${sed_error}")
endif()
The error message states the problem \uD800\uD801
is an invalid surrogate pair.
\uD800
is a high-surrogate code unit, as it is in the rangeD800
..DBFF
\uD801
is not a low-surrogate code unit, as it is not in the rangeDC00
..DFFF
.
This check is implemented in https://github.com/nlohmann/json/blob/develop/include/nlohmann/detail/input/lexer.hpp#L331.
I don't know why other validators accept this JSON, but it contains invalid UTF-8.
Update: References: https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf#G2630