Chlumsky/json-cpp-gen

Get rid of sscanf / sprintf

Chlumsky opened this issue · 4 comments

Not only are these functions archaic and overly complex (format string parsing etc.) but I have found that sscanf in particular is extremely slow. This needs to be done away with ASAP. However, the implementation for floats is very complex. There should also be some way for the user to select an implementation or provide their own functions for number (de)serialization.

I have found that invoking sscanf within a long JSON string is extremely slow because (for whatever reason) it always reads the entire rest of the string. [1] This actually changes the worst case complexity of the parser from linear to quadratic! As a consequence, if a JSON file has a million numbers, it can literally take minutes to parse. Therefore, this issue is now considered critical.

As of 2529794, the most problematic uses of these functions have been eliminated. Integer parsing and serialization is now done with custom code, which is at most twice slower than the fastest known implementation. Parsing floating-point numbers is now done by the strtod C function, which still isn't ideal but it is orders of magnitude faster than sscanf. Floating-point serialization is unchanged for now.

So at this point, the main performance issue has been resolved, but it may still be good to eliminate the remaining uses of these archaic functions, or allow the user to provide alternatives. Here is the summary of remaining usages:

  • strtod, strtold in floating-point number parsing,
  • sprintf in floating-point number serialization,
  • sscanf in the parser's unescape function reads hexadecimal Unicode sequences - this could be replaced pretty easily with a bunch of extra code, but since it is called on a separate (short) buffer, the critical performance issue is not present.

sscanf removed from unescape in 9ff6393. strtod and strtold are portable and safe to use, so no need to remove them.

The last remaining usage of <cstdio> in floating-point value serialization has been resolved in fc06458 in the following way:

Instead of using sprintf directly, it is now called through the macro JSON_CPP_SERIALIZE_FLOAT, JSON_CPP_SERIALIZE_DOUBLE, or JSON_CPP_SERIALIZE_LONG_DOUBLE. Those are defined as

#ifndef JSON_CPP_SERIALIZE_DOUBLE
#include <cstdio>
#define JSON_CPP_SERIALIZE_DOUBLE(outBuffer, x) sprintf(outBuffer, "%.17g", x)
#endif

Therefore, the definition can be overridden by an earlier definition of these macros (e.g. via a file listed in includes), which can circumvent using the deprecated sprintf function with e.g.

#define JSON_CPP_SERIALIZE_DOUBLE(outBuffer, x) snprintf(outBuffer, sizeof(outBuffer), "%.17g", x)

Note that sizeof(outBuffer) can be used to determine the size of the buffer.

Alternatively, the user can provide a completely different implementation, but to be compatible, it must output exactly inf, -inf, and nan for the respective values.