nlohmann/json

Support for iterator-range parsing

theodelrieu opened this issue · 53 comments

Hi, it would be handy to add an overload to the parse method.

In my code, I deal a lot with vector<uint8_t> when communicating with my server.
However I sometimes have to send JSON, and I must convert the buffer to a std::string, which is unfortunate.

I know I could typedef basic_json and use std::basic_string<uint8_t> as StringType.
However, other parts of the code uses the std::string version, and I don't want to patch the whole code because of that.

We could add this overload to parse, which will behave like std::string iterator constructor:

template<typename InputIterator>
static basic_json parse(InputIterator begin, InputIterator end, parser_callback_t cb = nullptr);

// user code

auto buffer = receive_bytes();
auto json = json::parse(buffer.begin(), buffer.end());

What are your thoughts on that ?
If you agree with the idea, I can take care of the PR.

I think the current implementation would not benefit from an iterator-range access: Inside the parser, I employ a lexer which relies on access to the input via a pointer of type lexer_char_t* (unsigned char*). For std::string, this can be done with casting the result of c_str(), and std::istreams are copied line by line to a buffer of type std::string.

To process a std::vector<uint8_t>, such a pointer could be casted from data(). Maybe one could generalize the static basic_json parse(const string_t& s, const parser_callback_t cb = nullptr) function to

template<typename T>
static basic_json parse(const T& s, const parser_callback_t cb = nullptr)

where T has a data() member function (std::basic_string, std::vector, and std::array) and whose pointers to its value type can be casted to lexer_char_t*.

What do you think?

So only contiguous containers of 1 byte integrals are supported I believe? (for now at least)
I don't know if counting on the data() method is a good thing, what if an user gives a homemade vector, or a char[] ?

Maybe it's better to use methods defined in the Container Concept, and use the std::begin method. We could static_assert that the iterator type must be RandomAccessIterator to ensure the contiguity of the container.

It could look like this:

template<typename ByteContainer>
static basic_json parse(const ByteContainer& buf, const parser_callback_t cb = nullptr)
{
  // ensure std::begin(ByteContainer) is valid + retrieve iterator type
  using Traits = std::iterator_traits<decltype(std::begin(std::declval<ByteContainer>()))>;

  static_assert(sizeof(Traits::value_type) == 1, "");
  // There is no new tag for ContiguousIterator in C++17, won't need to change
  static_assert(std::is_same_v<Traits::iterator_category, std::random_access_category_tag>, "");

  // quite ugly..., but will work with containers, C arrays
  do_lexing(reinterpret_cast<const lexer_char_t*>(&(*std::begin(buf)));
}

I think it's reasonable to assume that the container is invalid if iterator_traits cannot be defined with it.
That's also an issue on LWG.

What's your opinion ?

By the way, we can check for contiguity with the Iterator overload too, which removes some declval horror, so I think the iterator prototype is still correct.

Maybe we could add both ?

This looks good!

So far, parsing a string or an input stream could rely on an end of string or end of file mark. For this overload, one would need to pass a length to the parser - I guess std::distance(std::begin(buf), std::end(buf)) would do the job.

Shall I open a feature branch for this?

Sure!

You want the user to give the length to parse?
Could you show me what prototype you have in mind?

If std::distance can find out the length, then there is no need for the user to provide the length. (Unless we want to allow for a totally generic parse(const char*).

I do not have a prototype in mind. I would just implement and integrate your proposal into a feature branch so we can find out whether we oversaw a dependency.

Oh ok, I misunderstood what you wanted.
If a user wants to use parse with a const char*, I think the contiguous iterator-range overload does the job pretty well

I started a feature branch and started with a constructor for the parser class that takes an iterator range, see https://github.com/nlohmann/json/blob/feature/iterator_range_parsing/src/json.hpp.re2c#L8195.

It currently only checks if the iterators are random-access. This is, however, not sufficient to guarantee contiguous storage, and I have not found a way to enforce this in C++11. However, first tests (see https://github.com/nlohmann/json/blob/feature/iterator_range_parsing/test/src/unit-class_parser.cpp#L748) show that parsing works with

  • std::vector
  • std::array
  • std::string
  • std::initializer_list
  • std::valarray
  • C-style arrays

The next step would be to add overloads for the parse function for the basic_json class.

Any comments?

This is, however, not sufficient to guarantee contiguous storage, and I have not found a way to enforce this in C++11.

This is not a static property, but a runtime property, unfortunately. For example, std::deque can be either contiguous or not depending on the size of the deque.

http://stackoverflow.com/questions/35004633/checking-if-a-sequence-container-is-contiguous-in-memory

@gregmarr Thanks for clarifying. So I think the only way to cope with this is to make clear in the documentation that the behavior is undefined for iterator ranges with non-contiguous storage. Or any better idea?

Probably. You could probably have an assert that you get the same address for begin iterator plus range length, and the end iterator.

I added an assertion to do the check from the Stack Overflow article:

int i = 0;
assert(std::accumulate(first, last, true, [&i, &first](bool res, decltype(*first) val)
{
    return res and (val == *(std::next(std::addressof(*first), i++)));
}));

(Any idea how to get the int i = 0; inside the lamdba? Making it a static variable did not work)

(One way would be to pass std::pair<bool, int> to std::accumulate and do use this type are result.)

I added a first draft of the deserialization function. Feedback is greatly appreciated. In particular, it would be nice to maybe weaken the preconditions.

The 1 byte precondition can be enforced at compile time.

static_assert(sizeof(std::iterator_traits<IteratorType>::value_type) == 1, "");

I don't know if asserting for non-empty range is a good choice, maybe returning a 'null' value would be better?

@theodelrieu Right, checking the size can be done at compile time, I will change this. For an empty range, I would rather throw the same assertion as the parser would, but this would require creating some fake data - returning null would be problematic, because it would hide potential parse errors.

I saw you throw some exceptions to indicate parse errors, we might reuse one of those exceptions in this case?
Since we do not assert when a parsing error is encountered (because of invalid json), and since an empty value is an invalid json, I think it would be consistent with the rest of the invalid json handling.

I see if I can find a nice way to throw a parser error exception for an empty range.

Apart of the assertions, would this fix your original need?

Yes!

By the way, you were talking earlier about enhancing the std::string overload.

template<typename T>
static basic_json parse(const T& s, const parser_callback_t cb = nullptr)

Do you have any plan to do so? It would be great when one wants to parse the whole buffer.
(This is a reason why the standard library's algorithms are a bit annoying, requesting iterators everytime).

I can create a new issue for that though.

What do you mean? An overload for a type T and call the parser with begin()/end()? Or specifically a parse function for std::string?

I'm talking about your first comment:

I think the current implementation would not benefit from an iterator-range access: Inside the parser, I employ a lexer which relies on access to the input via a pointer of type lexer_char_t* (unsigned char*). For std::string, this can be done with casting the result of c_str(), and std::istreams are copied line by line to a buffer of type std::string.

To process a std::vector<uint8_t>, such a pointer could be casted from data(). Maybe one could >generalize the static basic_json parse(const string_t& s, const parser_callback_t cb = nullptr) >function to

template
static basic_json parse(const T& s, const parser_callback_t cb = nullptr)

where T has a data() member function (std::basic_string, std::vector, and std::array) and whose pointers to its value type can be casted to lexer_char_t*.

What do you think?

@theodelrieu I just committed some more changes. There are now the following overloads for the parse() functions:

  1. const char*
  2. std::istream
  3. C-style array (elements must be 1 byte)
  4. iterator range (elements must be 1 byte, range must be contiguous)
  5. container (elements must be 1 byte, container storage must be contiguous; delegated to item 4)

The last two items allowed to clean up the interface: std::string and string_t can be parsed with item 5, and item 1 now allows to parse string literals without copying them into a container first.

Note that compared to your initial proposal, I now check whether the iterator types for item 4 and 5 inherit from std::random_access_iterator_tag rather than requiring that they are the same. The actual checks are done in item 4:

  • contiguous storage is checked with an assertion
  • the size check is done in a static assertion
  • if an empty range is passed, the parser behaves as if an empty string was passed ("unexpected end of input")

I was skeptical at first, but now I think this made a great addition to the parser. Please let me know if anything is missing - otherwise, I would close this issue and add the feature with the 2.0.3 release.

Great!

I just have a few questions left:

  • Is the '\0' mandatory for parse to work properly?
    We should remove this constraint, users should not have to take the '\0' into account IMHO
    Internally, you could rely on a pair pointer, size, instead of the '\0'

  • I think you can remove the parse(char const[]) overload if you modify the enable_if condition in parse(const ContiguousContainer&):

    typename std::iterator_traits<decltype(std::begin(std::declval<ContiguousContainer const>()))>::iterator_category>::value
  • Concerning the const char* overload, I think it's too much constrained.
    Why not keeping it generic?

    template<typename CharT, typename = typename std::enable_if<sizeof(CharT) == 1, float>::type>
    basic_json parse(const CharT* buffer, parser_callback_t cb = nullptr);

    And in this case, the '\0' is mandatory.
    This way I could use unsigned char* or int8_t*.

Without adding \0, I ran into invalid reads. I need to check in the debugger why the lexer seems to read one character past the given limit. I hope it's just a stupid error on my side.

I shall try the other two proposals.

Cool!

For the last proposal, I forgot to add an is_integral<> check.

I integrated the latter two proposals and now shall have a look at the null byte.

I fixed the lexer so trailing null bytes are no longer necessary.

The following test cases work:

SECTION("contiguous containers")
{
    SECTION("directly")
    {
        SECTION("from std::vector")
        {
            std::vector<uint8_t> v = {'t', 'r', 'u', 'e'};
            CHECK(json::parse(v) == json(true));
        }

        SECTION("from std::array")
        {
            std::array<uint8_t, 5> v { {'t', 'r', 'u', 'e'} };
            CHECK(json::parse(v) == json(true));
        }

        SECTION("from array")
        {
            uint8_t v[] = {'t', 'r', 'u', 'e'};
            CHECK(json::parse(v) == json(true));
        }

        SECTION("from std::string")
        {
            std::string v = {'t', 'r', 'u', 'e'};
            CHECK(json::parse(v) == json(true));
        }

        SECTION("from std::initializer_list")
        {
            std::initializer_list<uint8_t> v = {'t', 'r', 'u', 'e'};
            CHECK(json::parse(v) == json(true));
        }

        SECTION("empty container")
        {
            std::vector<uint8_t> v;
            CHECK_THROWS_AS(json::parse(v), std::invalid_argument);
        }
    }

    SECTION("via iterator range")
    {
        SECTION("from std::vector")
        {
            std::vector<uint8_t> v = {'t', 'r', 'u', 'e'};
            CHECK(json::parse(std::begin(v), std::end(v)) == json(true));
        }

        SECTION("from std::array")
        {
            std::array<uint8_t, 5> v { {'t', 'r', 'u', 'e'} };
            CHECK(json::parse(std::begin(v), std::end(v)) == json(true));
        }

        SECTION("from array")
        {
            uint8_t v[] = {'t', 'r', 'u', 'e'};
            CHECK(json::parse(std::begin(v), std::end(v)) == json(true));
        }

        SECTION("from std::string")
        {
            std::string v = {'t', 'r', 'u', 'e'};
            CHECK(json::parse(std::begin(v), std::end(v)) == json(true));
        }

        SECTION("from std::initializer_list")
        {
            std::initializer_list<uint8_t> v = {'t', 'r', 'u', 'e'};
            CHECK(json::parse(std::begin(v), std::end(v)) == json(true));
        }

        SECTION("from std::valarray")
        {
            std::valarray<uint8_t> v = {'t', 'r', 'u', 'e'};
            CHECK(json::parse(std::begin(v), std::end(v)) == json(true));
        }

        SECTION("with empty range")
        {
            std::vector<uint8_t> v;
            CHECK_THROWS_AS(json::parse(std::begin(v), std::end(v)), std::invalid_argument);
        }
    }
}

There is still an error building with MSVC (see https://ci.appveyor.com/project/nlohmann/json/build/1166) when the null byte is missing in a C-style array. I shall investigate this tomorrow (or would be happy for hints and advices).

                uint8_t v[] = {'t', 'r', 'u', 'e'};
                CHECK(json::parse(v) == json(true));

This is the one that's failing, and it's just a raw array of 4 integers. There is no null byte at the end to terminate the parse. You need to explicitly add the 0 element.

@gregmarr I saw this as well, but did not want to change anything tonight, because Valgrind and Clang's Address Sanitizers did not complain. And would std::distance(std::begin(v), std::end(v)) not be 4 and behave the same as a std::vector or std::array?

I'm assuming this is using the char * overload.

template<typename CharT, typename std::enable_if<
                 std::is_integral<CharT>::value and sizeof(CharT) == 1, int>::type = 0>
    static basic_json parse(const CharT* s,
                            const parser_callback_t cb = nullptr)

It's an integral type and its size is 1.

With Xcode Version 8.0 beta 2 (8S162m),

uint8_t v[] = {'t', 'r', 'u', 'e'};
json::parse(v);

uses the

static basic_json parse(const ContiguousContainer& c,
                            const parser_callback_t cb = nullptr)

overload. Maybe MSVC chooses differently. I shall try tomorrow. Good night!

I tried on VS2015, I think this is a MSVC bug.
It chooses indeed the const CharT* overload.

Here is a small test to reproduce

template <typename ContiguousContainer>
void f(ContiguousContainer &&) {
  std::cout << "first\n";
}

template <typename CharT>
void f(CharT const*) {
    std::cout << "second\n";
}

int main() {
  auto t = std::vector<int>{1,2,3};
  uint8_t tab[] = {'t', 'r', 'u', 'e'};
  char *test;
  static_assert(!std::is_pointer<decltype(tab)>::value, "");
  f(tab);
  f(t);
  f(test);
  return 0;

Here the first overload binds to everything and should always be taken, but MSVC considers the second to be a best match.

Here is a test with GCC.

I don't know if there is a workaround, but if it's a confirmed bug, we should report it to Microsoft.

I believe that MSVC is correct here. The second template is more specialized that the first, so it's selected.

http://en.cppreference.com/w/cpp/language/function_template

template<class T> void f(T);        // template #1
template<class T> void f(T*);       // template #2
template<class T> void f(const T*); // template #3
void m() {
  const int* p;
  f(p); // overload resolution picks: #1: void f(T ) [T = const int *]
        //                            #2: void f(T*) [T = const int]
        //                            #3: void f(const T *) [T = int]
// partial ordering
// #1 from transformed #2: void(T) from void(U1*): P=T A=U1*: deduction ok: T=U1*
// #2 from transformed #1: void(T*) from void(U1): P=T* A=U1: deduction fails
// #2 is more specialized than #1 with regards to T
// #1 from transformed #3: void(T) from void(const U1*): P=T, A=const U1*: ok
// #3 from transformed #1: void(const T*) from void(U1): P=const T*, A=U1: fails
// #3 is more specialized than #1 with regards to T
// #2 from transformed #3: void(T*) from void(const U1*): P=T* A=const U1*: ok
// #3 from transformed #2: void(const T*) from void(U1*): P=const T* A=U1*: fails
// #3 is more specialized than #2 with regards to T
// result: #3 is selected
// in other words, f(const T*) is more specialized than f(T) or f(T*)
}

I truly believe this is a MSVC bug:

Here we pass a uint8_t[] to the function, not a pointer.
The compiler should consider the forwarding reference overload the best match,
since the const CharT * one requires an array-to-pointer conversion.
Clang and GCC both choose the right overload.

I cannot find the exact quote on cppreference about T&& binding rules for now, I will try again later.

That's an interesting question. Template overload resolution is indeed a bit of black magic.

From the same page as above:

P is the function parameter type (ContiguousContainer&&, CharT const *)
A is the argument passed to the function (uint8_t[4])

Before deduction begins, the following adjustments to P and A are made:

  1. If P is not a reference type,
    a) if A is an array type, A is replaced by the pointer type obtained from array-to-pointer conversion;
  2. If P is an rvalue reference to a cv-unqualified template parameter (so-called "forwarding reference"), and the corresponding function call argument is an lvalue, the type lvalue reference to A is used in place of A for deduction (Note: this is the basis for the action of std::forward):

For the CharT const * overload, #1 applies, and for the ContiguousContainer && overload, #4 applies.
That gives A = uint8_t[4]& for the ContiguousContainer version, and A = uint8_t * for the CharT version.

ContiguousContainer is then deduced as uint8_t[4]&
CharT is then deduced as uint8_t.

The deduced template functions are then (after reference collapsing)
parse<uint8_t[4]>(ContiguousContainer&)
parse<uint_t>(CharT const *)

The latter still looks to be more specialized, as it is const qualified.

However, there are also rules about what happens when considering the functions in a different order would give different results. Maybe that's involved here, or perhaps the lack of two-phase lookup in templates in VC.

As a matter of fact, I committed a version where the two functions in question are in a different order: no change...

Indeed it is tricky, I think we need a language lawyer on this one, we can post an example code on StackOverflow?

For now, we can put back the C array overload, to see if it helps?

@theodelrieu It would be nice to have more opinions on this, so I'm fine with any code on Stack Overflow.

Meanwhile, I tried the following: I changed the CharT const* function to

    template<typename CharPT, typename std::enable_if<
                 std::is_pointer<CharPT>::value and
                 std::is_integral<typename std::remove_pointer<CharPT>::type>::value and
                 sizeof(std::remove_pointer<CharPT>) == 1, int>::type = 0>
    static basic_json parse(const CharPT s,
                            const parser_callback_t cb = nullptr)

Calling this function with json::parse("\"\\ud80c\\udc60\"") gives:

error: call to 'parse' is ambiguous

note: candidate function [with CharPT = const char *, $1 = 0]
    static basic_json parse(const CharPT s,
                      ^
../src/json.hpp:6071:23: note: candidate function [with ContiguousContainer = char [15], $1 = 0]
    static basic_json parse(const ContiguousContainer& c,
                      ^

Adding the C array overload (and removing the experiment from my previous comment)

template<class T, std::size_t N>
    static basic_json parse(T (&array)[N],
                            const parser_callback_t cb = nullptr)

makes it worse:

src/unit-class_parser.cpp:593:27: error: call to 'parse' is ambiguous
        CHECK_THROWS_WITH(json::parse("\"\\uD80C\\u0000\""),
                          ^~~~~~~~~~~

../src/json.hpp:5907:23: note: candidate function [with T = const char, N = 15]
    static basic_json parse(T (&array)[N],
                      ^
../src/json.hpp:5943:23: note: candidate function [with CharT = char, $1 = 0]
    static basic_json parse(const CharT* s,
                      ^
../src/json.hpp:6104:23: note: candidate function [with ContiguousContainer = char [15], $1 = 0]
    static basic_json parse(const ContiguousContainer& c,

I made another experiment, but now run in another strange MSVC compiler error (https://ci.appveyor.com/project/nlohmann/json/build/1168):

C:\projects\json\src\json.hpp(5944): error C2027: use of undefined type 'std::remove_pointer<_Uty>' [C:\projects\json\test\json_unit.vcxproj]

After googling a bit, this may be a known MSVC issue...

I am working on a workaround right now, I think we can help MSVC a little

My workaround compiles.
However, tests fail on friend std::istream& operator<<(basic_json& j, std::istream& i) invocation, but this might be unrelated to the current issue.

The following code is a POC, I hope we can tweak it to make it cleaner.

// Trick struct to help MSVC choose the good overload
template <bool B>
struct IsPointer{};

// Impl method for contiguous containers
template<typename ContiguousContainer,
            typename std::enable_if<
                         std::is_base_of<
                         std::random_access_iterator_tag,
                         typename std::iterator_traits<decltype(std::begin(std::declval<ContiguousContainer>()))>::iterator_category>::value
                     , int>::type = 0>
static basic_json parse_impl(const ContiguousContainer& c, IsPointer<false>, const parser_callback_t cb)
{
    return parse(std::begin(c), std::end(c), cb);
}

// IsPointer<true> prevents the implicit array-to-pointer conversion
template<typename CharT>
static basic_json parse_impl(const CharT* c, IsPointer<true>, const parser_callback_t cb)
{
     return parser(reinterpret_cast<const char*>(c), cb).parse();
}

// We have to have a parse_impl for streams too. If we keep the parse(istream&)
// and parse(istream&&) overloads, calling parse with a std::stringstream 
// will chose the parse(T&&) one
template <typename Stream, typename = typename std::enable_if<std::is_base_of<std::istream,
                           typename std::remove_reference<Stream>::type>::value, char>::type>
static basic_json parse_impl(Stream& s, IsPointer<false>, const parser_callback_t cb)
{
    return parser(s, cb).parse();
}

// Only 'parse' overload with two arguments, (we keep the iterator-range one)
template<class T>
static basic_json parse(T&& c,  const parser_callback_t cb = nullptr)
{
    return parse_impl(std::forward<T>(c),
                IsPointer<std::is_pointer<typename std::remove_reference<T>::type>::value>{},
                cb);
}

The IsPointer trick is ugly, especially since we have to give it to every parse_impl overload.
I would like to know if the tests failures are related to this workaround first.
Let me know what you think about it!

Do I understand correctly that you need the IsPointer thingy because MSVC fails using std::is_pointer inside the templates?

In parse(T&&), I use std::is_pointer, since there is only one overload of parse, T is deduced as uint8_t(&)[4].
So in this method, std::is_pointer returns false.
This leads to the construction of IsPointer<false>, which prevents MSVC to choose the pointer overload, since it requires IsPointer<true>.

EDIT: It also seems MSVC performs the array-to-pointer conversion before resolving the enable_if.
Which might explain why your experiments didn't work.
I'm not an expert on template resolution techniques though...

Just a question: Why do you need to wrap the result of std::is_pointer<typename std::remove_reference<T>::type>::value inside the type IsPointer - couldn't you just pass a bool to parse_ impl?

You mean something like : parse_impl<true>(stuff...) ?

The parse_impl definition would look like that:

template <bool IsPointer, typename ContiguousContainer>
basic_json parse_impl(ContiguousContainer&&);

But you cannot partially specialize methods, so I need to wrap the bool in an empty struct.
EDIT: the example code was misleading

I found a better way to fix the issue, your experiment failed to compile because of the last template line:

 template<typename CharPT, typename std::enable_if<
                 std::is_pointer<CharPT>::value and
                 std::is_integral<typename std::remove_pointer<CharPT>::type>::value and
                 sizeof(std::remove_pointer<CharPT>) == 1, int>::type = 0>
    static basic_json parse(const CharPT s,
                            const parser_callback_t cb = nullptr)
    {
        return parser(reinterpret_cast<const char*>(s), cb).parse();
    }

Changing to sizeof(typename std::remove_pointer<CharPT>::type) makes it compile.
The tests fail but this could be a mistake in the operator<< implementation

@theodelrieu I comitted your proposed change (6e6e1c9) and all tests succeed.

I rebuilt using ninja, and launched ninja test.
Everything passes, I guess I ran tests in a bad folder or something like that.

Well, I have no remarks left, how about you?

I think the branch can be merged, unless you have any further remarks.

I'm fine for merging it!

Commit 776880b merged the feature branch. This ticket can be closed and will be part of the 2.0.3 release.

Thanks everybody for the constructive discussions!