Support for iterator-range parsing
theodelrieu opened this issue · 53 comments
Hi, it would be handy to add an overload to the parse
method.
In my code, I deal a lot with vector<uint8_t>
when communicating with my server.
However I sometimes have to send JSON, and I must convert the buffer to a std::string
, which is unfortunate.
I know I could typedef basic_json and use std::basic_string<uint8_t>
as StringType
.
However, other parts of the code uses the std::string
version, and I don't want to patch the whole code because of that.
We could add this overload to parse
, which will behave like std::string
iterator constructor:
template<typename InputIterator>
static basic_json parse(InputIterator begin, InputIterator end, parser_callback_t cb = nullptr);
// user code
auto buffer = receive_bytes();
auto json = json::parse(buffer.begin(), buffer.end());
What are your thoughts on that ?
If you agree with the idea, I can take care of the PR.
I think the current implementation would not benefit from an iterator-range access: Inside the parser, I employ a lexer which relies on access to the input via a pointer of type lexer_char_t*
(unsigned char*
). For std::string
, this can be done with casting the result of c_str()
, and std::istream
s are copied line by line to a buffer of type std::string
.
To process a std::vector<uint8_t>
, such a pointer could be casted from data()
. Maybe one could generalize the static basic_json parse(const string_t& s, const parser_callback_t cb = nullptr)
function to
template<typename T>
static basic_json parse(const T& s, const parser_callback_t cb = nullptr)
where T
has a data()
member function (std::basic_string
, std::vector
, and std::array
) and whose pointers to its value type can be casted to lexer_char_t*
.
What do you think?
So only contiguous containers of 1 byte integrals are supported I believe? (for now at least)
I don't know if counting on the data()
method is a good thing, what if an user gives a homemade vector, or a char[]
?
Maybe it's better to use methods defined in the Container Concept, and use the std::begin
method. We could static_assert
that the iterator type must be RandomAccessIterator to ensure the contiguity of the container.
It could look like this:
template<typename ByteContainer>
static basic_json parse(const ByteContainer& buf, const parser_callback_t cb = nullptr)
{
// ensure std::begin(ByteContainer) is valid + retrieve iterator type
using Traits = std::iterator_traits<decltype(std::begin(std::declval<ByteContainer>()))>;
static_assert(sizeof(Traits::value_type) == 1, "");
// There is no new tag for ContiguousIterator in C++17, won't need to change
static_assert(std::is_same_v<Traits::iterator_category, std::random_access_category_tag>, "");
// quite ugly..., but will work with containers, C arrays
do_lexing(reinterpret_cast<const lexer_char_t*>(&(*std::begin(buf)));
}
I think it's reasonable to assume that the container is invalid if iterator_traits cannot be defined with it.
That's also an issue on LWG.
What's your opinion ?
By the way, we can check for contiguity with the Iterator overload too, which removes some declval
horror, so I think the iterator prototype is still correct.
Maybe we could add both ?
This looks good!
So far, parsing a string or an input stream could rely on an end of string or end of file mark. For this overload, one would need to pass a length to the parser - I guess std::distance(std::begin(buf), std::end(buf))
would do the job.
Shall I open a feature branch for this?
Sure!
You want the user to give the length to parse?
Could you show me what prototype you have in mind?
If std::distance
can find out the length, then there is no need for the user to provide the length. (Unless we want to allow for a totally generic parse(const char*)
.
I do not have a prototype in mind. I would just implement and integrate your proposal into a feature branch so we can find out whether we oversaw a dependency.
Oh ok, I misunderstood what you wanted.
If a user wants to use parse
with a const char*
, I think the contiguous iterator-range overload does the job pretty well
I started a feature branch and started with a constructor for the parser
class that takes an iterator range, see https://github.com/nlohmann/json/blob/feature/iterator_range_parsing/src/json.hpp.re2c#L8195.
It currently only checks if the iterators are random-access. This is, however, not sufficient to guarantee contiguous storage, and I have not found a way to enforce this in C++11. However, first tests (see https://github.com/nlohmann/json/blob/feature/iterator_range_parsing/test/src/unit-class_parser.cpp#L748) show that parsing works with
std::vector
std::array
std::string
std::initializer_list
std::valarray
- C-style arrays
The next step would be to add overloads for the parse
function for the basic_json
class.
Any comments?
This is, however, not sufficient to guarantee contiguous storage, and I have not found a way to enforce this in C++11.
This is not a static property, but a runtime property, unfortunately. For example, std::deque can be either contiguous or not depending on the size of the deque.
http://stackoverflow.com/questions/35004633/checking-if-a-sequence-container-is-contiguous-in-memory
@gregmarr Thanks for clarifying. So I think the only way to cope with this is to make clear in the documentation that the behavior is undefined for iterator ranges with non-contiguous storage. Or any better idea?
Probably. You could probably have an assert that you get the same address for begin iterator plus range length, and the end iterator.
I added an assertion to do the check from the Stack Overflow article:
int i = 0;
assert(std::accumulate(first, last, true, [&i, &first](bool res, decltype(*first) val)
{
return res and (val == *(std::next(std::addressof(*first), i++)));
}));
(Any idea how to get the int i = 0;
inside the lamdba? Making it a static variable did not work)
(One way would be to pass std::pair<bool, int>
to std::accumulate
and do use this type are result.)
I added a first draft of the deserialization function. Feedback is greatly appreciated. In particular, it would be nice to maybe weaken the preconditions.
The 1 byte precondition can be enforced at compile time.
static_assert(sizeof(std::iterator_traits<IteratorType>::value_type) == 1, "");
I don't know if asserting for non-empty range is a good choice, maybe returning a 'null' value would be better?
@theodelrieu Right, checking the size can be done at compile time, I will change this. For an empty range, I would rather throw the same assertion as the parser would, but this would require creating some fake data - returning null
would be problematic, because it would hide potential parse errors.
I saw you throw some exceptions to indicate parse errors, we might reuse one of those exceptions in this case?
Since we do not assert when a parsing error is encountered (because of invalid json), and since an empty value is an invalid json, I think it would be consistent with the rest of the invalid json handling.
I see if I can find a nice way to throw a parser error exception for an empty range.
Apart of the assertions, would this fix your original need?
Yes!
By the way, you were talking earlier about enhancing the std::string
overload.
template<typename T>
static basic_json parse(const T& s, const parser_callback_t cb = nullptr)
Do you have any plan to do so? It would be great when one wants to parse the whole buffer.
(This is a reason why the standard library's algorithms are a bit annoying, requesting iterators everytime).
I can create a new issue for that though.
What do you mean? An overload for a type T
and call the parser with begin()
/end()
? Or specifically a parse function for std::string
?
I'm talking about your first comment:
I think the current implementation would not benefit from an iterator-range access: Inside the parser, I employ a lexer which relies on access to the input via a pointer of type lexer_char_t* (unsigned char*). For std::string, this can be done with casting the result of c_str(), and std::istreams are copied line by line to a buffer of type std::string.
To process a std::vector<uint8_t>, such a pointer could be casted from data(). Maybe one could >generalize the static basic_json parse(const string_t& s, const parser_callback_t cb = nullptr) >function to
template
static basic_json parse(const T& s, const parser_callback_t cb = nullptr)where T has a data() member function (std::basic_string, std::vector, and std::array) and whose pointers to its value type can be casted to lexer_char_t*.
What do you think?
@theodelrieu I just committed some more changes. There are now the following overloads for the parse()
functions:
const char*
std::istream
- C-style array (elements must be 1 byte)
- iterator range (elements must be 1 byte, range must be contiguous)
- container (elements must be 1 byte, container storage must be contiguous; delegated to item 4)
The last two items allowed to clean up the interface: std::string
and string_t
can be parsed with item 5, and item 1 now allows to parse string literals without copying them into a container first.
Note that compared to your initial proposal, I now check whether the iterator
types for item 4 and 5 inherit from std::random_access_iterator_tag
rather than requiring that they are the same. The actual checks are done in item 4:
- contiguous storage is checked with an assertion
- the size check is done in a static assertion
- if an empty range is passed, the parser behaves as if an empty string was passed ("unexpected end of input")
I was skeptical at first, but now I think this made a great addition to the parser. Please let me know if anything is missing - otherwise, I would close this issue and add the feature with the 2.0.3 release.
Great!
I just have a few questions left:
-
Is the '\0' mandatory for
parse
to work properly?
We should remove this constraint, users should not have to take the '\0' into account IMHO
Internally, you could rely on a pairpointer, size
, instead of the '\0' -
I think you can remove the
parse(char const[])
overload if you modify theenable_if
condition inparse(const ContiguousContainer&)
:typename std::iterator_traits<decltype(std::begin(std::declval<ContiguousContainer const>()))>::iterator_category>::value
-
Concerning the
const char*
overload, I think it's too much constrained.
Why not keeping it generic?template<typename CharT, typename = typename std::enable_if<sizeof(CharT) == 1, float>::type> basic_json parse(const CharT* buffer, parser_callback_t cb = nullptr);
And in this case, the '\0' is mandatory.
This way I could useunsigned char*
orint8_t*
.
Without adding \0
, I ran into invalid reads. I need to check in the debugger why the lexer seems to read one character past the given limit. I hope it's just a stupid error on my side.
I shall try the other two proposals.
Cool!
For the last proposal, I forgot to add an is_integral<>
check.
I integrated the latter two proposals and now shall have a look at the null byte.
I fixed the lexer so trailing null bytes are no longer necessary.
The following test cases work:
SECTION("contiguous containers")
{
SECTION("directly")
{
SECTION("from std::vector")
{
std::vector<uint8_t> v = {'t', 'r', 'u', 'e'};
CHECK(json::parse(v) == json(true));
}
SECTION("from std::array")
{
std::array<uint8_t, 5> v { {'t', 'r', 'u', 'e'} };
CHECK(json::parse(v) == json(true));
}
SECTION("from array")
{
uint8_t v[] = {'t', 'r', 'u', 'e'};
CHECK(json::parse(v) == json(true));
}
SECTION("from std::string")
{
std::string v = {'t', 'r', 'u', 'e'};
CHECK(json::parse(v) == json(true));
}
SECTION("from std::initializer_list")
{
std::initializer_list<uint8_t> v = {'t', 'r', 'u', 'e'};
CHECK(json::parse(v) == json(true));
}
SECTION("empty container")
{
std::vector<uint8_t> v;
CHECK_THROWS_AS(json::parse(v), std::invalid_argument);
}
}
SECTION("via iterator range")
{
SECTION("from std::vector")
{
std::vector<uint8_t> v = {'t', 'r', 'u', 'e'};
CHECK(json::parse(std::begin(v), std::end(v)) == json(true));
}
SECTION("from std::array")
{
std::array<uint8_t, 5> v { {'t', 'r', 'u', 'e'} };
CHECK(json::parse(std::begin(v), std::end(v)) == json(true));
}
SECTION("from array")
{
uint8_t v[] = {'t', 'r', 'u', 'e'};
CHECK(json::parse(std::begin(v), std::end(v)) == json(true));
}
SECTION("from std::string")
{
std::string v = {'t', 'r', 'u', 'e'};
CHECK(json::parse(std::begin(v), std::end(v)) == json(true));
}
SECTION("from std::initializer_list")
{
std::initializer_list<uint8_t> v = {'t', 'r', 'u', 'e'};
CHECK(json::parse(std::begin(v), std::end(v)) == json(true));
}
SECTION("from std::valarray")
{
std::valarray<uint8_t> v = {'t', 'r', 'u', 'e'};
CHECK(json::parse(std::begin(v), std::end(v)) == json(true));
}
SECTION("with empty range")
{
std::vector<uint8_t> v;
CHECK_THROWS_AS(json::parse(std::begin(v), std::end(v)), std::invalid_argument);
}
}
}
There is still an error building with MSVC (see https://ci.appveyor.com/project/nlohmann/json/build/1166) when the null byte is missing in a C-style array. I shall investigate this tomorrow (or would be happy for hints and advices).
uint8_t v[] = {'t', 'r', 'u', 'e'};
CHECK(json::parse(v) == json(true));
This is the one that's failing, and it's just a raw array of 4 integers. There is no null byte at the end to terminate the parse. You need to explicitly add the 0 element.
@gregmarr I saw this as well, but did not want to change anything tonight, because Valgrind and Clang's Address Sanitizers did not complain. And would std::distance(std::begin(v), std::end(v))
not be 4
and behave the same as a std::vector
or std::array
?
I'm assuming this is using the char * overload.
template<typename CharT, typename std::enable_if<
std::is_integral<CharT>::value and sizeof(CharT) == 1, int>::type = 0>
static basic_json parse(const CharT* s,
const parser_callback_t cb = nullptr)
It's an integral type and its size is 1.
With Xcode Version 8.0 beta 2 (8S162m),
uint8_t v[] = {'t', 'r', 'u', 'e'};
json::parse(v);
uses the
static basic_json parse(const ContiguousContainer& c,
const parser_callback_t cb = nullptr)
overload. Maybe MSVC chooses differently. I shall try tomorrow. Good night!
I tried on VS2015, I think this is a MSVC bug.
It chooses indeed the const CharT*
overload.
Here is a small test to reproduce
template <typename ContiguousContainer>
void f(ContiguousContainer &&) {
std::cout << "first\n";
}
template <typename CharT>
void f(CharT const*) {
std::cout << "second\n";
}
int main() {
auto t = std::vector<int>{1,2,3};
uint8_t tab[] = {'t', 'r', 'u', 'e'};
char *test;
static_assert(!std::is_pointer<decltype(tab)>::value, "");
f(tab);
f(t);
f(test);
return 0;
Here the first overload binds to everything and should always be taken, but MSVC considers the second to be a best match.
Here is a test with GCC.
I don't know if there is a workaround, but if it's a confirmed bug, we should report it to Microsoft.
I believe that MSVC is correct here. The second template is more specialized that the first, so it's selected.
http://en.cppreference.com/w/cpp/language/function_template
template<class T> void f(T); // template #1
template<class T> void f(T*); // template #2
template<class T> void f(const T*); // template #3
void m() {
const int* p;
f(p); // overload resolution picks: #1: void f(T ) [T = const int *]
// #2: void f(T*) [T = const int]
// #3: void f(const T *) [T = int]
// partial ordering
// #1 from transformed #2: void(T) from void(U1*): P=T A=U1*: deduction ok: T=U1*
// #2 from transformed #1: void(T*) from void(U1): P=T* A=U1: deduction fails
// #2 is more specialized than #1 with regards to T
// #1 from transformed #3: void(T) from void(const U1*): P=T, A=const U1*: ok
// #3 from transformed #1: void(const T*) from void(U1): P=const T*, A=U1: fails
// #3 is more specialized than #1 with regards to T
// #2 from transformed #3: void(T*) from void(const U1*): P=T* A=const U1*: ok
// #3 from transformed #2: void(const T*) from void(U1*): P=const T* A=U1*: fails
// #3 is more specialized than #2 with regards to T
// result: #3 is selected
// in other words, f(const T*) is more specialized than f(T) or f(T*)
}
I truly believe this is a MSVC bug:
Here we pass a uint8_t[]
to the function, not a pointer.
The compiler should consider the forwarding reference overload the best match,
since the const CharT *
one requires an array-to-pointer conversion.
Clang and GCC both choose the right overload.
I cannot find the exact quote on cppreference about T&&
binding rules for now, I will try again later.
That's an interesting question. Template overload resolution is indeed a bit of black magic.
From the same page as above:
P is the function parameter type (ContiguousContainer&&, CharT const *)
A is the argument passed to the function (uint8_t[4])
Before deduction begins, the following adjustments to P and A are made:
- If P is not a reference type,
a) if A is an array type, A is replaced by the pointer type obtained from array-to-pointer conversion; - If P is an rvalue reference to a cv-unqualified template parameter (so-called "forwarding reference"), and the corresponding function call argument is an lvalue, the type lvalue reference to A is used in place of A for deduction (Note: this is the basis for the action of std::forward):
For the CharT const * overload, #1 applies, and for the ContiguousContainer && overload, #4 applies.
That gives A = uint8_t[4]& for the ContiguousContainer version, and A = uint8_t * for the CharT version.
ContiguousContainer is then deduced as uint8_t[4]&
CharT is then deduced as uint8_t.
The deduced template functions are then (after reference collapsing)
parse<uint8_t[4]>(ContiguousContainer&)
parse<uint_t>(CharT const *)
The latter still looks to be more specialized, as it is const qualified.
However, there are also rules about what happens when considering the functions in a different order would give different results. Maybe that's involved here, or perhaps the lack of two-phase lookup in templates in VC.
As a matter of fact, I committed a version where the two functions in question are in a different order: no change...
Indeed it is tricky, I think we need a language lawyer on this one, we can post an example code on StackOverflow?
For now, we can put back the C array overload, to see if it helps?
@theodelrieu It would be nice to have more opinions on this, so I'm fine with any code on Stack Overflow.
Meanwhile, I tried the following: I changed the CharT const*
function to
template<typename CharPT, typename std::enable_if<
std::is_pointer<CharPT>::value and
std::is_integral<typename std::remove_pointer<CharPT>::type>::value and
sizeof(std::remove_pointer<CharPT>) == 1, int>::type = 0>
static basic_json parse(const CharPT s,
const parser_callback_t cb = nullptr)
Calling this function with json::parse("\"\\ud80c\\udc60\"")
gives:
error: call to 'parse' is ambiguous
note: candidate function [with CharPT = const char *, $1 = 0]
static basic_json parse(const CharPT s,
^
../src/json.hpp:6071:23: note: candidate function [with ContiguousContainer = char [15], $1 = 0]
static basic_json parse(const ContiguousContainer& c,
^
Adding the C array overload (and removing the experiment from my previous comment)
template<class T, std::size_t N>
static basic_json parse(T (&array)[N],
const parser_callback_t cb = nullptr)
makes it worse:
src/unit-class_parser.cpp:593:27: error: call to 'parse' is ambiguous
CHECK_THROWS_WITH(json::parse("\"\\uD80C\\u0000\""),
^~~~~~~~~~~
../src/json.hpp:5907:23: note: candidate function [with T = const char, N = 15]
static basic_json parse(T (&array)[N],
^
../src/json.hpp:5943:23: note: candidate function [with CharT = char, $1 = 0]
static basic_json parse(const CharT* s,
^
../src/json.hpp:6104:23: note: candidate function [with ContiguousContainer = char [15], $1 = 0]
static basic_json parse(const ContiguousContainer& c,
I made another experiment, but now run in another strange MSVC compiler error (https://ci.appveyor.com/project/nlohmann/json/build/1168):
C:\projects\json\src\json.hpp(5944): error C2027: use of undefined type 'std::remove_pointer<_Uty>' [C:\projects\json\test\json_unit.vcxproj]
After googling a bit, this may be a known MSVC issue...
I am working on a workaround right now, I think we can help MSVC a little
My workaround compiles.
However, tests fail on friend std::istream& operator<<(basic_json& j, std::istream& i)
invocation, but this might be unrelated to the current issue.
The following code is a POC, I hope we can tweak it to make it cleaner.
// Trick struct to help MSVC choose the good overload
template <bool B>
struct IsPointer{};
// Impl method for contiguous containers
template<typename ContiguousContainer,
typename std::enable_if<
std::is_base_of<
std::random_access_iterator_tag,
typename std::iterator_traits<decltype(std::begin(std::declval<ContiguousContainer>()))>::iterator_category>::value
, int>::type = 0>
static basic_json parse_impl(const ContiguousContainer& c, IsPointer<false>, const parser_callback_t cb)
{
return parse(std::begin(c), std::end(c), cb);
}
// IsPointer<true> prevents the implicit array-to-pointer conversion
template<typename CharT>
static basic_json parse_impl(const CharT* c, IsPointer<true>, const parser_callback_t cb)
{
return parser(reinterpret_cast<const char*>(c), cb).parse();
}
// We have to have a parse_impl for streams too. If we keep the parse(istream&)
// and parse(istream&&) overloads, calling parse with a std::stringstream
// will chose the parse(T&&) one
template <typename Stream, typename = typename std::enable_if<std::is_base_of<std::istream,
typename std::remove_reference<Stream>::type>::value, char>::type>
static basic_json parse_impl(Stream& s, IsPointer<false>, const parser_callback_t cb)
{
return parser(s, cb).parse();
}
// Only 'parse' overload with two arguments, (we keep the iterator-range one)
template<class T>
static basic_json parse(T&& c, const parser_callback_t cb = nullptr)
{
return parse_impl(std::forward<T>(c),
IsPointer<std::is_pointer<typename std::remove_reference<T>::type>::value>{},
cb);
}
The IsPointer trick is ugly, especially since we have to give it to every parse_impl overload.
I would like to know if the tests failures are related to this workaround first.
Let me know what you think about it!
Do I understand correctly that you need the IsPointer
thingy because MSVC fails using std::is_pointer
inside the templates?
In parse(T&&)
, I use std::is_pointer, since there is only one overload of parse
, T is deduced as uint8_t(&)[4]
.
So in this method, std::is_pointer
returns false.
This leads to the construction of IsPointer<false>
, which prevents MSVC to choose the pointer overload, since it requires IsPointer<true>
.
EDIT: It also seems MSVC performs the array-to-pointer conversion before resolving the enable_if.
Which might explain why your experiments didn't work.
I'm not an expert on template resolution techniques though...
Just a question: Why do you need to wrap the result of std::is_pointer<typename std::remove_reference<T>::type>::value
inside the type IsPointer
- couldn't you just pass a bool
to parse_ impl
?
You mean something like : parse_impl<true>(stuff...)
?
The parse_impl
definition would look like that:
template <bool IsPointer, typename ContiguousContainer>
basic_json parse_impl(ContiguousContainer&&);
But you cannot partially specialize methods, so I need to wrap the bool in an empty struct.
EDIT: the example code was misleading
I found a better way to fix the issue, your experiment failed to compile because of the last template line:
template<typename CharPT, typename std::enable_if<
std::is_pointer<CharPT>::value and
std::is_integral<typename std::remove_pointer<CharPT>::type>::value and
sizeof(std::remove_pointer<CharPT>) == 1, int>::type = 0>
static basic_json parse(const CharPT s,
const parser_callback_t cb = nullptr)
{
return parser(reinterpret_cast<const char*>(s), cb).parse();
}
Changing to sizeof(typename std::remove_pointer<CharPT>::type)
makes it compile.
The tests fail but this could be a mistake in the operator<<
implementation
@theodelrieu I comitted your proposed change (6e6e1c9) and all tests succeed.
I rebuilt using ninja, and launched ninja test.
Everything passes, I guess I ran tests in a bad folder or something like that.
Well, I have no remarks left, how about you?
I think the branch can be merged, unless you have any further remarks.
I'm fine for merging it!