Horribly inconsistent behavior between const/non-const reference in operator [] ()

Question

Horribly inconsistent behavior between const/non-const reference in operator [] ()

kawing-chiu opened this issue 8 years ago · 29 comments

Things like

std::string tag;
try {
    tag = json["tag"];
} catch (const std::domain_error &e) {
    //  handle it
}

works nicely and as expected. However, if json is a const reference, the above code will fail by assertion. Why is it designed like so? Shouldn't it be better and more consistent to at least throw a domain_error instead of fail by assertion? It took me more than one hour to debug this issue when I pass the above json by const reference to another function to extract data...

Answer 1 · 2016-08-09T15:29:01.000Z

Neither version of operator[](key) checks if the given key exists and only throw std::domain_error when called on a JSON value that is not of type object. Therefore, your code will - in both cases - only catch exceptions indicating that json is not a JSON object. The behavior differs in case key is not stored in the object:

The nonconst version (1) silently converts a null value to an empty object, and (2) creates a null value at key if it does not exist before it returns a reference to the value at key. This reference is always valid, because of step (2): there will always be a value at key.
The const version returns a const reference to the value at key. If there is no value at key, we can neither alter the object nor return a "null reference". The behavior is undefined in this situation (basically, the code is equivalent to return *object.find(key);). An assertion checks this.

I though about the semantics for quite a while as there is no similar operator[] for std::map returning a const reference. However, there is at for checked access, so operator[] is for the unchecked access. Checking whether key exists in the const version would result in overhead which would contradict the "not pay what you don't use" principle.

I would be happy for ideas how to improve this situation.

Answer 2 · 2016-08-10T23:41:47.000Z

Hello!
I think I have a solution how to

at least throw a domain_error

In that metod (string 3616):

template<typename T>
const_reference operator[](T* key) const;

We should simply change this:

assert(m_value.object != nullptr);
assert(m_value.object->find(key) != m_value.object->end());
return m_value.object->(key)->second;

to something like this:

assert(m_value.object != nullptr);
if(m_value.object->find(key) == m_value.object->end()) {
        throw std::domain_error("something");
}
return m_value.object->find(key)->second;

It works pretty nice for me :-)

Answer 3 · 2016-08-10T23:52:58.000Z

That's exactly what nlohmann said he doesn't want to do

Checking whether key exists in the const version would result in overhead which would contradict the "not pay what you don't use" principle.

Answer 4 · 2016-08-11T03:17:15.000Z

Well, I think this problem is not too hard, but it might require some API change. To see what is the appropriate behavior here, let's learn from how others do it.

Since JSON originates from and is heavily used by high-level scripting languages, we consider Python and Javascript's handling of json here (these two languages even have json functionalities built into the language, and these functionalites work intimately with built-in data structures).

[Javascript]

Here is the behavior of Javascript object, which maps to JSON objects:

Reading. Reading a non-exist key does not throw exception, but returns undefined. But nothing is inserted into the object when reading:
```
>> o = {}
Object {}
>> o.x
undefined
>> o
Object {}
```

Writing. Writing to non-exist key works as expected:

>> o.x = {123: '321'}
Object {123: "321"}
>> o.y = null
null
>> o
Object {x: Object, y: null}

Note that Javascript has two different built-in types: undefined and null. undefined means variable does not exist. null maps to JSON null.

[Python]

Here is the behavior of Python dict, which maps to JSON objects:

Reading. Reading a non-exist key throws KeyError exception, and the dict itself is of course not touched:

In [1]: d = {}

In [2]: d['x']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-2-58142e8c3daa> in <module>()
----> 1 d['x']

KeyError: 'x'

In [3]: d
Out[3]: {}

Writing. Writing to non-exist key works as expected:

In [3]: d
Out[3]: {}

In [4]: d['x'] = {'10': 100}

In [5]: d['y'] = None

In [6]: d
Out[6]: {'x': {'10': 100}, 'y': None}

Python does not have equivalent of Javascript's undefined. Python's None type maps to JSON null.
Note that on reading, both languages don't touch the object at all.

Now return to the C++ world. I think these two principles are worth following:

Const and non-const behavior should definitely match.
A production quality library should really get rid of assertions. See the C++ standard library for an example. Grep assert in the source of clang's libcxx and you will find that assertions are only used when exception is not available or very sparingly for internal debug purpose.

So from the above discussions, IMHO the best solution is to mimic Python's behavior, which is well tested, widely used and already familiar by many programmers. There's really no need to invent a new behavior. We can just define some exceptions like key_error or something. (In Python's case, KeyError is a built-in standard exception. But there's no reason that we must stick to built-in exceptions. For example, <future> defines a bunch of exceptions for use by std::future.) The reasons for using an exception on reading a non-exist key include:

Reading a non-exist key is indeed an exceptional case. Which maps well to exception conceptually.
Using exception avoids the overhead of doing unnecessary tests. JSON is a highly dynamic data format. A common use case of a json library is on a server that processes high-throughput json requests from the outside world. In this case, 99.99% percent of the requests are just valid. Every field exists as expected. The remaining 0.01% are some malformed jsons, maybe just several out of tens of thousands a day. So we certainly don't want to test every field for existence and type conformance for the 99.99% requests. And this is the exact reason that why I abandon rapid_json and come to nlohmann/json.
Note that in the current status of the library, if assertion error/UB may occur, this amounts to force the user of the library to write all of the tests on all of the json fields. So we come to the point again: assertions should really be eliminate in favor of exceptions in any production-ready libraries.

So, to sum up, IMHO, I think we really should follow the behavior of Javascript or Python. I would just choose Python's. Or, we can implement both and provide a global config option to let the user choose which flavor to use.

Answer 5 · 2016-08-11T03:43:48.000Z

avoids the overhead of doing unnecessary tests. JSON is a highly dynamic data format. A common use case of a json library is on a server that processes high-throughput json requests from the outside world. In this case, 99.99% percent of the requests are just valid. Every field exists as expected. The remaining 0.01% are some malformed jsons, maybe just several out of tens of thousands a day. So we certainly don't want to test every field for existence and type conformance for the 99.99% requests. And this is the exact reason that why I abandon rapid_json and come to nlohmann/json.

This is exactly why the test isn't there.

Note that in the current status of the library, if assertion error/UB may occur, this amounts to force the user of the library to write all of the tests on all of the json fields. So we come to the point again: assertions should really be eliminate in favor of exceptions in any production-ready libraries.

The assertions are removed when you build with NDEBUG, so there are no assertions when building for production. They're debugging aids. Therefore, ignore them.

If the user of the library needs to know whether or not the desired field is there because it's reading from a const object constructed from an unknown source, then it is up to the user of the library to do the test. Adding this test makes the 99.99% slower so that the 0.01% can be exactly the same speed, but with the test in the library instead of outside the library.

Answer 6 · 2016-08-11T03:49:49.000Z

Const and non-const behavior should definitely match.

That is simply not possible. It would mean that you could not do this:

json j;
j["foo"] = "bar";

Answer 7 · 2016-08-11T04:57:50.000Z

@gregmarr This is not what I mean. I mean 'the const part behavior' should match, of course. In our specific case, it means the behavior on reading.

Removing the assertion by such debug option is pointless, since what you get instead is undefined behavior, it's essentially the same. Failed with undefined behavior is even worse IMHO. Have you read through my post? I did not advocate adding tests at all. Where did you get that impression from?

Answer 8 · 2016-08-11T05:26:38.000Z

@Dka8 The fix you describe is (more or less) the code of the at function which offers checked access to the object. The operator[] function skips this check, just as std::vector or std::map do.

Answer 9 · 2016-08-11T05:28:40.000Z

I confused your post with dka8's post.
What exactly are you proposing then?

Answer 10 · 2016-08-11T05:29:31.000Z

@gregmarr I propose to leave everything as is:

at for checked access
operator[] for unchecked access

Answer 11 · 2016-08-11T05:39:05.000Z

@kawing-chiu I am aware how other languages cope with JSON, and the motivation for this project was "What if JSON was part of Modern C++?". Where possible, I tried to make JSON feel like a part of the STL and basically combined the interfaces of std::vector and std::map. Both offer with at checked access to the elements. std::vector also has operator[] for both const and nonconst access - in both cases without bounds checking. As std::map has no const operator[], I basically had three choices:

not to offer a const operator[]
copy the behavior of at and provide checked access
mimic the behavior of vector and provide unchecked access

Option 1 makes no sense. There a so many Stack Overflow threads about people being confused that they cannot use operator[] for const std::map that I did not follow this.

Option 2 makes more sense, but would always check for an element even if such a check is not needed. This, in my opinion, is not very C++ like.

Option 3 is documented as such. Just like dereferencing the past-the-end iterator or using an invalid vector index, the behavior is undefined. One could argue that calling find, checking whether it's end() and returning the reference if not does not really add overhead, but in the end it is a check that no one really asked for when using operator[] rather than at.

About assertions: I shall think about switching them off by default and only to switch them on with a special preprocessor symbol. However, the current code only use assertions in cases of otherwise undefined behavior.

Answer 12 · 2016-08-11T05:52:15.000Z

There is no need for a special preprocessor symbol, assert comes with one as part of the standard, as I mentioned.

I agree with leaving things as is.

Answer 13 · 2016-08-11T05:55:31.000Z

@gregmarr I agree. I think the main issue of @kawing-chiu is that they did not expect an assertion in this case. I do mention assertions in the README - I am not sure what I can do to avoid such confusions.

Answer 14 · 2016-08-11T07:51:25.000Z

@nlohmann OK, now I see why. Actually, I think Option 1 is the best of three and it makes some sense. If we're to follow STL's lead, then just follow as close as possible...

Another possibility I can think of is to define subclass of std::logic_error, throw it and let the user know what happen instead of the assertion error.

Either way, I think there should be a section dedicated to it in the readme.

Answer 15 · 2016-08-11T08:31:09.000Z

I think providing a function in the flavor of std::vector and offer unchecked access also for objects is more consistent than not allowing const access with operator[] at all. Anyway, I think the larger issue are the assertions. I don't think throwing an exception is a solution, because this again would introduce a check...

There is currently a notes section in the README, reading

The code contains numerous debug assertions which can be switched off by defining the preprocessor macro NDEBUG, see the documentation of assert.

What would you like to add?

Answer 16 · 2016-08-11T08:47:33.000Z

@nlohmann Exceptions are not like checks. They are zero cost on the 'normal path'. And unlike assertions they can be catched and won't make your server crash.

The problem is not at all whether we can turn assertions into undefined behaviors by NDEBUG.

As an example: std::promise::get_future() uses various kinds of future_error exceptions to indicate something that should not be done by the user (e.g. the future associate with the promise can only be retrieved once). The std::promise::get_future() function will not just irresponsibly assert and dead when the user access the future twice.

Answer 17 · 2016-08-11T08:53:43.000Z

With

I don't think throwing an exception is a solution, because this again would introduce a check...

I meant that one would still need code like

if (m_object->find(key) != m_object->end())
  throw std::logic_error(key + " not found");

Or did you mean to have a different assertion macro that throws, so that check would not be executed in production?

Answer 18 · 2016-08-11T09:11:25.000Z

I don't understand the point. You already have it in your assert()

assert(m_value.object->find(key) != m_value.object->end());

So isn't it just the same in terms of overhead??

Answer 19 · 2016-08-11T09:13:26.000Z

OK, so you meant this:

Or did you mean to have a different assertion macro that throws, so that check would not be executed in production?

:)

Answer 20 · 2016-08-11T09:17:18.000Z

Ok, so you propose to replace the vanilla assert by a function that throws some exception instead. And this function would be a no-op in production, right?

Answer 21 · 2016-08-11T09:34:52.000Z

Well, not exactly. I mean we use the exception for both development and production, like std::promise::get_future().

But since there's at(). What really needs to be done I think is to add a section to readme to clarify things and warn the user. At least say explicitly that [] may assert.

The code contains numerous debug assertions which can be switched off by defining the preprocessor macro NDEBUG, see the documentation of assert.

is not quite enough. 'There is a minefield'. But where are the mines?

Answer 22 · 2016-08-18T18:05:31.000Z

Option 1 makes no sense. There a so many Stack Overflow threads about people being confused that they cannot use operator[] for const std::map that I did not follow this.

C++ developers are still likely to work with const std::map and I don't think that using a different approach in this library makes them any less confused. I found a few caveats from my code where operator[] was used with const objects and I'd much rather prefer compiler errors in such cases (and consistency with std::map).

std::vector also has operator[] for both const and nonconst access - in both cases without bounds checking.

At least the behavior of both operators is consistent as neither of them can be used for inserting new elements.

Answer 23 · 2016-08-19T01:21:36.000Z

I agree with @maksis . I think that is the exact reason why std::map does not have a const operator[].

Answer 24 · 2016-08-31T18:31:15.000Z

The assertions are now mentioned in the README and the documentation of the function. Furthermore, the release notes for 2.0.3 will also mention the existence of assertions in the project and how to disable them.

Answer 25 · 2017-01-19T23:37:22.000Z

I know I'm late to the discussion here but I wanted to add some thoughts about this. I understand that the idea is that since const operator[] is unchecked for arrays, it should be fine to make it unchecked for maps too. But I don't think that's true. There are many reasons why operator[] is unchecked for arrays. But those reasons do not apply to maps.

First of all, indexing into an array is fundamentally a very cheap operation, possibly a single processor instruction. Adding a bounds check to it makes the operation many times slower. By contrast, looking up an entry in a string map is already an expensive operation involving O(log n) string comparisons. Doing an extra it != container.end() check only makes it negligibly slower.

Second of all, the most common operation with input arrays is to iterate through them and access each element. With this usage pattern, it's easy to ensure that there is no out-of-bounds access. By contrast, with input maps, random access by key is more common than iteration. And with random access it's a lot more likely that out-of-bounds access is a real possibility that you have to deal with.

Third of all, everyone is already very used to the fact that indexing into arrays is unchecked, ever since the days of C. There is no corresponding expectation for maps. I think that when people are writing code using json objects in C++, they might expect one of 2 things:

json objects are like std::map, i.e. there is no const operator[]
json objects are like dicts in Python, etc, i.e. operator[] throws for keys that aren't found

I really don't think that people's first thought would be that the behavior of operator[] for json objects is the same as it is for arrays, i.e. undefined behavior for out-of-bounds access. That, to me, is a fundamentally surprising thing about the interface of this library. Of course you can try to remedy the problem of a surprising interface with documentation, but it's still a problem.

I don't agree with the concern that it violates "pay only for what you use". Like I mentioned earlier, the cost of the check is negligible compared to the existing cost of the indexing operation. It's a small price to pay for eliminating a source of undefined behavior which, in my opinion, an inexperienced user is very likely to run into. Someone said that this check would make the 99.99% case slower for the sake of the 0.01% case. I strongly disagree with that analysis. How can it be that in 99.99% of cases, people are so sure about the source of their input that they're willing to invoke undefined behavior if they're wrong? In 99.99% of cases, the json that you're reading came from some external source, whether a config file or some other process. You really want to give that external source the power to cause a segfault in your application? I think in 99.99% of cases, people are going to want checked access. Or, if they're an inexperienced developer, they won't bother to do any checking thinking they don't need it, until the day that they're wrong and now they have a segfault to deal with.

In conclusion, what I want is for the const version of operator[] to have the same behavior as at, because operator[] has the nicer, more obvious syntax, so it should be the safe version. Adding this check would even be a backwards-compatible change because you're taking something that's undefined behavior now and making it defined behavior.

While we're here, another possible behavior for const operator[] would be to have an empty instance of basic_json as a static variable in the class, and have const operator[] return a const reference to this instance if the lookup fails. This would be very useful for reading nested configs where all the fields are optional, e.g.:

void loadConfig(const json& config) {
  const json& child = config["foo"];
  a_ = child.value("a", 0);
  b_ = child.value("b", true);
  c_ = child.value("c", "bar");
}

In this case, since all the fields inside of "foo" are optional, we can make "foo" itself be optional.

But probably the less surprising behavior for const operator[] is to throw. The above behavior of returning a const reference to a dummy instance is useful in its own right so it could be added as a new method, say child(key).

Answer 26 · 2017-01-20T02:27:41.000Z

Adding a bounds check to it makes the operation many times slower

Actually, even for arrays it is only marginally slower even for cache-friendly sequential access on modern architectures (pipelining and all that), and is essentially free for not-cache-friendly random-access.

I have to agree that undefined behavior should not be introduced without really good reasons, and if can be eliminated for extremely negligible cost, that should be made so.

On that note, I'm a believer that asserts should be reserved for testing preconditions that are internal to the library, rather than for checking validity of external inputs that we have no control over - that's what domain_errors are for!

assert(does_not_pose_sql_injection_threat(user_input)) ; )

Answer 27 · 2017-01-20T06:22:08.000Z

@eferreira The value() function) seems similar to your proposed child function, right?

Answer 28 · 2017-01-20T13:46:20.000Z

The value function returns a copy, right? The child function would return a const reference. I'd like a function that returns a reference to a child if it's there, otherwise it returns a reference to some dummy instance. Is it possible to achieve that with the value function?

Answer 29 · 2017-01-20T14:03:53.000Z

Right, it returns a copy.