adlnet/xAPI-Spec

Duplicate Properties/keys in JSON

Closed this issue · 20 comments

Per the 5/18/16 meeting, with the wide ranging behavior in JSON parsers in JavaScript, duplication of JSON properties is not easily handled to the point of making the LRS reject when it happens. The original requirement was on the Learning Record Provider/Statement Creator. Practically, expanding the LRS's role in rejection made sense at the time, but realizing now the implementation details, makes more sense to just have a SHOULD (in regards to an LRS rejecting per Part 2, Section 2.2)

Also talked about the impact on Conformance Testing, perhaps have a separate certification of an LRS that is able to do Content Testing rather than imposing too many restrictions.

For some background on JSON, it is defined by RFC 7159, which states the following for duplicate properties:

The names within an object SHOULD be unique.

Then if we go to RFC 2119 which defies SHOULD we'll find the following:

This word, or the adjective "RECOMMENDED", mean that there
may exist valid reasons in particular circumstances to ignore a
particular item, but the full implications must be understood and
carefully weighed before choosing a different course.

Changing this to a SHOULD for the LRS seems like the wrong move, regardless of how hard it is to implement. It weakens interoperability by removing part of the input validation. What is the expected behavior now? Providers that work with one LRS could now potentially fail with another LRS, or statements could be stored differently depending on how the LRS's JSON lib handles duplicate properties.

The specification currently states the following under the Statement Requirements

A Statement MUST use each property no more than one time

Which puts the requirement on the producer of the statements to ensure that they don't somehow generate a statement which has duplicate properties within the serialized JSON object.

The problem of the current verbiage

The LRS MUST reject Statements

  • where a key occurs multiple times within an object.

is that it is now placing the need to not have a statement producer generate a statement with duplicate properties within an object on the LRS where it can do nothing but reject the statement when instead that statement should have never been produced in that way.

When working with languages that have JSON serializing and deserializing built into them you do not have the ability to control how the JSON is deserialized. If I run the following command JSON.parse("{ \"a\":\"hi\", \"a\": \"Hello\" }") then I will have an object containing the following returned { a: "Hello" } because that first value is overwritten. You'll see the same behavior when working with other languages such as Python, Ruby, or Scala.

With this it is not possible for an LRS to detect that a duplicate property existed within an object and it would be blissfully unaware that it occurred. The only way to go about detecting it would be to create a custom JSON deserializer which would look for duplicate keys and reject it as invalid JSON (which based on the JSON specification previously referenced isn't invalid but just not recommended).

Additionally, in order to create a serialized JSON value which has duplicate properties you would need to be using a non-standard JSON serialization method, as a very crude example see below:

var statementString = "{";
statementString += "\"actor\": \"{ \"openid": \"1234567890...\" }\"";
statementString += "\"actor\": \"{ \"openid": \"0987654321...\" }\"";
// ... etc
statementString += "}";

When running through and creating a new serialized JSON value you would be incapable of adding duplicate properties onto an object because within code they would just overwrite themselves, see below for an example:

var statement = {};
statement.actor = { openid: '1234567890...' };
statement.actor = { openid: '0987654321...' };

These assignments are occurring on the same property name and would just overwrite one another resulting in an object looking like: {\"actor\": \"{ \"openid\":\"0987654321...\"}\"} when serialized.

Requiring a LRS to re-implement a method for parsing JSON which is in line with the JSON specification puts an unnecessary strain on adoption of the xAPI specification when creating an LRS and adds many possibilities for further segmenting implementations of the LRS specification through JSON not being interpreted in the same way across the system and additional edge cases resulted from incomplete & improper implementations of the JSON specification for deserializing.

Thanks for the detailed explanation @DavidTPate it seems like the requirements are very clear, but apparently hard to implement.

I'd be interested hear from other LRS vendors if they conform to this requirement; I did a quick test with Watershed and Postman and it behaved in the way you described; taking the 2nd instance of the property without error. I'm not 100% if it was postman or watershed that dropped the first instance though. cc @bscSCORM @fugu13

Could an LRS test for this without writing its own json parser by parsing the json, re-encoding it and comparing the original and resulting string length, perhaps with some white space removal?

http://stackoverflow.com/a/31502096/4214694

Yeah, I'd expect it to be Watershed doing that @garemoko as all Postman does is send the string value to the server (if you put invalid JSON in there for example Postman won't error, but your server will).

You couldn't reliably test just the string length as whitespace is irrelevant (outside of values) for JSON but relevant for a string length comparison. ie, the following are all the same serialized JSON:

{\"actor\":{\"openid\":\"1234567890...\"}}
{ \"actor\" : { \"openid\" : \"1234567890...\" } }
       { \"actor\" : { \"openid\" : \"1234567890...\" }     }
{
  \"actor\": {
    \"openid\": \"1234567890...\"
  }
}
{
    \"actor\": {
        \"openid\": \"1234567890...\"
    }
}

Right, you'd have to remove either all the whitespace or just whitespace outside of values from both strings. That's easy to do though. Not sure how much of an affect that would have on response time.

Right now, the position I'm taking is that the spec is clear and that LRS vendors can fulfil the requirement by striping whitespace and comparing string length of the original request content and the decoded then recoded JSON string. On that basis this issue can be closed without spec change required. (Though we may want to open a conformance test issue).

Anybody want to challenge that?

@garemoko, just to confirm, are you for keeping the requirement that an LRS MUST reject Statements with duplicate properties? If so, I second that.

There's a few things with this. You can't strip all whitespace because whitespace within a string matters as well. So going through and doing something like the following will not work:

// Incoming payload is: `{ "actor": { "name" : "Some Person", "openid" : "1234567890..." } }`
var payload = "{ \"actor\" : { \"name\" : \"Some Person\", \"openid\" : \"1234567890...\" } }";
var payloadNoWhitespace = payload.replace(/\s/g, ''); // This is now equal to `{"actor":{"name":"SomePerson","openid":"1234567890..."}}` notice that it's now `SomePerson` for the name, not `Some Person` so this will never match the deserialized then serialized JSON.

var deserializedThenReserializedPayload = JSON.stringify(JSON.parse(payload));

if (payloadNoWhitespace === deserializedThenReserializedPayload) {
  console.log('Success!'); // This will never be hit, since all whitespace has been removed
}

So to do this properly you have to remove all whitespace that isn't within quotes, there's a number of ways to do this most of which are very inefficient. So what we're going to do is use a forward lookahead to find values that are before a quote ("). We can do this with the following RegularExpression:

// Using the same payload in the previous example.
var payloadNoWhitespace = payload.replace(/\s(?=(?:[^"]*"[^"]*")*[^"]*$)/g, ''); // Now we come out with `{\"actor\":{\"name\":\"Some Person\",\"openid\":\"1234567890...\"}}` which will match the other deserialized and reserialized one.

Okay, how does this work with multi-line payloads (which are still valid)?

var payload = "{\n" +
"    \"actor\": {\n" +
"        \"name\": \"Some Person\",\n" +
"        \"openid\": \"1234567890...\"\n" +
"    }\n" +
"}";
var payloadNoWhitespace = payload.replace(/\s(?=(?:[^"]*"[^"]*")*[^"]*$)/g, ''); // Again we come out with `{\"actor\":{\"name\":\"Some Person\",\"openid\":\"1234567890...\"}}` which will match the other deserialized and reserialized one.

What about with localized values? We'll try it with some Arabic:

var payload = "{ \"actor\" : { \"name\" : \"هذا هو غبي\", \"openid\" : \"1234567890...\" } }";
var payloadNoWhitespace = payload.replace(/\s(?=(?:[^"]*"[^"]*")*[^"]*$)/g, ''); // We come out with `{\"actor\":{\"name\":\"هذا هو غبي\",\"openid\":\"1234567890...\"}}` which will match the other deserialized and reserialized one.

@DavidTPate I don't get why you can't strip all whitespace for the sake of comparison. Obviously you can't strip it from the version you store, but I'm just talking about what's used for comparison to detect duplicate properties. If the two statements match except for whitespace outside of values, they will still match when you strip all whitespace within values for both.

So, to take your example:

// Incoming payload is: `{ "actor": { "name" : "Some Person", "openid" : "1234567890..." } }`
var payload = "{ \"actor\" : { \"name\" : \"Some Person\", \"openid\" : \"1234567890...\" } }";
var payloadNoWhitespace = payload.replace(/\s/g, ''); // This is now equal to `{"actor":{"name":"SomePerson","openid":"1234567890..."}}` notice that it's now `SomePerson` for the name, not `Some Person` so this will only match the deserialized then serialized JSON once that's had whitespace removed too.

var deserializedThenReserializedPayloadNoWhitespace = JSON.stringify(JSON.parse(payload)).replace(/\s/g, '');

if (payloadNoWhitespace === deserializedThenReserializedPayloadNoWhitespace) {
  console.log('Success!'); // This will now be hit, since all whitespace has been removed from both
}

@garemoko You could strip ALL whitespace but that is not syntactically correct in terms of what you actually want to store. I don't see any reason why you would want to strip ALL all whitespace other than personal preference.

For me, the reason why I want to avoid that is because global regular expressions are very slow and parsing + stringifying the JSON unnecessarily is slow as well. When your dealing with these things at scale the times really add up with these operations. Whereas no matter what to store an xAPI statement you are going to have to parse the JSON at least once so avoid unnecessary operations just for personal preference is ideal with the goal being to improve performance by performing as few operations as necessary.

With a small example like what I have in my previous response will not show a big difference, but when you are dealing with statements that are much larger such as the third one in Appendix A is much more noticeable in addition to the scale at which you might receive these.

Either way, it's not the place of the specification to tell you how you determine whether there are duplicate properties but instead that you shouldn't allow duplicate properties.

Right - I'm just trying to tease out whether or not this requirement is an unnecessarily high processing burden on the LRS, using that approach as an example. Seems like you are saying it is too much processing and perhaps @ljwolford is saying it isn't? @ljwolford - is there anything the ADL LRS is doing to get around the issues @DavidTPate has outlined?

Reviewing my memories and older spec versions, I think the requirement is probably miswritten.

While we've occasionally allowed some slight things in 1.0.x versions that weren't entirely clear in 1.0.0, this is a fairly onerous additional requirement, and I can't seem to see any hint of it in 1.0.0. As should be clear, the usual practice with JSON (which is all that might have been implied by 1.0.0) is to arbitrarily pick one of the values of the duplicated key and ignore the other(s), so the requirement added later does not follow our usual approach of only clarifying.

The 'must not include' language was added in 1.0.1, but not the LRS requirement. I believe this was after a discussion, if I'm recalling correctly (anyone else have a recollection), where we talked about how it was very esoteric, that we'd be clear people shouldn't do it, but we wouldn't require LRSs to do anything since that was nonstandard practice and onerous.

Then, the LRS requirement seems to have been added at 79bdb1e

Which has never made it into an official release of the spec, since 1.0.3 isn't out yet, and was made presumably based on the earlier (underspecified) text that, at least as far as I recall, was never intended to entail an LRS requirement.

In other words, I don't think this is a requirement, and I think we should revise the text to make it clear.

In other words, @garemoko, I do want to challenge that ;). The spec isn't clear, because the requirement has never been in the actual released spec, and there's no good basis in 1.0.0 for including it in any 1.0.x release.

@fugu13 good catch. Based on that, here's the PR: #580

So what's next steps? Do we just remove that line? Do we recommended certain behavior in the event of multiple instances of a property (ignoring all but the first)? I'm in favor of SHOULD*ing some behavior in this scenario.

Sorry about another post asking for clarification, we're just trying to get this straight for our LRS Test work...

This behavior seems contradictory to the behavior being discussed in the stored validation discussion #931 . In this case it seems that testing the incoming Statement for duplicate properties is frowned upon, yet in 931 the LRS is forced to validate a property it must overwrite.

June 1 call - remove line. Add a description to say that if a property is included multiple times, which one is kept is undefined.

Can close this now since #937 merged