eclipsesource/play-json-schema-validator

Incorrect resolution behavior for refs contained in cache

Closed this issue · 20 comments

I'm not quite sure, but it seems it is not possible to preload schemas for the SchemaValidator and still use relative referencing inside the schema. What I mean is, when I do

val validator = SchemaValidator().addSchema(schemaUrl.toString, schema)
validator.validate(schema, json)

the relative reference "$ref": "common.schema.json#contact" inside a subschema of the schema is not resolved. But when I pass the schema as an URL to the validate method, the reference is resolved correctly.

The thing is, it seems that when I pass the reference as an URL, it loads the schema file and parses it again, even though I had previously added the schema to the validator. I base my guess on the fact that passing the schema as an URL takes roughly twice the time when compared to the other version.

It feels like the URL version gets the correct resolution scope, but passing the schema as a preloaded schema it fails to do so. This happens because the schema file is local, but the location is only known at runtime, so it cannot be written directly to the schema as the id. The solution could be to provide a method for validating with a schema with certain id, for example

validator.validate(schema, schemaUrl.toString, json)

But there is another problem here. As a workaround, I tried to modify the schema json after loading it, so that it has the correct id, then created the schema object from that. Unfortunately this causes the validation to run out of stack. I included the stack trace, but it alone may not be of much help.
stack-overflow.txt

Thanks for the report! I'll try to have a look today evening.

You are right about your assumption regarding the resolution scope. If you add a schema via addSchema the resolution scope can't be inferred (that's why you need to specify the id via addSchema). But I guess your issue is more related to performance, since resolving the same URL twice shouldn't trigger parsing the schema twice. Instead, it should be fetched from the built-in cache, which gets filled once a schema has been retrieved via a $ref. So I think this issue is actually related to an error within the cache (and I think I already spotted the error), but please correct if I am wrong.

Yes, sounds about right. In the end, it seems there was an assumption from my part that led me to use validator.validate with a pre-loaded schema, which can not handle relative referencing (at least for now).

So using the validator.validate(url, instance) should not parse any schemas, if I have already added the root schemas and referenced schemas to the validator with correct URLs? Also, is the stack overflow also caused by the problem with the cache or is it totally unrelated?

Yes, that's correct. I'll add try to add a fix on the weekend then.
Regarding the stack overflow: I don't think this is directly related to the cache, more likely it seems to happen during de-serialization of a JSON instance, but I can't tell for sure. Could you somehow make your changes available to me, e.g. by pushing your version of the modified validator together with your input data into a personal repo of yours and point me to that? That'd be great.

Fixed with 0.8.8

Thank you for the efforts!

I was just testing the validation with the new 0.8.8 version. It seems that it is still much slower to validate when passing the URL to validate than with the preloaded schema. It seems to have no effect in either case, if I add the schema to the validator by calling addSchema. In the schema I do not have any relative references that refer to schemas in other files. Also, if the root schema has an id (not the real URL of the schema, as it is not known prior to running the application), it fails to resolve any relative "#id-of-the-thing" references. It works if I do not specify any id for the root.

In my case the average timings are 9ms with the preloaded schema, and 32ms with the URL (about 3.5x slower). Not sure why this is happening, if the cache is working now. I have feeling that it might be the discrepancy between the id and the real URL. In the earlier try to work around the cache problem, I patched the schema json with the correct id (the URL of the schema), but this caused the stack overflow. I guess, I will try this again to see if there is a difference this time.

You're welcome. The fix of this particular issue should be only measurable in case you have a schema with multiple refs pointing to the the same document but after reading your comment that's probably not the case for you. So perhaps the issue is another one: could you formulate test cases with your expected outcome (like you did with #99) and open respective issues? That would be great! I guess these would be:

  • no performance gain via addSchema in contrast to validate(URL, JsValue)
  • failing to resolve "#id-of-the-thing" in case root schema has and id and was added via addSchema
    Did I get this right?

Hi,

I'm not sure if it's the same problem, but I'm having trouble resolving external schemas. Here's the code (mind you, I'm a noob in Scala):

  private def trySchema(schemaName: String): JsResult[SchemaType] = {
    val schemaResource = getClass.getResource(s"/admin-ui-schemas/${schemaName}")

    try {
      JsonSource.schemaFromUrl(schemaResource)
    } catch {
      case e: MatchError => JsError("MatchError: " + e.toString)
    }
  }

  private def createValidator(dependencies: List[String]) =
    dependencies.foldLeft(SchemaValidator()) { (validator, depSchemaName) =>
      validator.addSchema(depSchemaName, trySchema(depSchemaName).get)
    }

  private def validateSchema(schemaName: String, dependencies: List[String]) = {
    trySchema(schemaName) match {
      case JsError(err) => ko(Json.prettyPrint(JsError.toJson(err)))
      case JsSuccess(schema, _) => {
        val jsonResource = getClass.getResource(s"/json-schema-examples/example.${schemaName}")
        val jsonTry = JsonSource.fromUrl(jsonResource)

        jsonTry match {
          case Failure(err) => ko(err.toString)
          case Success(json) =>
            val validator = createValidator(dependencies)

            validator.validate(schema, json) match {
              case err: JsError => ko(Json.prettyPrint(JsError.toJson(err)))
              case _ => ok
            }
        }
      }
    }
  }

Seems like addSchema doesn't help and validator fails to resolve the schema in dependencies:

[info]   + Validate common-schema.json
[error]   x Validate bot-achievement-schema.json
[error]    {
[error]      "obj.linked_ont_id" : [ {
[error]        "msg" : [ "Instance does not match all schemas." ],
[error]        "args" : [ {
[error]          "keyword" : "allOf",
[error]          "schemaPath" : "#/properties/linked_ont_id",
[error]          "instancePath" : "/linked_ont_id",
[error]          "value" : "584070107576906b9edbd76d",
[error]          "errors" : {
[error]            "/allOf/0" : [ {
[error]              "schemaPath" : "#/allOf/0/properties/linked_ont_id",
[error]              "errors" : { },
[error]              "keyword" : "$ref",
[error]              "resolutionScope" : "bot-achievement-schema.json",
[error]              "msgs" : [ "Could not resolve ref common-schema.json#/definitions/ont." ],
[error]              "value" : "584070107576906b9edbd76d",
[error]              "instancePath" : "/linked_ont_id"
[error]            } ]
[error]          }
[error]        } ]
[error]      } ]
[error]    } (JsonSchemaTest.scala:47)

(I'm using v0.8.8 for Scala 2.11)

Hi @alexkuz,
thanks for the report. Could you also post the schemas as well as the instances you are trying to validate, so that I can reproduce this and have a look at it?

Ok, I think I've found the issue, I'll try to come up with a fix soon.

@edgarmueller Thank you!

Turned out that it works when I use valid URL in schema id, so as a fix (dirty, but fine for tests) I just replace id -> "http://localhost/" + id in schema refs.

I'm now facing another problem though - I have really huge schemas, with lots of oneOfdefinitions, and if the object doesn't match the schema, it can take a really long time to process it (sometimes it seemingly takes forever). For now, I'm just using a timeout, as a valid object doesn't take too long to be validated, fortunately.

Can you point me to such a schema? It would be interesting to find out what takes so long.

I would prefer not to share it publicly since it's from the private project, but I can send you it by email.

Sure, you can send it to the email address listed on my profile. Thank you!

This test should reflect the encountered error, correct? If not, please let me know. I've opened another issue for the performance problem.

That should do it, thank you!

Is that possible to have these fixes in v0.8.x? Our team is not ready yet to migrate to the next version :)

I'll need to backport them to the 2.11 branch which will involve a bit of work, since there were breaking changes in Play JSON 2.6, but yes, it's possible. I'll get to it probably over the weekend.

I did a 0.8.9 release, could you give it a try?

It works, thanks a lot!

Glad it worked, if there's anything else, let me know.