Error: Encountered unknown dialect 'https://json-schema.org/validation'
GabenGar opened this issue · 12 comments
Info
NodeJS - 16.20.0 (I know, more on that later)
json-schema
- 1.5.1
NextJS - 13.4.19
Repro
This branch basically: https://github.com/GabenGar/todos/tree/unknown-dialect
git clone https://github.com/GabenGar/todos --branch unknown-dialect --single-branch
cd todos
npm run install-all
npm run build
And then get several Error: Encountered unknown dialect 'https://json-schema.org/validation'
errors (I assume per worker) during page rendering stage.
Details
I know I don't meet the minimal nodejs req, but I've read all discussions and it looks like it's only due to fetch
becoming global in it. And I assume the error is caused by the package trying to fetch something and failing for whatever reason.
However I assumed by following these steps:
- initializing all schemas as a side effect
- writing a factory function to create a validator function
- call it when needed
I'd avoid any network calls at all.
But it crashing at build step (when no validator functions are called/created) means it crashes at init()
function. I assumed by importing from "@hyperjump/json-schema/draft-2020-12"
I'd get all parts already. What do I have to do to prevent the package from doing any network/fs calls and instead crash outright when not finding something?
You're getting that error because you're trying to load a schema and haven't declared what dialect of JSON Schema the schema uses. The default is https://json-schema.org/validation
and you've only loaded support for 2020-12. Therefore, you get the error that the dialect is unknown. There are no network calls happening in this situation.
To fix this problem, your schemas need to declare the dialect they use with $schema
or pass the dialect in the addSchema
function. The former is generally considered best-practice.
pass the dialect in the addSchema function
What is a retrievalUri
in the argument and what do I have to put there if I only want to pass defaultDialectId
? Also the error should be more explicit in saying it requires $schema
key and found none or couldn't figure out the metaschema. I assumed the functions from "@hyperjump/json-schema/draft-2020-12"
would automatically assume draft-2020-12
dialect when not provided, but I guess not.
The former is generally considered best-practice.
Don't know about that, it's mainly a noise in the schema collection derived from the same metaschema.
What is a
retrievalUri
in the argument and what do I have to put there if I only want to passdefaultDialectId
?
You can read about the Retrieval URI concept here. The short version is that you would use the retrievalUri
argument if your schema doesn't include $id
. In other words, it's an alternate way of associating a schema to a URI. If you don't want to set the retrievalUri
and do want to set the defaultDialectId
, you can pass undefined
for the retrievalUri
.
the error should be more explicit in saying it requires
$schema
key and found none or couldn't figure out the metaschema.
That's good feedback that the message isn't clear. The problem is that there are multiple reasons the dialect would be unknown and multiple ways to fix it. It's hard to fit all that in an Error message, but I'll see what I can do.
I assumed the functions from
"@hyperjump/json-schema/draft-2020-12"
would automatically assume draft-2020-12 dialect when not provided
I can see why that would be confusing. The way it works is that all functions work with any configured dialect. This allows for supporting multiple dialects and things like referencing a draft-07 schema from a 2020-12 schema. The dialect-specific imports load support for that dialect, but the functions it exposes are just the generic functions that work with any dialect. You can load as many dialects as you need and those functions will work with all of them. I considered separating the api from loading dialects, but I didn't want users to have to use two imports just to get started. You're the second person who's been confused by that, so maybe I made the wrong choice.
Don't know about that, it's mainly a noise in the schema collection derived from the same metaschema.
There are good reasons for it, but I won't get into that here except to say that we're making changes for the next version of JSON Schema that will render those "reasons" moot and it will make sense to not include the dialect in the schema anymore.
You can read about the Retrieval URI concept here. The short version is that you would use the
retrievalUri
argument if your schema doesn't include$id
.
How alternate it is allowed to be?
Given these Retrieval URIs and their schemas:
https://example.com/schema/account
{ "$id": "https://example.com/schema/profile", "title": "Profile", "type":"object", "additionalProperties": false }
https://example.com/schema/profile
{ "$id": "https://example.com/schema/account", "title": "Account", "type":"object", "additionalProperties": false }
Where do https://example.com/schema/account
and https://example.com/schema/profile
point to?
$id
is a complicated mess. The schema identifier is determined by resolving the $id
against the retrieval URI. Since the $id
is absolute, the resolved URI is the $id
and the retrieval URI has no effect. You can reference the schema using either that identifier or the retrieval URI. If there's a conflict between a schema identifier and a retrieval URI, the schema identifier wins and the retrieval URI gets shadowed. So, in your example, the https://example.com/schema/account
schema will point to the Account schema and the https://example.com/schema/profile
schema will point to the Profile schema.
If you had only loaded the first schema, both https://example.com/schema/account
and https://example.com/schema/profile
would point to the Profile schema, but no matter which URI was used, https://example.com/schema/profile
would be the base URI for resolving any references in the Profile schema.
All that contributes to why I discourage using $id
even though the spec encourages it. Just using retrieval URIs is much simpler, more natural, and easier to maintain. However, the retrievalUri
argument is just a simulating something like what would happen with a web request (http(s)://
) or file system access (file://
). This library allows you to not just simulate retrieval URIs, but to actually use them. In your case, I'd suggest using your schemas as files. They're files anyway and translating to use some arbitrary identifier is more works and error prone. Your code application would work the same way it does now except you would pass a URI like file:///path/to/schema/account.shcema.json
to your createValidator
function and you no longer need an init
function at all the load schemas. The schemas are loaded directly from the filesystem. I'm working on an enhancement to allow you to use paths relative the calling file so you don't have to write out the full file:
URI, but for now it's usually pretty easy to generate those paths.
Actually, I just realized that that code is in a directory called "frontend", so this is probably running in a browser. The same concept applies, but instead of using file:
URIs, you can use http(s):
URIs (https://localhost:3000/schemas/account
). Of course that requires you host those schemas at some URI in your website. Those schemas would otherwise be bundled in your JavaScript, so you're not exposing anything you wouldn't otherwise be exposing.
If for some reason, your still not comfortable with that solution, I'd still recommend using retrievalUri
s in addSchema
to identify schemas rather than $id
. It's simpler and it's all you need.
I forgot to mention, if you serve your schemas, you need serve them with Content-Type: application/schema+json; schema="https://json-schema.org/draft/2020-12/schema"
. The schema
parameter sets the default dialect so you don't need to use $schema
.
If you use this approach on the filesystem, there's no alternative to using $schema
in every schema.
If you had only loaded the first schema, both
https://example.com/schema/account and https://example.com/schema/profile
would point to the Profile schema, but no matter which URI was used,https://example.com/schema/profile
would be the base URI for resolving any references in the Profile schema.
This "sometimes kinda $id but not quite" behaviour doesn't sound too swell, as it introduces order-dependent resolution result.
All that contributes to why I discourage using $id even though the spec encourages it.
How are you supposed to reference other schemas within schemas without "$id"
value set?
Just using retrieval URIs is much simpler, more natural, and easier to maintain.
It isn't actually, as it assumes schemas can be downloaded with fetch/read from file system at runtime. And my situation is neither.
They're files anyway and translating to use some arbitrary identifier is more works and error prone.
I consider this a boon as it stops various implementations from "helpfully" assuming things and crash at schema resolution time instead of sometimes after fetch/fs call with a cryptic error message.
The same concept applies, but instead of using file: URIs, you can use http(s): URIs (
https://localhost:3000/schemas/account
). Of course that requires you host those schemas at some URI in your website.
Introducing unknown amount of waterfalling http calls just to compile a validation function is a pretty bad idea for the same reason running browser ESM without bundling is bad. Considering one of the videos on json schema youtube channel said they had ~100 levels of nesting (although I only had ~5 in my private hello world repo), no static server will tolerate additional 5-100 fetch calls on each page transition. It will either error out and break the whole chain anyway or shape traffic to the point it will result in a janky UX.
so this is probably running in a browser
It does but it's not relevant to the subject at hand. I just import the schema files as js modules which then get inlined into the bundle at build time, so for the purpose of the code I feed the schemas as js objects to the addSchema()
functions. No intention of runtime or even build time fetching down the line.
How are you supposed to reference other schemas within schemas without
"$id"
value set?
You reference them by their retrieval URI. Think of an HTML document in a browser. The URI you use to retrieve the HTML is the base URI for document. Any relative-reference URIs are resolved against that base URI and retrieved (usually with HTTP, but other URI schemes are usually supported as well). Referencing in a schema works exactly the same way, except you can manually determine how a URI resolves to a schema using the retrievalUri
argument of the addSchema
function skipping normal URI scheme-based resolution such as making an HTTP request.
Using the retrievalUri
argument instead of $id
, you're still assigning identifiers to all of your schemas and referencing schemas the same way, you're just assigning that identifier in a different way. I prefer the retrieval URI approach to the $id
approach because it's simple, is similar to how all other web technologies work, and the same pattern works in cases where you actually do want to retrieve schemas from the filesystem or the web.
In case it wasn't clear, although it's technically allowed by the spec and this library, there's no good reason to use both the retrievalUri
argument and $id
for the same schema. You should only use one at a time.
Introducing unknown amount of waterfalling http calls just to compile a validation function is a pretty bad idea
You're not wrong, but I think using appropriate HTTP cache header for your schemas addresses most of this concern. Also, as I understand it, HTTP/2/3 multiplexing and connection reuse features make sending many requests not the same kind of performance concern that it used to be. I also find the claim of getting anywhere near 100 levels of nested references dubious and at best an extreme outlier, but if that's a situation where you find your self, I agree this probably isn't the right approach.
In any case, I admit that my suggestion is more applicable to the filesystem than the web, which was what I thought was the case when I first mentioned it. At the least, you'd have to change the way you've organized things because there are different trade-offs in play.
I see that you're not comfortable with the approach of using normal URI scheme-based resolution. That's totally fine. This library supports multiple approaches and you're free to choose which works best for your situation.
I just released an update to improve the experience when not declaring a dialect. You'll now get an error with the following message,
Unable to determine a dialect for the schema. The dialect can be declared in a number of ways, but the recommended way is to use the '$schema' keyword in your schema.
You reference them by their retrieval URI. Think of an HTML document in a browser. The URI you use to retrieve the HTML is the base URI for document. Any relative-reference URIs are resolved against that base URI and retrieved (usually with HTTP, but other URI schemes are usually supported as well). Referencing in a schema works exactly the same way, except you can manually determine how a URI resolves to a schema using the
retrievalUri
argument of theaddSchema
function skipping normal URI scheme-based resolution such as making an HTTP request.
What is the source of truth for the retrieval URI? Also I am thinking of JSON schemas as a fancy input validation DSL, not as something related to web documents in a browser. retrieval URI
forces the schemas themselves to be aware of retrieval specifics at declaration time, when the "real" URL will be known at best at build time.
Using the
retrievalUri
argument instead of$id
, you're still assigning identifiers to all of your schemas and referencing schemas the same way, you're just assigning that identifier in a different way. I prefer the retrieval URI approach to the$id
approach because it's simple, is similar to how all other web technologies work, and the same pattern works in cases where you actually do want to retrieve schemas from the filesystem or the web.
Clearly it's not "simple" because it is a source of confusion in this very issue right now. Also I wouldn't call anything related to HTTP/file systems simple. Especially file systems, since Windows and Linux don't agree even on basic things like case sensitivity and path separators, so it's easy to end up in a situation where the retrievalUri
can resolve differently depending on host OS. It's much more simple to assume "$id"
is just a JSON string which can be checked for strict equality in all languages, instead of relying on built-in url parser/http/fs capabilities. Also fragment being important for the schema resolution while being completely ignored on the web doesn't help with "it's just an URL, bro" argument.
You're not wrong, but I think using appropriate HTTP cache header for your schemas addresses most of this concern. Also, as I understand it, HTTP/2/3 multiplexing and connection reuse features make sending many requests not the same kind of performance concern that it used to be. I also find the claim of getting anywhere near 100 levels of nested references dubious and at best an extreme outlier, but if that's a situation where you find your self, I agree this probably isn't the right approach.
Majority of internet runs on HTTP 1.1, and HTTP 2 has its own security issues (such as completely nullifying network jitter capabilities of TCP). Regardless of protocol, it's still megabytes of data which doesn't even need to be sent because its known at build time.
I also find the claim of getting anywhere near 100 levels of nested references dubious and at best an extreme outlier, but if that's a situation where you find your self, I agree this probably isn't the right approach.
That's not my claim, it's either in one of those interview videos on youtube channel or somewhere in json schema related discussions. I am merely at 5 levels, which would still upset nginx at default config.
I am thinking of JSON schemas as a fancy input validation DSL, not as something related to web documents in a browser.
I think of JSON Schema as a validation DSL as well, but it's also built on web technologies. It can be both.
Clearly it's not "simple" because it is a source of confusion in this very issue right now.
There were two separate questions on StackOverflow just in the last few days of people confused about why their file-relative references aren't working. I see this kind of question in some channel or another about once every two weeks on average. There are definitely a lot of people that expect this behavior and find the self identification concept foreign.
it's easy to end up in a situation where the retrievalUri can resolve differently depending on host OS
This isn't a problem because JSON Schema works with URIs which is universal. This library translates the URI into a file system path appropriate to the environment it's running in.
It's much more simple to assume "$id" is just a JSON string which can be checked for strict equality in all languages
I see that self-identification makes more sense to you. That's fine. Feel free to use what's most comfortable for you.
fragment being important for the schema resolution while being completely ignored on the web doesn't help with "it's just an URL, bro" argument.
JSON Schema uses fragments exactly the same way they are used on the web. In HTML, you can set an "id"
on an element and when used in a URI fragment, the browser moves to that position in the document. JSON Schema works the same way. The application/schema+json
media type defines a behavior for the URI fragment including JSON Pointer support. The JSON Schema validator retrieves the whole schema document, but sets the view (so-to-speak) of document (schema) to the sub-schema the fragment points to.
I suggest we close this issue at this point. The original issue of the confusing error message has been addressed and we're off on a tangent. I'm sorry you didn't find my suggestion helpful. Please do continue to use self-identification if that's what you're most comfortable with. That functionality is present and fully functional. I will however, continue to provide URI scheme based retrieval for those who find that more useful. If you want to be 100% sure you don't accidentally bump into scheme based retrieval, I suggest using a URI scheme other than file:
or http(s):
such as urn:
, tag:
, or even something custom. Those of us working the JSON Schema spec have been discussing the idea of promoting the use of schema:
URIs rather than URIs that imply a location that might not exist.