opis/json-schema

URI format check does not validate non-english alphabets

Closed this issue · 4 comments

We are using version 1.0.19 for backend JSON schema validation.

A simplified example schema:

{ "$schema": "http://json-schema.org/draft-07/schema#", "$id": "sample.json", "title": "A url", "type": "string", "format": "uri" }

and some strings to check against the schema above:

  • http://example.org/German_Alphabet_äüöß
  • http://example.org/dance/Rock´n´Roll

The problem ist in URI.php line 36:

const PATH_REGEX = '/^(?:(%[0-9a-f]{2})|[a-z0-9\/:@\-._~\!\$&\'\(\)*+,;=])*$/i';

The REGEX pattern above does not match characters and letters other than a-zA-Z.

That's an IRI, not an URI.

rfc3986 explains the chars allowed in path
https://tools.ietf.org/html/rfc3986#section-3.3

Hi @sorinsarca ,

Thank you very much for your explanation. I changed the URI format to IRI in my json schema. All German URLs are now validated as expected. However, the following URL ist still considered as invalid as an IRI:

https://www.oncampus.de/weiterbildung/moocs/Rock´n´Roll

Doesn't IRI actually involve the character ´?
The common browsers can resolve that character. Consequently there must be a defined format which includes that character as well.

Thanks for your help in advance!
Beni

I'll have to check into that.
Meanwhile you can also check the IRI spec to see if such char is allowed https://tools.ietf.org/html/rfc3987

It seems so at the first glance ...

2.1 Summary of IRI Syntax: IRIs are defined similarly to URIs in [RFC3986], but the class of
unreserved characters is extended by adding the characters of the UCS
(Universal Character Set, [ISO10646]) beyond U+007F, subject to the
limitations given in the syntax rules below and in section 6.1.

´ (\u00B4) belongs to the category of C1 Controls and Latin-1 Supplement in ISO10646 standard where the German specific letters are also placed in.

6.1 Limitations on UCS Characters Allowed in IRIs: The UCS contains many areas of characters for which there are
strong visual look-alikes. Because of the likelihood of
transcription errors, these also should be avoided. This
includes the full-width equivalents of Latin characters,
half-width Katakana characters for Japanese, and many others. It
also includes many look-alikes of "space", "delims", and
"unwise", characters excluded in [RFC3491]