URI format check does not validate non-english alphabets

Question

URI format check does not validate non-english alphabets

Closed this issue 4 years ago · 4 comments

We are using version 1.0.19 for backend JSON schema validation.

A simplified example schema:

{ "$schema": "http://json-schema.org/draft-07/schema#", "$id": "sample.json", "title": "A url", "type": "string", "format": "uri" }

and some strings to check against the schema above:

http://example.org/German_Alphabet_äüöß
http://example.org/dance/Rock´n´Roll

The problem ist in URI.php line 36:

const PATH_REGEX = '/^(?:(%[0-9a-f]{2})|[a-z0-9\/:@\-._~\!\$&\'\(\)*+,;=])*$/i';

The REGEX pattern above does not match characters and letters other than a-zA-Z.

Answer 1 · 2021-04-14T19:09:12.000Z

That's an IRI, not an URI.

rfc3986 explains the chars allowed in path
https://tools.ietf.org/html/rfc3986#section-3.3

Answer 2 · 2021-04-20T15:25:18.000Z

Hi @sorinsarca ,

Thank you very much for your explanation. I changed the URI format to IRI in my json schema. All German URLs are now validated as expected. However, the following URL ist still considered as invalid as an IRI:

https://www.oncampus.de/weiterbildung/moocs/Rock´n´Roll

Doesn't IRI actually involve the character ´?
The common browsers can resolve that character. Consequently there must be a defined format which includes that character as well.

Thanks for your help in advance!
Beni

Answer 3 · 2021-04-20T15:28:32.000Z

I'll have to check into that.
Meanwhile you can also check the IRI spec to see if such char is allowed https://tools.ietf.org/html/rfc3987

Answer 4 · 2021-04-20T17:04:11.000Z

It seems so at the first glance ...

2.1 Summary of IRI Syntax: IRIs are defined similarly to URIs in [RFC3986], but the class of
unreserved characters is extended by adding the characters of the UCS
(Universal Character Set, [ISO10646]) beyond U+007F, subject to the
limitations given in the syntax rules below and in section 6.1.

´ (\u00B4) belongs to the category of C1 Controls and Latin-1 Supplement in ISO10646 standard where the German specific letters are also placed in.

6.1 Limitations on UCS Characters Allowed in IRIs: The UCS contains many areas of characters for which there are
strong visual look-alikes. Because of the likelihood of
transcription errors, these also should be avoided. This
includes the full-width equivalents of Latin characters,
half-width Katakana characters for Japanese, and many others. It
also includes many look-alikes of "space", "delims", and
"unwise", characters excluded in [RFC3491]