Question: is the grammar 'round-trippable'?
jayaddison opened this issue · 12 comments
(I'm fairly naive to specifications and grammars, so please don't allow my questions / comments to slow down or distract from progress - it's been great to read this code and spec)
From reading a bit about formal grammars recently I learned about the 'round-trippable' property of some grammar specifications. Basically the idea is that you can parse an input, and then turn the parse tree back into the original input, unambiguously.
Although it might not be the first concern that springs to mind for URL grammars, I think that if it is possible to use a round-trippable grammar, it could eliminate an entire class of future information/communication-loss issues (and potentially even a few security issues as a result) that could spring up in actual implementations.
(I'm fairly naive to specifications and grammars, so please don't allow my questions / comments to slow down or distract from progress - it's been great to read this code and spec)
Thank you! I am quite happy with comments and remarks, it really helps to make things better.
It is also very nice to find that some people take an interest!
I think you refer to a property of the parser rather than the grammar. Specifically:
print (parse (url-string)) == url-string
Am I right?
The short answer is no, this property does not hold.
The parse function is lossy: the use of /
vs. \
in web- and file-URLs is not preserved in the model. It also does not store the number of slashes preceding a windows drive-letter.
Thus the following URL-strings have the same parse tree (which I currently call the 'URL model' in the specification).
http:\\host\dir\file
andhttp://host/dir/file
file:C:/path/to/file
andfile:/C:/path/to/file
andfile://C:/path/to/file
The print function is also lossy. It does an additional percent-encoding of certain code points.
However the parser does preserve the raw character data in e.g. dir and file tokens in the model.
The property does hold for strict (i.e. valid) URLs I believe.
And if I'm not mistaken, the following also holds:
print (parse (print (parse (url-string)))) == print (parse (url-string))
Do you think it would be a good idea to add some notes about this?
A somewhat longer answer:
This is a nice question and actually the specification is not quite exact in this area, as it leaves the construction of the parse tree implicit.
Basically, I leave it to the reader to infer the parse function from the grammar, and figure out what needs to be preserved and/or dropped by looking at the model.
I'd be quite happy to hear ideas about how to make things more explicit and clear.
The common approach would be to specify for each production rule, its 'semantics', which is just a function.
Then the grammar specifies the concrete syntax tree (CST) via the theory of CFGs, and the functions for each of the production rules together define an 'algebraic function' that maps the CST to what I now call the 'URL model' – which is basically just an abstract syntax tree for URLs.
I think I need such a thing anyway to specify the interpretation IP-addressess. But I haven't figured out how to concretely write that down yet.
<snip> ... if it is possible to use a round-trippable grammar ... <snip>
I think you refer to a property of the parser rather than the grammar. Specifically:
print (parse (url-string)) == url-string
Am I right?
Yep, I'd agree the ability to reconstruct the original input reliably would be a property of a parser.
My original issue description was a bit semantically muddled, and described reconstructing the original input unambiguously -- perhaps what I should have said is something more along the lines of (borrowing and extending your notation): parse(render(parse(url-string))) == parse(url-string) # for all valid url-string
.
The short answer is no, this property does not hold.
Thanks; that makes sense.
The property does hold for strict (i.e. valid) URLs I believe.
After URL normalization, it seems like that should be true - there might be some gaps prior to resolution though? Multiple dot-segment references, such as x/././y
, for example, and the per-scheme format rules would be another area I'd look for ambiguity.
Do you think it would be a good idea to add some notes about this?
Potentially - although making sure those notes would be relevant and valuable to readers both seem important. Did you have anything in mind?
I'd be quite happy to hear ideas about how to make things more explicit and clear.
I'll try, although it's near the limits of my understanding and my sense is that other people have a better knowledge of parsing, grammars and web standards. Further discussion (if it seems like a useful topic to address) is probably a good start.
parse(render(parse(url-string))) == parse(url-string) # for all valid url-string.
I see. This does hold, also for invalid url-strings.
It does require the same parser mode to be used in the case of relative URLs.
The WHATWG Standard actually has this as an explicitly stated goal, however, their parse function also normalises. So that would become:
normalise . parse . print . normalise . parse (url-string) == normalise . parse (url-string)
I expose more of the structure of an URL than they do, so it's a stronger claim for me to make, but not doing so leads to issues for them as well (whatwg/url#574 is an example) so I hope the corner cases can be resolved, then I can make this claim as well.
I suppose it would be good to mention such properties in my specification. Maybe I can add tests for it to the reference implementation too. There's a bunch of other gaps still that I need to fill in though.
I'll try, although it's near the limits of my understanding and my sense is that other people have a better knowledge of parsing, grammars and web standards. Further discussion (if it seems like a useful topic to address) is probably a good start.
You might be better than you think! Thinking that you don't know something can be a sign that you already have an intuitive understanding of the problems that need to be solved. That is a big thing.
I created issue #4 for any potential discussions about specifying the parse tree and parse function more rigorously.
I'll leave this thread for any discussion about formally mentioning properties such as the ones above.
The WHATWG Standard actually has this as an explicitly stated goal, however, their parse function also normalises. So that would become:
normalise . parse . print . normalise . parse (url-string) == normalise . parse (url-string)
That makes sense, thanks -- essentially, apply a one-way process to reach the 'normal form', and from that point, we can parse/print (similar to deserialize/serialize) that form reliably.
I expose more of the structure of an URL than they do, so it's a stronger claim for me to make, but not doing so leads to issues for them as well (whatwg/url#574 is an example) so I hope the corner cases can be resolved, then I can make this claim as well.
Noted, I'll read up on that issue.
You might be better than you think! Thinking that you don't know something can be a sign that you already have an intuitive understanding of the problems that need to be solved. That is a big thing.
Thanks :) I'm certainly glad to understand and learn some of the patterns used to formalize specifications, and to some extent also keen to raise awareness of where I think there may be gaps or opportunities.
I created issue #4 for any potential discussions about specifying the parse tree and parse function more rigorously.
I'll leave this thread for any discussion about formally mentioning properties such as the ones above.
For housekeeping purposes I'm considering closing a few issues I've opened and think are 'resolved' (you've covered the questions I had here, and more). Fortunately I don't think that prevents any further discussion should anyone on GitHub want to continue the conversation here.
I don't mind leaving issues open as a discussion thread either. If there would be a lot more activity then I'd have a better look at process, but I'm not too concerned about that now.
Now that I'm thinking about it, I might have a look at the discussions feature of GitHub. I think it is nice if there is a spot for questions rather than issues.
I think this is good summary of what was talked about here (in case anyone finds it useful), considering parse
to also normalize:
- the following holds:
parse (print url)
=url
for every normalized URLurl
- the following holds:
print (parse str)
≠str
for some stringstr
(ifparse str
succeeds)
There are some corollaries, most of which trivial, but not necessarily obvious, and still interesting:
- from (1):
parse ∘ print
≗id
- of course:
parse (print (parse (print url)))
=parse (print url)
for every URLurl
- but also:
parse (print (parse (print url)))
=url
for every normalized URLurl
- and likewise:
parse (print (parse str))
=parse str
for every stringstr
(ifparse str
succeeds) parse (print url)
always succeeds
In English:
- Parsing a printed URL always round‐trips.
- Printing a parsed URL string does not, however.
- Same as (1).
- Parsing a printed URL round‐trips the second time.
- Parsing a printed URL twice also round‐trips if the URL was normalized.
- Same as (1), but for a parsed URL.
- Printing a URL will always result in a parsable URL string.
Also note that, interestingly:
- Even if
parse
does not normalize, (4), (5), (6) and (7) are still valid (though not “corollaries” per se then).
Noting, of course, that this is all from what I was able to gather and understand. If there is anything wrong with any of this, feel free to let me know and I can update this post!
Very cool!
@alwinb: I am unsure whether I can edit the wiki. Maybe I am just missing something, but you might have to tick off the “Restrict editing to collaborators only” in the repository settings.
Thanks, can you edit it now?
Yeah, thanks! 👍 🎉 I made some miscellaneous formatting changes to test it out, and also added a link to the “Discussions” tab. Hopefully my changes seem fine enough!
I should probably include these in the specification page itself. And re-check them with each change.