Fix the parsing so that all test cases pass
Closed this issue ยท 28 comments
The current "check failures" cases are all valid URIs.
Ooops! It seems that I've incorrectly translated test suite - I thought that these test cases should fail by design so I've thoughtlessly copied them ;-) Sorry for this mess!
I will try to fix them this weekend!
Oh, no problem - they were failing, and I think they were misleading before your work too!
The parser has some mistake(s) in it, or maybe is incomplete even. I kept the cases around as I knew they'd need fixing some day.
To clarify: they never have passed, but should do if the parser matches the spec ๐
Yeah - it's clear now ;-) I'm not sure if I'm able to fix them, but I'm going try in my spare time.
Hi @garyb,
I want to fix one of the failing test cases - I mean parsing of /top_story.htm
scenario. I suspect that it should parse to:
(AbsoluteURI
Nothing
(HierarchicalPart
Nothing
(Just (Right (rootDir </> file "top_story.htm"))))
Nothing)
I've found that this could be fixed if we move //
prefix parsing/printing to the Authority
module. It seems that RFC references it this way.
One additional change which follows this modification is parsing of this test case:
"sql2:///?q=foo&var.bar=baz"
(Left
(URI
(Just (Scheme "sql2"))
(HierarchicalPart
(Just (Authority Nothing [Tuple (NameAddress "") Nothing]))
(Just (Left rootDir)))
(Just (Query (Tuple "q" (Just "foo") : Tuple "var.bar" (Just "baz") : Nil)))
Nothing))
In this situation Authority
contains empty host info which is quite consistent with for example uriparser
results:
$ uriparse "sql2:///?q=foo&var.bar=baz"
uri: sql2:///?q=foo&var.bar=baz
scheme: sql2
hostText:
.. pathSeg:
query: q=foo&var.bar=baz
absolutePath: false
$ uriparse "sql2:/?q=foo&var.bar=baz"
uri: sql2:/?q=foo&var.bar=baz
scheme: sql2
query: q=foo&var.bar=baz
absolutePath: true
Also cases like localhost:8000
has to be changed to //localhost:8000
which is again consistent with uriparser
results.
After this change localhost:8000
alone is parsed as schema + path and seems to be close to mailto:*
and tel:*
cases, but their are still failing due to path validation problems...
What do you think? Should I provide pull request?
Unfortunately my bramch (but also upstream version) does not handle this additional, example test case: sql2:/?q=foo&var.bar=baz
, but handles sql2:/test?q=foo&var.bar=baz
well. It seems that it is also related to path parsing when authority is missing. I not fully understand path parsing code so I need a bit more time to try to fix this.
I've added appropriate tests for these scenarios on my branch too.
Sorry, it seems that I've made a mistake - /top_story.html
is not an absolute uri - it should be probably parsed to relative-ref:
testIsoURIRef
"/top_story.htm"
(Right
(RelativeRef
(RelativePart
Nothing
(Just (Right (rootDir </> file "top_story.htm"))))
Nothing
Nothing))
I'm going to fix this on my branch ;-)
Yeah, sorry, if I'd seen this sooner I would have said I think relative-ref is the one :)
I've changed the above snippet to use rootDir
but it does not type check as RelativeRef
allows only URIPathRel
. I'm not sure if this is consistent with the spec. Sorry if this is stupid question, but could you explain why RelativeRef
allows only relative paths?
Or wait, I will go through the spec more rigorously and then ask questions ;-)
I think that example in spec suggests that /g
is relative reference which resolves to absolute path (it transforms base URI by replacing whole path
).
What do you think should be a representation of /top_story.html
?
I think maybe what you were saying earlier is right and RelativeRef
should probably allow both rel and abs paths.
I'm open to suggestions though! I definitely don't have the answers here, otherwise I'd have made an attempt to fix it long ago ๐ - that's part of why it's still in the state it is, I knew some thought would be needed to figure out what the expected representation for the failing cases would be.
Ok, thanks for clarification @garyb.
I can try to modify relative reference representation, so it can handle both cases. Should I just define RelativeRefPath
(and use it in RelativeRef
) insted of URIPathRel
as:
type RelativeRefPath = Either (URIPath Rel Unsandboxed) (URIPath Abs Unsandboxed)
With above representation we would end up with two levels of nested Either
's and I'm not sure if this is nice to work with.
I can try to use custom data type instead:
data RelativeRefPath
= AbsDir (Path Abs Dir Unsandboxed)
| AbsFile (Path Abs File Unsandboxed)
| RelDir (Path Rel Dir Unsandboxed)
| RelFile (Path Rel File Unsandboxed)
What do you think? Do you have any other propositions or suggestions? Should I try to implement this and we will see where it goes?
P.S.
Sorry that I'm bothering you with such details ;-)
We could just reuse URIPath
I think :)
Oh, I see what you mean about the nested Eithers though. Uhm... let's just go with that for now, and see what falls out.
Thinking a little bit further and analyzing other failing test cases I think that Abs
restinction also fails in case of HierarchicalPart
(so in case of AbsoluteURI
and URI
). Let's consider this scenario:
urn:example:animal:ferret:nose
which according to spec should be represented as URI
:
- scheme:
urn
- path:
example:animal:ferret:nose
It seems that this path value is not absolute. The same applies to mailto:*
, tel:*
where the only (in this context) significant restriction for path is that it should not begin with //
(so even: mailto:../relative
is correct uri with path value ../relative
). I think that we should probably extend HierarchicalPart
representation too. What do you think @garyb?
So finally the only distinction between URI
and RelativeRef
path types would be Sandboxed
and Unsandboxed
type...
I meant "Abs
restriction fails in case of HierarchicalPart
" - I've corrected this in my previous post ;-)
Let's think a bit more about this uri:
scheme:../relative
Do we want to parse and represent it?
Syntactically it is correct URI which just contains extra "dot segment" :-) It seems that strict reference resolution algorithm will accept it and transform to scheme:relative
(without even checking base URI, because scheme
is defined). It will transform it without throwing any errors, according to this rule:
The remove_dot_segments algorithm respects that hierarchy by
removing extra dot-segments rather than treat them as an error or
leaving them to be misinterpreted by dereference implementations.
So I'm not sure what do we really want in this case and how we should represent paths in URI
's (uris with defined scheme
)...
Maybe we should stick to Path Abs a Sandboxed
in this case, not accept pathological cases like described above and treat mailto:mail@example.com
as mailto:/mail@example.com
- there is separate question how to handle printing in this case. Or... we should drop sophisticated path representation and use just String
:-)
What do you think @garyb?
It seems that in case of HierarchicalPart
path has to be absolute if authority is present, but can be relative or absolute (but should not begin with //
) if authority is missing. So there is also no simple way to handle this in types.
As I need working parser/printer for my project and I'm not sure how to handle these corner cases in current representation I've simplified path representation to String
on my branch and fixed all failing tests there. Of course I really want to help fix these issues here but I'm not sure how to tackle them now...
After today's fixes I think the only thing that's missing is now is the ability for the parsers to deal with "non-file" paths, like the fred@example.com
, example:animal:ferret:nose
examples. Next time we make a breaking release I'll add a sum in the representation that allows for FS paths or these other more generic paths, where appropriate.
Cool!
Of course I have my doubts... I think that /top_story.htm
should be parsed as relative ref with absolute path and that Scheme
in URI
should be mandatory and we should drop Maybe
from there (isn't it the main distinction between URI
and RelativeRef
in spec?) .
I also think that absolute and relative path distinction doesn't hold in case of both relative ref and uri.
Sorry that I've hijacked this thread. Maybe I should drop some of my incorrect or irrelevant comments from the past...
Yeah I think you're right with all of those points actually - I was going to say /top_story.htm
seems equally valid as URI
or RelativeRef
, but checking the spec again you're absolutely right that scheme
should not be optional for URI
.
It even says in the textual description
The scheme and path components are required, though the path may be empty (no characters).
I didn't do a very good job of following the spec with this thing, did I?! ๐คฆโโ๏ธ
I think that it would much be better if this spec start with just single text input which would accept and parse UriRef
in question ;-)
... and I'm sure that we are going to provide such an input for them soon ;-)
@paluh I'm not sure if you still have an interest in this, but as per the other comment you probably just got notified on, the library has been redesigned quite a bit, and now also parses correctly ๐ / disallows invalid URIs. Thanks for your input on all this, and for the work you had done on it too - I took the approach of not integrating pathy into it anymore, as as you pointed out there are a bunch of "paths" in URIs that have nothing to do with file paths.