purescript-contrib/purescript-uri

Fix the parsing so that all test cases pass

Closed this issue ยท 28 comments

garyb commented

The current "check failures" cases are all valid URIs.

paluh commented

Ooops! It seems that I've incorrectly translated test suite - I thought that these test cases should fail by design so I've thoughtlessly copied them ;-) Sorry for this mess!
I will try to fix them this weekend!

garyb commented

Oh, no problem - they were failing, and I think they were misleading before your work too!

The parser has some mistake(s) in it, or maybe is incomplete even. I kept the cases around as I knew they'd need fixing some day.

garyb commented

To clarify: they never have passed, but should do if the parser matches the spec ๐Ÿ˜„

paluh commented

Yeah - it's clear now ;-) I'm not sure if I'm able to fix them, but I'm going try in my spare time.

paluh commented

Hi @garyb,

I want to fix one of the failing test cases - I mean parsing of /top_story.htm scenario. I suspect that it should parse to:

(AbsoluteURI
  Nothing
    (HierarchicalPart
      Nothing
      (Just (Right (rootDir </> file "top_story.htm"))))
    Nothing)

I've found that this could be fixed if we move // prefix parsing/printing to the Authority module. It seems that RFC references it this way.
One additional change which follows this modification is parsing of this test case:

"sql2:///?q=foo&var.bar=baz"
(Left
  (URI
    (Just (Scheme "sql2"))
    (HierarchicalPart
      (Just (Authority Nothing [Tuple (NameAddress "") Nothing]))
      (Just (Left rootDir)))
    (Just (Query (Tuple "q" (Just "foo") : Tuple "var.bar" (Just "baz") : Nil)))
    Nothing))

In this situation Authority contains empty host info which is quite consistent with for example uriparser results:

$ uriparse "sql2:///?q=foo&var.bar=baz"

  uri:          sql2:///?q=foo&var.bar=baz
  scheme:       sql2
  hostText:
   .. pathSeg:
  query:        q=foo&var.bar=baz
  absolutePath: false

$ uriparse "sql2:/?q=foo&var.bar=baz"

  uri:          sql2:/?q=foo&var.bar=baz
  scheme:       sql2
  query:        q=foo&var.bar=baz
  absolutePath: true

Also cases like localhost:8000 has to be changed to //localhost:8000 which is again consistent with uriparser results.
After this change localhost:8000 alone is parsed as schema + path and seems to be close to mailto:* and tel:* cases, but their are still failing due to path validation problems...

What do you think? Should I provide pull request?

paluh commented

Unfortunately my bramch (but also upstream version) does not handle this additional, example test case: sql2:/?q=foo&var.bar=baz, but handles sql2:/test?q=foo&var.bar=baz well. It seems that it is also related to path parsing when authority is missing. I not fully understand path parsing code so I need a bit more time to try to fix this.
I've added appropriate tests for these scenarios on my branch too.

paluh commented

Sorry, it seems that I've made a mistake - /top_story.html is not an absolute uri - it should be probably parsed to relative-ref:

 testIsoURIRef
   "/top_story.htm"
   (Right
     (RelativeRef
       (RelativePart
         Nothing
         (Just (Right (rootDir </> file "top_story.htm"))))
       Nothing
       Nothing))

I'm going to fix this on my branch ;-)

garyb commented

Yeah, sorry, if I'd seen this sooner I would have said I think relative-ref is the one :)

paluh commented

I've changed the above snippet to use rootDir but it does not type check as RelativeRef allows only URIPathRel. I'm not sure if this is consistent with the spec. Sorry if this is stupid question, but could you explain why RelativeRef allows only relative paths?

paluh commented

Or wait, I will go through the spec more rigorously and then ask questions ;-)

paluh commented

I think that example in spec suggests that /g is relative reference which resolves to absolute path (it transforms base URI by replacing whole path).

What do you think should be a representation of /top_story.html?

garyb commented

I think maybe what you were saying earlier is right and RelativeRef should probably allow both rel and abs paths.

I'm open to suggestions though! I definitely don't have the answers here, otherwise I'd have made an attempt to fix it long ago ๐Ÿ˜„ - that's part of why it's still in the state it is, I knew some thought would be needed to figure out what the expected representation for the failing cases would be.

paluh commented

Ok, thanks for clarification @garyb.

I can try to modify relative reference representation, so it can handle both cases. Should I just define RelativeRefPath (and use it in RelativeRef) insted of URIPathRel as:

type RelativeRefPath = Either (URIPath Rel Unsandboxed) (URIPath Abs Unsandboxed)

With above representation we would end up with two levels of nested Either's and I'm not sure if this is nice to work with.

I can try to use custom data type instead:

data RelativeRefPath
  = AbsDir (Path Abs Dir Unsandboxed)
  | AbsFile (Path Abs File Unsandboxed)
  | RelDir (Path Rel Dir Unsandboxed)
  | RelFile (Path Rel File Unsandboxed)

What do you think? Do you have any other propositions or suggestions? Should I try to implement this and we will see where it goes?

P.S.
Sorry that I'm bothering you with such details ;-)

garyb commented

Oh, I see what you mean about the nested Eithers though. Uhm... let's just go with that for now, and see what falls out.

paluh commented

Thinking a little bit further and analyzing other failing test cases I think that Abs restinction also fails in case of HierarchicalPart (so in case of AbsoluteURI and URI). Let's consider this scenario:

urn:example:animal:ferret:nose

which according to spec should be represented as URI:

  • scheme: urn
  • path: example:animal:ferret:nose

It seems that this path value is not absolute. The same applies to mailto:*, tel:* where the only (in this context) significant restriction for path is that it should not begin with // (so even: mailto:../relative is correct uri with path value ../relative). I think that we should probably extend HierarchicalPart representation too. What do you think @garyb?

paluh commented

So finally the only distinction between URI and RelativeRef path types would be Sandboxed and Unsandboxed type...

paluh commented

I meant "Abs restriction fails in case of HierarchicalPart" - I've corrected this in my previous post ;-)

paluh commented

Let's think a bit more about this uri:

scheme:../relative

Do we want to parse and represent it?

Syntactically it is correct URI which just contains extra "dot segment" :-) It seems that strict reference resolution algorithm will accept it and transform to scheme:relative (without even checking base URI, because scheme is defined). It will transform it without throwing any errors, according to this rule:

The remove_dot_segments algorithm respects that hierarchy by
removing extra dot-segments rather than treat them as an error or
leaving them to be misinterpreted by dereference implementations.

So I'm not sure what do we really want in this case and how we should represent paths in URI's (uris with defined scheme)...

Maybe we should stick to Path Abs a Sandboxed in this case, not accept pathological cases like described above and treat mailto:mail@example.com as mailto:/mail@example.com - there is separate question how to handle printing in this case. Or... we should drop sophisticated path representation and use just String :-)

What do you think @garyb?

paluh commented

It seems that in case of HierarchicalPart path has to be absolute if authority is present, but can be relative or absolute (but should not begin with //) if authority is missing. So there is also no simple way to handle this in types.
As I need working parser/printer for my project and I'm not sure how to handle these corner cases in current representation I've simplified path representation to String on my branch and fixed all failing tests there. Of course I really want to help fix these issues here but I'm not sure how to tackle them now...

garyb commented

After today's fixes I think the only thing that's missing is now is the ability for the parsers to deal with "non-file" paths, like the fred@example.com, example:animal:ferret:nose examples. Next time we make a breaking release I'll add a sum in the representation that allows for FS paths or these other more generic paths, where appropriate.

paluh commented

Cool!

Of course I have my doubts... I think that /top_story.htm should be parsed as relative ref with absolute path and that Scheme in URI should be mandatory and we should drop Maybe from there (isn't it the main distinction between URI and RelativeRef in spec?) .

I also think that absolute and relative path distinction doesn't hold in case of both relative ref and uri.

Sorry that I've hijacked this thread. Maybe I should drop some of my incorrect or irrelevant comments from the past...

garyb commented

Yeah I think you're right with all of those points actually - I was going to say /top_story.htm seems equally valid as URI or RelativeRef, but checking the spec again you're absolutely right that scheme should not be optional for URI.

garyb commented

It even says in the textual description

The scheme and path components are required, though the path may be empty (no characters).

I didn't do a very good job of following the spec with this thing, did I?! ๐Ÿคฆโ€โ™‚๏ธ

paluh commented

I think that it would much be better if this spec start with just single text input which would accept and parse UriRef in question ;-)

paluh commented

... and I'm sure that we are going to provide such an input for them soon ;-)

garyb commented

@paluh I'm not sure if you still have an interest in this, but as per the other comment you probably just got notified on, the library has been redesigned quite a bit, and now also parses correctly ๐Ÿ˜‰ / disallows invalid URIs. Thanks for your input on all this, and for the work you had done on it too - I took the approach of not integrating pathy into it anymore, as as you pointed out there are a bunch of "paths" in URIs that have nothing to do with file paths.

paluh commented

@garyb Thanks for info and for your great work!