NET-A-PORTER/scala-uri

Parse Errors on invalid URI - A bit to strict?

Closed this issue · 15 comments

When parsing an URI like such: http://localhost:9000/?foo=test&&bar=test. Scala URI throws a parse error. While the URI is technically invalid, I think its a bit strict and should be able to normalize these kind of errors. Lots of websites make that stupid mistake, so as a parser we should be able to handle it.

I'm going to look into the source and see if there is a way I can fix it, but I'm not sure approach you would like me to take.

Let me know what you think

Thanks for raising this - it seems reasonable to support those URLs. This should be available in 0.4.3-SNAPSHOT for you to test.

Current behaviour is to consider it as an query parameter with an empty name and empty value. This means the &&are maintained if you parse and then call toString:

scala> val uri = Uri.parse("http://localhost:9000/?foo=test&&bar=test")
uri: com.netaporter.uri.Uri = http://localhost:9000/?foo=test&&bar=test

If you want to remove it you can use uri.removeParams("")

scala> val uri2 = uri.removeParams("")
uri2: com.netaporter.uri.Uri = http://localhost:9000/?foo=test&bar=test

Let me know if that works for you

Wow, you are truly amazing. Thanks so much for providing such a quick fix for this. I'm going to deploy your new update and let you know how it goes. That behavior is exactly what I what expect and it will work perfectly for me.

Thanks again!

In production, I see another parse error that may want to get fixed. Any query param with spaces also doesn't get parsed. Here is the bad URL:

localhost:9000/mlb/2014/06/15/david-wrights-slump-continues-why-new-york-mets-franchise-third-baseman-must-be-gone-before-seasons-end/?utm_source=RantSports&utm_medium=HUBRecirculation&utm_term=MLBNew York MetsGrid

I'm going to take a look and see if I can fix this on my own. If you have any tips or comments, let me know.

Thanks,

Mike

I think this is another case of something that is technically not a valid URL (spaces should be encoded to %20 or +), but it something we should still support.

I think This URL would work in 0.4.2, but in the latest SNAPSHOT I've made the parser stricter as a result of the changes in #51.

I'm actually thinking I should revert the changes in #51. We should be able to do this, as the root cause of #51 has been fixed in parboiled2:

sirthias/parboiled2#78

Adding one more example of strict parsing.

localhost:9000/t?x=y%26

Will this be fixed in 0.4.2?

Any update on this? Will 0.4.3 be released anytime soon? Currently being troubled by one of those URLs with the dangling &...

Thanks. :-)

btd commented

Got the same problem. @theon could you clarify if 0.4.3 will be published? thanks

Hi, sorry for the delay.

I have reverted the changes from #51 that made the parser a lot stricter and published 0.4.3. Could you give it a try.

@ov7a @sauravshah @mbseid I added tests for your examples in 47e6353. Some of the example URIs didn't start with http://. These will now parse with 0.4.3 but without the protocol, URIs like localhost/xyz will be parsed as relative URIs with localhost as part of the path. What do you think about this? I am reluctant to change this as that is how they would be treated by a browser if put in a href, for example.

@theon: Definitely makes sense to me to behave as browsers do, especially if it's also in line with the standards.

And the dangling ampersand use-case seems to be working okay now. E.g., Uri.parse("https://hi.com/some-path?this=that&a=b&") does not lead to nasty exception. Thanks!

ov7a commented

@theon, thanks! I think browser-like behaviour in situations like this is the best one can do.

I brought another example. Originally, this was a link to facebook photo.

https://example.com/path/ms.c.eE~;SoMeThingLongLikeBase64~;MoReOfthis-;a~-~-.id.a.3450750937450/8375385797349587/?type=1&param=1

Invalid URI could not be parsed. 3 rules mismatched at error location:
  _uri / | / _abs_uri / optional / _authority / optional / _userInfo / capture / oneOrMore / ANY
  _uri / | / _abs_uri / optional / _authority / optional / _userInfo / optional / ":" / ':'
  _uri / | / _abs_uri / optional / _authority / optional / _userInfo / "@" / '@'
 at index 128: https://example.com/path/ms.c.eE~;SoMeThingLongLikeBase64~;MoReOfthis-;a~-~-.id.a.3450750937450/8375385797349587/?type=1&param=1

@ov7a Thanks for the report. This was cause by the matrix parameter support getting confused. I don't think many people use matrix params, so regret making it compulsory.

I've just published a new version 0.4.4 where matrix param support is now optional and disabled by default (README link above also updated). So if you upgrade to 0.4.4 that URL should now work without you having to make any changes.

Forgot to put the issue number in the commit. Commit is c9f632d

ov7a commented

@theon, wow! That was super fast! Thank yo very much! It works now.