NET-A-PORTER/scala-uri

A URISyntaxException is thrown parsing the following url: https://krownlab.com/products/hardware-systems/baldur/#baldur-top-mount#1

Closed this issue · 5 comments

scala> import com.netaporter.uri.Uri
import com.netaporter.uri.Uri

scala> Uri.parse("https://example.com/products/hardware-systems/blah/#blah-top-mount#1")
java.net.URISyntaxException: Invalid URI could not be parsed. Vector(RuleTrace(List(NonTerminal(Named(_uri),-66), NonTerminal(RuleCall,-66), NonTerminal(Sequence,-66), NonTerminal(FirstOf,-66), NonTerminal(Named(_abs_uri),-66), NonTerminal(RuleCall,-66), NonTerminal(Sequence,-66), NonTerminal(Optional,-15), NonTerminal(Named(_fragment),-15), NonTerminal(RuleCall,-15), NonTerminal(Sequence,-15), NonTerminal(Capture,-14), NonTerminal(ZeroOrMore,-14), NonTerminal(Sequence,0)),NotPredicate(Terminal(AnyOf(#)),1)), RuleTrace(List(NonTerminal(Named(_uri),-66), NonTerminal(RuleCall,-66), NonTerminal(Sequence,-66)),CharMatch(�))) at index 66: https://example.com/products/hardware-systems/blah/#blah-top-mount#1
  at com.netaporter.uri.parsing.UriParser$.parse(UriParser.scala:67)
  at com.netaporter.uri.Uri$.parse(Uri.scala:303)
  ... 43 elided

Based on RFC 3986 3. Syntax Components, a URI may contain at most one '#' which identifies the start of the fragment. A fragment is not permitted to contain a '#'.

Hi @evanbennett,

thx for clarifying the background. But we also use Net-a-porter for parsing urls, which come from access logs. And we run into the same error, because the URLs actually contain a second fragment separator. And it seems like most browsers can cope with it perfectly fine. Is there a way to configure a less restrictive mode to parse such URLs none the less?

Thx @theon, that was lightning fast.

theon commented

@christoph-buente, np!

I have made the parsing of fragments more permissive. It should now successfully parse these URLs. The change is published under version 0.4.16 of scala-uri. Give it a try and let me know if it works as expected. (May take a couple hours to make it to maven central, but is available at https://oss.sonatype.org/content/repositories/releases/com/netaporter/scala-uri_2.11/0.4.16/)

The second # will be considered part of the fragment and as such will be URL encoded to %23 when you call .toString on the URL. E.g.

https://krownlab.com/products/hardware-systems/baldur/#baldur-top-mount#1

will become the valid URL:

https://krownlab.com/products/hardware-systems/baldur/#baldur-top-mount%231

Thx a million, @theon.