purescript-uri
A type-safe parser, printer, and ADT for URLs and URIs based on RFC 3986.
Installation
bower install purescript-uri
Getting started
The types and names here are a fairly faithful representation of the components described in the spec.
URI
is for absolutely specified URIs that can also have path, query, and fragment (hash) parts.AbsoluteURI
is a variation onURI
that drops the ability for the URI to carry a fragment.RelativeRef
is for relatively specified URIs that can also have path, query, and fragment (hash) parts.URIRef
is combination ofURI
andRelativeRef
, allowing the full range of representable URIs.
The absolute/relative terminology when applied to URIs does not relate to the paths that a URI may carry, it refers to whether the URI has a "scheme" or not. For example http://example.com
and file://../test.txt
are absolute URIs but //example.com
and /test.txt
are relative.
Assuming none of the unsafe
-prefixed functions are used when constructing a URI, it should be impossible to construct a URI that is invalid using the types this library provides*. The slight downside of this is the data structures are relatively complex so as to only admit correct possibilities.
* Actually, there is one exception to that - IPv6Address
is far too forgiving in what it allows currently. Contributions welcome!
URI component representations
Due to the differing needs of users of this library, the URI types are all parameterised to allow for custom representations to be used for parts of the URI. Take a look at the most heavily parametrised type, URIRef
:
type URIRef userInfo hosts path hierPath relPath query fragment = ...
This allows us to provide hooks into the parsing and printing processes for a URI, so that types better suited to the intended use case can be used.
Taking userInfo
as an example, according to the spec, the user-info
part of an authority is just an arbitrary string of characters terminated by an @
before a hostname. An extremely common usage for this is the user:password
scheme, so by leaving the choice of representation as a type variable we can switch it out for a type specifically designed to handle that (this library includes one actually, under URI.Extra.UserPassInfo
).
App-specific URI type definitions
When using this library, you'll probably want to define type synonyms for the URIs that make sense for your use case. A URI type that uses the simple representations for each component will look something like this:
type MyURI = URIRef UserInfo (HostPortPair Host Port) Path HierPath RelPath Query Fragment
Along with these types, you'll want to define an options record that specifies how to parse and print URIs that look like this:
options ∷ Record (URIRefOptions UserInfo (HostPortPair Host Port) Path HierPath RelPath Query Fragment)
options =
{ parseUserInfo: pure
, printUserInfo: identity
, parseHosts: HostPortPair.parser pure pure
, printHosts: HostPortPair.print identity identity
, parsePath: pure
, printPath: identity
, parseHierPath: pure
, printHierPath: identity
, parseRelPath: pure
, printRelPath: identity
, parseQuery: pure
, printQuery: identity
, parseFragment: pure
, printFragment: identity
}
As you can see by all the pure
and identity
, we're not doing a whole lot here. parseHosts
is a bit of an exception, but that's just due to the way that case is handled (see later in this README for more details about that).
These types (UserInfo
, HostPortPair
, Host
, etc.) are all provided by the library, and where necessary can only be constructed via smart constructor. This ensures that percent-encoding is applied to characters where necessary to ensure the constructed values will print as valid URIs, and so on.
If we decided that we wanted to support user:password
style user-info, we'd modify this by changing our type to use UserPassInfo
:
type MyURI = URIRef UserPassInfo (HostPortPair Host Port) Path HierPath RelPath Query Fragment
And update our options to use the appropriate parse/print functions accordingly:
options ∷ Record (URIRefOptions UserPassInfo (HostPortPair Host Port) Path HierPath RelPath Query Fragment)
options =
{ parseUserInfo: UserPassInfo.parse
, printUserInfo: UserPassInfo.print
, ...
Writing custom component types
These parse/print
functions all share much the same shape of signature. For the case in the previous example, they come out as:
parseUserInfo ∷ UserInfo → Either URIPartParseError UserPassInfo
printUserInfo ∷ UserPassInfo → UserInfo
So you can see that for each component, when the options hooks/custom representation stuff is used, we take one of these library-provided component types and parse it into our new representation, and also print it back to that simple type later.
Each of the library-provided component types have a toString
function that extracts the inner value as a string after applying percent-decoding, and an unsafeToString
that provides exactly the value that was parsed, preserving percent decoding. Similarly, there's a fromString
that performs the minimal amount of required percent encoding for that part of the URI, and an unsafeFromString
that performs no encoding at all.
You may ask why it's ever useful to have access to the encoded values, or to be able to print without encoding, so here's a motivating example:
For the UserPassInfo
example, the typical way of encoding a username or password that contains a colon within it is to use %3A
(us:er
becomes us%3Aer
). This allows colons-within-the-values to be recongised as independent from the colon-separating-username-and-password (us%3Aer:password
).
According to the spec it is not a requirement to encode colons in this part of the URI scheme, so just using toString
on us:er
will get us back a us:er
, resulting in us:er:password
, so we'd have no way of knowing where the user ends and where the password starts.
The solution when printing is to do some custom encoding that also replaces :
with %3A
for the user/password parts, and then joins them with the unencoded :
afterwards. If we constructed the resulting UserInfo
value with fromString
it would re-encode our already encoded user/password parts (giving us %253A
instead of %3A
), so we use unsafeFromString
since we've done the encoding ourselves.
Similarly, when parsing these values back, we want to split on :
and then percent-decode the user/password parts individually, so we need to use unsafeToString
to ensure we get the encoded version.
Another example where this sort of thing might be useful is if you would like to encode/decode spaces in paths as +
rather than %20
. Having the ability to hook into the parse/print stage and choose to examine or print with or without percent encoding/decoding applied gives us the flexibility to produce and consume values exactly as we want, rather than the library attempting to know best in all cases.
Host parsing
The host printing/parsing setup is a little different. This is to accommodate something that lies outside of the RFC 3986 spec: multiple host definitions within a URI. The motivating case for this is things like connection strings for MongoDB, where host/port pairs can be defined separated by commas within a single URI:
mongodb://db1.example.net:27017,db2.example.net:2500/?replicaSet=test
This doesn't jive with what is said in RFC 3986, as there a comma is allowed as part of a hostname, but the multiple ports don't fit into the schema. To get around this, when it comes to parsing hosts, the parsing is entirely handed over to the parseHosts
parser in the options (in the cases for the other parameters, a normal function is run on a value that has been parsed according to the spec already).
For normal URIs the HostPortPair
parser/printer should serve well enough. This accepts functions to deal with the host/port parts allowing for those aspects to be dealt with much like all the other options.
For URIs that are like the MongoDB connection string, this library provides URI.Extra.MultiHostPortPair
. Given that both of these allow for custom Host
/ Port
types, hopefully nobody else will need to write anything for the general host-section-parsing part!
Further documentation
The tests contain many examples of URI constructions using the basic types this library provides.
Module documentation is published on Pursuit.