base64 handling of IRIs for Content State
azaroth42 opened this issue · 7 comments
I think we need to define what to do with IRIs in the Content State spec, given that they will need to be encoded into safe base64 at various points.
In particular, given the IRI: https://en.wiktionary.org/wiki/Ῥόδος
Is the process to base64 encode the UTF-8 representation of those characters directly, or is the process to percent encode the non-ascii characters first and then base64 encode?
https://base64url.herokuapp.com/
With "basic JavaScript encoding", this all works fine for a string like "hello world"; you can take it round the circle, it survives client and server encoding and decoding.
But if you try our friend Ῥόδος then the client side encoding to content state (a real use case, of course) breaks with:
InvalidCharacterError: Failed to execute 'btoa' on 'Window': The string to be encoded contains characters outside of the Latin1 range.
You can fix this by checking the second box; this uses the logic at https://developer.mozilla.org/en-US/docs/Web/API/WindowOrWorkerGlobalScope/btoa#unicode_strings
But then, server and client produce different contentState strings, and the algorithm starts to get a bit fragile.
The mechanism in that mozilla page is not the only way of making the JavaScript string safe for btoa, but clearly a server-side implementation has to do the same thing. The third box uses the built-in window.encodeURI / decodeURI on the client, and matches this on the server with
urllib.parse.quote(plain_text, safe=',/?:@&=+$#')
Source is here - https://github.com/tomcrane/base64url
I've been a bit verbose and clunky with this code, to see what's going on.
Regarding padding, it can be a valid base64url encoded string with the padding removed to avoid having "=" in the parameter value, which is still allowed under base64url.
The spec says
The pad character "=" is typically percent-encoded when used in an URI, but if the data length is known implicitly, this can be avoided by skipping the padding.
... I feel it looks a bit messy if our content state ends up still having %
chars in it, so I have added a checkbox to show the non-padded version, which you can see also round-trips.
So - client side script:
function encodeUriEncode(plainContentState, noPadding) {
let uriEncoded = encodeURI(plainContentState);
let base64 = btoa(uriEncoded);
let base64url = base64ToBase64url(base64);
if(noPadding) base64url = removePadding(base64url);
return base64url;
}
function decodeUriEncode(base64url, noPadding) {
if(noPadding) base64url = restorePadding(base64url);
let base64 = base64urlToBase64(base64url);
let decoded = atob(base64);
return decodeURI(decoded);
}
Server-side Python:
def encode_uri(plain_text, no_padding):
quoted = urllib.parse.quote(plain_text, safe=',/?:@&=+$#')
binary = quoted.encode("UTF-8")
base64url = base64.urlsafe_b64encode(binary) # this is bytes
utf8_decoded = base64url.decode("UTF-8")
if no_padding:
utf8_decoded = remove_padding(utf8_decoded)
return utf8_decoded
def decode_uri(content_state, no_padding):
if no_padding:
content_state = restore_padding(content_state)
binary = base64.urlsafe_b64decode(content_state)
plain_text = binary.decode("UTF-8")
unquoted = urllib.parse.unquote(plain_text)
return unquoted
Addressed in IIIF/api@23ede73
TSG call 2021-09-15: Couldn't think of any better spec than the above. Propose that this should be it for 1.0
This is auto-linked but dropping in an explicit link to this discussion:
IIIF/trc#79
Certainly for 1.0 👍
First TRC pass superseded by IIIF/api#2072