nolanlawson/blob-util

Clean up media type in blobToDataUrl

Treora opened this issue · 4 comments

Currently, blobToDataUrl simply puts blob.type into the URL (at this line), but it should be cleaned up to be valid:

  • blob.type may contain white space (e.g. I encountered a 'data:image/png; qs=0.7;base64,iVB…' from a Response.blob() in Firefox). I looked up its definition in the File API draft spec; which allows it to be any parseable mime type as defined in whatwg’s mime sniffing spec that bases on RFC 7231 §3.1.1.1 (HTTP media types) which indeed allows whitespace around semicolons. (confusingly however, that latter one is based on RFC 2046 which depends on RFC 2045, which in fact seems to not permit whitespace even though spaces do appear in its own examples… though I have not checked all its errata; in any case, fact is that white space can occur) And a data: url obviously does not permit white space (see its syntax definition in RFC2397).

  • As that data url spec (RFC2397) states, parameter values using quoted strings should instead be percentage-encoded:

    Attribute values in [RFC2045] are allowed to be either represented as
    tokens or as quoted strings. However, within a "data" URL, the
    "quoted-string" representation would be awkward, since the quote mark
    is itself not a valid urlchar. For this reason, parameter values
    should use the URL Escaped encoding instead of quoted string if the
    parameter values contain any "tspecial".

  • Perhaps any other caveats?

I may tackle this myself sometime. For now I thought to just raise the issue already and park my research notes here. :)

Would it be enough to URL-encode the type? I.e.

'data:' + encodeURIComponent(blob.type) + ';base64,' + base64String

?

It seems to me that if a Blob (in this case a File) has whitespace or funky characters in its type, then that's its type. Presumably the browser should treat that File exactly the same as if it were encoded into a data URI – if the browser massages the string or corrects for user errors, then great.

I don't believe it's the responsibility of blob-util to clean up all the ways that mime types can be messed up when coming from any server, but I do believe that blobToDataUrl should create a URL that is roughly equivalent to what you'd get from URL.createObjectURL() (i.e. a blob URL). So it seems to me that encoding the string should be sufficient.

OTOH if the blob.url already contains ;base64 in it, then I'm not sure there's much we can do except implement an entire mime parser, which I really don't think should be the responsibility of blob-util. 🙂

Would it be enough to URL-encode the type?

Unfortunately, no. That would also percent-encode characters of the media type that we should keep intact, such as the slash and semicolons. According to the data URL spec quoted above, and especially when following the reasoning of this recently suggested erratum, each parameter value that is a quoted string should be stripped of its quotes and then percent-encoded where needed.

This would indeed require some parsing (especially as a quoted value can again contain a quote if preceded by a backslash, according to the quoted-string syntax used in the media type spec (rfc 7231). But since it seems quotes can only appear either around or inside parameter values (but backslash-escaped in the latter case), I wonder if some regex might actually suffice. A quick attempt:

function reencodeMediatype(type) {
    return type.replace(
        /"((\\"|[^"])+)"/g, // find any string within quotes (possibly containing backslash-escaped quotes)
        (_, m) => encodeURIComponent(
            m.replace(/\\(.?)/g, '$1')
        ), // unescape all backslash-escaped characters; then percent-encode the value instead.
    );
}

Quite possibly I will have overlooked some complication; regexes do not usually suffice for correctly parsing things. In any case, I agree this should not be a responsability of blob-util. I hope somebody already packaged and published a correct implementation somewhere that could just be used here, or perhaps it’s time to make one. On NPM I can only find the modules attempting to do the reverse (e.g. parse-data-url, data-urls); not sure where else to search.

----

OTOH if the blob.url already contains ;base64 in it, …

(presuming you meant blob.type)

…then, firstly, it would not be a valid media type parameter anyway as it lacks a =; as the spec says:

The ";base64" extension is distinguishable from a content-type
parameter by the fact that it doesn't have a following "=" sign.

…and secondly, the current implementation would simply append a second ;base64, to the end, which presumably a receiving end would strip off to end up with the original (although invalid) type again. (a quick look at this library’s dataUrlToBlob() implementation reveals however that it throws away all of the media type parameters! And blindly assumes base64 encoding. Separate issue.)


Although making modules implement the details correctly seems worthwhile to avoid incompatibilities further down the line, I guess this might (at least for myself) not have the highest priority to dive even deeper into. Personally, I would probably just consider quoted media type parameters as unsupported for the time being, and fix the whitespace issue by just removing all whitespace from the type: blob.type.replace(/\s/g,'')

This sounds like a tricky issue. I guess since blob-util is modular, it may be worth fixing. But your workaround also works fine and it seems like a bit of a edge-case, so maybe that's okay. I'm a bit on-the-fence about this.

I'm also nervous about the proposed regex fix, unless it can really handle all possible edge cases (with tests to demonstrate 🙂).