Mimetypes as codes
Fstub42 opened this issue · 14 comments
Naming every possible type is already done with mime types.
I think it would make sense to use them as codes, instead of defining your own.
"image/jpg" or "text/plain" make nice paths btw ;)
What are your thoughts on it, did I miss something?
Yeah, we could have a mime type based multicodec, something like:
/mime/<mime-type-here>
Naming every possible type is already done with mime types.
not everything is registered under mime. also, mime is not generally specific enough. In most circumstances, application/json
says very little about the encoded data. It would be nice to enable users to upgrade type annotations to signal more precise types. Im not sure multicodec will help much here, but it might.
Hi @jbenet @Fstub42 , has there been more discussions or agreements for getting this supported?
I'm experimenting with cids
and would like to potentially add support for mime-types.
Please let me know if you have some more info that I'll be up for sending a PR we could work on towards such a support.
Based on the information found at https://www.iana.org/assignments/media-types/media-types.xhtml, plus the suggestions in previous posts here, I'm thinking the following ranges can be reserved for the different mime types/subtypes (I put some examples I'm using to play with from the cids
code):
// 0x1000 - 0x17ff (11 bits) reserved for application/* (there currently are ~1,300 subtypes)
multicodec.addCodec('mime/application/json', Buffer.from('1000', 'hex'));
multicodec.addCodec('mime/application/octet-stream', Buffer.from('1001', 'hex'));
multicodec.addCodec('mime/application/ld+json', Buffer.from('1002', 'hex'));
multicodec.addCodec('mime/application/rdf+xml', Buffer.from('1003', 'hex'));
// 0x1800 - 0x18ff (8 bits) reserved for audio/* (there currently are ~150 subtypes)
multicodec.addCodec('mime/audio/mp4', Buffer.from('1800', 'hex'));
// 0x1900 - 0x190f (4 bits) reserved for font/* (there currently are ~8 subtypes)
multicodec.addCodec('mime/font/ttf', Buffer.from('1900', 'hex'));
// 0x1910 - 0x197f (7 bits) reserved for image/* (there currently are ~60 subtypes)
multicodec.addCodec('mime/image/png', Buffer.from('1910', 'hex'));
// 0x1980 - 0x19cf (5 bits) reserved for message/* (there currently are ~18 subtypes)
multicodec.addCodec('mime/message/sip', Buffer.from('1980', 'hex'));
// 0x19d0 - 0x1a3f (6 bits) reserved for model/* (there currently are ~24 subtypes)
multicodec.addCodec('mime/model/3mf', Buffer.from('19d0', 'hex'));
// 0x1a40 - 0x1a8f (5 bits) reserved for multipart/* (there currently are ~13 subtypes)
multicodec.addCodec('mime/multipart/byteranges', Buffer.from('1a40', 'hex'));
// 0x1a90 - 0x1aff (7 bits) reserved for text/* (there currently are ~71 subtypes)
multicodec.addCodec('mime/text/html', Buffer.from('1a90', 'hex'));
multicodec.addCodec('mime/text/csv', Buffer.from('1a91', 'hex'));
multicodec.addCodec('mime/text/turtle', Buffer.from('1a92', 'hex'));
multicodec.addCodec('mime/text/xml', Buffer.from('1a93', 'hex'));
// 0x1b00 - 0x1b6f (7 bits) reserved for video/* (there currently are ~78 subtypes)
multicodec.addCodec('mime/video/JPEG', Buffer.from('1b00', 'hex'));
multicodec.addCodec('mime/video/mp4', Buffer.from('1b01', 'hex'));
This is cool, thanks for giving it a push :)
It might be nicer to start with just one bucket of numbers. Most will inevitably run full anyway, so there's little sense in pushing the problem down a few years.
An approach that feels more accomodating for simple future change is to start with a single bucket that includes a snapshot of the whole mediatypes table, and then regularly add a new bucket with mediatypes added in the meantime
Mimetypes seem like a category of multicodecs that would be fine with fragmented numbers, i.e. they don't seem to benefit from being strictly consecutively numbered. (While something like the various variable-length multihash functions clearly do.)
multicodec.addCodec('mime/video/JPEG', Buffer.from('1b00', 'hex'));
I was always under the impression that mimetypes were case-insensitive -- is that the case? Important question for decoding/encoding.
It would be nice to enable users to upgrade type annotations to signal more precise types
@jbenet There's a longstanding convention for this, e.g. application/ld+json
and application/epub+zip
, and "more than 1000 occurances" of +
in the assignment table :)
It would also be interesting to look at the complete mediatype syntax (category/foo.example+foo
or more) and make sure it's fine to use these in filesystem-ish paths.
Thanks @lgierth .
As per the RFC it seems you are right and they are case-insensitive, in https://tools.ietf.org/html/rfc2045#section-5.1 it says:
The type, subtype, and parameter names are not case sensitive. For
example, TEXT, Text, and TeXt are all equivalent top-level media
types.
As per the ranges I think you have a valid point, I was proposing to reserve enough bits to cope with quite a large addition to each of the types, but I wouldn't disagree with having them fragmented as long as we get the initial bucket of most common/popular ones all together now at least.
Since multicodec is represented with varint and MIME is a hierarchical classification system anyway, wouldn't it make sense to define one big range for MIME with a "prefix" (most significant septet of a 3-byte varint) followed by 7 bits for type and subtype? That gives a range of 128 types and 128 subtypes for each type - types with more subtypes can use multiple type septets. Squabbling over bit real estate is unnecessarily complex and less valuable than simpler decoding logic.
Speaking of all this, isn't multicodec essentially a broader (though non-hierarchical) version of mimetypes with a binary encoding?
The missing piece is actually making the mapping.
Is it outside of the scope of this topic to suggest bijective improvements of MIME in deciding the mapping? For instance, in #84 @Stebalien mentions that application/*
has 1500+ entries, which strongly suggests it's an overloaded category that needs to be split up. Also, the meaning of text/*
is unclear since it includes media which isn't displayed as textual (eg text/html
) and many textual formats can be found under application/*
(eg application/json
).
Some ideas:
- A bit to denote whether a format is human-readable (not used for mapping)
data/*
as a catch-all instead ofapplication/*
- maybe
app/*
to replacevnd.*
? code/*
category for source codeexe/*
for executable formats (.class, .exe, elf, .o)archive/*
for file containers (.zip, .gz, .iso)file/*
for filesystem data (ext* inodes, potentially useful for IPFS)block/*
for data intended to be used in merkle-DAG structures (cryptocurrency blocks, IPLD objects)
Some of these may belong in their own multicodecs or outright unnecessary. Also, all the suggestions thus far don't suggest any future-proof way of representing higher level interpretations of a lower level data format, eg application/*+xml
. Naively they could just be treated as separate entries, but this discards potentially useful information. If we give the mime "namespace" 3-4 septets I guess they could be category/format/type ? With "type" being 0 for unspecified semantics. Or a second mime namespace could be added with a longer varuint representation. Heck, just adding another "prefix" for xml would get rid of half of application/*
(mime/application/svg+xml
vs mime-xml/svg
or mime-xml/image/svg
)
Redefining MIME types is likely way outside of the scope of this project. This project is primarily concerned with defining short "codes" for arbitrary things.
Not sure where to put this comment, if in a new issue, or else where, but I'd like to see a mime-type column in the table. It should be unique, as codec parameters can be specified for things that are not. This would help ensure that duplicate entries are not added, but also provide a way to automatically map from mime-types to multicodecs w/o people building their own [and possibly incorrect] tables of such.
Made this comment in the PR, but reposting here for visibility:
Looks like this effort has been stalled for a while, mostly due to concerns around the drastic increase in table size?
The readme of the project describes a first-come, first-serve policy when it comes to adding new codecs, and I wonder if we could maybe apply that here as well with mime types. I.e. maybe we can start with a small handful of the most commonly used MIME types on the internet today (say, this list), and then add more over time based on demand, instead of dumping in all known mime-types at once?
Is there some particular need for all the mime types to be in a contiguous block that I'm not aware of?
Just my two cents: it looks like concern over achieving the perfect encoding of mime types is what has stalled this (long overdue) work. I would suggest being much less precious about it, and just treating mime types like a legacy format.
The existing codec name plaintextv2
(0x706c61
) is a conceptually different thing from the mime type text/plain
. They signal different intents.
Content that has been encoded with the intent of running on a HTTP server (a legacy protocol in the context of multiformats) can and should use multicodec encoded mime type mappings because that is the nature of the content.
If in the future a more idiomatic multicodec schema is designed for stuff like images, then those codecs can be added in addition to the existing legacy mime-type mappings.
There's plenty of byte real-estate. No need to be precious.