apache/arrow-julia

Should extension metadata tag be more specific?

DrChainsaw opened this issue · 4 comments

The current implementation seems to use the metadata tag "ARROW:extension:name" for julia specific types (e.g Symbols).

Shouldn't this instead be something like "ARROW:extension:arrow-julia.Symbol" to make collisions less likely? Maybe I misunderstand the extension documentation, but to me it seems like the tag name (and maybe also extension, docs are not super clear imo) is an example placeholder to be replaced with the actual name.

With the above change it would be possible to write tables from other languages which make use of custom types in more than one language. For example, I'm writing tables from the java implementation and in many cases I would like strings to be deserialized as symbols (e.g. for enums), but that non-specific tag might block similar optimizations for other implementations.

I guess it is not fun to change it due to backwards compatibility (probably need to support both tags for the forseeable future), but maybe better to rip of the bandaid as quickly as possible should it turn out that it should use a more specific tag.

Just below they suggest name spacing the value passed there: https://arrow.apache.org/docs/format/Columnar.html#extension-types. That bit reads to me like it is not a placeholder, but rather the customization is in the value (not the key).

But the metadata is a dict, so the namespacing they suggest would be pointless if only applied to values since they will be overwritten.

Maybe we can check against another implementation by seeing the metadata produced by an extension type there, e.g. following https://arrow.apache.org/docs/python/generated/pyarrow.ExtensionType.html#pyarrow.ExtensionType.extension_name.

It doesn't seem clear if a column can have more than one extension type though. It could be there's only 1 key on purpose so that different implementations can share that key to define an extension to the arrow spec overall (e.g. we if we all agree what a foo is, we define an extension name for that, serialize that metadata, and then read it in as a foo when possible). Which maybe then means your suggestion is that arrow-julia shouldn't be using "extension types" specifically for metadata that is only used by that implementation, and should use other keys for that.

Would appreciate any feedback from someone who understands the spec better

After searching through the arrow repo after arrow:extension is seems like you might be right. Here it is defined as a const in the cpp code for example and I could not find any trace of it being manipulated or changed anywhere (which would be a quite strange thing to do as well).

The c code seems to accept the metadata as a vector of pairs through, so it would in theory allow for multiple identical keys, but python, java and Julia use dicts so there is no way to have it through any of them.

It is a bit unclear to me what the ExtensionType stuff does in arrow and what one gets for buying in to it. However, most things points to it being a mechanism for your foo example.

I close the issue now since my initial understanding of the extension metadata was incorrect and I don't think there is any action needed here. Just reopen or open another issue if there is a point about late conversion stuff (e.g. String <-> Symbol) not fitting into the definition of extensions.