mapping-commons/sssom-py

UX: Extra entries in `curie_map` of written TSV

Opened this issue · 4 comments

Overview

When I write SSSOM to TSV, I'd like it to only include entries in curie_map where the prefixes are actually used in 1 or more places in the mapping set. However, extra entries are appearing.

Case 1: When passing a Converter and metadata

I expected 4 entries, but got 10.

I also mentioned this in: #513. I don't necessarily mind these extra entries in there, as they are some popular and relevant namespaces. The extra ones I got were: owl, sssom, and oboInOwl, rdf, rdfs, and semapv. I'd suggest that we could possibly add some parameterization for this. Stick with the default of either including these important namespaces or not, and then a parameter to allow for the opposite. Also, IDK if this is really a sssom-py issue or a curies issue.

Case 2: When not passing metadata, but no Converter

I expected 4 entries, but got 1,547.

Possible solutions

Nico wrote:

...if the massive OBO context was used to infer prefixes, we should automatically call clean_prefix_map().

Additional details

FYI:

Results, based on various means:

  1. n=4. This is the amount of entries I want, and it is the amount that I get when using some custom code I wrote. ordo-icd11.sssom - joes ad hoc way.tsv.zip
  2. n=10. This is the amount of entries I get when first instantiating a Converter and passing that to MappingSetDataFrame. ordo-icd11.sssom - with converter.tsv.zip
  3. n=1,547. This is the amount of entries I get when instantiating a MappingSetDataFrame passing in my metadata but no Converter. ordo-icd11.sssom - no converter.tsv.zip
  1. owl, sssom, and oboInOwl, rdf, rdfs, and semapv are all built in prefixes and are therefore added by default (especially sssom and semapv which are nearly always needed.
  2. There is a method on MappingSetPrefixMap called clean_prefix_map() which removes all prefixes from the curie map that are not used. Try to see if it gets rid of some of the buildin ones?

For (2), right--I forgot to do msdf.clean_prefix_map().

Do we really want the default behavior to be, that when no converter is passed that we include all of these 1,547 entries? Not to mention the related sub-issues of #513: (2) Incorrect curie_map (it leaves out prefixes that are in metadata), and (3) UX: Should automatically instantiate Converter.

My preference would be that clean_prefix_map() should be automatic, and if we want to have a parameter that adds tons of namespaces, we can add that.

Do we really want the default behavior to be, that when no converter is passed that we include all of these 1,547 entries?

i. I think you are right. Can you update the OC to request that, if the massive OBO context was used to infer prefixes, we should automatically call clean_prefix_map()?

Incorrect curie_map (it leaves out prefixes that are in metadata),

ii. If this is true is a bug, add as action item to OC.

Should automatically instantiate Converter.

iii. This must already be the case..

(i) Done! (ii) Done!

(iii) It probably is, but let me rephrase: Should correctly instantiate Converter.
This is related to #513 so let me add what I mean there (this comment).