UX: Extra entries in `curie_map` of written TSV
Opened this issue · 4 comments
Overview
When I write SSSOM to TSV, I'd like it to only include entries in curie_map
where the prefixes are actually used in 1 or more places in the mapping set. However, extra entries are appearing.
Case 1: When passing a Converter
and metadata
I expected 4 entries, but got 10.
I also mentioned this in: #513. I don't necessarily mind these extra entries in there, as they are some popular and relevant namespaces. The extra ones I got were: owl
, sssom
, and oboInOwl
, rdf
, rdfs
, and semapv
. I'd suggest that we could possibly add some parameterization for this. Stick with the default of either including these important namespaces or not, and then a parameter to allow for the opposite. Also, IDK if this is really a sssom-py
issue or a curies
issue.
Case 2: When not passing metadata
, but no Converter
I expected 4 entries, but got 1,547.
Possible solutions
...if the massive OBO context was used to infer prefixes, we should automatically call
clean_prefix_map()
.
Additional details
FYI:
- Here's the file I'm using to instantiate my
metadata
: icd11.sssom-metadata.yml.zip - This is the PR where I'm creating this mapping set.
Results, based on various means:
- n=4. This is the amount of entries I want, and it is the amount that I get when using some custom code I wrote. ordo-icd11.sssom - joes ad hoc way.tsv.zip
- n=10. This is the amount of entries I get when first instantiating a
Converter
and passing that toMappingSetDataFrame
. ordo-icd11.sssom - with converter.tsv.zip - n=1,547. This is the amount of entries I get when instantiating a
MappingSetDataFrame
passing in mymetadata
but noConverter
. ordo-icd11.sssom - no converter.tsv.zip
- owl, sssom, and oboInOwl, rdf, rdfs, and semapv are all built in prefixes and are therefore added by default (especially
sssom
andsemapv
which are nearly always needed. - There is a method on
MappingSetPrefixMap
calledclean_prefix_map()
which removes all prefixes from the curie map that are not used. Try to see if it gets rid of some of the buildin ones?
For (2), right--I forgot to do msdf.clean_prefix_map()
.
Do we really want the default behavior to be, that when no converter
is passed that we include all of these 1,547 entries? Not to mention the related sub-issues of #513: (2) Incorrect curie_map
(it leaves out prefixes that are in metadata
), and (3) UX: Should automatically instantiate Converter.
My preference would be that clean_prefix_map()
should be automatic, and if we want to have a parameter that adds tons of namespaces, we can add that.
Do we really want the default behavior to be, that when no converter is passed that we include all of these 1,547 entries?
i. I think you are right. Can you update the OC to request that, if the massive OBO context was used to infer prefixes, we should automatically call clean_prefix_map()
?
Incorrect
curie_map
(it leaves out prefixes that are inmetadata
),
ii. If this is true is a bug, add as action item to OC.
Should automatically instantiate Converter.
iii. This must already be the case..
(i) Done! (ii) Done!
(iii) It probably is, but let me rephrase: Should correctly instantiate Converter
.
This is related to #513 so let me add what I mean there (this comment).