mapping-commons/sssom

Similarity Score - usage question

Opened this issue · 7 comments

I'm assisting the development of some mappings within a set of ecosystem and land-use classifications.

The actual mappings are all done manually by subject-matter experts, so the mapping justification is semapv:ManualMappingCuration.

However, many of the mappings are partial, in this sense - a class from the source scheme maps to n classes in the target scheme, in known proportions, e.g.

  • 30% of source:Class345 will correspond to target:ClassDFG
  • 10% of source:Class345 will correspond to target:ClassHJK
  • 60% of source:Class345 will correspond to target:ClassZXC

Is this where semantic_similarity_score comes in? Would it be correct to set

  1. predicate_id to skos:relatedMatch (or should it be skos:narrowMatch?)
  2. semantic_similarity_score to 0.3 0.1 0.6 respectively

Or is this all application-dependent - i.e. it is up to us, since we are the ones who will be using the mappings.

I have never wondered about this :D I am not entirely sure. What are the use cases for these kinds of alignments? What can you do with a 10% alignment practically I mean?

Implementation of this solution requires an intermediate assumption that links the membership estimates (0.3, 0.6, 0.1 in your example) to spatial expression. For example, a membership estimate of 0.3 means that, based on the information in the class descriptions, there is a subjective probability of 0.3 that source:Class345 belongs to target:ClassDFG (ideally, subjective probabilities should be estimated by averaging estimates of replicated subject experts). To give this spatial expression in the way proposed, we need to assume that subjective probabilities of membership are directly related to spatial extent. i.e. membership = 0.3 means that 30% of the mapped extent of source:Class345 occurs [somewhere] within the mapped area of target:ClassDFG, but we do not know which 30% of 345. In many cases, uses may decide the assumption is reasonable for their application, though ideally it should be empirically evaluated with test data.

To provide a bit more context: the goal is to determine some property (function) of a spatial region, where we have a classification of the region using System 1, but the assessment requires its classification using System 2.

i.e.

  • we have a spatial region classified according to a term from System 1
  • to get an assessment of the region, we have some procedure that we can apply using terms from System 2
  • we know what proportions of the class from System 1 correspond to classes in System 2
  • so we compute an assessment based on the area (?) of the region multiplied by the proportion inferred to be in each class from System 2.

Using a (notional) example

Region q23w

  • has an area 230 sq.km
  • is classified as source:Class345

Using the proportions in the example above, this would mean that
--> 69 sq.km is inferred to be target:ClassDFG
--> 23 sq.km is inferred to be target:ClassHJK
--> 138 sq.km is inferred to be target:ClassZXC

so you do the assessment based on the latter three ...

(of course we don't know which​ 138 sq.km is ClassZXC, etc, but it is assumed to fall within Region q23w).

This discussion is a tad out of my depth, I am sorry; I hope someone else from the @mapping-commons/sssom-core team can chip in and give feedback. Without understanding this exactly, I would say that such fuzzy matches are out of scope for SSSOM, but this does not have to keep you from using the semantic_similarity fields to record the information. In my view:

  1. confidence captures the level of certainty an agent has in the absolute truthfulness of the mapping (subject, predicate, object).
  2. semantic_similarity_score captures the result of a semantic similarity matching process, which was grounds to inform the mapping agent (the curator, or the tool) to assert the mapping, whose truthfulness is still absolute (i.e. not fuzzy / partial)

Maybe however my internal and your internal model of your question only have a very low "semantic overlap" and what I am saying here is completely off topic 😛

My 2 cents:

  • It seems to me that using semantic_similarity_score for that purpose would be overloading its intended meaning. If you need to do it, I strongly recommend making sure you also fill the semantic_similarity_measure field to point to a resource that makes clear what kind of “score“ is actually stored there (in fact, I personally think that semantic_similarity_score should never be used without an accompanying semantic_similarity_measure no matter what).
  • This might be a case where the use of a non-standard field (“extension slot”) is warranted. We don’t recommend that in the sake of interoperability, but since those mappings are apparently intended for internal use only (“we are the ones who will use the mappings”), this may be acceptable. The problem is that extension slots cannot really be used for now, because support for them is still experimental in SSSOM-Java and inexistent in SSSOM-Py. (So if you prepare a mapping set with extension slots and then at some point process the set with SSSOM-Py, the extension slots will be lost.)

I would say that such fuzzy matches are out of scope for SSSOM

Hmm. That would be disappointing. I really doubt it is really just a niche concern - it certainly isn't in linguistics. Partial matches are supported by narrowMatch/broadMatch already. We just have an assessment of the proportions of the extension of the source class match the target classes.

I understand that semantic_similarity_score was devised to capture the result of an automated similarity assessment. But its semantics appear to match our application as well. The semantic_similarity_measure would be something like semapv:ManualMappingCuration again.

Perhaps we just adopt a local convention in the context of our project to use the slots in this way. But I thought it was worth canvassing this list to see if a similar use case had already been encountered.

cthoyt commented

I am not sure using SSSOM to describe the extent of overlap of regions is the right use of SSSOM. This seems more like a more general kind of relationship instead of a mapping. From what I understand, the semantic similarity measurement should be something like "ontological similarity" like what https://github.com/related-sciences/nxontology implements, but it's understandable that this up to interpretation since the docs are completely empty for https://mapping-commons.github.io/sssom/semantic_similarity_score/