[Phaleron-SexEstimation] Necessity of data transformations in Sex Estimation module

Question

[Phaleron-SexEstimation] Necessity of data transformations in Sex Estimation module

Closed this issue 2 years ago · 22 comments

In earlier communications about the sex-estimation module, @cuboideum and I decided on a data model. Speaking practically, the value relevant for a user of the module would not be the category label, as that would effectively be a description of a morphological state. Instead, of interest is the integer value arbitrarily assigned to that category label, either 1-3 or 1-5 depending on method, as well as 9 for unobservable. Thus the ontologies currently contain data transformation processes and their resulting data items, e.g. "NuchalCrestSexMorphologyNumericScore". Only this numeric data item would be recorded in AnthroGraph, see image below.

It was discussed during a meeting with the Phaleron group that entering the integer 9 for unobservable is undesirable, and the entry form should instead show the string "unobservable". This can be easily achieved by switching to the category labels. As I have noticed, reading the rdfs:label of the category labels (which are always just the integer save for "unobservable") results in SPARQL exporting the rdfs:label as an integer anyway, not as a string. This makes the conversion to integer unnecessary, and I could implement the following model in AnthroGraph:

My questions are as follows:

Is this method acceptable, or should the data transformation item be kept for the dataset in AnthroGraph? It should be possible to have the labels appear in the valuesetpattern but have the integers be saved in the dataset, or even have both saved.

The issue then is that the integers will be numeric data items while unobservable will still have to be saved as a category label, since we likely don't want a seemingly random 9 integer to show up in the data when you entered the string "unobservable". I really don't like the idea of mixing the two.

If we no longer use the data transformation, it would make sense to delete these from the ontologies, but I think it makes sense to keep them, as for some methods calculations are expected to be done (which we will not do in this implementation either way) and for the investigation model it would be necessary to show that there is a transformation process where the integer comes from; the calculation is not actually done with the category label. Agreed?

Answer 1 · 2022-09-02T09:04:51.000Z

(I made a small mistake in the graphs: the is_about statement for the ROI should obviously be connected to the morhpology class, not the value specification)

Answer 2 · 2022-09-05T08:19:38.000Z

The central point is that there exist to actions that are carried out by observers:

Make an observation
Code the observation according to the Phaleron scoring key

It makes sense to separate the two processes because the observations could be recoded according to a different code book for a different kind of investigation. Of course, it would be possible to recode the Phaleron codes directly by I would argue that it is more important what observation a code represents, not what the coding is.

Answer 3 · 2022-09-05T08:28:21.000Z

The central point is that there exist to actions that are carried out by observers:
1. Make an observation

2. Code the observation according to the Phaleron scoring key
It makes sense to separate the two processes because the observations could be recoded according to a different code book for a different kind of investigation. Of course, it would be possible to recode the Phaleron codes directly by I would argue that it is more important what observation a code represents, not what the coding is.

As a consequence, I would argue that "Walker's nuchal crest assigned morphology value specification" should have category labels that are represented by individuals that contain a definition of what observation would constitute this stage. If it was considered helpful to assign the label "Nuchal crest score 1", this would not be a numeric code for potential use in some database but an identifier for a standardised observation with published directives. A label referring to the nature of the observation would be more helpful, though.

Answer 4 · 2022-09-05T08:29:57.000Z

Rather than debating how to represent observations, I would question if the second action, coding the observation according to some scheme, needs to be modelled in detail. If the objective was to economise on triples, I would rather scrap that.

Answer 5 · 2022-09-06T09:22:59.000Z

The issue was discussed in an RDFBones/AnthroGraph meeting on 6 September 2022. The discussion is to be continued during the next meeting on 13 September 2022.

Answer 6 · 2022-09-13T17:38:46.000Z

The discussion on 13 Sept 2022 concluded that the value specifications are not categorical (though they could theoretically be understood as such), nor are they scalar items as they have no unit. According to the methodology they should rather be understood as a numeric estimate of sorts.
There is no established class in RDFBones or OBO for such an item.
The value specification class for this module will be renamed to 'assigned morphology numeric score value specification'. It is possible to save an integer to this proposed numeric estimate class, which was agreed to be sufficient to represent the intervals in the sex estimation methodologies.
However, the issue of the 'unobservable' datum then remains unresolved. It is not possible to add a null value to xsd:integer. An alternative must be found that allows the category label 'unobservable'.
This problem will be reviewed during the next meeting on 20 Sept 2022

Answer 7 · 2022-09-21T10:50:16.000Z

@zarquon42b has found a potential method to represent null values which may be of use.

An issue has been opened on the OBI repository where this problem is discussed via a simplified version using the scalar value specification: obi-ontology/obi/issues/1599

Answer 8 · 2022-09-27T14:11:27.000Z

This topic was discussed again today in a meeting, taking into account the feedback received in obi issue #1599. Though the changes proposed for a failed planned process would fit this issue in theory, they do not help us with our dataset implementation issue.

@zarquon42b and I share the opinion that the easiest solution is to revert to categorical value specifications with added-on integer values. This is not perfect, but there appears to be no better alternative. As for the main question of this issue, I believe the data transformation processes are unnecessary. The integer value is inherent in the description of the morphology categories as per Klales, Walker, etc. There is therefore no need to transform them. @cuboideum do you agree?

Answer 9 · 2022-09-28T12:36:38.000Z

As a workaround for the current case, I do agree. But there are two points that I think are important:

This does not solve the problem of NA values with numeric values.
The 'score' categories are a completely different type of data items from the categories that have been defined so far. As their definition is dependent on the overall variability of the scored trait within the skeletal material under investigation, identical labels from different investigations are not compatible. It would be helpful to separate compatible and incompatible category labels in the RDFBones core ontology. But we will need a philosophical discussion about how this distinction can be framed.

Answer 10 · 2022-09-29T08:53:47.000Z

As their definition is dependent on the overall variability of the scored trait within the skeletal material under investigation

This may be something we should ask the Phaleron group. I think this is explicitly mentioned in Walker, but I understood Phenice's descriptions to be population-independent. It would make sense for Walker to be more flexible since it concerns the skull, not the less variable pelvis. I'm not sure about Klales and Standards, I'd have to look at the publications again.
I think our time is better spent on the implementation aspect of this, the Pahleron group is much more knowledgable on this subject and they may have adapted the methods. I'd recommend this as an issue for the next meeting,

Answer 11 · 2022-09-29T09:15:54.000Z

But what is the baseline for the expression? Traits can look very dissimilar in different populations. So for it to be universal, there should be objective criteria that identify each score which is then applicable to all humans.

Answer 12 · 2022-09-29T11:15:21.000Z

Traits can look very dissimilar in different populations.

As I understand Phenice, they would dispute this very claim.
As I said, I think we should ask the Phaleron group, as I share your opinion that the descriptions offered in the publication can not be understood universally.

Answer 13 · 2022-09-29T14:26:21.000Z

As I understand Phenice, they would dispute this very claim.

Then go and have a look at cranial traits in indigenous Australians and compare that with traits in Europe.

Answer 14 · 2022-09-29T14:29:42.000Z

Then go and have a look at cranial traits in indigenous Australians and compare that with traits in Europe.

Phenice concerns the coxal bones, not the cranium.

Answer 15 · 2022-09-29T14:30:16.000Z

Here is a paper I coauthored on a simple trait like frontal bone inclination in several populations and you can see that male values in some populations correspond to female ones in others. Using the global range, almost all males in one populations would be scored as feminine or vice versa.

Answer 16 · 2022-09-29T14:30:41.000Z

Ah, but this applies to all sexing traits, doesn't it?

Edit: I thought of the examples you showed regarding mastoid process prominence.

Answer 17 · 2022-09-29T14:38:17.000Z

To quote Phenice:

Table 2 indicates that there are some racial differences in the accuracy of the technique. Though these are not great, it should be born in mind that a slightly lower accuracy is to be expected when sexing remains of a population with which the researcher is less well acquainted.

I read this initially as meaning "it's good enough, you don't really have to worry about race". But it does effectively admit that the descriptions are context-dependent and you actually do have to worry about the population, so we should treat it that way.

Answer 18 · 2022-09-29T14:41:15.000Z

No this is actually just a sorry excuse and supports my conjecture that the categories are not really defined properly and rather left to the "acquaintance of the researcher with the population".
It means that they are somewhat aware that their method is not for sexing humans but rather a specific population. There are no hard criteria, when to apply which scoring.

Answer 19 · 2022-09-29T14:47:23.000Z

Yes, agreed. I retract my thoughts on Phenice, I misinterpreted the text. So this problem applies to all sex estimations, and it will likely re-surface in age estimation due to environmental factors. As said, we should discuss this ontological aspect eventually.

Answer 20 · 2022-10-04T15:51:54.000Z

This issue was discussed with the Phaleron group today. The group understands the categories to be absolute descriptions; any researcher using them on any population sample should in theory come to the same categorical data item as any other researcher. Differences in population morphologies are addressed via altering the linear regression calculation, particularly for cranial traits, which are far less reliable. A researcher may use a reliable method as a benchmark to decide whether another method giving dubious results should be considered lower priority, as was done with the Walker method in Phaleron.

I therefore suggest to handle the sex morphology intervals as categorical data items. They will have an integer value label attached in the following way:

I am as yet unsure how exactly to label the category labels, which is why they are inconsistent in the image above.. A descriptive name is of course expected, but it is difficult to put the concept into a single rdfs label. It will likely end up being something to the effect of "Morphology Interval 1" etc.

Answer 21 · 2022-10-11T10:53:42.000Z

This issue was discussed again today during an RDFBones group meeting. It was agreed upon that clarity of data has higher priority over reducing number of triples in the dataset. The data transformation process from the categorical data item to the integer value should be included in the dataset for clarity.

I therefore propose the following scheme, exemplified here via Phenice's ischiopubic ramus ridge, where the data item derived from the transformation process is saved as a separate numeric data entity:

@cuboideum Is this acceptable? There may be a more fitting predicate to use than "is about" for the numeric data item, but as I recall, adding a second value specification would not be possible.

Answer 22 · 2022-10-18T16:12:54.000Z

This issue was discussed again today during an RDFBones group meeting. The proposed model is acceptable for the dataset. The ontology will include the data transformation processes. For clarity, an 'is about' statement is added from the numeric score to the ROI to ensure that the integer can more easily be linked to its corresponding specimen: