ncss-tech/SoilTaxonomy

explainST for multiword/extragrade subgroups

brownag opened this issue · 7 comments

Since my updates to fix parsing for all subgroups in the #14 "rearrangement" explainST "works" for all modern taxa. However, it still does some funky stuff -- I BELIEVE at least in part because of the specific definitions in the lookup table for formative elements.

For instance -- "acr" is expected to be a great group element, so when used in the extragrade subgroups e.g. "acrudoxic plinthic ..." the formatting is wrong. "acr" matches "acrudoxic" and positions the empty space wrong. This is probably some special handling that will be needed to "explain" and define relevant explanations for multiword/extragrade subgroups

> cat(explainST("acrudoxic plinthic kandiudults"))
acrudoxic plinthic kandiudults
|         |        |    | |                                                                         
          presence of plinthite                                                                     
|                  |    | |                                                                         

                   |    | |                                                                         
                   presence of a kandic horizon                                                     
                        | |                                                                         
                        udic SMR                                                                    
                          |                                                                         
                          soils with an argillic or kandic horizon and a base saturation at pH 8.2 <35% at a depth of 180cm

I am fairly sure these types of things can be resolved with some careful manual review of "likely suspects" and random spot checks and mostly making new entries in the lookup table as appropriate to handle unintended matches (#5 related). E.g. "acrudoxic" actually comprises 3 formative elements within itself. Do we match them all and somehow concatenate? Or do we simply define "acrudoxic" as some aggregate formative element with a unique definition?

Adding "acrudoxic" record to the subgroup formative element dictionary seems to "fix" the formatting, but does not auto-concatenate on the result with the "plinthic" explanation

acrudoxic plinthic kandiudults
          |        |    | |                                                                         
          presence of plinthite                                                                     
                   |    | |                                                                         
                   presence of a kandic horizon                                                     
                        | |                                                                         
                        udic SMR                                                                    
                          |                                                                         
                          soils with an argillic or kandic horizon and a base saturation at pH 8.2 <35% at a depth of 180cm

Need to figure out @dylanbeaudette 's workflow and plans for additions and updates to these dictionaries

There are at least three sub-issues here:

  • Should formative element dictionaries contain definitions for "atomic elements" (fragi and gloss are split) and / or "compound elements" (fragloss)? I'm, not keen on re-implementing a full lexical parsing system for contractions / mixtures of atomic elements, so I'd vote for an entry that covers every unique case. This gets a little more complex at the subgroup level. At some future time we might try implementing a fully generic system for atomic elements.
  • Constraining formative element dictionary searches by level within the ST hierarchy. I haven't looked recently, but I suspect that the search / matching code is too generous. There should be no reason for a match in the GG dictionary when "explaining" the SG terms. For example, "acr" or "acro" should not trigger a match at the SG level (with the current dictionaries).
  • Multi-word SG "explaining". I know this works for many taxa cat(explainST('abruptic haplic durixeralfs')) for example, but I'll have to review the current implementation for cases given in this issue.

Need to figure out @dylanbeaudette 's workflow and plans for additions and updates to these dictionaries

It is wide-open.

My original goal was to have a simple / concise definition for all taxa, starting with what I could glean from textbooks (Buol et al., Schaetzl and Anderson, etc.). The first draft can (should IMHO) specify simple (atomic) and compound formative elements, favoring longer matches ("acrudoxic" vs. "acr"). This specific example shouldn't be a problem because only the subgroup formative element dictionary will be used on the subgroup chunk of taxa text. Explanations will include a compact description of compound formative elements (mostly relevant at the subgroup).

Long-term:

  • An extensible set of dictionaries that include at least 3 levels of verbosity and detail: general public, moderate soil knowledge, scientists / educators. An option or argument toggles the output.
  • A more general approach to assigning meaning, possibly transcending levels of the ST hierarchy. This isn't 100% feasible because some formative element definitions are tied to their position in the hierarchy.
  • A more visually concise mechanism for delivering all three (?) levels of detail. This could be in the form of HTML DIV styled by CSS, SVG, or an image. Ideas welcome.

I'm going to spend a little time right now looking at the non-exported functions related to explainST.

Made some minor upgrades / fixes to the "explanation" of multi-term SG taxa, and "?" place holder for incomplete dictionary entries. Prior code would only flag missing (NA vs. empty-string) entries.

cat(explainST('acrudoxic plinthic kandiudults'))
acrudoxic plinthic kandiudults
|         |        |    | |                                                                         
?                                                                                                   
          |        |    | |                                                                         
          presence of plinthite                                                                     
                   |    | |                                                                         
                   presence of a kandic horizon                                                     
                        | |                                                                         
                        udic SMR                                                                    
                          |                                                                         
                          soils with an argillic or kandic horizon and a base saturation at pH 8.2 <35% at a depth of 180cm
  • Should formative element dictionaries contain definitions for "atomic elements" (fragi and gloss are split) and / or "compound elements" (fragloss)? I'm, not keen on re-implementing a full lexical parsing system for contractions / mixtures of atomic elements, so I'd vote for an entry that covers every unique case. This gets a little more complex at the subgroup level. At some future time we might try implementing a fully generic system for atomic elements.

I don't think a fully generic atomic system is worthwhile at this point--more or less same reason as with your comment below on context-dependent meanings of some elements.

  • Constraining formative element dictionary searches by level within the ST hierarchy. I haven't looked recently, but I suspect that the search / matching code is too generous. There should be no reason for a match in the GG dictionary when "explaining" the SG terms. For example, "acr" or "acro" should not trigger a match at the SG level (with the current dictionaries).

This is about as far as my reasoning went on it. My understanding is the searches are constrained. I didn't take the time to trace it to the origin of the spurious match but rather guessed what it was doing. Just saw your fix!

My original goal was to have a simple / concise definition for all taxa, starting with what I could glean from textbooks (Buol et al., Schaetzl and Anderson, etc.). The first draft can (should IMHO) specify simple (atomic) and compound formative elements, favoring longer matches ("acrudoxic" vs. "acr"). This specific example shouldn't be a problem because only the subgroup formative element dictionary will be used on the subgroup chunk of taxa text. Explanations will include a compact description of compound formative elements (mostly relevant at the subgroup).

That sounds good for a conceptual basis and I support all that.

Thinking back to where I was when I made this issue I was more referring to the nuts and bolts of having a process for updating and an idea for what the metadata for each of the columns in those tables are.

For instance in subgroup we have:

element, central, intergrade, extragrade, intragrade, derivation, connotation, simplified, link

However, I don't think those columns are currently defined in e.g. the man page for ST_formative_elements.

Essentially: what do they mean, and how do we know when/how to fill them in for the currently-blank ones?

Long-term:

  • An extensible set of dictionaries that include at least 3 levels of verbosity and detail: general public, moderate soil knowledge, scientists / educators. An option or argument toggles the output.

So, in the current datasets order to subgroup we have:

  • derivation which is the etymology
  • the connotation is a fairly concise and easy to read technical meaning
  • we also have simplified which is generally not populated.
  • A more general approach to assigning meaning, possibly transcending levels of the ST hierarchy. This isn't 100% feasible because some formative element definitions are tied to their position in the hierarchy.

Good long term goal, and I think even context-dependent definitions of formative elements could probably be handled eventually.

  • A more visually concise mechanism for delivering all three (?) levels of detail. This could be in the form of HTML DIV styled by CSS, SVG, or an image. Ideas welcome.

There is a lot of potential there. This should probably be a separate issue. Does it relate to #10?

Great. Further discussion / planning would greatly benefit from pen/paper or at least in front of a white board.

I'd like to move "planning" content and ideas into a more permanent location / issues so that we can close this issue--the core problem is (I think) resolved.

I can confirm that 274 out of 275 multiword subgroups work as expected. The one that doesnt work is a subgroup in ST_unique_list that does not seem to exist in the 12th edition keys: "hydric pachic placudands"

I have committed two relevant files in d6cc06a and am closing this issue.