microbiomedata/DataHarmonizer

bugs in data.tsv (and upstream yaml) from use_modular_gd.py

Closed this issue · 6 comments

  • if the pattern looks like a list, make it a enumeration/pulldown

    • solved by processing MIxS after NMDC because the list-like patterns came from NMDC and get overwritten by MIxS?
    • they are still in MIxS string serialization
  • add hierarchical indentation of enumerated values

    • I have some old code for that somewhere
    • probably requires sem-sql code and database (vs a SPARQL solution)
    • use any ontologies besides EnvO?
    • see #58
  • Add support for partial date columns and time columns

  • align section composition and ordering with @mslarae13's Example Use tab

    • that info will be added to other more structured tabs
  • tidy the descriptions

  • add meanings for enumerated values

    • lookup with enum_annotator (see example in Makefile)
    • expose as Ontology IDs
    • some enum labels from MIxS have added parenthetic content that makes string matching difficult: meadows (grasses,alfalfa,fescue,bromegrass,timothy)
  • add more patterns based on

    • string_serialization
    • slot's range... pretty thorough at this point
  • populate examples column?

  • are terms being included even though they are marked skip on nmdc_biosample_slots ?

  • Ontology ID

  • terse labels (from apparent prioritization of NMDC over MIxS annotations?)

  • parent classes with "https:" prefixes

    • long-term solution: align section composition and ordering below
    • short-term solution comes from full URLs (prefer prefixed) above
  • seems like number of required fields too low

    • added requirements from slot usages
  • elaborate on the use of regular expressions in the guidance column. Also include the string serialization?

  • Where is the default PV in the sample_type enum coming from

    • @click.option('--default_data_status', default="default", show_default=True)
  • what does the Null values section in the double-click header help mean? see cidgoh#244

    • shows the contents of the data status column in data.tsv, which I was populating with --default_data_status
  • take advantage of min and max values for pH (anything else?)

  • whose id-like fields should be used? The ones from NMDC or ones created by @mslarae13

    • using identifiers from biosample_identification_slots

MIxS slots with NMDC URLs like https://microbiomedata/schema/mixs/ph

just put the MIxS term requests second in the tasks dict

This also mostly resolves

terse labels (from apparent prioritization of NMDC over MIxS annotations?)

Improvements on the NMDC terms could be driven to changes to the nmdc schema, or curations in tab nmdc_biosample_slots (like we will probably do for the descriptions.)

Ontology ID/full URLs (prefer prefixed)

example: https://microbiomedata/schema/ecosystem

would prefer for that to appear as nmdc:ecosystem

Current definition of nmdc prefix::

nmdc:
    prefix_prefix: nmdc
    prefix_reference: https://microbiomedata/meta/

enums included so far, with enrichment success

  • cur_land_use_enum medium, especially if \(.*$ is removed before searching,
  • drainage_class_enum poor
  • fao_class_enum good
  • profile_position_enum errors out on first term, backslope
  • soil_horizon_enum poor
  • tillage_enum errors out on first term, chisel

Tabulation of ranges for NMDC and MIxS as-is only

To-do

external identifier       3
double                    1

enums

cur_land_use_enum         1
drainage_class_enum       1
fao_class_enum            1
profile_position_enum     1
soil_horizon_enum         1
tillage_enum              1

Handled already

string                   39
    xsd:token by default
quantity value           16
date                      3
    xsd:date
    see notes above regarding partial dates and times

New tabulations:

TABULATION OF SLOT RANGES, for prioritizing range->regex conversion
string                           38
quantity value                   16
external identifier               3
date                              3
oxygen_relationship_enum          1
storage_condt_enum                1
soil_horizon_enum                 1
sample_type_enum                  1
samp_biotic_relationship_enum     1
profile_position_enum             1
fao_class_enum                    1
growth_facility_enum              1
env_package_enum                  1
drainage_class_enum               1
cur_land_use_enum                 1
analysis_type_enum                1
tillage_enum                      1
dtype: int64


TABULATION OF STRING SERIALIZATIONS, for prioritizing serialization->regex conversion
<none>                                                                                                 30
{PMID}|{DOI}|{URL}                                                                                     20
{text}                                                                                                  8
enumeration                                                                                             7
{float} {unit}                                                                                          4
{termLabel} {[termID]}                                                                                  4
{text};{float} {unit}                                                                                   3
{integer}                                                                                               2
{timestamp}                                                                                             2
{text}:{text}                                                                                           2
{text};{timestamp}                                                                                      1
[summit|shoulder|backslope|footslope|toeslope]                                                          1
{PMID}|{DOI}|{URL}|{text}                                                                               1
{termLabel} {[termID]}|{text}                                                                           1
{{text}|{float} {unit}};{float} {unit}                                                                  1
[O horizon|A horizon|E horizon|B horizon|C horizon|R layer|Permafrost]                                  1
{float}                                                                                                 1
{float} C                                                                                               1
{float} {float}                                                                                         1
{termLabel} {[termID]}; {timestamp}                                                                     1
{term}: {term}, {text}                                                                                  1
[Acrisols|Andosols|Arenosols|Cambisols|Chernozems|Ferralsols|Fluvisols|Gleysols|Greyzems|Gypsisols|     1
[very poorly|poorly|somewhat poorly|moderately well|well|excessively drained]                           1
[cities|farmstead|industrial areas|roads/railroads|rock|sand|gravel|mudflats|salt flats|badlands|pe     1
{boolean};{Rn/start_time/end_time/duration}                                                             1
HH:MM:SS                                                                                                1
{text};{float} {unit};{timestamp}                                                                       1
[drill|cutting disc|ridge till|strip tillage|zonal tillage|chisel|tined|mouldboard|disc plough]         1
dtype: int64

Looks like all tasks here are complete except for the one about hierarchical enums which is covered by other issues. Closing.