turbomam/md2tsvs

ideal output for reviewing LinkML enums?

turbomam opened this issue · 3 comments

Especially enums that have meaning added with linkml_model_enrichment/annotators/enum_annotator.py

  1. Are there multiple enums with the same meaning
    • then how to repair? make some of the enum names synonyms?
  2. Which enums have no meaning assigned?
  3. Which enum meaning are suspect because the name and the meaning-based description are lexically different? How much of a difference is noteworthy?
    • A change in one letter or digit in a strain (organism) might indicate an entirely wrong meaning assignment
    • But meaning can be assigned based on a synonym, in which case the name and description could be entirely different
    • What string distance metric should we use? Cosine? SIFT4?
python linkml_model_enrichment/annotators/enum_annotator.py \
--modelfile availabilities.yaml \
--tabular_outputfile mapping.log \
--ontoprefix NCBITaxon \
--enum_list strained_enum \
--replaced_chars Z

hopefully the enum names don't contain and Zs! I wanted to be sure not to drop _s or -s

poetry run python md2tsvs/md2tsvs.py \
--mdfile  ../synbio-schema/handcrafted/generated/docs/binomial_name_enum.md \
--distcol0 0 --distcol1 1 \
--static_table_num 1

Note that some strains aren't getting valid meaning assignments because NCBI hasn't recorded many strains for the organism, like https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=4952&lvl=3&lin=f&keep=1&srchmode=1&unlock