plazi/ggxml2taxpub

use of Taxpub tags after import to SIBiLS

Opened this issue · 6 comments

@tcatapano @jgobeill @pruch

following the issue https://github.com/plazi/eBioDiv/issues/28#issuecomment-1063968467 and discussion at tech meeting March 10, the use of tags is as follows in the EJT corpus https://github.com/plazi/ggxml2taxpub-treatments/tree/main/level1

image

the tags "563" are used for indexing and creating an article view (in fact a treatment centric view)

  • <tp:nomenclature> is needed to separate the subSubSection including the <tp:taxon-name> of the treatment.

  • <tp:treatment-sec type="reference_group"> is needed to separate all <tp:taxon-name> that are part of the synonymies

  • all other (none type:reference_group) <tp:treatment-sec type=> are not separated. This comes at the cost that certain usages will not be possible, such as looking at the conservation level. At the same time, it allows to make statements of relationship of the the nominate taxon with other mentioned <tp:taxon-name>

  • <tp:material-citation> is used

@jgobeill, @pruch please comment

Thank you for the notes @myrmoteras !

@patruch @jgobeill

a thought regarding the removal of <tp:treatment-sec sec-type>: the advantage of having a set up types is that it would allow more specific searches. For example, if we keep the type=conservation, then we could ask the question what are species of a certain conservation status? what conservation status are available? The answer then could just be what's in the tag.

Would that be helpful for the reuse of treatments in SIBiLS?

For me, for example the case with conservation, could be interesting to work with the redlist community at IUCN, the world conservation union.

Similarly is the case of using type=biology_ecology where all the behavioral stuff in a treatment is located, and thus might facilitate searching?!

@patruch @jgobeill

here is another example from the Handbook of the Mammals of the world you will get

https://tb.plazi.org/GgServer/taxPubL1/03C36F2EFFFB347EFF11441DF6EF0C5F

In this case, the book uses a set of additional types. such as "activity" or "breeding". these are essentiall subtypes of "biology_ecology" and I wonder whether this might be something to consider? May be in a next phase? May be create a vocabulary of terms we then use in SIBiLS?

@myrmoteras @patruch This is document representation. For the next prototype, we certainly will try different fields and representation... even if some will be redundant. But you'll be able to choose what representation is most useful.

Just a question: how do you pick up these tags like "breeding" or "activity" ? Is it something systematic or more ad hoc ? For instance "breeding" could be borrowed from the NCI Thesaurus's definition https://www.ebi.ac.uk/ols/ontologies/ncit/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FNCIT_C42877.

@patruch there are three answers:

  1. In most cases, we have a a list of types that we use in our processing (see the table below)
  2. in the case of Handbooks of the Mammals of the World we use for the ca 6,500 mammal species additionally the types that are used through the book. They could be groupe as childrens of type=biology_ecology
  3. Pensoft uses a different approach. They just use for the sec-type whatever the author uses as title for the section. So there is a huge number of ad hoc types.

You can see the distribution of sec-types here:

<style> </style>
DocCount SectType  
739159 nomenclature
307445 description
293299 materials_examined
264425 reference_group
258774 distribution
153970 discussion
133457 diagnosis  
107163 multiple  
101206 etymology
36734 biology_ecology
29415 notes  
18268 key  

nomenclature is not a sec-type, but tp:nomenclature

source: sec-types ranking.csv

This is for sure something we should discuss, probably before we make all accessible. Most of the terms could be mapped to one of the widely used, and at at the same time, we could use some hierarchy in the terms.