nfdi4plants/ARCTokenization

[Discussion] AnnotationTable.TokenizedAnnotationTable

HLWeil opened this issue · 4 comments

HLWeil commented

I think we should reconsider the current design of this type as it's kind of an awkward state:

Currently it is split into a list of IO columns and a list of Term Columns. This has two-fold problems according to the current proposed state of the ARC specification 1.2:

  1. What about non-term and non-IO columns like Protocol REF?
  2. There MUST be at most 1 Input and 1 Output Column, so a list seems counterintuitive.

Alternatively to trying to design this in some specific way, we could also keep it more naive and just have a list of columns (including terms, IOs and whatever)?

#25

i am thinking of a complete rework of the parsing. I think we should use ARCtrl's composite column model.

iirc ARCtrl parses annotation tables like this:

  • pattern match and assign grouping
  • everything not assignes is Freetext

if that is true, then it should be easy to use for tokenization as well, by filling these composite columns with CvParams in an additional step.

Sounds good? @HLWeil

HLWeil commented

Yup that's pretty much it.

It sounds fine with me, provided that it doesn't fail in some specific cases which should be checked. But as a starting point for getting your tokens for further use it should be good!

Closing this as we use ARCtr's ARCTable parser now, which we then tokenize. See #48