[Discussion] AnnotationTable.TokenizedAnnotationTable
HLWeil opened this issue · 4 comments
I think we should reconsider the current design of this type as it's kind of an awkward state:
Currently it is split into a list of IO columns
and a list of Term Columns
. This has two-fold problems according to the current proposed state of the ARC specification 1.2:
- What about non-term and non-IO columns like
Protocol REF
? - There MUST be at most
1 Input
and1 Output
Column, so a list seems counterintuitive.
Alternatively to trying to design this in some specific way, we could also keep it more naive and just have a list of columns (including terms, IOs and whatever)?
i am thinking of a complete rework of the parsing. I think we should use ARCtrl's composite column model.
iirc ARCtrl parses annotation tables like this:
- pattern match and assign grouping
- everything not assignes is
Freetext
if that is true, then it should be easy to use for tokenization as well, by filling these composite columns with CvParams in an additional step.
Sounds good? @HLWeil
Yup that's pretty much it.
It sounds fine with me, provided that it doesn't fail in some specific cases which should be checked. But as a starting point for getting your tokens for further use it should be good!