Inconsistent annotations for LS numbers
Closed this issue · 2 comments
Validation issues:
ERROR: Sentence answers-20111108024148AAO8oFI_ans-0010 token 12 -- invalid X form '1'
ERROR: Sentence email-enronsent24_01-0014 token 5 -- invalid X form '20'
ERROR: Sentence email-enronsent24_01-0057 token 4 -- invalid X form '20'
ERROR: Sentence email-enronsent24_01-0114 token 4 -- invalid X form '20'
ERROR: Sentence answers-20111108090913AAf83Jh_ans-0007 token 1 -- invalid X form '1'
ERROR: Sentence answers-20111108090913AAf83Jh_ans-0011 token 1 -- invalid X form '2'
ERROR: Sentence answers-20111108090913AAf83Jh_ans-0017 token 1 -- invalid X form '3'
ERROR: Sentence answers-20111108090913AAf83Jh_ans-0021 token 1 -- invalid X form '4'
ERROR: Sentence answers-20111108073322AA27tkh_ans-0012 token 2 -- invalid X form '2'
There are several issues here:
- These should be
NUM
instead ofX
to be consistent with the other LS annotations. - They should be attached to the following sentence to be consistent with how the other LS+NUM tokens are grouped.
- The LS tokens are missing
NumType=Ord|NumForm=Digit
features -- there may be other cases like this.
Note: I'm using NumType=Ord
here instead of Card
as these are ordered values -- first, second, third, etc. -- not counted values.
Looking across the different treebanks, the EWT treebank is separating the (1)
/i)
/etc. into separate tokens, whereas GUM and GENTLE are keeping them as a single token.
They are also keeping multi-section list items grouped, such as in 2.1.
. I don't think EWT has examples of that in its data set.
These should be
NUM
instead ofX
to be consistent with the other LS annotations.
Thanks. A Grew-match query for these:
See also #440
They are also keeping multi-section list items grouped, such as in
2.1.
. I don't think EWT has examples of that in its data set.
See email-enronsent38_01-0002 and successive sentences. They are kept as one token.
They should be attached to the following sentence to be consistent with how the other LS+NUM tokens are grouped.
Perhaps, but I'm guessing they were separated in the original text with newlines or something. Messing with the sentence boundaries is something I'm a little reluctant to do...let's move that discussion to #415.
The LS tokens are missing
NumType=Ord|NumForm=Digit
features -- there may be other cases like this.
Will open a separate issue for this.