Misc. annotation errors and/or conversion script bugs
lgessler opened this issue · 6 comments
There are some annotations which I'm fairly sure are incorrect and are choking up the JSON conversion script. (These errors occur using the unmodified versions of all scripts taken straight from STRUESLE.) One or two might also be indicative of a bug in the conllulex2json.py file.
vs
mistagged as a noun--should be prep
AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': '
c', 'ss2': '
c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})
- ditto
AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': '
c', 'ss2': '
c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})
- Script complains about "to" in this snippet at ID=23. Not immediately clear to me what the issue is--perhaps that "to" is labeled ADP/IN? For its xpos I think it ought to be TO, not sure about its upos. Snippet:
13 shit shit NOUN NN _ 16 obl:npmod _ _ _ _ _ _ _ _ _ _ _
14 this this PRON DT _ 16 nsubj _ _ _ _ _ _ _ _ _ _ _
15 can can AUX MD _ 16 aux _ _ _ _ _ _ _ _ _ _ _
16 end end VERB VB _ 4 parataxis _ _ _ _ _ _ _ _ _ _ _
17 right right ADV RB _ 18 advmod _ _ _ _ _ _ _ _ _ _ _
18 now now ADV RB _ 16 advmod _ _ _ _ _ _ _ _ _ _ _
19 if if SCONJ IN _ 21 mark _ _ _ _ _ _ _ _ _ _ _
20 I I PRON PRP _ 21 nsubj _ _ _ _ _ _ _ _ _ _ _
21 want want VERB VBP _ 16 advcl _ _ _ _ _ _ _ _ _ _ _
22 it it PRON PRP _ 21 obj _ _ _ _ _ _ _ _ _ _ _
23 to to ADP IN _ 21 obl _ _ _ _ _ `i `i _ _ _ _
24 . . PUNCT . _ 4 punct _ _ _ _ _ _ _ _ _ _ _
Error:
AssertionError: ('french-a17a4340-f9c0-8fef-fa1b-1bf13879399b-02', {'lexlemma': 'to', 'lexcat': 'INF', 'ss': '
i', 'ss2': '
i', 'toknums': [23]}, {'#': 23, 'word': 'to', 'lemma': 'to', 'upos': 'ADP', 'xpos': 'IN', 'feats': None, 'head': 21, 'deprel': 'obl', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-INF-`i'})
Relevant span of code:
if validate_pos and upos!=lc and (upos,lc) not in {('NOUN','N'),('PROPN','N'),('VERB','V'),
('ADP','P'),('ADV','P'),('SCONJ','P'),
('ADP','DISC'),('ADV','DISC'),('SCONJ','DISC'),
('PART','POSS')}:
# most often, the single-word lexcat should match its upos
# check a list of exceptions
mismatchOK = False
if xpos=='TO' and lc.startswith('INF'):
mismatchOK = True
elif (xpos=='TO')!=lc.startswith('INF'):
assert upos in ['SCONJ', "ADP"] and swe['lexlemma']=='for',(sent['sent_id'],swe,tok)
mismatchOK = True
- Originator as function:
(in french-c02823ec-60bd-adce-7327-01337eb9d1c8-02)
AssertionError: ('p.Originator should never be function', {'lexlemma': 'you', 'lexcat': 'PRON.POSS', 'ss': 'p.Originator', 'ss2': 'p.Originator', 'toknums': [1]})
- lexcat DISC with ADJ:
AssertionError: In spanish-a25e8289-e04a-f5af-ce56-ead9faca65b1-02, single-word expression 'like' has lexcat DISC, which is incompatible with its upos ADJ
- "her" tagged with Possessor is incorrectly parsed as iobj and tagged as
PRP
instead ofPRP$
. Relevant snippet:
1 My my PRON PRP$ _ 2 nmod:poss _ _ _ _ _ SocialRel Gestalt _ _ _ _
2 grandma grandma NOUN NN _ 3 nsubj _ _ _ _ _ _ _ _ _ _ _
3 had have VERB VBD _ 0 root _ _ _ _ _ _ _ _ _ _ _
4 her she PRON PRP _ 3 iobj _ _ _ _ _ Possessor Possessor _ _ _ _
5 super super ADV RB _ 6 advmod _ _ _ _ _ _ _ _ _ _ _
6 thick thick ADJ JJ _ 8 amod _ _ _ _ _ _ _ _ _ _ _
7 floor floor NOUN NN _ 8 compound _ _ _ _ _ _ _ _ _ _ _
8 mats mat NOUN NNS _ 3 obj _ _ _ _ _ _ _ _ _ _ _
9 * * PUNCT NFP _ 8 punct _ _ _ _ _ _ _ _ _ _ _
10 over over ADP IN _ 13 case _ _ _ _ _ Locus Locus _ _ _ _
11 * * PUNCT NFP _ 13 punct _ _ _ _ _ _ _ _ _ _ _
12 the the DET DT _ 13 det _ _ _ _ _ _ _ _ _ _ _
13 accelerator accelerator NOUN NN _ 3 obl _ _ _ _ _ _ _ _ _ _ _
14 , , PUNCT , _ 3 punct _ _ _ _ _ _ _ _ _ _ _
Error:
AssertionError: In spanish-ebba3c73-2431-c216-8f4d-d469ee8d5564-01, single-word expression 'her' has lexcat P, which is incompatible with its upos PRON
- "NA" is misannotated--this is NA as in North America, i.e. a PROPN/NP, but it's lemmatized as "no", and its tags are weird.
AssertionError: ('german-35000895-1d78-c18a-01ed-f7410b9c0581-01', {'lexlemma': 'no', 'lexcat': 'ADV', 'ss': None, 'ss2': None, 'toknums': [5]}, {'#': 5, 'word': 'NA', 'lemma': 'no', 'upos': 'PART', 'xpos': 'TO', 'feats': None, 'head': 6, 'deprel': 'mark', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-ADV'})
Thanks for catching these.
1-2. Yes, I would say "vs" is ADP or CCONJ depending on use.
-
Should be "if I want it to/PART/TO"
-
Please give more context, I don't know why "you" should be annotated as prepositional
-
See what discourse particle "like" is tagged as in STREUSLE/EWT? Probably ADV is better than ADJ for the UPOS.
re: 4.:
# sent_id = french-c02823ec-60bd-adce-7327-01337eb9d1c8-02
# text = Your point is : being ignorant makes it right ?
1 Your you PRON PRP$ _ 2 nmod:poss _ _ _ _ _ Originator Originator _ _ _ _
2 point point NOUN NN _ 3 nsubj _ _ _ _ _ _ _ _ _ _ _
3 is be VERB VBZ _ 0 root _ _ _ _ _ _ _ _ _ _ _
4 : : PUNCT : _ 3 punct _ _ _ _ _ _ _ _ _ _ _
5 being be AUX VBG _ 6 cop _ _ _ _ _ _ _ _ _ _ _
6 ignorant ignorant ADJ JJ _ 7 nsubj _ _ _ _ _ _ _ _ _ _ _
7 makes make VERB VBZ _ 3 ccomp _ _ _ _ _ _ _ _ _ _ _
8 it it PRON PRP _ 7 obj _ _ _ _ _ _ _ _ _ _ _
9 right right ADV RB _ 7 advmod _ _ _ _ _ _ _ _ _ _ _
10 ? ? PUNCT . _ 3 punct _ _ _ _ _ _ _ _ _ _ _
re: 5 I can't seem to find any in STREUSLE but GUM tags them as UH/INTJ: https://corpling.uis.georgetown.edu/annis/#_q=Imxpa2UiIC4gIiwi&_c=R1VN&cl=5&cr=5&s=0&l=10
and EWT has one example also tagged as UH: https://corpling.uis.georgetown.edu/annis/#_q=Imxpa2UiIC4gIiwi&_c=VURfRW5nbGlzaC1FV1Q&cl=5&cr=5&s=0&l=10
OK actually on (5), the sentence is just "I was like ...", which GUM consistently tags as RP/ADP when it's a quotative "be like", e.g.:
- I was like/RP/ADP, I'll call you when I get home
EWT also seems to follow this except for one case where it's UH (results):
- He was like/UH/INTJ what ???
- I was like/RP/ADP Ummmm
- he was like/RP/ADP Oops
Other instances of "your" are Originator~>Gestalt, so I changed it to that for (4)