validate.py does not pick up presentential comments
beemorris opened this issue · 2 comments
Udpipe complains if there is a comment in front of a sentence, but the validator doesn't pick this up. Is this an issue with UDpipe (e.g. does the format allow pre-sentential comments) or is it an issue with the validator ? Here is an example:
# can't find this in common voice file
# sent_id = 349
# text = Uico a rak rang tuk.
# text_en = The dog was very fast.
1 Uico uico NOUN _ _ 4 nsubj _ dog
2 a a PRON _ Number=Sing|Person=3 4 expl _ 3SG
3 rak rak PART _ _ 4 discourse _ PERF
4 rang rang VERB _ _ 0 root _ fast
5 tuk tuk ADV _ _ 4 advmod _ very|SpaceAfter=No
6 . PUNCT PUNCT _ _ 4 punct _ _
# can't find this in common voice file
# sent_id = 350
# text = Mei a vun sen.
# text_en = The light immediately turned red.
1 Mei mei NOUN _ _ 4 nsubj _ light
2 a a PRON _ Number=Sing|Person=3 4 expl _ 3SG
3 vun vun ADV _ _ 4 advmod _ immediately
4 sen sen VERB _ _ 0 root _ red|SpaceAfter=No
5 . PUNCT PUNCT _ _ 4 punct _ _
Here is the output from UDpipe:
[Line 1814 Sent 214]: [L1 Format misplaced-comment] Spurious comment line. Comments are only allowed before a sentence.
Your examples look OK to me. All comments are presentential, i.e., must occur before the line of the first token of the sentence. See the CoNLL-U format specification. I think UDPipe can read comments.
Or did you mean by "presentential" the fact that the "sent_id" comment is not the first comment? But that is formally okay as well. There must be just one "sent_id" comment but its relative position to other comments is not prescribed.
The error message you list (which btw looks quite like the output from validate.py
:-)) could actually mean that the previous sentence was not followed with a blank line, hence the script thinks we are still reading the previous sentence and the current comment occurrs in the middle or at the end of the sentence.