Problem with conllu_to_conll.pl and restore_conllu_lines.pl files
alirezamshi-zz opened this issue · 2 comments
Hello,
I think there is a bug with conllu_to_conll.pl and restore_conllu_lines.pl. Here is the code that I run for Swedish:
perl conllu_to_conllx.pl < sv_talbanken-ud-test.conllu > sv_talbanken-ud-test.conll
Then I convert it back to 'conllu' format:
perl restore_conllu_lines.pl sv_talbanken-ud-test.conll sv_talbanken-ud-test.conllu > sv_talbanken-ud-test.conllu.merged
Then, I run the UD official evaluation script for "sv_talbanken-ud-test.conllu.merged" and "sv_talbanken-ud-test.conllu", but the code crashed with the following error:
main.UDError: The concatenation of tokens in gold file and in system file differ!
First 20 differing characters in gold file: 'kbasbeloppetvidsamma' and system file: '_kbasbeloppetvidsamm'
The same thing happened with "tr-imst-ud-test.conllu" and "ru_syntagrus-ud-test.conllu".
conllu_to_conllx.pl < sv_talbanken-ud-test.conllu > test.conll restore_conllu_lines.pl test.conll sv_talbanken-ud-test.conllu > test.conllu conllu-align-tokens.pl sv_talbanken-ud-test.conllu test.conllu > /dev/null Non-whitespace character mismatch. Gold line no. 86, offset 321, buffer 'sk'. System line no. 86, buffer 's_k'. at /net/work/people/zeman/unidep/tools/conllu-align-tokens.pl line 112.
Hmm, the problem is that restoring the extra CoNLL-U lines from the original CoNLL-U file is not enough. CoNLL-U allows words with spaces while CoNLL-X does not. Therefore, conllu_to_conll.pl
replaces word-internal spaces with underscores. But restore_conllu_lines.pl
only adds the extra lines and it does not try to fix the FORM field of the token lines. In such cases, the output file is not even valid CoNLL-U because the # text
line does not match the FORM column.
I'll see whether I can make restore_conllu_lines.pl
deal with this properly.