Error processing pre-tokenised FoLiA with untokenised parts
proycon opened this issue · 8 comments
I'm running into an error when processing pre-tokenised FoLiA:
mlp09$ frog --language=nld --skip=tmcpa /scratch/proycon/HuygensING-brieven-correspondenten-1900-1-1_02ba32f1-34da-4c4b-b839-936e03ae1642.folia.xml -X
Word(class='NUMBER',generate_id='HuygensING-brieven-correspondenten-1900-1-1_02ba32f1-34da-4c4b-b839-936e03ae1642.text.1.body.1.p.6.s.1',set='passthru') creation failed: Set 'passthru' is used but has no declaration for token-annotation
It seems the error occurs on a paragraph which contains text but no sentences/words (so it is untokenised unlike the others), when removing this paragraph, everything does process fine. It might be indicative of a more structural problem though, as the problem also occurs when I do not skip the tokeniser.
Hmm, seems like a logic error somewhere.
When skipping the tokenizer, 'passthru' should be selected for token-annotation.
You say the same problem arises when --skip=t is NOT selected? That would be odd, as passthru plays no role then.
Could you come up withe a minimal example?
Yes, it arises both with and without -skip=t
, minimal input https://lst.science.ru.nl/~proycon/issue71_c.xml:
WARNING: cannot tokenize: issue71_c.xml. It has been processed with ucto before!
Falling back to passthru mode. (you might consider using --skip=t)
1 1893 1893 TW(hoofd,vrij) 0.992711 O
Word(class='NUMBER',generate_id='HuygensING-brieven-correspondenten-1900-1-1_02ba32f1-34da-4c4b-b839-936e03ae1642.text.1.body.1.p.6.s.1',set='passthru') creation failed: Set 'passthru' is used but has no declaration for token-annotation
Skipping tokeniser:
$ frog --language=nld --skip=tmcpa issue71_c.xml -X
frog-:Fri Jun 14 10:53:48 2019 Frogging issue71_c.xml
1 1893 1893 TW(hoofd,vrij) 0.992711 O
Word(class='NUMBER',generate_id='HuygensING-brieven-correspondenten-1900-1-1_02ba32f1-34da-4c4b-b839-936e03ae1642.text.1.body.1.p.6.s.1',set='passthru') creation failed: Set 'passthru' is used but has no declaration for token-annotation
When I remove the provenance information (https://lst.science.ru.nl/~proycon/issue71_d.xml) it does work.
The issue also persist if I assign the correct set definition for token annotation (the document was done with an older version of ucto that didn't point to a set definition URL yet t seems). (https://lst.science.ru.nl/~proycon/issue71_e.xml)
I'm running into another but possibly related problem when skipping the tokeniser on https://lst.science.ru.nl/~proycon/issue71_b.xml (and same on https://lst.science.ru.nl/~proycon/issue71_g.xml which differs only in the set definition for token annotatation, old style)
$ frog --language=nld --skip=tmcpa issue71_b.xml -X
frog 0.18 (c) CLTS, ILK 1998 - 2019
CLST - Centre for Language and Speech Technology,Radboud University
ILK - Induction of Linguistic Knowledge Research Group,Tilburg University
based on [ucto 0.17, libfolia 2.1, timbl 6.4.14, ticcutils 0.22, mbt 3.5]
frog-:config read from: /vol/customopt/lamachine16.dev/share/frog/nld/frog.cfg
frog-:configuration version = 0.12
frog-mblem-frog-mblem-:Initiating lemmatizer...
ucto:configured TEXTCAT( /vol/customopt/lamachine16.dev/share/ucto/textcat.cfg )
frog-tok-:Language List =[nld]
frog-tagger-tagger-:reading subsets from /vol/customopt/lamachine16.dev/share/frog/nld//subsets.cgn
frog-tagger-tagger-:reading constraints from /vol/customopt/lamachine16.dev/share/frog/nld//constraints.cgn
frog-NER-tagger-:READ /vol/customopt/lamachine16.dev/share/frog/nld//ners.known
frog-NER-tagger-:loaded 13 additional Named Entities files
frog-:Fri Jun 14 11:01:57 2019 Initialization done.
frog-:Fri Jun 14 11:01:57 2019 Frogging issue71_b.xml
frog-:problem frogging: issue71_b.xml
frog-:Class WORD is used but has no default declaration for token-annotation
frog-:Fri Jun 14 11:01:57 2019 Frog finished
Well, this is very frustrating...
The error is a consequence of the incremental processing of the FoLiA.
- We read the "header" of the document including metadata, annotation declarations etc.
- A token-annotation might be read. Which is the default for already existing annotations
- while --passthru is specified, a NEW token-annotation is added with the 'passthru' set
- now there is NO default token-annotation anymore!
- incremental parsing continues and at some point find a <w> with a class. It USED to be in the default set, which is gone now!
- CRASH
This is not easy to fix:
- adding the passthru (or anya other) token-annotation at a later stage is not feasible. All annoations we are going to use need to be declared and are used by the parser.
- removing ALL token-annotations before adding 'passthru' will work, as it introduces a new default, but is dubious
- we might pass the original annotations to the parser, but this needs a lot of work
- we might just ditch incremental parsing.
- we might just forbid tokenizing a file with already some tokenisation (not just by ucto)
This really needs some more thought.
I see the dilemma yes.
I wouldn't ditch incremental parsing, it's a nice asset you put a lot of time in.
The easiest solution is probably:
we might just forbid tokenizing a file with already some tokenisation (not just by ucto)
This sounds acceptable to me, either a document is tokenised or it isn't, and frog should be able to deal with both. But if it's a mix of tokenised and not tokenised then the burden shouldn't be in Frog or ucto to find out the parts which are not tokenised, those should simply be skipped in processing then (but not erased of course).
I implemented sort-of a solution. All files are handled now by ucto and frog.
We need to check if the output is correct or acceptable
I still get the same error on issue71_c and issue71_e:
Word(class='NUMBER',generate_id='HuygensING-brieven-correspondenten-1900-1-1_02ba32f1-34da-4c4b-b839-936e03ae1642.text.1.body.1.p.6.s.1',set='passthru') creation failed: Set 'passthru' is used but has no declaration for token-annotation
The problem on issue71_b and issue71_f does seem solved now.
Yep, seems okay now!