Error processing pre-tokenised FoLiA with untokenised parts

Question

Error processing pre-tokenised FoLiA with untokenised parts

proycon opened this issue 6 years ago · 8 comments

I'm running into an error when processing pre-tokenised FoLiA:

mlp09$ frog --language=nld --skip=tmcpa /scratch/proycon/HuygensING-brieven-correspondenten-1900-1-1_02ba32f1-34da-4c4b-b839-936e03ae1642.folia.xml -X

Word(class='NUMBER',generate_id='HuygensING-brieven-correspondenten-1900-1-1_02ba32f1-34da-4c4b-b839-936e03ae1642.text.1.body.1.p.6.s.1',set='passthru') creation failed: Set 'passthru' is used but has no declaration for token-annotation

It seems the error occurs on a paragraph which contains text but no sentences/words (so it is untokenised unlike the others), when removing this paragraph, everything does process fine. It might be indicative of a more structural problem though, as the problem also occurs when I do not skip the tokeniser.

Answer 1 · 2019-06-14T07:48:01.000Z

Hmm, seems like a logic error somewhere.
When skipping the tokenizer, 'passthru' should be selected for token-annotation.

You say the same problem arises when --skip=t is NOT selected? That would be odd, as passthru plays no role then.

Could you come up withe a minimal example?

Answer 2 · 2019-06-14T09:05:53.000Z

Yes, it arises both with and without -skip=t, minimal input https://lst.science.ru.nl/~proycon/issue71_c.xml:

WARNING: cannot tokenize: issue71_c.xml. It has been processed with ucto before! 
  Falling back to passthru mode. (you might consider using --skip=t)

1       1893    1893                    TW(hoofd,vrij)  0.992711        O

Word(class='NUMBER',generate_id='HuygensING-brieven-correspondenten-1900-1-1_02ba32f1-34da-4c4b-b839-936e03ae1642.text.1.body.1.p.6.s.1',set='passthru') creation failed: Set 'passthru' is used but has no declaration for token-annotation

Skipping tokeniser:

$ frog --language=nld --skip=tmcpa issue71_c.xml -X                                                                                                                       
frog-:Fri Jun 14 10:53:48 2019 Frogging issue71_c.xml
1       1893    1893                    TW(hoofd,vrij)  0.992711        O

Word(class='NUMBER',generate_id='HuygensING-brieven-correspondenten-1900-1-1_02ba32f1-34da-4c4b-b839-936e03ae1642.text.1.body.1.p.6.s.1',set='passthru') creation failed: Set 'passthru' is used but has no declaration for token-annotation

When I remove the provenance information (https://lst.science.ru.nl/~proycon/issue71_d.xml) it does work.

The issue also persist if I assign the correct set definition for token annotation (the document was done with an older version of ucto that didn't point to a set definition URL yet t seems). (https://lst.science.ru.nl/~proycon/issue71_e.xml)

I'm running into another but possibly related problem when skipping the tokeniser on https://lst.science.ru.nl/~proycon/issue71_b.xml (and same on https://lst.science.ru.nl/~proycon/issue71_g.xml which differs only in the set definition for token annotatation, old style)

$ frog --language=nld --skip=tmcpa issue71_b.xml -X                                                                                                                             
frog 0.18 (c) CLTS, ILK 1998 - 2019
CLST  - Centre for Language and Speech Technology,Radboud University
ILK   - Induction of Linguistic Knowledge Research Group,Tilburg University
based on [ucto 0.17, libfolia 2.1, timbl 6.4.14, ticcutils 0.22, mbt 3.5]
frog-:config read from: /vol/customopt/lamachine16.dev/share/frog/nld/frog.cfg
frog-:configuration version = 0.12
frog-mblem-frog-mblem-:Initiating lemmatizer...
ucto:configured TEXTCAT( /vol/customopt/lamachine16.dev/share/ucto/textcat.cfg )
frog-tok-:Language List =[nld]
frog-tagger-tagger-:reading subsets from /vol/customopt/lamachine16.dev/share/frog/nld//subsets.cgn
frog-tagger-tagger-:reading constraints from /vol/customopt/lamachine16.dev/share/frog/nld//constraints.cgn
frog-NER-tagger-:READ  /vol/customopt/lamachine16.dev/share/frog/nld//ners.known
frog-NER-tagger-:loaded 13 additional Named Entities files
frog-:Fri Jun 14 11:01:57 2019 Initialization done.
frog-:Fri Jun 14 11:01:57 2019 Frogging issue71_b.xml
frog-:problem frogging: issue71_b.xml
frog-:Class WORD is used but has no default declaration for token-annotation
frog-:Fri Jun 14 11:01:57 2019 Frog finished

Answer 3 · 2019-06-14T11:08:49.000Z

Well, this is very frustrating...
The error is a consequence of the incremental processing of the FoLiA.

We read the "header" of the document including metadata, annotation declarations etc.
A token-annotation might be read. Which is the default for already existing annotations
while --passthru is specified, a NEW token-annotation is added with the 'passthru' set
now there is NO default token-annotation anymore!
incremental parsing continues and at some point find a <w> with a class. It USED to be in the default set, which is gone now!
CRASH

This is not easy to fix:

adding the passthru (or anya other) token-annotation at a later stage is not feasible. All annoations we are going to use need to be declared and are used by the parser.
removing ALL token-annotations before adding 'passthru' will work, as it introduces a new default, but is dubious
we might pass the original annotations to the parser, but this needs a lot of work
we might just ditch incremental parsing.
we might just forbid tokenizing a file with already some tokenisation (not just by ucto)

This really needs some more thought.

Answer 4 · 2019-06-14T12:45:35.000Z

I see the dilemma yes.

I wouldn't ditch incremental parsing, it's a nice asset you put a lot of time in.

The easiest solution is probably:

we might just forbid tokenizing a file with already some tokenisation (not just by ucto)

This sounds acceptable to me, either a document is tokenised or it isn't, and frog should be able to deal with both. But if it's a mix of tokenised and not tokenised then the burden shouldn't be in Frog or ucto to find out the parts which are not tokenised, those should simply be skipped in processing then (but not erased of course).

Answer 5 · 2019-06-14T14:00:00.000Z

I implemented sort-of a solution. All files are handled now by ucto and frog.
We need to check if the output is correct or acceptable

Answer 6 · 2019-06-14T16:31:10.000Z

I still get the same error on issue71_c and issue71_e:

Word(class='NUMBER',generate_id='HuygensING-brieven-correspondenten-1900-1-1_02ba32f1-34da-4c4b-b839-936e03ae1642.text.1.body.1.p.6.s.1',set='passthru') creation failed: Set 'passthru' is used but has no declaration for token-annotation

The problem on issue71_b and issue71_f does seem solved now.

Answer 7 · 2019-07-11T07:59:26.000Z

@proycon I assume this is fixed now?

Answer 8 · 2019-07-11T09:25:11.000Z

Yep, seems okay now!