Integrate @bertsky's tokenization repair?

Question

Integrate @bertsky's tokenization repair?

Closed this issue 5 years ago · 6 comments

So, @mikegerber do you want me to integrate my tokenization repair solution here as well? (Perhaps to be activated by an extra parameter?)

Answer 1 · 2019-11-29T12:16:17.000Z

@bertsky Please try to not mix different issues, I almost missed it :)

Answer 2 · 2019-11-29T12:16:41.000Z

I'm a bit busy at the moment, I'll have a look and then report back!

Answer 3 · 2019-11-29T12:39:45.000Z

@bertsky Please try to not mix different issues, I almost missed it :)

I'm sorry.

I'm a bit busy at the moment, I'll have a look and then report back!

Take your time.

It's not going to be as easy, though: In our scenario, we cannot re-use the validator report, because we also don't have the correct ordering (yet), at least on the document level. But the logic underneath it is the same: keep comparing the concatenation with the parent using the joiner or omitting it.

Answer 4 · 2019-11-29T13:29:09.000Z

Also from #4:

ocrd-repair-inconsistencies -m assets/data/kant_aufklaerung_1784/data/mets.xml -I OCR-D-GT-PAGE -O OCR-D-GT-PAGE-FIXED

12:29:31.191 INFO processor.RepairInconsistencies - INPUT FILE 0 / PHYS_0017

12:29:31.207 DEBUG processor.RepairInconsistencies - Resorting lines of page "PHYS_0017" region "tl_1" from ['w_w1aab1b1b2b1b1ab1', 'word_1478541234932_798', 'word_1478541234930_797'] to ['w_w1aab1b1b2b1b1ab1', 'word_1478541234932_798', 'word_1478541234930_797'] does not suffice to turn "Berliniſche Monatsſchrift ." into "Berliniſche Monatsſchrift."

12:29:31.207 DEBUG processor.RepairInconsistencies - Resorting lines of page "PHYS_0017" region "tl_4" from ['word_1478541284648_806', 'word_1478541284647_805'] to ['word_1478541284648_806', 'word_1478541284647_805'] does not suffice to turn "1 ." into "1."

12:29:31.208 DEBUG processor.RepairInconsistencies - Resorting lines of page "PHYS_0017" region "tl_5" from ['w_w1aab1b3b2b3b1ab1', 'w_w1aab1b3b2b3b1ac27', 'word_1478541289590_808', 'word_1478541289588_807'] to ['w_w1aab1b3b2b3b1ab1', 'w_w1aab1b3b2b3b1ac27', 'word_1478541289590_808', 'word_1478541289588_807'] does not suffice to turn "Beantwortung der Frage :" into "Beantwortung der Frage:"

12:29:31.208 DEBUG processor.RepairInconsistencies - Resorting lines of page "PHYS_0017" region "tl_6" from ['w_w1aab1b3b2b3b3ab1', 'w_w1aab1b3b2b3b3ab9', 'word_1478541293583_810', 'word_1478541293581_809'] to ['w_w1aab1b3b2b3b3ab1', 'w_w1aab1b3b2b3b3ab9', 'word_1478541293583_810', 'word_1478541293581_809'] does not suffice to turn "Was iſt Aufklaͤrung ?" into "Was iſt Aufklaͤrung?"

...

Answer 5 · 2019-11-29T16:15:30.000Z

It's not going to be as easy, though: In our scenario, we cannot re-use the validator report, because we also don't have the correct ordering (yet), at least on the document level. But the logic underneath it is the same: keep comparing the concatenation with the parent using the joiner or omitting it.

Since this processor is not about fixing tokenization, or fixing text, it's maybe best to merely relax the concatenation test here: instead of ...

    if sorted_lines_text == region_text:

...write...

    if sorted_lines_text == region_text or sorted_lines_text.replace('\n', '') == region_text.replace('\n', ''):

(And for the word level accordingly – without ' '.)

That way, one can repair the XML ordering first and independently here, and can next attempt to repair tokenization. What do you think?

Answer 6 · 2019-11-29T17:52:01.000Z

#6 is along those lines.