lots of invalid files
Opened this issue · 4 comments
Going over the PAGE files with a linter against the actual schema from Transkribus (they hijacked the 2013 namespace), or the upstream 2019 schema (where it applies), yields different sources error on various files:
<Software> Transkribus <Software>
- abuse of
/PcGts/Metadata/Comments
for recursive elements (despite being a simpleType) - large negative
@points
(both x and y)
Note that most PAGE files where produced with Aletheia which comes from the inventors of the PAGE format. If there is anything wrong with such PAGE files, you should report it there.
Some PAGE files were produced by Transkribus. It is a known problem that such PAGE files are "special". See for example Transkribus/TranskribusCore#45, now moved to GitLab.
Yes, the 2019 files are all good, only the Transkribus ones are invalid. But it's no use looking back at the tool that generated them – this concerns the dataset alone. The first error is trivial to fix, and the second is a question of devising a good mapping scheme from the various descriptions in Comments
to 2019 version MetadataItem
s. But the third is not so trivial – it may not be enough to just impose the page frame as a coordinate boundary. I have noticed there are clear quality problems with the polygons themselves (invalidities and oft inconsistency between lines and regions), which often makes it impossible to extract text line images correctly. And then there is still a problem with precision: often, ascenders or descenders or diacritics are not included in the polygons of the handwriting.
Again that's a known problem of Transkribus. Obviously Transkribus users don't care for correct boxes / polygons. Usually the baselines are better, but it looks like Transkribus does not fix the polygon dimensions when baselines are modified.
The same problems exist for all PAGE files of AustrianNewspapers (also produced by Transkribus).
And the bad news is that probably most transcription today are still done by people using Transkribus. Nevertheless I see no progress and improvements on that side.
I agree, the baselines quality is higher than the polygons in Transkribus/P2PaLA results, and yes this applies to many other datasets, too. But still, we need to find a way around this ex-post in the data.
I am thinking of looking at different implementations for polygonalization and then writing a dedicated tool for that (working in tandem with good binarization). W.r.t. region level we can either use the approach of ocrd_cis.ocropy.lines2regions, or use parts of ocrd_segment.repair to postprocess the annotated segments.