ALTO output: Missing <SP> tags between <String> tags
jbarth-ubhd opened this issue ยท 24 comments
Perhaps this is not an error.
Kind regards,
J. Barth
Can you provide sample data and how you ran the tool?
I guess you output the ALTO files directly from ABBYY, because we don't yet provide a transormation from ABBYY to ALTO. Then this should be an example: https://digi.bib.uni-mannheim.de/~stweil/ocr-praxis/Testseiten/alto/417576986_0031.xml . The <SP>
stands AFAIK for space and it does validate in this form.
Yes, I'll try to find out if <SP> (=space) is really necessary between <String>s in ALTO.
I guess that it still validates without the SP tags. Moreover, most of the information (HPOS, WIDTH) can be calculated from the line above and below, but if the width of a space is important for some application, then it might be easier to have this data directly. I don't know what the VPOS information for a space says or whether it is also determined by some other values.
On ALTO 2.1 .xsd it looks like this:
<xsd:sequence maxOccurs="unbounded">
<xsd:element name="String" type="StringType"/>
<xsd:element name="SP" minOccurs="0"> ...
</xsd:element>
</xsd:sequence>
So strictly speaking it seems that <SP> is not necessary, but the <sequence> seems to imply it.
but the seems to imply it.
Not sure. I only see here, that, if <SP>
occurs, then it has to occur after a <String>
.
Here is an ALTO file generated with Tesseract (see tesseract-ocr/tesseract#2067). Another page was processed by ABBYY Finereader.
While ABBYY adds the <SP>
tags, Tesseract (and ocr-fileformat) does not. As the <String>
tags contain the surrounding box positions and the distance of two text boxes can be calculated without additional information, that looks sufficient at a first glance. But without the <SP>
the DFG viewer does not separate the words!
I am not sure whether this is a bug of the DFG viewer (and Kitodo Presentation) or whether ALTO requires explicit tags for the whitespace between words. Perhaps @sebastian-meyer or @cneud know the answer?
The ALTO documentation says "A TextBlock is divided into lines and those are divided into strings, spaces and hyphens". I don't interpret that as a strict requirement that spaces are required, and nor does the .xsd. It's clear that spaces are required if the strings are given without HPOS
and WIDTH
attributes, but I think it is redundant if those attributes are available.
The ALTO spec itself needs to clarify this issue.
Clemens has created an issue for that: altoxml/schema#54 (thank you).
Thanks for flagging this, I will put it on the agenda for our next ALTO board call which will be held November 29th.
To chip in, I've interpreted the standard that the <SP><String>
alternation is mandatory (sequence definition of <TextLine>
contents) and that whitespace should never occur inside a <String>
and this is how I implemented it.
If a <String>
never contains whitespace, then <SP>
is completely redundant. Does ALTO allow overlapping words in a row? If yes, does that require a separating space with negative width? :-)
If a never contains whitespace, then is completely redundant.
Why? Whitespace is a character like any other and personally I would've taken the decision to encode it explicitly using <String>
if the standard wouldn't heavily imply that you shouldn't do that. Of course, you can throw away the data and let people compute inter-word spacing implicitly provided through word bounding boxes but it isn't like tesseract, kraken or any other sequence classification based OCR engine doesn't output a label for whitespace (and the boundaries of that activation can almost certainly differ from the boundary of the activations of the adjacent letters). I'd rather not throw away metadata that some weird subdiscipline in the humanities that only the 8 people participating in it have ever heard about might need.
Does ALTO allow overlapping words in a row?
ALTO luckily allows overlapping elements in constrast to PageXML.
Then how would you encode two overlapping words if you are forced to put a <SP>
between them?
Just have overlapping bounding boxes? Presumably there is still a reading order that determines the ordering of the <String>
tags. But yeah it helps that I decided a long time ago that words are a waaay to squishy concept and arbitrarily defined anything bounded by whitespace is a separate word/segment for serialization purposes (not only for ALTO). Of course, I you want to encode a proper tokenization, this data model shouldn't be used. On the other hand, I'm of firm conviction that starting to do that in a raw OCR serialization format is only going to lead to madness.
Just to follow up - I'm afraid a quick resolve is not really around the corner...the issue was discussed in the last ALTO board call, with the core elements of the discussion summarized here.
While the general feeling was that the use of <SP>
is not mandatory, some more research into ALTO's history is required to determine the original authors exact intentions.
An expansion of the <SP>
tag with a width
attribute has been identified among board members as a possibility to create more useful future applications for the <SP>
tag.
If one really wants to be on the safe side, the quick solution right now would be to indeed include <SP>
in the output of any ALTO export implementation as it is also straightforward to remove in post-processing.
ALTO's history is required to determine the original authors exact intentions.
As a note, most of the character-based classification systems common at the time ALTO was originally specified didn't treat whitespace as a proper glyph, i.e. whitespace is just something bordered by other glyphs and is never seen by the classifier as such. This at least explains the existence of a separate <SP>
tag.
Thank you, @cneud, @mittagessen and the ALTO board.
As the current DFG viewer expects the <SP>
tags, I think that programs like ocr-transform
should produce them, too. Pull request tesseract-ocr/tesseract#2117 adds the tags to Tesseract's new ALTO output, so that output is now compatible with the DFG viewer.
The addition of the <SP>
should be handled upstream in the corresponding transformation. Currently, we use hocr2alto and page2alto. We can keep this issue here open as a reminder.
According to the ALTO XSD the SP tag is optional - minOccurs="0"
And I do not see a way how to reliably calculate HEIGHT/WIDTH/VPOS/HPOS attributes from the hOCR data for the SP tag.
IMHO - proper handling of optional SP tag should be fixed by DFG viewer.
If the <SP>
is not mandatory, we have to "ignore" it in the styles of the fulltext view and always make a space after a <STRING>
.
This is what I've done in the DFG-Viewer styles now. Please have a look at the current master of the DFG-Viewer at test.dfg-viewer.de.
Please compare the example from above in current master and in version 5.0 of DFG-Viewer and report change requests.
@albig IMHO the second one seems better from user perspective - it is more readable/compact.
@albig IMHO the spacing looks better now (in master), but the linebreaks seem a bit random...