ALTO output: Missing <SP> tags between <String> tags

Question

ALTO output: Missing <SP> tags between <String> tags

jbarth-ubhd opened this issue 7 years ago · 24 comments

Perhaps this is not an error.
Kind regards,
J. Barth

Answer 1 · 2017-12-22T08:57:26.000Z

Can you provide sample data and how you ran the tool?

Answer 2 · 2017-12-22T09:03:38.000Z

I guess you output the ALTO files directly from ABBYY, because we don't yet provide a transormation from ABBYY to ALTO. Then this should be an example: https://digi.bib.uni-mannheim.de/~stweil/ocr-praxis/Testseiten/alto/417576986_0031.xml . The <SP> stands AFAIK for space and it does validate in this form.

Answer 3 · 2017-12-22T09:05:26.000Z

Yes, I'll try to find out if <SP> (=space) is really necessary between <String>s in ALTO.

Answer 4 · 2017-12-22T09:19:03.000Z

I guess that it still validates without the SP tags. Moreover, most of the information (HPOS, WIDTH) can be calculated from the line above and below, but if the width of a space is important for some application, then it might be easier to have this data directly. I don't know what the VPOS information for a space says or whether it is also determined by some other values.

Answer 5 · 2017-12-22T10:24:15.000Z

On ALTO 2.1 .xsd it looks like this:

  <xsd:sequence maxOccurs="unbounded">
    <xsd:element name="String" type="StringType"/>
    <xsd:element name="SP" minOccurs="0"> ...
    </xsd:element>
  </xsd:sequence>

So strictly speaking it seems that <SP> is not necessary, but the <sequence> seems to imply it.

Answer 6 · 2017-12-22T10:44:40.000Z

but the seems to imply it.

Not sure. I only see here, that, if <SP> occurs, then it has to occur after a <String>.

Answer 7 · 2018-11-22T21:28:48.000Z

Here is an ALTO file generated with Tesseract (see tesseract-ocr/tesseract#2067). Another page was processed by ABBYY Finereader.

While ABBYY adds the <SP> tags, Tesseract (and ocr-fileformat) does not. As the <String> tags contain the surrounding box positions and the distance of two text boxes can be calculated without additional information, that looks sufficient at a first glance. But without the <SP> the DFG viewer does not separate the words!

I am not sure whether this is a bug of the DFG viewer (and Kitodo Presentation) or whether ALTO requires explicit tags for the whitespace between words. Perhaps @sebastian-meyer or @cneud know the answer?

Answer 8 · 2018-11-22T21:50:58.000Z

The ALTO documentation says "A TextBlock is divided into lines and those are divided into strings, spaces and hyphens". I don't interpret that as a strict requirement that spaces are required, and nor does the .xsd. It's clear that spaces are required if the strings are given without HPOS and WIDTH attributes, but I think it is redundant if those attributes are available.

Answer 9 · 2018-11-23T10:40:53.000Z

The ALTO spec itself needs to clarify this issue.

Answer 10 · 2018-11-23T11:51:32.000Z

Clemens has created an issue for that: altoxml/schema#54 (thank you).

Answer 11 · 2018-11-23T11:57:55.000Z

Thanks for flagging this, I will put it on the agenda for our next ALTO board call which will be held November 29th.

Answer 12 · 2018-11-24T19:50:39.000Z

To chip in, I've interpreted the standard that the <SP><String> alternation is mandatory (sequence definition of <TextLine> contents) and that whitespace should never occur inside a <String> and this is how I implemented it.

Answer 13 · 2018-11-24T20:17:47.000Z

If a <String> never contains whitespace, then <SP> is completely redundant. Does ALTO allow overlapping words in a row? If yes, does that require a separating space with negative width? :-)

Answer 14 · 2018-11-24T20:39:51.000Z

If a never contains whitespace, then is completely redundant.

Why? Whitespace is a character like any other and personally I would've taken the decision to encode it explicitly using <String> if the standard wouldn't heavily imply that you shouldn't do that. Of course, you can throw away the data and let people compute inter-word spacing implicitly provided through word bounding boxes but it isn't like tesseract, kraken or any other sequence classification based OCR engine doesn't output a label for whitespace (and the boundaries of that activation can almost certainly differ from the boundary of the activations of the adjacent letters). I'd rather not throw away metadata that some weird subdiscipline in the humanities that only the 8 people participating in it have ever heard about might need.

Does ALTO allow overlapping words in a row?

ALTO luckily allows overlapping elements in constrast to PageXML.

Answer 15 · 2018-11-24T21:06:24.000Z

Then how would you encode two overlapping words if you are forced to put a <SP> between them?

Answer 16 · 2018-11-24T23:12:32.000Z

Just have overlapping bounding boxes? Presumably there is still a reading order that determines the ordering of the <String> tags. But yeah it helps that I decided a long time ago that words are a waaay to squishy concept and arbitrarily defined anything bounded by whitespace is a separate word/segment for serialization purposes (not only for ALTO). Of course, I you want to encode a proper tokenization, this data model shouldn't be used. On the other hand, I'm of firm conviction that starting to do that in a raw OCR serialization format is only going to lead to madness.

Answer 17 · 2018-12-13T16:00:43.000Z

Just to follow up - I'm afraid a quick resolve is not really around the corner...the issue was discussed in the last ALTO board call, with the core elements of the discussion summarized here.

While the general feeling was that the use of <SP> is not mandatory, some more research into ALTO's history is required to determine the original authors exact intentions.

An expansion of the <SP> tag with a width attribute has been identified among board members as a possibility to create more useful future applications for the <SP> tag.

If one really wants to be on the safe side, the quick solution right now would be to indeed include <SP> in the output of any ALTO export implementation as it is also straightforward to remove in post-processing.

Answer 18 · 2018-12-13T16:16:40.000Z

ALTO's history is required to determine the original authors exact intentions.

As a note, most of the character-based classification systems common at the time ALTO was originally specified didn't treat whitespace as a proper glyph, i.e. whitespace is just something bordered by other glyphs and is never seen by the classifier as such. This at least explains the existence of a separate <SP> tag.

Answer 19 · 2018-12-13T16:30:39.000Z

Thank you, @cneud, @mittagessen and the ALTO board.

As the current DFG viewer expects the <SP> tags, I think that programs like ocr-transform should produce them, too. Pull request tesseract-ocr/tesseract#2117 adds the tags to Tesseract's new ALTO output, so that output is now compatible with the DFG viewer.

Answer 20 · 2019-12-30T13:56:31.000Z

The addition of the <SP> should be handled upstream in the corresponding transformation. Currently, we use hocr2alto and page2alto. We can keep this issue here open as a reminder.

Answer 21 · 2020-01-02T16:07:00.000Z

According to the ALTO XSD the SP tag is optional - minOccurs="0"

And I do not see a way how to reliably calculate HEIGHT/WIDTH/VPOS/HPOS attributes from the hOCR data for the SP tag.

IMHO - proper handling of optional SP tag should be fixed by DFG viewer.

Answer 22 · 2020-01-03T14:47:41.000Z

If the <SP> is not mandatory, we have to "ignore" it in the styles of the fulltext view and always make a space after a <STRING>.

This is what I've done in the DFG-Viewer styles now. Please have a look at the current master of the DFG-Viewer at test.dfg-viewer.de.

Please compare the example from above in current master and in version 5.0 of DFG-Viewer and report change requests.

Answer 23 · 2020-01-03T15:55:31.000Z

@albig IMHO the second one seems better from user perspective - it is more readable/compact.

Answer 24 · 2020-01-06T15:12:20.000Z

@albig IMHO the spacing looks better now (in master), but the linebreaks seem a bit random...