ALTO whitespace handling is inconsistent
jbaiter opened this issue · 1 comments
jbaiter commented
When an ALTO file does not explicitly denote whitespace with <SP>
, the text
for the whole snippet does not include whitespace, while the text for each individual region does:
{
"text": "DieZahlderer,welchejeneSchreckens: zeitmitAugenſahen,inwelcher<em>Zittau</em>, <em>im</em>GefolgedesſiebenjährigenKrieges,den 23.Juli1757,aufdieſchre>li<ſteArt zerſtörtward,kannzwarnurnochklein",
"score": 662.4285,
"pages": [
{
"id": "p00000001",
"width": 1269,
"height": 1947
}
],
"regions": [
{
"ulx": 141,
"uly": 720,
"lrx": 989,
"lry": 984,
"text": "Die Zahl derer, welche jene Schreckens: zeit mit Augen ſahen, in welcher <em>Zittau</em>, <em>im</em> Gefolge des ſiebenjährigen Krieges, den 23. Juli 1757, auf die ſchre>li<ſte Art zerſtört ward, kann zwar nur noch klein",
"pageIdx": 0
}
],
// ...
}
Thanks to @ulb-sa-schmilj for reporting!
jbaiter commented
Couldn't reproduce this with the most recent version, closing until further notice.