UB-Mannheim/ocr-fileformat

TEI support?

Opened this issue · 14 comments

Question from the workshop: Can we also add transformation to/from TEI?

My first impression was that the TEI format is normally used in different applications. But I learned that it is also possible to add x-y-coordinates of boxes in TEI. I haven't look deeper whether this is a suitable feature request...

I found a ALTO2TEI XSLT here: https://github.com/collex/typewright/blob/master/lib/saxon/AltoToTeiA.xsl (some fields are hardcoded for this project and they are writing about some other style sheet where they based theirs on).

Also http://able.myspecies.info/abbyy-xml-tei-xml (looks a little special at first glance...)

kba commented

TEI is quite a big standard, lots of different flavors, so there are probably a lot of ways to implement it.

It depends on what you want to achieve. If the primary goal is the transformation from TEI to ALTO for use in the DFG viewer, that reduces the complexity a lot because much data can simply be ignored.

We don't have any use case for this at the moment. Maybe, we can just leave the issue open here and collect more information and any possible implementations by reusing some code. BTW I don't think that the technical implementation would be difficult, but reading and understanding format descriptions as well as testing with good examples.

There are a lot of transformation tools for TEI here: https://github.com/TEIC/Stylesheets but ALTO or ABBYY is not among them.

kba commented

Yes, let's keep this open and target the Dfg viewer, that seems feasible.

cneud commented

See also this service which can convert various formats including ALTO to TEI: https://github.com/INL/OpenConvert

Thank you @kba, that looks interesting as well! Let me know when anyone wants to work on integrating any of these transformation in ocr-fileformat.

We don't have any use case for this at the moment.

Now we have a use case. We must convert 64833 TEI files (like this one) to ALTO for Kitodo Presentation / DFG Viewer.

A first attempt on writing a XSLT can be found here but although it produces valid HOCR, the subsequent transformation to ALTO is not successful (most likely due to the lack of ocr_line in the HOCR file). I guess it would be possible to extract the document's line structure from jumps in the top-left coordinate of the words in a paragraph but I don't see an easy way on how to do this in XSLT. So maybe there will be a python script eventually...

Nice! @jmechnich Can you create a PR? Then it is easier to discuss this further. But I am quite happy with such a XSLT transformation, even when there are no ocr_lines (they are AFAIK also missing in your TEI file).

Several years later... 😏

Hi all, is this still an open issue as the PR has been merged without further discussion?