benwbrum/fromthepage

AI text generation should create valid XML even if ALTO has angle brackets

Opened this issue · 0 comments

One of the USDA's pages generates AI text with invalid XML mark-up. This seems to be generated by Transkribus during the HTR generation process, creating strange <INS> and <GAP> tags in the ALTO. When these are inserted into the transcription field, they result in invalid XML.

We should escape the XML generated by ALTO.