Text Cleanup Attempt to fix common errors in OCR-scanned text. Known Issues: Assumes UTF-8 input, even for properly annotated XML files.