Awesome Historical Newspaper Analysis Tools and Literature

Awesome

A curated list of awesome tools and literature for historical newspaper analysis, including data standardization, optical character recognition, document layout analysis, text enrichment, semantic segmentation, quality evaluation and natural language analysis.

Tools

Data standards

hOCR

hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information.

  • hocr-tools Tools for the manipulation of hOCR files and the evaluation of OCR quality.

  • hocrjs - Visualization of hOCR files.

  • PAGEviewer - Visualization of page layout and OCR segmentation for PAGE XML, ALTO XML, FineReader XML and hOCR.

Optical character recognition

  • Tesseract OCR engine - Open source C++ api and command line tool. Provides basic layout analysis.

  • Ocrad - The GNU OCR.

Document layout analysis, text enrichment and semantic segmentation

Quality evaluation

  • Aletheia - Ground truth annotation tool.

Text analysis

  • scikit-learn - Well-documented general purpose ML library.

  • TidyText - Manipulation of text data (R package; easy to do the same with pandas).

  • LdaSeqModel - Dynamic topic modeling in Python.

Literature