/ReadingOrderRecalculation

Post-process PageXMLs to better the reading order of regions

Primary LanguagePythonMIT LicenseMIT

PageXML Reading Order Recalculation

This is a simple rule-based script for recalculating the region reading order of a PageXML file. It is meant to post-process results of a layout recognition using Transkribus or Loghi. More specifically, it is modelled to correctly order one or two page scans which contain marginalia.

This script was developed for Het Utrechts Archief within the context of an internship.

What It Does

  • Extract Features: Parses XML files to extract image information (height, width) as well as text regions and their coordinates.
  • Calculate Reading Order: Uses extracted features to calculate the region reading order.
  • Update Files: Saves the new reading order into the PageXMLs.

Requirements

You only need numpy and pandas in addition to some standard Python libraries. You can install the required dependencies using pip:

pip install numpy pandas

Usage

Batch Reading Order Recalculation of PageXML files

The code is written to process all XML files located in a directory; To execute the script, install all dependencies first and then run following:

python reorder.py example_folder/page --overwrite

As arguments, specify the base directory containing the PageXML files (here example_folder/page), and add --overwrite if you wish to overwrite the existing file.

How It Works

The script is using simple logic based on the geometric properties of the regions and page.

Given this sample layout of a scan:

  1. Determine orientation (landscape = two pages, portrait = one page) based on the image’s height and width. Depending on the orientation, the bookfold location is estimated:
    • at the horizontal centre of the scan for landscape orientation
    • at the left edge (x = 0) for portrait orientation

  1. The regions are assigned either 0 for left page or 1 for right page based on where their own horizontal centre is located.
  2. The regions are ordered:
    • Left page to right page
      • Top to bottom
        • Left to right

  1. The script then uses this initial order to iterate through all regions, comparing every current box with its immediate following one in the ranking. It checks whether the following box might be a marginalium by inspecting if they are located on the same page, and then if the candidate is vertically contained within the current box:

  1. It is then confirmed that it is located to the left or right of the current box (In this case, it is considered to be left of it; it’s comparing the left edge for the left condition (and vice versa) so overlapping boxes are handled correctly):

  1. If all these conditions apply, their ranks/indices are swapped:

  1. If a swap occurs, the loop breaks and restarts with the new order. This gets repeated until no more swaps occur in a full loop; the final reading order has been reached:

Visualisation

You can visualise the calculated reading order path by specifying your base directory and executing it:

python visualise.py example_folder

Here are some side-by-side comparisons of input image and visualised result:

(these can be found in the example_folder; The scans were processed using Loghi.

License

This project is licensed under the MIT License - see the LICENSE file for details.