Laypa: A Novel Framework for Applying Segmentation Networks to Historical Documents
HIP'23 paper: https://doi.org/10.1145/3604951.3605520
ArXiv paper: Coming soon!
Part of the Loghi pipeline
Laypa is a segmentation network, with the goal of finding regions (paragraph, page number, etc.) and baselines in documents. The current approach is using a ResNet backbone and a feature pyramid head, which made pixel wise classifications. The models are built using the detectron2 framework. The baselines and region classifications are then made available for further processing. This post-processing turn the classification into instances. So that they can be used by other programs (OCR/HTR), either as masks or directly as pageXML.
Developed using the following software and hardware:
Operating System | Python | PyTorch | Cudatoolkit | GPU | CPU | Success |
---|---|---|---|---|---|---|
Ubuntu 22.04.4 LTS (Linux-6.5.0-28-generic-x86_64-with-glibc2.35) | 3.12.3 | 2.3.0 | 12.1 | NVIDIA GeForce RTX 3080 Ti Laptop GPU | 12th Gen Intel(R) Core(TM) i9-12900H | ✅ |
Click here to show all tested environments
More coming soon
Run utils/collect_env_info.py
to retrieve your environment information, and add them via pull request.
Operating System | Python | PyTorch | Cudatoolkit | GPU | CPU | Success |
---|---|---|---|---|---|---|
Ubuntu 22.04.4 LTS (Linux-6.5.0-28-generic-x86_64-with-glibc2.35) | 3.12.3 | 2.3.0 | 12.1 | NVIDIA GeForce RTX 3080 Ti Laptop GPU | 12th Gen Intel(R) Core(TM) i9-12900H | ✅ |
The recommended way of running Laypa is inside a conda environment. To ensure easier compatibility a method of building a docker is also provided.
To start clone the github repo to your local machine using either HTTPS:
git clone https://github.com/stefanklut/laypa.git
Or using SSH:
git clone git@github.com:stefanklut/laypa.git
And make laypa the working directory:
cd laypa
If not already installed, install either conda or miniconda (install instructions), or mamba (install instructions).
The required packages are listed in the environment.yml
file. The environment can be automatically created using the following commands.
Using conda/miniconda:
conda env create -f environment.yml
Using mamba:
mamba env create -f environment.yml
When running Laypa always activate the conda environment
conda activate laypa
If not already installed, install the Docker Engine (install instructions). The docker environment can most easily be build with the provided script.
Laypa now has a release on dockerhub. Using the docker of loghi/docker.laypa
, should pull the corresponding laypa docker directly from docker hub. If this fails from some reason it can be pulled manually from here. If it is outdated or requires differences to the source code, please try the Manual Installation.
Building the docker using the provided script:
./buildImage.sh <PATH_TO_LAYPA>
Or the multistage build with some profiler tools taken out (might be smaller):
./buildImage.multistage.sh <PATH_TO_LAYPA>
Click for manual docker install instructions (not recommended)
First copy the Laypa directory to the temporary docker directory:
tmp_dir=$(mktemp -d)
cp -r -T <PATH_TO_LAYPA> $tmp_dir/laypa
cp Dockerfile $tmp_dir/Dockerfile
cp _entrypoint.sh $tmp_dir/_entrypoint.sh
cp .dockerignore $tmp_dir/.dockerignore
Then build the docker image using the following command:
docker build -t loghi/docker.laypa $tmp_dir
Click for minikube install instructions
Minikube is local Kubernetes, allowing you to test the Laypa tools in a Kubernetes environment. If not already installed start with installing minikube (install instructions)
If the docker images have already been built the minikube can run them straight away. To do so, start minikube without any special arguments:
minikube start
Afterwards the docker for Laypa can be added to the running minikube instance using the following command (assuming the Laypa docker was built under the name loghi/docker.laypa):
minikube image load loghi/docker.laypa
It is also possible to build the Laypa docker using the minikube docker instance. This means minikube will need access to the Laypa code. As it stand, this is current still done using a copy command from the local storage. In order to do so start the minikube with the mount argument:
minikube start --mount
This will make the machines filesystem available to minikube. Then ssh into the running minikube:
minikube ssh
Within the ssh minikube go to the location of the laypa where the host /home/<user>
is mounted to minikube-host
cd minikube-host/<PATH_TO_LAYPA>
And follow the instructions for install a docker version of Laypa as described here
When successful the docker image should be available under the name loghi/docker.laypa
. This can be verified using the following command:
docker image ls
And checking if loghi/docker.laypa is present in the list of built images.
Some initial pretrained models can be found here.
The dataset used for training requires images combined with ground truth pageXML. For structure the pageXML needs to be inside a directory one level down from the images. The dataset can be split over multiple directories, with the image paths specified in a .txt
file. The structure should look as follows:
training_data
├── page
│ ├── image1.xml
│ ├── image2.xml
│ ├── image3.xml
│ └── ...
├── image1.jpg
├── image2.jpg
├── image3.jpg
└── ...
Where the image and pageXML filename stems should match image1.jpg <-> image1.xml
. For the .txt
based dataset absolute paths to the images are recommended. The structure for the data used as validation is the same as that for training.
When running inference the images you want processed should be in a single directory. With the images directly under the root folder as follows:
inference_data
├── image1.jpg
├── image2.jpg
├── image3.jpg
└── ...
Some dataset that should work with laypa are listed below, some preprocessing may be require:
Three things are required to train a model using main.py
.
- A config file, See
configs/segmentation
for examples of config files and their contents. - Ground truth training/validation data in the form of images and their corresponding pageXML. The training/validation data can be provided by giving either a
.txt
file containing image paths, the image paths themselves, or the path of a directory containing the images.
Required arguments:
python main.py \
-c/--config <CONFIG> \
-t/--train <TRAIN [TRAIN ...]> \
-v/--val <VAL [VAL ...]>
Click to see all arguments
Optional arguments:
python main.py \
-c/--config CONFIG \
-t/--train TRAIN [TRAIN ...] \
-v/--val VAL [VAL ...] \
[--tmp_dir TMP_DIR] \
[--keep_tmp_dir] \
[--num-gpus NUM_GPUS] \
[--num-machines NUM_MACHINES] \
[--machine-rank MACHINE_RANK] \
[--dist-url DIST_URL] \
[--opts OPTS [OPTS ...]]
The optional arguments are shown using square brackets. The --tmp_dir
parameter specifies a folder in which to store temporary files. While the --keep_tmp_dir
parameter prevents the temporary files from being deleted after a run (mostly for debugging).
The remaining arguments are all for training with multiple GPUs or on multiple nodes. --num-gpus
specifies the number of GPUs per machine. --num-machines
specifies the number of nodes in the network. --machine-rank
gives a node a unique number. --dist-url
is the URL for the PyTorch distributed backend. The final parameter --opts
allows you to change values specified in the config files. For example, --opts SOLVER.IMS_PER_BATCH 8
sets the batch size to 8.
As indicated by the trailing dots multiple training sets can be passed to the training model at once. This can also be done using the train argument multiple types. The .txt
files can also be mixed with the directories. For example:
# Pass multiple directories at once
python main.py -c config.yml -t data/training_dir1 data/training_dir2 -v data/validation_set
# Pass multiple directories with multiple arguments
python main.py -c config.yml -t data/training_dir1 -t data/training_dir2 -v data/validation_set
# Mix training directory with txt file
python main.py -c config.yml -t data/training_dir -t data/training_file.txt -v data/validation_set
To run the trained model on images without ground truth, the images need to be in a single directory. The output consists of either pageXML in the case of regions or a mask in the other cases. This mask can then be processed using other tools to turn the pixel predictions into valid pageXML (for example on baselines). As stated, the regions are turned into polygons for the pageXML within the program already.
How to run the Laypa inference individually will be explained first, and how to run it with the full scripts that include the conversion from images to pageXML with come after.
To just run the Laypa inference in run.py
, you need three things:
- A config file, See
configs/segmentation
for examples of config files and their contents. - The data can be provided by giving either a
.txt
file containing image paths, the image paths themselves, or the path of a directory containing the images. - A location to which the processed files can be written. The directory will be created if it does not exist yet.
Required arguments
python run.py \
-c/--config CONFIG \
-i/--input INPUT \
-o/--output OUTPUT
Click to see all arguments
Optional arguments:
python run.py \
-c/--config CONFIG \
-i/--input INPUT \
-o/--output OUTPUT
[--opts OPTS [OPTS ...]]
The optional arguments are shown using square brackets. The final parameter --opts
allows you to change values specified in the config files. For example, --opts SOLVER.IMS_PER_BATCH 8
sets the batch size to 8.
List values have to be overridden by encapsulating the whole list with quotes like --opts PREPROCESS.REGION.RECTANGLE_REGIONS '["Photo"]'
An example of how to call the run.py
command is given below:
python run.py -c config.yml -i data/inference_dir -o results_dir
Examples of running the full pipeline (with processing of baselines) are present in the scripts
directory. These files make the assumption that the docker images for both Laypa and the loghi-tooling (Java post-processing) are available on your machine. The script will also try and verify this. The Laypa docker image needs to be build with the pretrained models included.
To run the scripts only two thing are needed:
- A directory with images to be processed.
- A location to which the processed files can be written. The directory will be created if it does not exist yet.
Required arguments:
./scripts/pipeline.sh <input> <output>
Click to see all arguments
Optional arguments:
./scripts/pipeline.sh \
<input> \
<output> \
-g/--gpu GPU
The required arguments are shown using angle brackets. The --gpu
parameter specifies what GPU(s) is accessible to the docker containers. The default is all
.
The positional arguments input and output refer to the input and output directory. An example of running the one of the pipelines is shown below:
./scripts/pipeline.sh inference_dir results_dir
The Flask Server is set up to run the inference code in a Kubernetes environment. To run the Flask API run the start_flask.sh
application with the environment variables set. This can generally be set when running a docker, which can set the environment variables beforehand depending on the docker internal file structure. To quickly test locally you can run the start_flask_local.sh
application, which sets the environment variables at runtime.
The flask server will run on port 5000 and can be called from outside using a curl
command. When testing on a localhost the command will look as follows:
curl -X POST -F image=@<PATH_TO_IMAGE> -F identifier=<identifier> -F model=<MODEL_FOLDER_NAME> 'http://localhost:5000/predict'
The required form information is the image (image
) that should be processed. A given identifier to differentiate multiple runs/tests (identifier
). And finally which config and weights to use (model
). The config and weights are saved in a folder, this folder name is what needs to be provided. In this folder, the config should be named config.yml
and the weight file should end in .pth
.
For a small tutorial using some concrete examples see the tutorial
directory.
The Laypa repository also contains a few tools used to evaluate the results generated by the model.
The first tool is a visual comparison between the predictions of the model and the ground truth. This is done as an overlay of the classes over the original image. The overlay class names and colors are taken from the dataset catalog. The tool to do this is visualization.py
. The visualization has almost the same arguments as the training command (main.py
).
Required arguments:
python tooling/visualization.py \
-c/--config CONFIG \
-i/--input INPUT [INPUT ...] \
Click to see all arguments
Optional arguments:
python tooling/visualization.py \
-c/--config CONFIG \
-i/--input INPUT [INPUT ...] \
[-o/--output OUTPUT] \
[--tmp_dir TMP_DIR] \
[--keep_tmp_dir]
[--opts OPTS [OPTS ...]] \
[--sorted] \
[--save SAVE]
The optional arguments are shown using square brackets. The -o/output
parameter specifies the output directory for the visualization masks. The --tmp_dir
parameter specifies a folder in which to store temporary files. While the --keep_tmp_dir
parameter prevents the temporary files from being deleted after a run (mostly for debugging). The final parameter --opts
allows you to change values specified in the config files. For example, --opts SOLVER.IMS_PER_BATCH 8
sets the batch size to 8. The --sorted
parameter sorts the images based on the order in the operating system. The --save
parameter specifies what type of file the visualization should be saved as. The options are "pred" for the prediction, "gt" for the ground truth, "both" for both the prediction and the ground truth and "all" for all of the previous. If just --save
is given the default is "all".
Example of running visualization.py
:
python tooling/visualization.py -c config.yml -i input_dir
The visualization.py
will then open a window with both the prediction and the ground truth side by side (if the ground truth exists). Allowing for easier comparison. The visualization masks are created in the same way the preprocessing converts pageXML to masks.
The second tool validation.py
is used to get the validation scores of a model. This is done by comparing the prediction of the model to the ground truth. The validation scores are the Intersection over Union (IoU) and Accuracy (Acc) scores. The tool requires the input directory (--input
) where there is also a page folder inside the input folder. The page folder should contain the xmls with the ground truth baselines/regions. To run the validation tool use the following command:
Required arguments:
python tooling/validation.py \
-c/--config CONFIG \
-i/--input INPUT
Click to see all arguments
```sh python validation.py \ -c/--config CONFIG \ -i/--input INPUT \ [--opts OPTS [OPTS ...]] ```The optional arguments are shown using square brackets. The final parameter --opts
allows you to change values specified in the config files. For example, --opts MODEL.WEIGHTS <PATH_TO_WEIGHTS>
sets the path to the weights file. This needs to be done if the weights are not in the config file. Without MODEL.WEIGHTS
the weights are taken from the config file. If the weights are not in the config file and not specified with MODEL.WEIGHTS
the program will return results for an untrained model.
The third tool is a program to compare the similarity of two sets of pageXML. This can mean either comparing ground truth to predicted pageXML, or determining the similarity of two annotations by different people. This tool is the xml_comparison.py
file. The comparison allows you to specify how regions and baseline should be drawn in when creating the pixel masks. The pixel masks are then compared based on their Intersection over Union (IoU) and Accuracy (Acc) scores. For the sake of the Accuracy metric one of the two sets needs to be specified as the ground truth set. So one set is the ground truth directory (--gt
) argument and the other is the input directory (--input
) argument.
Required arguments:
python tooling/xml_comparison.py \
-g/--gt GT [GT ...] \
-i/--input INPUT [INPUT ...]
Click to see all arguments
Optional arguments:
python tooling/xml_comparison.py \
-g/--gt GT [GT ...] \
-i/--input INPUT [INPUT ...] \
[-m/--mode {baseline,region,start,end,separator,baseline_separator}] \
[--regions REGIONS [REGIONS ...]] \
[--merge_regions [MERGE_REGIONS]] \
[--region_type REGION_TYPE [REGION_TYPE ...]] \
[-w/--line_width LINE_WIDTH]
The optional arguments are shown using square brackets. The --mode
parameter specifies what type of prediction the model has to do. If the mode is region, the --regions
argument specifies which regions need to be extracted from the pageXML (for example "page-number"). The --merge_regions
then specifies if any of these regions need to be merged. This could mean converting "insertion" into "resolution" since they are talking about the same thing resolution:insertion
. The final region argument is --region_type
which can specify the region type of a region. In the other modes lines are used. The line arguments are --line_width
, which specifies the line width, and --line_color
, which specifies the line color.
The final tool is a program for showing the pageXML as mask images. This can help with showing how the pageXML regions and baseline look. This can be done in gray scale, color, or as a colored overlay over the original image. This tool is located in the xml_viewer.py file. It requires an input directory (--input
) argument and output directory (--output
) argument.
Required arguments:
python tooling/xml_viewer.py \
-c/--config CONFIG \
-i/--input INPUT [INPUT ...] \
-o/--output OUTPUT [OUTPUT ...]
Click to see all arguments
Optional arguments:
python tooling/xml_viewer.py \
-c/--config CONFIG \
-i/--input INPUT [INPUT ...] \
-o/--output OUTPUT [OUTPUT ...] \
[--opts OPTS [OPTS ...]] \
[-t/--output_type {gray,color,overlay}]
The optional arguments are shown using square brackets. The parameter --opts
allows you to change values specified in the config files. The --output_type
parameter specifies which type of
Distributed under the MIT License. See LICENSE
for more information.
This project was made while working at the KNAW Humanities Cluster Digital Infrastructure
Please report any bugs or errors that you find to the issues page, so that they can be looked into. Try to see if an issue with the same problem/bug is not still open. Feature requests should also be done through the issues page.
If you discover a bug or missing feature that you would like to help with please feel free to send a pull request.