/PubLayNet_tfrecords

Primary LanguagePythonMIT LicenseMIT

PubLayNet_tfrecords

This repo contains scripts used to convert PubLayNet dataset to tfrecords for semantic segmentation. The tfrecords can be used to train and evaluate semantic segmentation neural networks for document structure extraction and document layout recognition.

Style and Format

The style and formatting of the tfrecords is that of the official semantic segmentation model released on TensorFlow's model repository (https://github.com/tensorflow/models/tree/master/research/deeplab). More specifically, the scripts released here follow the style and formatting of Pascal_VOC dataset used for deeplab.

Instructions for Using the Repo

To use the code:

  1. Download the PubLayNet files from its official GitHub repo (https://github.com/ibm-aur-nlp/PubLayNet).
  2. Put train.json and dev.json under PubLayNet_tfrecords/PubLayNet folder.
  3. Unzip the downloaded files and put each batch in its appropriate folder under PubLayNet_tfrecords/PubLayNet/RawImages/.
  4. In Terminal, navigate to ./PubLayNet_tfrecords.
  5. To create the segmentation mask PNG files run python create_PubLayNet_segmentation_mask_png_files.py.
  6. To create tfrecords run python build_PubLayNet_tfrecords.py.
  7. Tfrecords will be saved in PubLayNet_tfrecords/PubLayNet/tfrecords.

About PubLayNet

PubLayNet is a large dataset of document images, of which the layout is annotated with both bounding boxes and polygonal segmentations. The source of the documents is PubMed Central Open Access Subset (commercial use collection). The annotations are automatically generated by matching the PDF format and the XML format of the articles in the PubMed Central Open Access Subset. More details are available in the paper "PubLayNet: largest dataset ever for document layout analysis."