PublayNetSharp

Extract and convert PubLayNet data to PageXml format

Related projects

PublayNet dataset

PubLayNet is a large dataset of document images, of which the layout is annotated with both bounding boxes and polygonal segmentations. The source of the documents is PubMed Central Open Access Subset (commercial use collection). The annotations are automatically generated by matching the PDF format and the XML format of the articles in the PubMed Central Open Access Subset. More details are available in our paper "PubLayNet: largest dataset ever for document layout analysis.".

PubLayNet's repo

Steps

Compressed Pdf documents retrieval

Prior to running the extraction and conversion, you will need to download the required pdf documents from the PubMed Central website. In order to get the location of each document processed in the PublayNet dataset, you will need to match the image files name to the tar.gz file name in the database. The index of the Commercial Use Collection documents is available here in txt and here in csv.

The files can then be downloaded from the PubMed Central's FTP, more info here.

Example

In PublayNet sample data, the PMC5491943_00004.jpg is the first processed image. The expected compressed file name is PMC5491943.tar.gz. From the PubMed Central's index file, the location of the document is: oa_package/32/31/PMC5491943.tar.gz.

Pdf extraction

You can use the following to extract all the compressed pdf document in tarGzFolder into outputFolder.

TarGz.ExtractAll(tarGzFolder, outputFolder);

Coco data to PageXml

You can use the following to convert the coco formated PublayNet data to PageXml. One xml document per page will be generated, only if the pdf document is available in the outputFolder.

CocoPageXml.Convert(jsonPath, outputFolder);

BobLd/PublayNetSharp