Parsing PDFs using YOLOV3

There exist many python librairies which enable the parsing of pdfs, Camelot is one of the best. Although it performs well on text, however, it struggles on tables specially the ones localized inside paragraphs.
Camelot offers the option of specifying the regions to parse through the variable table_areas="x1,y1,x2,y2" where (x1, y1) is left-top and (x2, y2) right-bottom in PDF coordinate space. When filled out, the result is significantly enhanced.

Explaining the basic idea

One way to automize the parsing of tables is to train an algorithm capable of returning the coordinates of the bounding boxe circling the table, as detailled in the following pipeline:

If the primitive pdf page is image-based, we can use ocrmypdf to turn into a text-based one in order to be able to get the text inside of the table. We, then, carry out the following operations:

  • Transform a pdf page into an image one using pdf2img
  • Use a trained algorithm to detect the regions of tables.
  • Normalize the bounding boxes, using the image dimension, which enables use to get the regions in the pdf space using the pdf dimensions obtained through PyPDF2.
  • Feed the regions to camelot and get the corresponding pandas dataframes.


When detecting a table in pdf image we expand the bounding boxe in order to guarante its full inclusion, as follows:

Tables detection

The algorithm which allows the detection of tables, is nothing but yolov3, I advise your to read my previous article about objects detection. We finetune the algorithm to detect tables and retrain all the architecture. To do so, we carry out the following steps:

  • Create a training database using Makesense a tool which enables an export in yolo's format:

  • Train a yolov3 repository modified to fit our purpose on AWS EC2, we get the following results:

Requirements

All python requirements are included in the file package.txt, all you need to do is run the following command line:

pip install -r packages.txt

Prediction

It is possible to make prediction on a pdf page using the following command line:

python predict_table.py --pdf_path pdfs/boeings.pdf --page 2

It takes two arguments:

  • pdf_path: where the original pdf file is located
  • page: the desired page to parse

Examples

NB: following the same steps, we can train the algorithms to detect any other object in a pdf page such as graphics and images which can be extracted from the image page.