python-pdf-extractor-library-benchmarking

A simple benchmarking of PDF extractor library in Python based on how accurate they extract text from the PDF.

About the project

There are many pdf extractor libraries for Python out there with all their pros and cons. Choosing one from many can feels numbing. We can go to google and search "the best python library for extracting pdf" and get various results. But then it is back to your need. What exactly you need? You need to extract the informations from the PDF. And it usually the text information. But what's the best extracting library to choose? Here I propose a way to benchmarking the libraries: by checking the accuracy of words extracted from the PDF source.

Libraries

Python libraries I used for this project are:

Sources

All e-books I'm using for benchmarking these libraries collected from Project Gutenberg.

How I did the benchmarking

I extract ten pages from each sources using PyPDF, PDFMiner, and PyMuPDF. I made a class to feed in each libraries result to Textblob and check whether the words are correct of misspelled. I count the misspelled and using the number to calculate the accuracy of each libraries for each sources.

yafethtb/python-pdf-extractor-library-benchmarking

python-pdf-extractor-library-benchmarking

About the project

Libraries

Sources

How I did the benchmarking