/POC-dotnet-ExtractPdfContent

🔬 Proof of Concept of extracting content from PDF files using multiple PDF libraries

Primary LanguageC#MIT LicenseMIT

PoC .NET - Extract PDF content

wakatime Maintainability Test Coverage CodeFactor GitHub license GitHub last commit Build Linting

🔬 Proof of Concept of extracting content from PDF files using multiple PDF libraries.


Libraries

Refer to this article: Reading a PDF in C# on .NET Core

The main goal of this POC is to test the available options for effectively reading content from PDF files and replace the current iTextSharp—for .NET Framework.


Results

⚠️ DocNet

The results are not the best, but they look good. With the files tested, some errors were detected that could be avoided using some simple regexp when processing it later.

❌ iTextSharp.LGPLv2.Core

Encoding issues. The simple PDF generated by the library itself can be read, but another PDF tested returns problems with encoding.

✅ 🔝 PdfPig

99.999% of the result of PdfPig was the same as the old iTextSharp class (not the itextSharp Core version). This will be used in my projects to replace the old one.

❌ PdfSharpCore

This library doesn't support extract text yet.