The library works with few pdfs for two main reasons:
- The transformation matrix and the graphic state is not handled
- The fonts/encodings are not correctly handled
Extract tables (and paragraphs outside tables) from pdf
(please read before use)
This software is released under MIT license but uses iTextSharp v.4.1.6 that is released under MPL LGPL license. Before using this software you should also agree with the iTextSharp v.4.1.6 license. Also, take care if you upgrade iTextSharp because newer versions are released under AGPL.
PDF is a file format used to define device independent page output. This project intend to retrieve text and tables from a pdf.
The main part is the Engine.
The Renderer is a debug window to understand what's happening.
Call
var pages = ExtractText.Read(fileName);
to read all the pages.
Then, for every page, call
Page.DetermineTableStructures();
Page.DetermineParagraphs();
Page.FillContent();
To check if you already called the method above, use
Page.IsRefreshed
After that you'll be able to access to
Page.Contents
Contents is a collection of IPageContent ordered from top of page to bottom.
A IPageContent can be a
- Paragraph that contains text (Content)
- Table that contains a matrix of text (Content[,])