/pdftextractor

Primary LanguageGoMIT LicenseMIT

PDFtExtractor


This package performs a text extraction on PDF files, with a focus on performance. The function ExtractText takes two parameters; the path to the PDF file, and a boolean value defining whether to extract text from drawn images as well. It returns a byte slice of the text contents.

The current package has some issues when handling the PDF text objects, and does not properly decode the PDF glyphs

Usage

To use this package in your project, run the following command in your module:

go get -u github.com/syhv-git/pdftextractor

You will also need to download tesseract-ocr and libtesseract-dev (tesseract-ocr-dev for apk). You can add more language data to /usr/share/tesseract-ocr/$VERSION/tessdata/

There may be dependency issues with the Gosseract package. This will require the linux mint package from the same developer

Roadmap

  • Decode PDF string objects and extract the raw text

    • Narrowed issues down to font encoding and cmaps
    • Everything before decodeText() works as expected
  • Optimize the codebase

  • Test with PDFs containing images

  • Test with an Image based PDF file

  • Test interoperability with other PDF versions

  • Test various encoding types and font styles

  • Test various string object encodings