PDFtExtractor

This package performs a text extraction on PDF files, with a focus on performance. The function ExtractText takes two parameters; the path to the PDF file, and a boolean value defining whether to extract text from drawn images as well. It returns a byte slice of the text contents.

The current package has some issues when handling the PDF text objects, and does not properly decode the PDF glyphs

Usage

To use this package in your project, run the following command in your module:

go get -u github.com/syhv-git/pdftextractor

You will also need to download tesseract-ocr and libtesseract-dev (tesseract-ocr-dev for apk). You can add more language data to /usr/share/tesseract-ocr/$VERSION/tessdata/

There may be dependency issues with the Gosseract package. This will require the linux mint package from the same developer

Roadmap

Decode PDF string objects and extract the raw text
- Narrowed issues down to font encoding and cmaps
- Everything before decodeText() works as expected
Optimize the codebase
Test with PDFs containing images
Test with an Image based PDF file
Test interoperability with other PDF versions
Test various encoding types and font styles
Test various string object encodings

rhaist/pdftextractor

PDFtExtractor

Usage

Roadmap