/pdf2msgpack

Efficiently export PDF content in an easily machine readable format

Primary LanguageGoGNU General Public License v2.0GPL-2.0

About

pdf2msgpack is designed to efficiently dump information from a PDF in a portable format which is easy to consume without heavyweight PDF libraries.

At the moment it dumps glyphs (a rectangle and text) and path information.

pdf2msgpack is Alpha Software

pdf2msgpack is used internally at The Sensible Code company. It's still quite young and the output might change significantly.

Installation

To configure and build, run:

./waf configure
./waf build

There is an accompanying docker file which enables building a static binary.

To build the static binary, just ensure submodules are checked out and invoke make.

git submodule update --init
make

Running

./pdf2msgpack <pdf-file>

Output

pdf2msgpack writes its output to msgpack, which is a convenient, fast serialization format with libraries available in many languages.

Here is a description of the wire format.

At the top level the document is not a list but consecutive objects written to the stream.

document (consecutive objects): <wire version : int> <metadata> <page>...

metadata (dict): {
  "Pages": int,
  "FileName": str,
  "XFA": dict,
  "EmbeddedFiles": [<embedfile>...],
  "FontInfo": [<fontinfo : dict>...],
  (other string fields supplied in PDF),
}

page (dict): {
  "Size": [<width : int>, <height : int>],
  "Glyphs": [<glyph>...],
  "Paths": [<pathitem>...],
  "Bitmap": [ [<width : int>, <height : int>], <pixeldata : uint8> ]
}

embedfile (list):
  [ <name : str>, <description : str>, <mime-type : str>,
    <created-date : str>, <modified-date : str>, <content : bin>]

pathitem (list): [<type : (EO_FILL|STROKE|FILL|SET_STROKE_COLOR|
                           SET_STROKE_WIDTH|SET_FILL_COLOR)>,
                  [<pathdata>...]]

pathdata (list): depending on pathitem type
  EO_FILL, FILL, STROKE: pathcoord
  SET_FILL_COLOR: [<r : uint8>, <g : uint8>, <b : uint8>]
  SET_STROKE_WIDTH: <width : float>

pathcoord (2 list or 6 list):
  point: [<x : float>, <y : float>]
  quadratic curve: [<a : float>, <b : float>, <c : float>, <d : float>, <e : float>, <f : float>]

fontinfo (dict): {<Name> <Type> <Encoding> <Embedded> <Subset> <ToUnicode>}

Options

Some of the options affect the wire format generated by the command.

  • --meta-only: changes document to skip generation of page objects.
  • --bitmap: generate grayscale bitmap in page object (otherwise absent)
  • --xfa: generate XML Forms Architecture (XFA) data in metadata object (otherwise nil)
  • --embedded-files: generate embedded files in metadata object (otherwise nil)

Licence

pdf2msgpack is licensed under the GPLv2.