pdf2msgpack
is designed to efficiently dump information from a PDF in a portable format which is easy to consume without heavyweight PDF libraries.
At the moment it dumps glyphs (a rectangle and text) and path information.
pdf2msgpack
is used internally at The Sensible Code company. It's still quite young and the output might change significantly.
To configure and build, run:
./waf configure
./waf build
There is an accompanying docker file which enables building a static binary.
To build the static binary, just ensure submodules are checked out and
invoke make
.
git submodule update --init
make
./pdf2msgpack <pdf-file>
pdf2msgpack
writes its output to msgpack, which is a convenient, fast serialization format with libraries available in many languages.
Here is a description of the wire format.
At the top level the document is not a list but consecutive objects written to the stream.
document (consecutive objects): <wire version : int> <metadata> <page>...
metadata (dict): {
"Pages": int,
"FileName": str,
"XFA": dict,
"EmbeddedFiles": [<embedfile>...],
"FontInfo": [<fontinfo : dict>...],
(other string fields supplied in PDF),
}
page (dict): {
"Size": [<width : int>, <height : int>],
"Glyphs": [<glyph>...],
"Paths": [<pathitem>...],
"Bitmap": [ [<width : int>, <height : int>], <pixeldata : uint8> ]
}
embedfile (list):
[ <name : str>, <description : str>, <mime-type : str>,
<created-date : str>, <modified-date : str>, <content : bin>]
pathitem (list): [<type : (EO_FILL|STROKE|FILL|SET_STROKE_COLOR|
SET_STROKE_WIDTH|SET_FILL_COLOR)>,
[<pathdata>...]]
pathdata (list): depending on pathitem type
EO_FILL, FILL, STROKE: pathcoord
SET_FILL_COLOR: [<r : uint8>, <g : uint8>, <b : uint8>]
SET_STROKE_WIDTH: <width : float>
pathcoord (2 list or 6 list):
point: [<x : float>, <y : float>]
quadratic curve: [<a : float>, <b : float>, <c : float>, <d : float>, <e : float>, <f : float>]
fontinfo (dict): {<Name> <Type> <Encoding> <Embedded> <Subset> <ToUnicode>}
Some of the options affect the wire format generated by the command.
--meta-only
: changesdocument
to skip generation ofpage
objects.--bitmap
: generate grayscale bitmap inpage
object (otherwise absent)--xfa
: generate XML Forms Architecture (XFA) data inmetadata
object (otherwise nil)--embedded-files
: generate embedded files inmetadata
object (otherwise nil)
pdf2msgpack
is licensed under the GPLv2.