pdftojson is a pdftotext
wrapper that generates JSON with bounding box data. It takes care of overlapping duplicated characters, which often exists in MS-Word-generated PDF files with floating images and text.
Consider this PDF file:
pdftotext -bbox theFile.pdf
would generate this:
...
<word xMin="103.320000" yMin="547.355700" xMax="152.368008" yMax="561.321720">(6)綠線</word>
<word xMin="155.880000" yMin="547.355700" xMax="176.846541" yMax="561.321720">G01</word>
<word xMin="155.880000" yMin="547.355700" xMax="162.867200" yMax="561.321720">G</word>
<word xMin="180.300000" yMin="547.355700" xMax="222.295867" yMax="561.321720">站延伸</word>
<word xMin="208.080000" yMin="547.355700" xMax="264.053062" yMax="561.321720">伸至大溪</word>
<word xMin="264.480000" yMin="547.355700" xMax="334.420485" yMax="561.321720">、龍潭先進</word>
<word xMin="320.340000" yMin="547.355700" xMax="348.294390" yMax="561.321720">進公</word>
<word xMin="124.680000" yMin="572.375700" xMax="166.675867" yMax="586.341720">共運輸</word>
<word xMin="152.700000" yMin="572.375700" xMax="222.644667" yMax="586.341720">輸系統發展</word>
<word xMin="208.440000" yMin="572.375700" xMax="278.395867" yMax="586.341720">展委託可行</word>
<word xMin="264.840000" yMin="572.375700" xMax="320.813062" yMax="586.341720">行性研究</word>
...
pdftotext
does a great job "undoing" physical layout (columns, hyphenation, etc) of a PDF document. However, in its result there are some overlapping and duplicate words. PDF layout engines sometimes generate these quirks when images and text are mixed within a page.
On the other hand, pdftojson theFile.pdf
could generate this:
...
{
"xMin": 103.2,
"xMax": 348.29439,
"yMin": 547.3557,
"yMax": 561.32172,
"text": "(6)綠線 G01 站延伸至大溪、龍潭先進公"
},
{
"xMin": 124.68,
"xMax": 320.813062,
"yMin": 572.3757,
"yMax": 586.34172,
"text": "共運輸系統發展委託可行性研究"
}
...
$ npm install pdftojson
pdftojson
uses pdftotext
. Please make sure pdftotext
is available in PATH
.
pdftojson is available as a command line tool and a nodejs library.
# outputs some.json
$ pdftojson some.pdf
# converts page 3 ~ 6 of some.pdf and outputs to some.json
$ pdftojson -c "-f 3 -l 6" some.pdf
The library exposes a single function that takes the name of a PDF file and returns a promise.
import pdftojson from 'pdftojson';
pdftojson("./some.pdf").then((output) => {
// output is a Javascript object.
});
All numeric values are in pt
.
[
{ //: Page
width: (Number) page width,
height: (Number) page height,
words: [
{
text: (String) the text enclosed in the bounding box,
// All coordinates calculated from top-left corner of the page
xMin: (Number) left edge of the bounding box,
xMax: (Number) right edge of the bounding box,
yMin: (Number) top edge of the bounding box,
yMax: (Number) bottom edge of the bounding box
}, // ...
]
}, // ...
]