Pdf information extraction library based on pdf.js and node.js with various output formats.
npm install -g pdf-gold-digger
pdfdig -i some_file.pdf
pdfdig -h
ex. pdfdig -i input-file -o output_directory -f json
--input or -i pdf file location (required)
--output or -o pdf file location (optional default "out")
--debug or -d show debug information (optional - default "false")
--format or -f format (optional - default "text") - ("text,json,xml,html")
--font or -t extract fonts as ttf files (optional)
--password or -p password
--help or -h display this help message
--version or -v display version information
git clone https://github.com/vane/pdf-gold-digger
sh demo.sh
and see results in out
directory
- extract text
- separate each page
- separate each line
- separate font information
- extract images
- output formats
- text
-f text (default)
- json
-f json
- xml
-f xml
- html
-f html
- text
- specify output directory
- load pdf from remote location
- from url
- output to markdown format
- pack output to zip
- extract tables
- extract forms
- extract drawings
- extract text from glyphs
- ability to provide input file for glyph path to letter
- detect when unicode is not provided or mangled
- get bounding box from text and draw it on canvas
- use tesseract.js as optional fallback