About

This script extracts extracts annotations (highlights, comments, etc.) from a PDF file, and formats them as plain text.

The scripts uses colormath to identify the highlights' colors, see the wiki. The default template uses these colors to determine hierarchy and meaning.

At present, the following annotations are supported:

Highlights without an attached comment are output first, as "highlights" with just the highlighted text included.
Highlights with an attached comment, and text annotations (not attached to any particular text/highlight) are output next, as "detailed comments".
Underline, strikeout, and squiggly underline annotations are output last, as "Nits", with or without an attached comment. The intention of this is to easily separate formatting or grammatical corrections from more substantial comments about the content of the document.

For each annotation, the page number is given, along with the associated (highlighted/underlined) text, if any. Additionally, if the documents includes outlines (aka bookmarks) such as those generated by the hyperref package, those are also used to identify to which section in the document the annotation refers.

See the wiki for more information.

Installation

 pip install pdfminer.six chardet six colormath Jinja2 pathlib
 python setup.py install

Usage

pdf-highlights.py FILE.PDF [> OUTPUT]

Dependencies

My own setup:

Python 3.6
chardet (3.0.4)
colormath (3.0.0)
Jinja2 (2.10)
pathlib (1.0.1)
pdfminer.six (20170720)
six (1.11.0)

Output formatting

There's a Jinja2 template you can adopt as you like. The script exposes the following data to the template:

highlights annotations
comments annotations
editing annotations
Author
Title