pandadoc: A Python repository from chris-mcdo

pandadoc: lightweight pandoc wrapper

An extremely lightweight pandoc wrapper for Python 3.8+.

Features:

Supports conversion between all formats that pandoc supports - markdown, HTML, LaTeX, Word, epub, pdf (output), and more.
Output to raw bytes (binary formats - e.g. PDF), to str objects (text formats - e.g. markdown), or to file (any format).
pandoc errors are raised as (informative) exceptions.
Full flexibility of the pandoc command-line tool, and the same syntax. (See the pandoc manual for more information.)

Getting Started Guide

Installation

First, ensure pandoc is on your PATH. (In other words, install pandoc and add it to your PATH.)

Then install pandadoc from PyPI:

$ python -m pip install pandadoc

That's it.

Usage

Convert a webpage (or file) to markdown, and store it as a python str:

>>> import pandadoc
>>> input_file = "https://example.com/"
>>> # Or: input_file = "path/to/my/file.html"
>>> example_md = pandadoc.call_pandoc(
...    options=["-t", "markdown"], files=[input_file]
... )
>>> print(example_md)
<div>

# Example Domain

This domain is for use in illustrative examples in documents.
...

Now convert the markdown to RTF, and write it to a file:

>>> rtf_output_file = "example.rtf"
>>> pandadoc.call_pandoc(
...     options=["-f", "markdown", "-t", "rtf", "-o", rtf_output_file],
...     input_text=example_md,
... )
''

Notice that call_pandoc returns an empty string '' when a file output is used.

Looking at the output file:

{\pard \ql \f0 \sa180 \li0 \fi0 \outlinelevel0 \b \fs36 Example Domain\par}
{\pard \ql \f0 \sa180 \li0 \fi0 This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.\par}
{\pard \ql \f0 \sa180 \li0 \fi0 {\field{\*\fldinst{HYPERLINK "https://www.iana.org/domains/example"}}{\fldrslt{\ul
More information...
}}}
\par}

Convert this RTF document to PDF, using xelatex with a custom character set, and store the result as raw bytes:

>>> raw_pdf = pandadoc.call_pandoc(
...     options=["-f", "markdown", "-t", "pdf", "--pdf-engine", "xelatex", "--variable-mainfont",  "Palatino"],
...     files=[rtf_output_file],
...     decode=False,
... )

(Note that PDF conversion requires a "PDF engine" to be installed - e.g. pdflatex, latexmk etc.)

Now you can send those raw bytes over a network, or write them to a file:

>>> with open("example.pdf", "wb") as f:
...     f.write(raw_pdf)
...
>>> # Finished

You can find more pandoc examples here.

Exceptions

If pandoc exits with an error, an appropriate exception is raised (based on the exit code):

>>> pandadoc.call_pandoc(
...     options=["-f", "markdown", "-t", "zzz"], # non-existent format
...     input_text=example_md,
... )
Traceback (most recent call last):
...
pandadoc.exceptions.PandocUnknownWriterError: Unknown output format zzz
>>> isinstance(pandadoc.exceptions.PandocUnknownWriterError(), pandadoc.PandocError)
True

You can find a full list of exceptions in the pandadoc.exceptions module.

Explanation

The pandoc command-line tool works like this:

pandoc [OPTIONS] [FILES]

In addition to the OPTIONS (documented here), you can provide either some FILES, or some input text (via stdin).

The call_pandoc function of pandadoc works in a similar way:

The options argument contains a list of pandoc options. E.g. ["-f", "markdown", "-t", "html"].
The files argument is a list of file paths (or absolute URIs). E.g. ["path/to/file.md", "https://www.fsf.org"]
The input_text argument is used as text input to pandoc. E.g. # Simple Doc\n\nA simple markdown document\n.
The timeout argument can be used to stop pandoc if it takes too long.
The decode argument determines whether the result should be decoded to a str (True by default) or left as raw bytes.

Bugs/Requests

Please use the GitHub issue tracker to submit bugs or request features.

Feedback is always appreciated.

License

Distributed under the MIT license.

chris-mcdo/pandadoc