This repository contains a Docker image which can be used to generate previews and thumbnails from files in common
office document formats. It uses LibreOffice via
unoconv for rendering and exposes several Celery
tasks to access this functionality. Documents are read and previews and thumbnails are written via
PyFilesystem, support for accessing S3 object stores via
fs-s3
is included. This image is intended to be deployed with Kubernetes but can
also be used with Docker.
When instantiating the image as a container the mode the container should be running in needs to be specified. There are two possible modes:
-
celery-worker
: In this mode a Celery worker is started which publishes four tasks with the following signatures:-
unoconv.tasks.supported_import_format(*, mime_type: str = None, extension: str = None) -> bool
Returns a boolean value indicating if a document format is supported. Either
mime_type
orextension
or both have to be set. Theextension
must include the leading dot. -
unoconv.tasks.generate_preview_jpg(*, input_fs_url: str, input_file: str, output_fs_url: str, output_file: str, mime_type: str = None, extension: str = None, pixel_height: int = None, pixel_width: int = None, maintain_ratio: bool = False, quality: int = None, timeout: int = UNOCONV_DEFAULT_TIMEOUT)
This tasks renders the first page (or slide) of a document as a JPEG image.
-
The document is read from
input_fs_url
:input_file
and the JPEG image is written tooutput_fs_url
:output_file
. -
mime_type
andextension
are interpreted just like as withunoconv.tasks.supported_import_format
. Ifextension
isNone
the task tries to guess it from the suppliedinput_file
name. -
pixel_height
andpixel_width
specify the dimensions of the resulting image and are optional (i.e. they either must be set or both beNone
). The behaviour is different whenscale_height
orscale_width
areTrue
, see below. -
maintain_ratio
activates automatic aspect ratio preserving scaling of the image. The image is scaled in such a way that it fits into the bonding box given bypixel_height
andpixel_width
while preserving the aspect ratio. Ifmaintain_ratio
isTrue
the image is rendered two times: once to determine the dimensions of the original document and a second time with the calculated dimensions applied. -
quality
determines the quality of the resulting JPEG image by tuning the compression algorithm. Valid values are between 1 (lowest quality, smallest file size) and 100 (highest quality, largest file size). -
timeout
specifies a timeout for the invokedunoconv
command.
Exceptions thrown:
ValueError
: Input format is unsupported or the supplied dimensions are invalidFileNotFoundError
: Input file was not foundRuntimeError
: All other cases
-
-
unoconv.tasks.generate_preview_png(*, input_fs_url: str, input_file: str, output_fs_url: str, output_file: str, mime_type: str = None, extension: str = None, pixel_height: int = None, pixel_width: int = None, maintain_ratio: bool = False, compression: int = None, timeout: int = UNOCONV_DEFAULT_TIMEOUT)
This task works just like
unoconv.tasks.generate_preview_jpg
but generates a PNG image instead. It uses thecompression
parameter instead of thequality
parameter to tune the image compression algorithm:- Valid values for
compression
are between 1 (lowest compression) and 9 (highest compress).
- Valid values for
-
unoconv.tasks.generate_pdf(*, input_fs_url: str, input_file: str, output_fs_url: str, output_file: str, mime_type: str = None, extension: str = None, paper_format: str = None, paper_orientation: str = None, timeout: int = UNOCONV_DEFAULT_TIMEOUT)
Again this is similar to the last two task. But in this case a PDF document containing all pages (or slides) is generated. Instead of image dimensions and compression ratios the
paper_format
andpaper_orientation
can be specified:- Valid values for
paper_format
depend on the LibreOffice version. Some valid values areA3
,A4
,A5
,B4
,B5
,LETTER
, andLEGAL
. - Valid values for
paper_orientation
arePORTRAIT
andLANDSCAPE
.
If
paper_format
is specified without apaper_orientation
LibreOffice assumes an orientation ofPORTRAIT
. So even when only specifyingpaper_format
both settings in the original document are overridden. - Valid values for
To configure the Celery workers to connect to the Celery backends the Celery configuration needs to be mounted as
/celery-worker/config/celeryconfig.py
inside the container. It contains configuration variable assignments as per the Celery documentation. To get the results of the tasks a result backend is needed.These tasks need to be called by name. It is possible to use
send_task
for this or to define asignature
with one of the names above:app = Celery() supported_import_format = app.signature('unoconv.tasks.supported_import_format') generate_preview_jpg = app.signature('unoconv.tasks.generate_preview_jpg') generate_preview_png = app.signature('unoconv.tasks.generate_preview_png') generate_pdf = app.signature('unoconv.tasks.generate_pdf')
-
-
unoconv-listener
: This mode startsunoconv
as server process inside the container. This container is optional, but as the startup of LibreOffice is expensive it speeds things up and uses fewer resources. If this container is not present, thecelery-worker
container starts up anunoconv
server and LibreOffice instance by itself each time a request comes in and terminates it again when done.
The mode needs to be supplied as single argument to the container's entry-point. This is done via the
Kubernetes args
option in container specifications. When using docker-compose
or Docker Swarm
this would be command
.
To deploy docker-unoconv
with Kubernetes it is best to use the provided Helm chart. It can be found in charts/unoconv
.
If you're not using Helm the manifest templates in charts/unoconv/templates
will still be a good starting point
for building your own manifests.
The Helm chart comes with a few configuration options:
The unoconv
listener can be disabled by setting containers.unoconvListener.enabled
to false
. But normally it
should always be enabled.
containers:
unoconvListener:
enabled: true
The configuration for the Celery worker needs to be supplied under the key containers.celeryWorker.config
. It is
injected into the container via a ConfigMap
.
containers:
celeryWorker:
config:
broker_url = 'amqp://guest:guest@rabbitmq:5672'
result_backend = 'rpc://'
tasks_queues = 'unoconv'
By default the deployment consists of five pods. The Celery workers are just started with one worker process per pod,
so they need to be scaled by increasing the number of replicas
. This can be done automatically be enabling the
horizontal autoscaler below.
replicaCount: 5
With the standard settings the Helm chart will use the latest
image. For production deployment it is recommended
to specify a release version instead of using latest
. In that case the pullPolicy
should be set to IfNotPresent
.
image:
repository: elementalnet/unoconv
tag: latest
pullPolicy: Always
To access documents residing on a filesystem a data volume can be mounted into the Celery worker container:
containers:
celeryWorker:
dataVolume:
enabled: false
# Mount path inside the Celery worker container
mountPath: /data
reference:
persistentVolumeClaim:
claimName: your-pvc
It is also possible to specify resources. Currently both containers use the same resource allocation. This
might turn out to be suboptimal and separate resource specifications might be needed in the future. A horizontal
pod autoscaler can be enabled to adjust the number of replicas
automatically.
resources: {}
# limits:
# cpu: 100m
# memory: 128Mi
# requests:
# cpu: 100m
# memory: 128Mi
horizontalPodAutoscaler:
# Remember to set resources above if you enable this
enabled: false
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 50
The last three options relate to pod placement:
nodeSelector: {}
tolerations: []
affinity: {}
Please see tests/docker-compose.yaml
for an example on how to use this image with Docker.
A pre-built Docker image is present on Docker Hub under https://hub.docker.com/r/elementalnet/unoconv. The current
master branch is available under the tags latest
and master
. Releases are available with their respective
version as the tag. All images are built automatically via Travis CI.
-
During testing I've seen some crashes of
unoconv
which seem to be related to memory corruption. These are not directly reproducible and seem to sometimes correlate with crashes of LibreOffice. Stability increased after disabling the listener and so using a new LibreOffice instance for each new task. -
Again during testing I've seen AMQP heartbeat failures when some tasks which take over over second to complete. The workaround was to disable heartbeats with
broker_heartbeat = None
. I'm not sure if this is related to the test setup, the Celery configuration or a general problem.