README
This is the README file for the doc-converter
project. The
doc-converter
project implements an HTTP service, architected to run
in the Amazon AWS envrironment, for among other things, converting
.xls[x] and .doc[x] documents to PDF. This project will produce three
RPMs:
doc-converter-1.0.0-0.rpm
doc-converter-client-1.0.0-0.rpm
doc-converter-utils-1.0.0-0.rpm
...or some other versioned RPMs. Just to be clear, this project's targets are RPMs that you can then use to create a document conversion server and client. In other words, you build these RPMs then follow some additional instructions in order to create the server.
If you don't want to create your own customized RPMs you can try
to install the RPMs (no guarantees that they will actually work for you
however) from the quasi-official doc-converter
yum repository:
[doc-converter]
baseurl=http://doc-converter.s3-website-us-east-1.amazonaws.com
name=doc-converter
gpgcheck=0
enabled=1
...or if you want to descend into the hell that is
yum-config-manager
(hint: it is very buggy), then try this:
$ sudo yum-config-manager --add-repo http://doc-converter.s3-website-us-east-1.amazonaws.com
$ sudo yum-config-manager --save --setopt 'doc-converter.s3-website-us-east-1.amazonaws.com.name=doc-converter'
$ sudo yum-config-manager --save --setopt 'doc-converter.s3-website-us-east-1.amazonaws.com.gpgcheck=0'
$ sudo yum -y install doc-converter
$ sudo yum -y install doc-converter-client
P.S. The reason for setting name
in the repo config file is so
yum-config-manager
will add the gpgcheck
option. No attempt to set
gpgcheck
with out first having gpgcheck
in the file was successful
until I hit upon this work around.
doc-converter
Architecture
The doc-converter
service employs your S3 bucket as a document
repository. Both the doc-converter
client and server use the S3
bucket for storing documents. In ASCII art, the architecture looks
something like this:
+------<<< S3 bucket policy >>>-------+
| |
| s3://mybucket |
| |
| 10.0.1.81 |
_____o_____ +-------+ |
\ / .xls[x], ... | | -----o----
\ S3 / -----------> | EC2 o---{ IAM Role }
\ / <------------ | | ----------
\___/ .pdf, .png +-+-----+
^ / +-- LibreOffice
| / +-- ImageMagick
|.txt /////// +-- Apache
|.xls[x] /
|.doc[x] / GET/POST
| /
+--------+ / $ curl http://10.0.1.81/converter/mybucket/FE76C51A-872D-11E5-89DF-3E269020DED9
| | /
| Client |/
| |
+--------+ $ doc2pdf-client -b mybucket -h 10.0.1.81 test.xlsx
Your clients push documents to your S3 bucket (more on permissions you'll need for your bucket later) and then request a conversion service by making a request to an HTTP endpoint. The server then reads the bucket and converts the documents. The client can then retrieve the PDF and/or .png files from the S3 bucket.
Architectural Considerations and Alternatives
On possible way to implement this service takes advantage of the fact that S3 buckets can be configured to generate various events when objects are modified. The S3 service can generate:
- an SQS notification
- an SNS notification
- a Lambda event
In the the first two scenarios, we would need some sort of subscriber to those events, either a message queue handler of some kind that reads the SQS queue and processes the events or an endpoint that handles SNS notifications and takes an appropriate action. In either case, we would need to provision an EC2 instance to run our code.
Scenario 3 is clearly more appealing. That is, create a Lambda function that responds to the S3 event and then performs some magic on the documents. While this appears a bit more appealing architecturally, it introduces some complexity in figuring out exactly how a Lambda function can invoke LibreOffice!
In the end, I decided to implement a very simple store and request model. The client stores the document to a bucket shared between the client and server and then explicitly tells the server to do something. This explicit request, which can be made by the client after the document is pushed to the shared S3 bucket, is low-cost (as in nearly zero) as compared to generating an event placed on a queue that will need to be read at some frequency. SQS reads have a cost, albeit very small. However if you plan on running your service 24x7x365 you'll be probably be doing a lot of reads that come up empty.
The store and request model only requires a small Perl CGI that forks and executes LibreOffice from the command line. While a bit unsatisfying from the elegance perspective it seems to do the trick reliably, if not quickly.
While not sexy, it does support the use case I was actually interested in. My client creates a document in an S3 bucket for both potentially immediate needs and for archival requirements. In general I might not need the PDF or .png thumbnails right away so I'm not necessarily interested in waiting around for the server to complete those tasks.
In that case it would be sufficient for my client to store the file to S3 and then just request the conversion. Of course, I could opt to wait for the conversion as well. YMMV.
I note alternative architectures for file conversions in the event someone out there has a niftier solution.
Description
Behind the scenes, the document conversion process uses LibreOffice's
headless mode from the command line. While you can apparently run
LibreOffice as a server in headless mode, my experience with that
method has not been all that positive (concurrency issues that will
eventually crash LibreOffice - a bug has been filed against LO and
this may in fact be fixed in 5.2) , hence this HTTP based conversion
process pays the penalty of a fork and exec to execute
soffice
from the command line.
Once you install the doc-converter
RPM, take a peek at man doc-converter
for more details. You can also take a look at
/opt/libreoffice50/program/soffice --help
for more details on
converting documents.
Reporting Bugs, Carping or Making Constructive Suggestions
Fire away: Rob Lauer - rlauer6@comcast.net
I would be especially interested in anyone that has figure out whether:
- LibreOffice really can be used to support multiple connections
- Has gotten the pyuno interface to work successfully and reliably
- ...or has another method that makes more sense to create a reliable document creation service.
Requirements
Summary
- an EC2 instance running in a publicly available subnet
- an S3 bucket that is readable and writable by both the client and the server. This can be achived using an IAM role that you can assign to both the client and server instance
- LibreOffice 5
- RPM build tools (
rpm-build
) automake
autoconf
rpm-build
Details
An EC2 Instance
The server that is used for the document converter is assumed to be an
AWS EC2 instance prepared with the libreoffice-doc-converter.json
CloudFormation stack template.
A CloudFormation template (libreoffice-create-stack
) to create
such a stack is included as part of the doc-converter-client
RPM.
To use the stack creation script you'll want to make sure that:
-
...you're launching the stack in a subnet that has access to the internet. This is necessary so that the required assets can be retrieved and installed.
-
...you take a look at the defaults defined in the CloudFormation JSON template and make the necessary modifications or override them on the command line when you run the
libreoffice-create-stack
bash script. Pay attention to these parameters in the template:- Subnet
- Keyname
- Role
- Instance Type
usage: libreoffice-create-stack [options]
-e echo only - just echo the command
-h|? this
-i instance type (ex: t2.micro)
-I AMI (defaults to: ami-e3106686
-k keyname
-L LibreOffice version (5.1.0)
-l LibreOffice release (3)
-n server name (defaults to 'libreoffice-doc-converter')
-R role name (defaults to EC2 instance role)
-r region (defaults to us-east-1)
-S subnet (no default - required)
-s stack name (defaults to 'libreoffice-doc-converter')
-t template (defaults to 'libreoffice-doc-converter.json')
-v validate template only
-u url for the doc-converter RPM repo
- ...you have an IAM role configured that will allow your EC2 server instance
and your client to read & write to an S3 bucket. The bucket will be
used by your client and the
doc-converter
service to store documents. The policy for an appropriate IAM role might look like this:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1446041925000",
"Effect": "Allow",
"Action": [
"s3:ListAllMyBuckets"
],
"Resource": "arn:aws:s3:::*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListObjects",
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::mybucket",
"arn:aws:s3:::mybucket/*"
]
}
]
}
This policy will allow your EC2 instance, and hence the
doc-converter
service to read and write to a specific directory (and
sub-directories) within your bucket.
See libreoffice-doc-converter.json
.
See libreoffice-create-stack
Assuming you've defined an IAM role called bucket-writer
, to create
an EC2 instance that will implement your document conversion service,
use the provided script that submits the CloudFormation stack for
creation. To use the script you should have the AWS CLI tools installed.
$ libreoffice-create-stack -h
$ libreoffice-create-stack -t /usr/share/doc-converter/libreoffice-doc-converter.json \
-i t2.micro \
-L 5.1.0 \
-l 3 \
-R bucket-writer \
-k mykey \
-S subnet-932594e4
-u http://doc-converter.s3-website-us-east-1.amazonaws.com
NOTE: YOU MUST HAVE PERMISSIONS TO ACTUALLY CREATE EC2 INSTANCES AND OTHER ASSETS IN ORDER TO RUN THE STACK CREATION SCRIPT.
Network Considerations
Subnet
The stack has to be launched in a subnet, but which one? The stack will not create it's own network or subnet, so you must provide the subnet id. The script will determine the VPC that you are launching into by looking at the subnet id. If you aren't using a VPC, then you'll have to muck with the CloudFormation script and work that out for yourself. You should be using a VPC ;-)
Security Group
By default the stack will create it's own security group. It will open up ingress to ports 80 and 22 only. It is assumed you are launching the instance in a public subnet that has access to the internet so an external public IP is provisioned by default. If you want to launch your server in a private subnet, you can, but your instance must have access to the internet via a NAT or gateway otherwise the instance will not be able to access the resources it needs to configure itself.
If you do not want to provision an external IP, then set the PublicIpEnabled value to "false". If you want just want to block in-bound traffic but still want your server in a Public subnet, then you might want to change the ingres rules.
"SecurityGroupIngress" : [
{
"IpProtocol" : "tcp",
"FromPort" : "80",
"ToPort" : "80",
"CidrIp" : "10.1.0.0/16"
},
{
"IpProtocol" : "tcp",
"FromPort" : "22",
"ToPort" : "22",
"CidrIp" : "10.1.0.0/16"
}]
Here, I'm only allowing traffic from within my 10.1.0.0/16 subnet to access my server.
LibreOffice 5
The stack creation process mentioned above will grab a version of
LibreOffice 5 and install that during the instance creation. The
LibreOffice 5 tar ball is retrieved from the LibreOffice website using
the last known good location based on the desired version. The
LibreOffice version that is downloaded by default is 5.1.0 release 3.
The process thinks the tarball for that release might be found at
http://download.documentfoundation.org
as
LibreOffice_5.1.0_Linux_x86-64_rpm.tar.gz
. And that may in fact be
true, but you can try playing with the options in the -L and -l
options of the libreoffice-create-stack
script if you want a
different version of LibreOffice or it has moved since this release.
Take a look at the libreoffice-download
script to see how it attepts
to download and eventually install LibreOffice if the stack creation
process is failing for you.
The location of these assets might change, so you might have to modify the CloudFormation specification. Again, it is suggested you take a peek at the CloudFormation template.
Apache 2.2+
An Apache web server, listening on port 80 is also created as part of the stack creation process. An Apache configuration file is created and installed as:
/etc/httpd/conf.d/doc-converter.conf
There's nothing special about the configuration or the use of Apache.
It was just quick and convenient. You can probably make the
doc-converter.cgi
script work in other environments.
Amazon::S3
A modified copy of the Amazon::S3
Perl module is included in this
project. I apologize in advance to those who are offended by the fact
that I did not simply push my change (send the session token in the
header) to that project, however that CPAN project appears to be
either deprecated or on life support in favor of more modern, but
far heavier versions of Perl S3 interfaces.
LibreOffice Fixups
A Perl script, fix-OOXMLRecalcMode
is executed as part of the stack
creation process. It sets a LibreOffice configuration value so
spreadsheets are recalculated when they are loaded. This is required
if you want your formulas in spreadsheets to be calculated before PDFs
are created.
ghostscript fixups (Japanese Fonts)
Some issues have been identified when attempting to create .png file
from PDFs that require Japanese fonts. I've pushed a fix to the
/etc/ghostscript/8.70/cidfmap.local
file, however if you really are
using Japanese fonts, this fix may not work for you. YMMV.
Getting Started
-
If you want to merely create a new
doc-converter
server instance read the section titled Creating a Document Conversion Server. -
If you want to hack on the project read the section titled Hacking on the Project
Creating a Document Conversion Server
You should be able to create a doc-converter
instance by simply
installing the client package and creating the stack using the
included CloudFormation template. The EC2 client from which you
create the stack should have the AWS CLI tools.
- Setup the yum repository
- Install the client package
- Set your credentials
- Submit the CloudFormation template
or
- Setup the yum repository
- Install the client package
- Use the AWS console to configure the CloudFormation template
The template can be found here:
http://doc-converter.s3.amazonaws.com/libreoffice-doc-converter.json
$ sudo yum-config-manager --nogpgcheck --add-repo http://doc-converter.s3-website-us-east-1.amazonaws.com
$ sudo yum --nogpgcheck -y install doc-converter-client
Make sure your AWS credentials are available either in your
~/.aws/config
file or as environment variables.
$ libreoffice-create-stack -k mac-book -R mybucket -i t2.micro \
-t /usr/share/doc-converter/libreoffice-doc-converter.json
You should see some output put that looks something:
{
"StackId": "arn:aws:cloudformation:us-east-1:106518701080:stack/libreoffice-doc-converter/287f64e0-8727-11e5-8263-50d5cd23c4c6"
}
Are we there yet, are we there yet?
$ aws cloudformation describe-stacks --region us-east-1 --stack-name libreoffice-doc-converter --query 'Stacks[0].StackStatus'
Hacking on the Project
Hacking on the project means working with autoconf
, bash
and Perl
scripts. If that makes your heart pump read on. Make sure you have
installed:
automake
autoconf
rpm-build
To get started, download the project from GitHub, unzip it, and try to build it. You might have some success. If not, it's most likely some toolchain issues you'll need to investigate. It's not rocket science though, so hang in there.
$ wget https://github.com/rlauer6/doc-converter/archive/master.zip
$ unzip master.zip
$ cd doc-converter-master
Building an RPM
Building the project RPMs can be done using the build
script
provided. Make sure you've installed autoconf
, automake
, and the
rpm-build
packages and have a working RPM build directory.
$ mkdir -p ~/rpm/{BUILD,RPMS,SOURCES,SPECS,SRPMS}
$ echo "%_topdir /home/ec2-user/rpm" > ~/.rpmmacros
Now give it a go to create the RPMs.
$ ./build
If you want to build the RPMs and install them to your own S3 bucket that you've turned into a static website (and hence a yum repo), try this recipe.
$ aws s3 mb s3://mybucket
$ aws s3 cp index.html s3://mybucket/ --grants read=uri=http://acs.amazonaws.com/groups/global/AllUsers
$ aws s3 website s3://mybucket/ --index-document index.html \
--error-document error.html
$ ./build mybucket
If your default region was the us-east-1, then you can visit:
http://mybucket.s3-website-us-east-1.amazonaws.com
...and then you might want to configure the yum repository:
$ sudo yum-config-manager --add-repo http://mybucket.s3-website-us-east-1.amazonaws.com
$ yum --disablerepp="*" --enablerepo="mybucket.s3-website-us-east-1.amazonaws.com" list available
This probably should not be the same bucket you use for document conversion.
The CloudFormation template to create your doc-converter
server will
try to install the doc-converter
RPM package from a repo. In fact
it's the one that you specified using the -u option of the
libreoffice-create-stack
script, so the above recipe might be useful
to you if you want to create a repo that can be accessed by the
CloudFormation template.
$ libreoffice-create-stack -k mac-book -R bucket-writer -i t2.micro \
-t /usr/share/doc-converter/libreoffice-doc-converter.json \
-u http://mybucket.s3-website-us-east-1.amazonaws.com
Alternate Server Build Procedure
On the other hand you can just throw your hands up and say, "oh
bloody hell", and see what the dang CloudFormation template is trying
to do in that wonky UserData
section and do it all by hand.
1 "UserData" : { "Fn::Base64" : { "Fn::Join" : ["", [
2 "#!/bin/bash -v\n",
3 "\n",
4 "# Helper function\n",
5 "function error_exit\n",
6 "{\n",
7 " /opt/aws/bin/cfn-signal -e 1 -r \"$1\" '", { "Ref" : "WaitHandle" }, "'\n",
8 " exit 1\n",
9 "}\n",
10 "yum update -y\n",
11 "# Install packages\n",
12 "/opt/aws/bin/cfn-init -s ", { "Ref" : "AWS::StackId" }, " -r DocConverter ", "-c Config ",
13 " --region ", { "Ref" : "AWS::Region" }, " || error_exit 'Failed to run cfn-init'\n",
14 "\n",
15 "# setup doc-converter repo and install package\n",
16 "yum-config-manager --nogpgcheck --add-repo ", { "Ref" : "RPMUrl" },"\n",
17 "yum install --nogpgcheck -y doc-converter\n",
18 "# Adbobe Japanese font fix\n",
19 "cat /usr/share/doc-converter/cidfmap.local >> /etc/ghostscript/8.70/cidfmap.local\n",
20 "# download & install LibreOffice 5\n",
21 "/usr/bin/libreoffice-download -r ",{ "Ref" : "LoRelease"}, " -v ", {"Ref" : "LoVersion"}, " || error_exit 'LibreOffice download error'\n",
22 "/sbin/service httpd restart\n",
23 "\n",
24 "/opt/aws/bin/cfn-signal -e 0 -r \"Setup complete\" '", { "Ref" : "WaitHandle" }, "'\n"
In that case, you essentially need to do the following things:
- Create an EC2 instance
- Install all of the Required RPM packages as described in the
doc-converter.spec
file. - Install the
doc-converter
package (lines 16,17) - Optionally apply patches to the ghostscript files to handle Japanese fonts (line 19)
- Install LibreOffice 5.1 (line 21)
- Restart Apache
So, yeah, sure, you can do all those things manually, but the point of all of this automation is so that you can create multiple instances of the document converter. You might want to create a load balanced document conversion service someday. ;-)
Using the Service
Assuming you've stuck with us so far, and you have an EC2 server running, you'll want to try it out.
$ export DOC_CONVERTER_HOST=10.0.1.108
$ export DOC_CONVERTER_BUCKET=mybucket
$ /usr/libexec/doc2pdf-client test.xlsx
$ /usr/libexec/doc2pdf-client -h 10.0.1.108 -b mybucket test.xlsx
$ /usr/libexec/doc2pdf-client -t 70x90 -h 10.0.1.108 -b mybucket foo.doc
See man doc2pdf-client
for more details.
Oh Yeah, GUIDs
The doc-converter
service URI requires a document id which is, in fact a GUID.
https://en.wikipedia.org/wiki/Globally_unique_identifier
Your documents should be stored in the S3 bucket with a prefix that
consists of the GUID (8+4+4+4+12). The included client,
doc2pdf-client
, uploads documents to the bucket and creates a new
document identifier for you.
Example GUID:
3F2504E0-4F89-41D3-9A0C-0305E82C3301
Finally...
I think you have enough clues to proceed. Don't forget to read the man pages.
-
$ man doc2pdf-client
-
$ man doc-converter
If all else fails, drop me a note.