RFC - Roadmap for version 1
danfickle opened this issue ยท 20 comments
These are my thought on the issues that need to be addressed before version 1 is released, in no particular order:
- #161 - MathML support - COMPLETE
- NO-ISSUE-YET, COMPLETE - Entity support such as nbsp for HTML, InvisibleTimes for MathML and SVG entities. This is tricky as there is no way to programatically inject entities using the Java XML parser.
I propose that we add a doctype dynamically to the start of the XML input, with the desired entities. However this means we have to read XML input into a string, rather than just passing a file or input stream to the XML reader. The builder can be used to specify which entities to load.Used custom doc types and an external entity resolver instead. DOCUMENT. - #38 - Transforms. A few issues remaining to implement:
- Link placing doesn't take account of transforms.
- Translate is not implemented.
- Some work for transformed boxes in page margins.
- Testing. Do transforms of MathML, SVG and custom objects?
- NO-ISSUE-YET - Logging / error handling overhaul - Currently error handling is ad-hoc. For example should we continue on a load failure or fatally throw? I propose to allow this to be configurable by allowing the user to hook logging on a per-run basis and halt on any log message (which will be changed to enum constants) with a poison exception.
- #60 - CSS3 Columns - Currently implemented for text only. Need to debug to allow other box types in columns.
- #126 - Overflowing pages - Currently content that goes past the right margin is cut off silently. This is mostly a problem with tables. I propose a CSS property that allows cut off content to be printed on the next page. DOCUMENT.
- #204 - Multi run cache - Currently there is a multi-run cache hook method, but the objects stored may not be thread safe. This means it is unsuitable for many use-cases. Propose to remove all caches except font metrics cache.
- NO-ISSUE-YET - Per run cache - Need to make sure nothing is being placed into a PDF document more than once. For example, is an img from the img tag and a background image from the same url embedded twice?
- #83 - Unicode font justification fix - There is a fix in #143 but we are waiting for PDF-BOX 2.0.9 to implement it.
- #123 - RTL table layout - Altering table layout to correct RTL scares me but there have been a couple of requests so should try.
- NO-ISSUE-YET - Remove remnants of configuration class and move all config to builders. There are still some config values that are coming from various file locations.
- #145 - Padding with percentages not working - It appears that it is resolving padding percentage values with a zero base value.
- NO-ISSUE-YET - Make sure all dependencies are up to date. Do this after test system introduced.
- #208 - Semi automatic testing. Propose some sort of semi-auto testing with image diff. This would allow you to run before and after changes to make sure nothing has been broken. Unfortunately, we can't have one-true-source of reference results as reportedly font-handling, etc can change slightly between JREs.
- NO-ISSUE-YET - Java2D cleanup. Make sure all Java2D functionality is in the Java2D module and delete broken code samples and tools. Also make sure Java2D RTL works.
- NO-ISSUE-YET - Documentation. Review and complete the template author's guide, integration guide, create comparison with other solutions such as Flying Saucer, headless-browsers, etc.
- #180 - Performance and memory improvements - IN PROGRESS.
- #143 - Other improvements from this pull-request.
- NO-ISSUE-YET - Floating elements escape elements with
overflow:hidden
set.
Hopefully, most of the other open issues can wait for subsequent releases. NOTE: There will be several more release candidate version before version 1.
I'd appreciate feedback from anybody, especially @rototor. Any other issues that need to be addressed before version 1?
The very first thing I would suggest: add instructions for non-Maven users to the Integration Guide: I myself use Ant (from Netbeans). This includes making available a full list of dependencies (PDFBox, and whatever misc .jar it's required currently - for example graphics2d which took me for surprise during my initial tests)
Obviously we need to get v1 out first properly so we can have downloadable releases and the like.
Also I second the logging overhaul.
It seems that there are some issure with transparent embedded svg, The transparent background will be displayed in black. The same issure occured when I use batik to convert svg to bmp myself. However, there is no problem when converting to png. The batik does not provide a converter for transforming svg to bmp, so I write a custom one according that for transforming svg to png. In the end, I decide link the extern png as a workaround.
@vipcxj Can you share the SVG which does not work for you with me? I would like to fix this bug (which is in https://github.com/rototor/pdfbox-graphics2d, as there the whole Graphics2d->PDF mapping is happening).
I will give you the SVG when I go to work tomorrow. It's a watermark
update: this is the content of the svg.
<?xml version="1.0" encoding="UTF-8" ?>
<svg width="512" height="512" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<style type="text/css">text { fill: gray; font-family: Avenir, Arial, Helvetica, sans-serif; }</style>
<defs>
<pattern id="twitterhandle" patternUnits="userSpaceOnUse" width="400" height="200">
<text y="30" font-size="40" id="name">TEST WATERMARK</text>
</pattern>
<pattern xlink:href="#twitterhandle">
<text y="120" x="200" font-size="30" id="occupation">test watermark</text>
</pattern>
<pattern id="combo" xlink:href="#twitterhandle" patternTransform="rotate(-45)">
<use xlink:href="#name" />
<use xlink:href="#occupation" />
</pattern>
</defs>
<rect width="100%" height="100%" fill="url(#combo)" />
</svg>
@vipcxj I've just released pdfbox-graphics2d version 0.11 which fixes this problem. PdfBoxGraphics2D did not handle the PatternPaint of Batik SVG. You can manually depend on this version or wait till it is integrated here.
I would be delighted to see some improvement to my issue #119 ... the proposed workaround using floating containers works nine times out of ten, but not as perfect as everything else in this amazing project (at least for me).
Regards
Bigdatha
@achuinard uh... I do. And it's still quite popular here in Latin America.
Not everybody likes Maven or Eclipse, and there is nothing wrong with that.
Just wondering: has anyone done performance benchmarks? As there are quite a lot of us looking at this project as a long-term replacement for good ol' FS+iText, matching the performance of that should be a goal.
I've only done some quick testing with simple reports (basically tables, nothing fancy), and I've found openhtmltopdf to be as much as 50% slower than FS+iText, and I have no clue on where could be the bottlenecks (here? in PDFBox?).
@dilworks This is likely caused by PDFBox or its dependency FontBox. Are you using many custom fonts? FontBox is a little bit slow when parsing fonts...
Well, my reports are very simple - I'm using the PDF defaults (Times, Helvetica), not even external ones!
Thanks @dilworks
You inspired me to create a large document and run VisualVM while it was processing. It immediately highlighted a silly bug in the BIDI splitter which is now fixed (above). This was taking well over half the run-time. The next culprit to look at is createInlineBox. Any ideas on why that is so slow?
Embarrassingly, the BIDI splitter should not even run when not configured, which I'll fix in a future commit.
Is there any reason why this can't run on the modern Google App Engine Standard env Java8? It removes a ton of restrictions from the older java7 environment (no more whitelist of jars, most APIs should work).
I'd be happy to test if nobody has.
@danfickle Good starting point. I did a few test runs with a 18-page test document (will try to clean it up from any proprietary/private info to provide a public test sample of the reports I generate) with nothing but default fonts and rather simple tables. 25 runs for each converter, measuring times (although not resource usage, but then, we rarely generate hundreds-of-pages reports so that represents one of my most frequent use cases)
Here are the test results:
benchmark_pdfgen.xlsx
So far, I've found the performance gap between FS and OH to be around 30%.
(LOL at GitHub that doesn't support OpenDocument documents!)
Now I'll try with the really heavy hundred-of-pages CPU-draining workloads :)
Testcase:
testcase_fs_oh.tar.gz
Forgot to tell my setup: this is my dev laptop (a quite ancient Core 2 Duo P8600 with 6GB DDR2 RAM and 500GB of good ol' spinning rust storage) running both generators inside a J2EE container (WildFly 11.0)
All right then!
And once again, thanks for improving the library!
More docs please. More examples.
Great project though, works nicely for me, just would like to know all the things I can do and more importantly can't do
Implementing flexbox layout (#69) will be a huge improvement