The Apache PDFBox library is an open source Java tool for working with PDF documents. The Apache PDFBox library project allows viewing PDF documents, creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. Apache PDFBox also includes several command-line utilities.
The project in this repository offers several versions of PDFBox source code that can be directly compiled with Eclipse without using Maven. The source code version used here is pdfbox-2.0.23
. The complete version (PDFBox-Complete) is a complete unmodified PDFBox with all packages. The other versions, which are in other repositories for convenience, are modified versions offering more capabilities and generally for more specific uses.
This is a compact version of PDFBox that is ready for compilation and execution. It contains new packages, notably to produce formatted PDF files from text files. ShowJustifiedFormattedBook example shows how to proceed and it generates file:
../output/org.apache.pdfbox.breakintolines.DocumentManager.Output.pdf
Where output
is a directory containing default input and output files.
This is a complete unmodifed version of PDFBox that is ready for compilation and execution. It contains all necessary packages, where some of them are normally not included in PDFBox source code (Example: org.bouncycastle
). If you are not using encryption, you can delete bouncycastle
or simply not copy it to your project. However, you should also delete the examples that use it in order to compile the source code.
The advantage of these repositories is that no building software is necessary (no need for Maven), thus one doesn't even need to be a programmer to compile and to run the examples. One can easily compile the contents of these repositories using Eclipse for Java, for example.
When starting Eclipse it always asks for the path of the Workspace. A Workspace is just a directory where the projects are stored. It is highly recommended to put a complete path starting with the disk where one wants to store the projects. Although it is not very difficult to move a project from a place to another, it is easy to get lost if one has several workspaces in different disks or different paths. Putting several workspaces under the same directory is a good idea because it is easy to remember where they are, and it is also easier to make backups. It is recommended to maintain several workspaces instead of only one with all projects inside. If possible it is better to have one workspace per project, especially when it is a big project. One can have other small projects with the main project if they are fairly small and if they are tightly related to the main project.
After supplying the Workspace directory, when Eclipse opens, a Welcome tab shows up inside Eclipse frame. This can be overwhelming to beginners, since instead of explaining you how to use it, and particularly how to dismiss this tab, it proposes a series of options. It is easier to ignore this page and dismiss it by clicking twice on the "Welcome" tab as indicated in Fig. 1.
Figure 1 - Dismissing the Welcome tab
The standard way to create a new project in Eclipse is by clicking "File > New > Java Project" (or alternatively pressing Alt-Shift-N simultaneously). Just ignore the suggestions shown at the Welcome tab and inside the Package Explorer, and proceed as shown in Fig. 2.
Figure 2 - Creating a new project by clicking "File > New > Java Project"
In the new opened window, one should type the name of the project and click on Next as shown in Fig. 3.
Figure 3 - Naming the project and clicking Next
One should then uncheck the box "Create module-info.java file" and click Finish. This is shown in Fig. 4. At some point it will be expected to be able to see the hierarchy of the project. This is allowed by clicking the ">" on the left of the project name in the Package Explorer, as indicated in the step ③ of Fig. 4.
Figure 4 - ① Uncheck the box, ② Click Finish, ③ Expand hierarchy by clicking on >
Finally, one should be able to see the hierarchy of the packages that are added to the project. This is set as indicated in Fig. 5.
Figure 5 - Setting for showing package hierarchy by clicking "⋮ > Package Presentation > Hierarchical"
If one is willing to access the directory where the project, a package or a file is located, there is a very simple and easy way to do it in Eclipse. Just click at the file, project or package one is trying to see in the file system and then right-click choosing "Properties" (at the bottom of the menu). A window will open and by clicking the icon on the right, as indicated in Fig. 6, a Window explorer (if one is working on Windows) window will open at the directory where it is located.
Figure 6 - Properties window of directory src
. On the right, where to click to open a file explorer window
In Fig. 6, src
was clicked, the root of the source files in Eclipse. Once the file explorer opens one can dismiss the Properties window.
After creating a new Java project in Eclipse (as shown in Fig. 2, 3, 4 and 5) and copying all the files in the source code under the directory src
to the directory src
of your project, just click on the project name or src
, and then on "File > Refresh". Eclipse will then start to compile the code. Once the code is compiled you can already run the examples. Whenever a project is refreshed in Eclipse, whatever new java
file found is compiled and the compiled file is saved in the bin
directory in a same package storage hierarchy as the source code. Whenever a file is not a java file, it is just copied from src
to bin
. This is basically what happens when using Eclipse. Any new Java file created is automatically compiled on the fly. This simplicity and all the help when typing new code is what makes the beauty of Eclipse IDE. Classes, Interfaces, Enums and even new packages can be automatically created by just right clicking in the package you want to create them and choosing New
.
PDFBox has a PDF file viewer at org.apache.pdfbox.debugger.PDFDebugger.java
. It is recommended to use this viewer when starting to deal with PDF files. In this viewer one can verify the internal structure, and visualize the contents of the PDF file in several formats, besides just rendering it. To execute this program while in Eclipse, just click at PDFDebugger.java
and then click over the play icon as indicated below:
Figure 7 - Running the PDF Viewer: 1) Select file PDFDebugger.java and 2) Click on run
Once the program opens, one needs to supply a PDF file by using the menu "File > Open...", which just opens a standard window to browse the file wanted. Once the file is loaded it appears in this way:
Figure 8 - PDF Viewer showing the rendering of the first page of the file
But the main feature of this viewer is the capability to visualize the real content of the PDF file. This is done by clicking at "+" on the left of the Page wanted and then clicking at Contents.
Figure 9 - PDF Viewer showing the contents of the first page of a file
This allows debugging the file generated to see if it corresponds to what it was intended. The contents can be shown in the following formats: "Nice view" (which is the most convenient, but it may take some seconds to process - here characters in strings are represented in octal when they are not ASCII), "Raw view" (direct binary format after decompressing, but only ASCII characters are represented), and "Hex view" (The binary content shown in hexadecimal notation).
The examples are found in the package org.apache.pdfbox.examples
.
When generating a PDF file from scratch, which contains text, the recommended example is ShowTextWithPositioning.java
.
When using fonts, this example shows how to embed them in the PDF file. One can use either a font using an encoding vector (as seen with PDTrueTypeFont.load
) or directly with the GID using Type 0 fonts (as seen with PDType0Font.load
). Type 0 fonts are more convenient because one doesn't have to think about an encoding vector, and it can deal with UNICODE directly, if the character is provided in the font. The inconvenient is that each character in a string is stored in two bytes instead of just one byte in the PDF file.
This is illustrated by examining the PDF file generated by the example ShowTextWithPositioning.java
. Its rendering is shown in Fig. 10, while visualizing the file with the PDF Viewer.
Figure 10 - Rendering of file justify-example.pdf created by ShowTextWithPositioning.java
When examining the contents of the file in the PDF Viewer one can really grasp the differences in Fig. 11.
Figure 11 - Contents of file justify-example.pdf created by ShowTextWithPositioning.java
In this example texts are positioned using a matrix (Tm
commands as seen in Fig. 11).
However, this is quite cumbersome. If one needs to separate the next text with a custom space, it is better to use the Td
command as shown in Fig. 9. The x
component of the Td
command is just the space between the begining of the text before and the one that follows the Td
, that is, it just translates to a distace x from the beginning of the previous text. The y component of Td
command is just a zero, when translating in the same line. In PDFBox this command is generated when calling the function newLineAtOffset
from the class PDPageContentStream
.
As seen in Fig. 11, the second Tw
command does not have any effect because of the use of Type 0 fonts. This example is very useful, especially to show what one should not do when using Type 0 fonts. Using a TJ
command (the array version of Tj
command) seems to be the best idea to justify texts with Type 0 fonts. However, one can notice some details that are not that good: the white space is represented in the string (it has GID /000/003, or simply 3), it occupies two bytes plus two parenthesis and an extra space, the widths separating the words are in character coordinate space (thus having many more digits),
it is always the same value (-3696.5562), and the negative sign not only occupies an extra byte, but it is also counterintuitive. In total, for separating two words using TJ
as shown in this example, one needs 16 characters. The method we used in the file of Fig. 8 and 9 takes 15 characters to separate two words, including the extra Tj
command. This seems a bit more compact because spaces are not represented and displacements are smaller. In any case,
the result is much simpler and gains in readability. However, one can doubt of the usefulness of using Type 0 fonts at all because strings in these fonts take twice as many bytes, but, again, there are more tricks that can be used.
With texts in English there will be high redundancy of null
bytes in the first byte of the character which can be compacted using compression (simply reversing the false
value of this line to true
, which allows compression of streams). Therefore, using compression, the use of Type 0 fonts is almost unoticiable in the size of the file. However one loses a convenient feature demonstrated in the ShowTextWithPositioning.java
, which is the use of word spacing, the Tw
commands.
GID is the Glyph identification number. If you are not sure what GID means you should download Glyph Inspector and opentype.js, placing glyph-inspector.html
in some directory and putting opentype.js
into the same directory under the subdirectory dist
. In other words, if you copy glyph-inspector.html
to the directory test
your opentype.js
should be at test/dist
. Other files to put inside test/dist
in order to make Glyph Inspector to work properly: opentype.js.map, opentype.min.js, opentype.min.js.map, opentype.module.js and opentype.module.js.map.
When running the program into a browser you will see the following screen:
The GID is the number from 0 to 99 in the grid showing the glyphs of the font, but as one can easily notice the GID can go to much higher values, such as 1293 to this particular font. This value can potentially go until 32767.