This is a talk about the structure of PDF files and how they can be created.
The slide deck for the talk can be created with:
cargo run create deck deck.pdf
This repository contains a number of commands for generating different PDF files. The following command will show you what PDFs can be generated.
```bash
cargo run create
Some tools used in the creation of this talk.
Useful for exploring the structure of a PDF on the command line
qpdf --json out.pdf | bat -l json
Show a pretty printed version
qpdf --qdf --object-streams=disable out.pdf -
https://github.com/yeslogic/allsorts-tools
Useful for exploring and subsetting fonts.
Print the character map
allsorts cmap --font input-font.ttf
Subset a font with characters from a block of text
WARNING: this seems to optimise away some information, use pyftsubset instead
allsorts subset \
--text "text to use for subsetting" \
input-font.ttf output-font.ttf
Useful for extracting fonts and images from PDF files.
List fonts and images
mutool info -FI input.pdf
Show internal PDF objects (can be browsed more easily with qpdf).
mutool show input.pdf <object id>
Extract all fonts and images into current directory.
mutool extract input.pdf
pip install fonttools
Subset a font with pyftsubset
pyftsubset input-font.ttf \
--gids=<glyph-ids> \
--layout-features='*' \
--glyph-names \
--symbol-cmap \
--notdef-glyph \
--notdef-outline \
--name-IDs='*' \
--recommended-glyphs \
--name-legacy \
--name-languages='*'
Name | Files |
---|---|
Minimal | PDF | qPDF |
Maximal streams and type0 font | PDF | qPDF |
Maximal no streams and ttf font | PDF | qPDF |
Maximal no streams, ttf font, no subsetting | PDF | qPDF |
Web page | PDF | qPDF |
Slide deck |
- Portable: independent of application software, hardware and operating system.
- Document: complete description of fixed-layout flat document.
- File: everything needed to present the document can be stored within a single file.
https://www.adobe.com/uk/acrobat/resources/pdf-timeline.html
- 1982 Postscript
- 1993 PDF 1.0 Text, images, pages, links, bookmarks.
- 1994 PDF 1.1 Passwords, encryption, device-independent colour.
- 1996 PDF 1.2 Forms, interactive elements, audio, compression, unicode, more advanced colour features.
- 2000 PDF 1.3 Digital signatures, ICC colour spaces, JavaScript, attachments, new annotation types.
- 2001 PDF 1.4 Accessibility (tagged PDF), format and scripting improvements.
- 2003 PDF 1.5 Multimedia objec streams (video, 3d models etc), improved encryption and format handling.
- 2004 PDF 1.6 3d artwork, OpenType font embedding, XFA (XML Forms Architecture), improved colour spaces and scripting.
- 2006 PDF 1.7 Stabilising. Improved 3d models, text, attachments, encryption and scripting.
- 2008 PDF 1.7E1
- 2008 PDF 1.7E3
- 2009 PDF 1.7E5 XFA 3.0
- 2009 PDF 1.7E6 XFA 3.1
- 2011 PDF 1.7E8 XFA 3.3
- 2008 PDF 1.7 (ISO 32000-1:2008)
- 2017 PDF 2.0 (ISO 32000-2:2020)
- Elimination of all proprietary elements
- Improvements across encryption, signatures, accessibility, rich media.
- Removes XFA.
- One minimal PDF with one page, containing some text and minimal metadata.
- One maximal PDF with multiple pages, text with different fonts, vector graphics and images.
- A web page rendered as a PDF.
- qpdf: useful for exploring PDF files on the command line. qpdf can render a PDF as a qpdf file which is a valid PDF file with additional whitespace and comments. It can also render the PDF tree as a JSON object.
- mutool: useful for extracting fonts and images from a PDF file. This is surprisingly difficult without a tool. It can be useful for checking for duplication or redundancy in the fonts. mutool also has commands to show the internal structure of a PDF file and modify it.
- allsorts: useful for exploring and subsetting fonts. It can be used to extract the character map of a font and subset it.
(See: Spec 7.5 File Structure)
- Header
%PDF-1.7
- Body
- List of indirect objects (Spec 7.3.10) in the PDF.
- Each object has an ID and may reference other objects by their ID.
- Object stream (stream) stream of non-stream objects, useful for compression
- Important objects
Catalog
(dictionary) the root of the object hierarchy.Info
(dictionary) metadata about the file such as author, title, subject etc.
- Cross-reference table
- Keeps track of the offsets of objects within the file, enabling random access to objects.
- Trailer
- Contains pointers to important objects
Size
total number of entries in the cross-reference table.Prev
points to the (previous?) cross-reference section when dealing with updated PDF files.Root
points to the catalog dictionary.Encrypt
points to the encryption dictionary. Only present if the file is encrypted.Info
points to the information dictionary.ID
two unique identifiers for the file. Mostly relevant for encryption.
- StartXRef
- Offset of the cross-reference table.
- EOF marker
%%EOF
- Catalog (dictionary) (Spec 7.7.2) the root of the object hierarchy, points to:
- Pages (dictionary) (Spec 7.7.3.2) a node in the page tree. Often just the root page node.
- Resources (dictionary) (Spec 7.8.3) a dictionary of resources that may be used in a content stream. For example
<</Font<</F1 2 0 R>>>>
maps the nameF1
to an indirect object reference2 0
- Page (dictionary) (Spec 7.7.3.3) a leaf in the page tree.
- Content Stream (stream or array) (Spec 7.8.2)
- Resources (dictionary) (Spec 7.8.3) a dictionary of resources that may be used in a content stream. For example
- Pages (dictionary) (Spec 7.7.3.2) a node in the page tree. Often just the root page node.
- See Table 50 in Spec 8.2 Graphics Objects for a list of operators that can be used in a content stream.
- See Spec 7.3.4 String objects for details of listeral strings
(...)
, hex strings<...>
, and name objects/...
.
Text extraction: Extracting text from a PDF can be difficult because the text is not stored in a simple way. Text is often shifted slightly which can be interpreted as space or vice versa.
Font subsetting: When merging PDF files the fonts can be duplicated or incorrectly subsetted which can result in bigger files. We see this with the PDF generator.