mbutterick/typesetting

PDF accessibility

violetcereza opened this issue · 12 comments

I know this project is in it's early stages and I don't want to seem ungrateful for the open source contributions you've brought to the typography community, but I wanted to make sure this is on your radar for this project.

PDFs have a specification that allows them to annotate semantic structure ('tags') on top of whatever visual stuff they contain. This runs parallel to PDF bookmarks, which are helpful for jumping around the table of contents, but tags have even lower-level structure. (I don't know much about the technical stuff so correct me if I'm wrong.)

PDF tags make PDFs readable for screen readers, and are required within certain institutions that have rules around accessibility and equitable access. (For example, in the US Federal agencies must be "508 compliant.")

Word (and probably some other typesetting tools) generate tagged PDFs, but LaTeX has never really been able to do this. I think it would be an amazing contribution to the state of accessibility in academia if there were a typesetting tool that did generate tagged PDFs.

I am interested in helping with this, but I'm not sure if this project is in a state that's ready for collaboration yet. Also, I have experience in high-level programming languages but I would probably have to do some reading on PDF compilers.

I am open to looking at this. With the caveat that replacing LaTeX is not a design goal of Quad, so even in the best case, my net “contribution to the state of accessibility in academia” is likely zero.

Questions from someone who now knows slightly less than nothing about PDF tags from reading about them in the PDF Reference.

PDF tags seem to be a way for Adobe to shoehorn an affordance for XML- or HTML-style markup into the PDF format. It seems like an awkward fit, because the foundational idea of PDF is more of a drawing model (since PDF is derived from PostScript), not a structural model. BTW this is one of many reasons I have misgivings about the PDF format — it’s an excellent paper simulator, but continues to grow like the Winchester Mystery House into jobs it’s not suited for.

Moreover it seems like PDF tags depend on three ingredients:

  1. the insertion of tags in the source document (by an author manually, or by authoring software automatically).
  2. the parsing & interpretation of tags by PDF reader software.
  3. agreement between the authoring software and the reader software about the names of the tags and what they mean (that is, the same trouble that has existed between HTML pages and web browsers for 25 years)

If that much is accurate (I invite correction if not), then how has this worked in practice? For instance, what reader applications support PDF tags? You mention screen readers — are PDF tags core to what they do, or incidental? Are there certain tagging conventions that are observed? And then on the other side — what is the typical workflow for authors? Are PDF tags used for all documents, or mostly those that are being converted from HTML or XML? What tags are supported? (I don’t expect you to know all these answers, but if you have a link to other resources I could study, that would be helpful.)

What I’ve learned the hard way is that following a specification in the PDF Reference is pretty useless. In practice there are idiomatic expectations of the various programs that handle PDFs. So if you don’t implement a feature consistently with those expectations, it’s a waste.

I’m going to reframe this as a project.

Got it. I'm still meaning to gather more conclusive answers to your questions, but I apologize for the delay! Should I continue the discussion here if I have more information?

Sure, there’s no rush of course. You can reopen the issue when there’s actionable information available.

I'm sure you already know some of this, but I'm collecting resources here on my deep dive into pdf formatting. Once I feel like I have a base-level understanding of the format, I'll do more research on what tools you are using to output PDFs (Pango?) and the facilities available for including tag structure.

How to see the internal tagging structure in a PDF

  1. Open the print production tool in Adobe Acrobat Pro and click "Preflight"
    image

  2. Click Options -> Browse Internal PDF Structure...
    image

  3. Make sure you have selected the light bulb tab in the upper left corner, and browse StructTreeRoot. I have highlighted an "Artifact" tag in my example tagged PDF.

image

Here's an example of an Object under StructTreeRoot with alt text (for an image):
image

Resources on tagging in PDFs

I'm still working through this W3 document. Check out the code example under PDF21 for what tagging looks like.

Interesting background on PDF encoding

Adobe's standards document (linked from the W3 doc)
(check out section 14.8.4 "Standard Structure Types")

If you could provide more information on how quad hooks into a lower-level drawing interface, that would be super helpful! I've been looking at your code and I can't find any references to pango or anything.

I also understand that quad is in its early stages and the connections it has to the underlying graphics system may change, so perhaps this project should wait a bit.

The lower-level parts aren’t documented yet because they’re still in flux. I don’t use Pango, however, nor any other library — I make the PDFs from scratch.

Oh I guess I misread your reference to Pango in the documentation! I will keep an eye on the codebase as you develop it. Please let me know if you want anything from me 😃

Most of my PDF-making code was ported from the pdfkit project a few years ago. In the past year, pdfkit has added support for Tagged PDF. So the most likely path to supporting Tagged PDF is to go back to pdfkit and port this new code into quad (or more specifically pitfall, which is the PDF-generating part of the library)