Hopding/pdf-lib

[Feature Request]: Use PDF as a template, similar to mail merge in word

Misiu opened this issue ยท 14 comments

Misiu commented

I'm looking for a client-side library that will allow me to create a preview of a document for a client.
Idea is to create a simple HTML form with couple of inputs and a preview button that will generate a pdf based on a template.
I'm aware that I can load an existing PDF and add an overlay to it (as shown in samples), but I'd like to replace text, for example, I'd like to replace {{name}} with John and {{surname}} with Smith.

I've searched over the issues and found #33 and #137, as I understand Your library doesn't support reading the text, so please consider this as a feature request.
With this one feature, Your library would be an ideal solution for client-side pdf manipulation.

I am adding my vote here as this is something that could be really handy to my use case ;)

Hello @Misiu! This is an interesting idea. I can certainly see its utility.

There is currently work being done by @cshenks to develop an AcroForms API (see here). When this is complete, it should be fairly straightforward to parse and modify the content of AcroForm text fields with pdf-lib. This seems pretty much like what you are describing here. Do you think this would address your use case?

@jtraulle @gustavodipietrodeus @DaveLo Since you've all expressed interest in this, I'd very much like to hear your thoughts as well.

@Hopding , I'm interested in this functionality for use in variable data printing, at my company we send a customized instruction booklet to customers.

Our current solution converts HTML to PDF , but we are reaching the point where development is constraining our design team since every change means a lengthy rebuild.

Using this library there are places where I can easily put dynamic objects in blank areas (images, barcodes, etc), but other places where having string interpolation would be hugely helpful (Hello, {{name}} welcome to {{service}})

Misiu commented

@Hopding AcroForms will be useful, but I support @DaveLo idea. I'd like to put placeholders in PDF and replace it with content.
As I think about this right now it won't be that easy. If the placeholder will be replaced with longer content, the whole text must reformat (some part might be moved to a new line).

I think, like @DaveLo and @Misiu that AcroForms does not fulfill the same purpose as placeholder and placeholders will be more appropriate in my usecase (being able to search and replace on the client side some markers/placeholders that has been placed onto generated PDF during server side rendering of the PDF) ๐Ÿ˜‰

Implementing this feature without using AcroForms presents three main challenges:

  1. Locating the placeholders. This requires pdf-lib to sift through all the content streams in a document and locate all the text drawing operators. This wouldn't be too difficult to do. The challenging part is mapping the glyph IDs to unicode text. This would be a significant undertaking. The PDF specification defines a ridiculous number of ways to store fonts and encode text. Writing code to support all of them is entirely possible to do. It would just take a lot of time and effort. The final step in this process is to process all the unicode text and produce a list of words/sentences/paragraphs in the document. You might think this last step would be simple, but it is not. PDF does not store text in a structured format like HTML. It just says to draw characters at X/Y coordinates. So you'd need to convert these spatial coordinates to structured text.
  2. Encoding the replacement text. Presumably, you'd want this feature to automatically draw the replacement text in the same font as the placeholders. This is also much harder than you might expect. For example, the font the placeholders were drawn in might have been subsetted, meaning it might not support the replacement text. And even if it does, you'd need to extract all the font objects for the placeholder font and figure out how to encode the new text (because, again, the PDF spec allows all sorts of fonts and encodings).
  3. Laying out new text block. As @Misiu mentioned, it's highly unlikely that the replacement text will have the same length as the placeholder text. This means that you'd need to handle laying out the text already present on the document, not just the placeholders. And not necessarily just the sentence of paragraph to which the placeholders belonged. If the replacement text it long enough, it might require other paragraphs to be relaid out. And what happens if you end up exceeding the page length? And this is assuming your dealing with simple paragraphs of text. Many PDF documents have all sorts of fancy images and complicated layouts that would be extremely difficult to identify and handle automatically.

There are some shortcuts that could be taken if we placed some restrictions on the feature. For example, we could make (1) much easier if we required the placeholder text to be tagged with marked content operators (see section 14.6 MarkedContent of the PDF spec). But this would require the placeholders to be created in a special way, so it wouldn't be able to identify arbitrary strings of text like {{foo}}.

We could make (2) much easier as well, if we didn't try to automatically extract and reuse the font that the placeholders were drawn in. This step would be fairly straightforward if we required you to embed/provide your own font, just like you'd do for PDFPage.drawText.

But as for (3), I'm not too sure what could be done to simplify this. I'm open to ideas though! I'm sure other PDF libraries (such as iText or PDFBox) support text extraction and replacement in some form/fashion. So it'd be interesting to see how they handle this part.

@Hopding , I think restrictions make a ton of sense here. In general forcing tradeoffs for VDP style usage is reasonable, if the user needs full customization then drawing text on the page in an empty block is a better already available solution than variable interpolation.

  1. I'm probably not knowledgeable enough to speak on 1 very well, but would it make sense to define the whole text block as the marked content and then once you pull the string out use interpolation on the variable pieces?

  2. This is a perfectly reasonable ask, if you are automating document creation you probably have the font somewhere accessible.

  3. This might be unworkable, but what if you forced a same or fewer character limit for the substitution? At least to start this limits the complexity so that worrying about page overflow or interacting with existing image layouts since you'd consume the same or less space with the text.

Hi Guys is this feature still in the works?

@kevin8479 This is not something I am actively working on. There are a number of other features that have much higher demand that I'd need to finish before turning to this. But as always, if any enterprising individual would like to try implementing this themselves, I'm happy to provide advise and answer questions!

same issue.
news updates about this?

This issue is stale because it has been open 2 weeks with no activity. It will be closed in 2 days unless there is new activity. See MAINTAINERSHIP.md#issues for details.

Misiu commented

This is still a valid feature request. Maybe the state bot can leave alone the issues with the `feature-request label?

Hey @Misiu ๐Ÿ‘‹! I'm revamping how issues/discussions are handled on this repo (see MAINTAINERSHIP.md#issues for details). Going forward issues will only be kept open for long periods of time if they have a clear path to implementation or somebody is actively working on them (or they're high-impact bugs).

This is definitely still a valid feature request, but it's been opened for 2 years now with no clear path forward. Since nobody is working on it (or likely to be anytime soon), I don't think it needs to be tracked as an open issue anymore. However, since there's been some good discussion in this thread, I've added it to #998 so it doesn't get totally buried.

Closing this as its status is now being tracked on the roadmap.