foliojs/pdfkit

Text in PDF recognized as gibberish in any PDFium viewer due to invalid bfrange definitions in ToUnicodeMap

orzFly opened this issue · 1 comments

Bug Report

Description of the problem

pdfkit/lib/font/embedded.js

Lines 269 to 271 in 485b7e6

1 beginbfrange
<0000> <${toHex(entries.length - 1)}> [${entries.join(' ')}]
endbfrange

Currently, our code generates all ToUnicodeMap entries on a single line. This yields invalid text mapping on any PDFium base viewers (and maybe others).

https://source.chromium.org/chromium/_/pdfium/pdfium.git/+/master:core/fpdfapi/font/cpdf_tounicodemap.cpp;l=171-172;drc=61bda438f9071586c92f8f626c29021524a8d0b0

    uint32_t lowcode = lowcode_opt.value();
    uint32_t highcode = (lowcode & 0xffffff00) | (highcode_opt.value() & 0xff);

Related Chromium bug: https://bugs.chromium.org/p/pdfium/issues/detail?id=1339#c1

The PDF spec doesn't give too much detail about beginbfrange. I looked around and found the doc below. Based on section 1.4.1 in that doc, the <19ff><1a00><63cf> beginbfrange entry is illegal. The first byte values should be the same for the two source range values in the entry.
https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf

The link is moved or removed at this moment. I found another copy at http://www.audentia-gestion.fr/ADOBE/5411.ToUnicode.pdf

image

Screenshots

  • Google Chrome 122.0.6261.69 Linux x86_64
    image

  • Chromium 122.0.6261.69 (Official Build) Arch Linux (64-bit)
    image

  • WPS Office for Linux 11.1.0.11698
    image
    image

  • Firefox (pdf.js) - CORRECT
    image

  • Adobe Acrobat Reader 2023.008.20533 64-bit on Windows 11 - CORRECT
    image

Code sample

https://replit.com/@orzFly/pdfkit-tounicode?v=1
test.pdf

I used 258 glyphs in the document, so only the first two (258 % 256 = 2) glyphs is correct - yields "AB" correctly. All the rest are incorrect.

Your environment

  • pdfkit version: 0.12.3, or master
  • Node version: 12.22.9
  • Browser version:
    • Google Chrome 122.0.6261.69 Linux x86_64
    • WPS Office for Linux 11.1.0.11698
    • Chromium 122.0.6261.69 (Official Build) Arch Linux (64-bit)
  • Operating System: Linux x86_64

I have a possible fix - will send a pull request later. However, I am not sure how to add unit test about this particular behavior.