Text in PDF recognized as gibberish in any PDFium viewer due to invalid bfrange definitions in ToUnicodeMap
orzFly opened this issue · 1 comments
Bug Report
Description of the problem
Lines 269 to 271 in 485b7e6
Currently, our code generates all ToUnicodeMap entries on a single line. This yields invalid text mapping on any PDFium base viewers (and maybe others).
uint32_t lowcode = lowcode_opt.value();
uint32_t highcode = (lowcode & 0xffffff00) | (highcode_opt.value() & 0xff);
Related Chromium bug: https://bugs.chromium.org/p/pdfium/issues/detail?id=1339#c1
The PDF spec doesn't give too much detail about beginbfrange. I looked around and found the doc below. Based on section 1.4.1 in that doc, the <19ff><1a00><63cf> beginbfrange entry is illegal. The first byte values should be the same for the two source range values in the entry.
https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf
The link is moved or removed at this moment. I found another copy at http://www.audentia-gestion.fr/ADOBE/5411.ToUnicode.pdf
Screenshots
Code sample
https://replit.com/@orzFly/pdfkit-tounicode?v=1
test.pdf
I used 258 glyphs in the document, so only the first two (258 % 256 = 2) glyphs is correct - yields "AB" correctly. All the rest are incorrect.
Your environment
- pdfkit version: 0.12.3, or master
- Node version: 12.22.9
- Browser version:
- Google Chrome 122.0.6261.69 Linux x86_64
- WPS Office for Linux 11.1.0.11698
- Chromium 122.0.6261.69 (Official Build) Arch Linux (64-bit)
- Operating System: Linux x86_64
I have a possible fix - will send a pull request later. However, I am not sure how to add unit test about this particular behavior.