yob/pdf-reader

Numerals read as `\u0000` when using font feature settings

SimonEggert opened this issue · 1 comments

First of all, thanks for the work and effort you've put into this great library!

Bug description

We are having an issue with numerals not being read correctly by PDF::Inspector::Text.analyze. They get misinterpreted as \u0000 when we use font-feature-settings: 'tnum' as style. We are generating the PDF with Gotenberg from HTML templates.

Minimal reproducible example

<div>21.09.2023</div> gets read as 21.09.2023

while

<div style="font-feature-settings: 'tnum'">21.09.2023</div>gets read as \u0000\u0000.\u0000\u0000.\u0000\u0000\u0000\u0000.

PDFs

Here are two PDFs, one with the feature turned off and one with the feature turned on:
font_features_off.pdf
font_features_on.pdf

Further information

The UNIX tool pdftotext is able to read both versions correctly so I think the PDF is alright.
The font in use is Barlow if that makes any difference.

Any help would be appreciated!

P.S.: I'll also open an issue regarding this problem over at https://github.com/prawnpdf/pdf-inspector so feel free to close this one if you think it should be handled there.

yob commented

Thanks for the clear report and simple test files.

Looking at the features on file, it has a ToUnicode CMap that maps each glyph code to the unicode codepoint \u0000 and we're honoring it:

$ ruby -Ilib bin/pdf_object font_features_on.pdf 20
{:Filter=>:FlateDecode, :Length=>245}
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<<  /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
6 beginbfchar
<008E> <0000>
<008F> <0000>
<0090> <0000>
<0091> <0000>
<0097> <0000>
<00B4> <002E>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end

However, the content stream is using the optional "marked content" operators (BDC, EMC) and I can see the real characters in there as literal strings ((2), (1), (9), etc):

$ ruby -Ilib bin/pdf_object font_features_on.pdf 5                                                                                                                                                                                                                
{:Filter=>:FlateDecode, :Length=>282}                                                                                                                                                                                                                                                       
.23999999 0 0 -.23999999 0 841.91998 cm                                                                                                                                                                                                                                                     
q                                                                                                                                             
0 387.5 2479.1665 2732.0789 re                                         
W* n                                                                   
q                                                                      
3.122376 0 0 3.122376 0 387.5 cm                                       
1 1 1 RG 1 1 1 rg                                                      
/G3 gs                                                                 
0 0 794 875 re                                                         
f                                                                      
0 0 794 875 re                                                         
f                                                                      
.1255 .1333 .1569 RG .1255 .1333 .1569 rg                              
BT                                                                     
/P <</MCID 0 >>BDC                                                     
/Span<</ActualText (2) >> BDC                                          
/F4 10.6599998 Tf                                                      
1 0 0 -1 0 11 Tm                                                       
<0090> Tj                                                              
EMC                                                                    
/Span<</ActualText (1) >> BDC                                          
5.6159058 0 Td <008F> Tj                                               
EMC                                                                    
5.6159058 0 Td <00B4> Tj                                                                                                                      
/Span<</ActualText (0) >> BDC                                                                                                                                                                                                                                                               
2.9624634 0 Td <008E> Tj                                                                                                                      
EMC                                                                    
/Span<</ActualText (9) >> BDC                                          
5.6159058 0 Td <0097> Tj                                               
EMC                                                                    
5.6159058 0 Td <00B4> Tj                                               
/Span<</ActualText (2) >> BDC                                          
2.9624634 0 Td <0090> Tj                                               
EMC                                                                    
/Span<</ActualText (0) >> BDC                                          
5.6159058 0 Td <008E> Tj                                               
EMC                                                                    
/Span<</ActualText (2) >> BDC                                          
5.6159058 0 Td <0090> Tj                                               
EMC                                                                    
/Span<</ActualText (3) >> BDC                                          
5.6159058 0 Td <0091> Tj                                               
EMC                                                                    
EMC                                                                    
ET                                                                     
Q                                                                      
Q

pdf-reader currently doesn't look at marked content. Maybe we should, and maybe this suggests marked content should take precedence over ToUnicode CMaps?