pymupdf/PyMuPDF

font_family in page.get_text() dict at span level instead of font_name

SirishaGorasa opened this issue · 12 comments

Description of the bug

The span object in page.get_text has font_family instead of font_name, this could be problematic while trying to recreate the text, as the same PDF can contain different subset fonts under same font family. Please do share ways we can get the original subset font name from get_text.

Page.get_fonts has the indication of exact name, but when associated with span it represents the font family.

How to reproduce the bug

traverse through page.get_text() dict until span level, and font reported indicates font_family rather than original font name.

PyMuPDF version

1.23.x or earlier

Operating system

Windows

Python version

3.8

This method returns the font name! Using pymupdf.TOOLS.set_subset_fontnames(True) will return the subset prefix too.

BTW please make sure to upgrade your Python version soon.
Version 3.8 will no longer be supported beginning with some release on October.
Seizing support means we will no longer create wheels and stop accepting issues.

Sure, Thanks for your quick response.I would check the same and let you know.

This worked. Thanks !
Can we get the encoding or the font symbolic name for each span, as there can be different encodings defined for the same base font. Therefore, Font symbolic name helps in this case.

This worked. Thanks ! Can we get the encoding or the font symbolic name for each span, as there can be different encodings defined for the same base font. Therefore, Font symbolic name helps in this case.

No, this is not possible. Between fonts having identical names down to even the subset prefix "ABCDEF+" cannot be differentiated.

Can we get the font name from the span as well the base font name too?
For eg.:
For a span, I need to have
"font" : "Calibri" and "BaseFont" : "AFHYFG+Calibri" both.

If a font is a subset or not can be determined by whether there exists a prefix made of 6 uppercase characters followed by a "+".
There is no other information available.

Is there a restriction on the number of characters in the subset font name??
For eg.:

The internal structure had the below as the subset font name
/BaseFont /ABCDFG+TimesNewRomanPSMT-BoldCond
and
TOOLS.set_subset_fontnames(True)
and
span["font"]
returned
ABCDFG+TimesNewRomanPSMT-BoldCo

The last two characters from the subset font name are missing.

Can you let me understand why this had happened?

Yes, there is an in-built length restriction of 31 on the font name.

Oh, is it??

Which means even though the base font name in the internal structure has the number of characters more than 31, set_subset_fontnames(TRUE), strips it to 31 characters only??

but What if there's a necessity to get the full length base font name???

No way to do this - sorry.

That's ok.

Appreciate your quick response.