font_family in page.get_text() dict at span level instead of font_name

Question

font_family in page.get_text() dict at span level instead of font_name

SirishaGorasa opened this issue 4 months ago · 12 comments

Description of the bug

The span object in page.get_text has font_family instead of font_name, this could be problematic while trying to recreate the text, as the same PDF can contain different subset fonts under same font family. Please do share ways we can get the original subset font name from get_text.

Page.get_fonts has the indication of exact name, but when associated with span it represents the font family.

How to reproduce the bug

traverse through page.get_text() dict until span level, and font reported indicates font_family rather than original font name.

PyMuPDF version

1.23.x or earlier

Operating system

Windows

Python version

3.8

Answer 1 · 2024-06-04T14:40:58.000Z

This method returns the font name! Using pymupdf.TOOLS.set_subset_fontnames(True) will return the subset prefix too.

Answer 2 · 2024-06-04T14:44:23.000Z

BTW please make sure to upgrade your Python version soon.
Version 3.8 will no longer be supported beginning with some release on October.
Seizing support means we will no longer create wheels and stop accepting issues.

Answer 3 · 2024-06-04T14:46:22.000Z

Sure, Thanks for your quick response.I would check the same and let you know.

Answer 4 · 2024-06-12T14:43:10.000Z

This worked. Thanks !
Can we get the encoding or the font symbolic name for each span, as there can be different encodings defined for the same base font. Therefore, Font symbolic name helps in this case.

Answer 5 · 2024-06-12T14:49:02.000Z

This worked. Thanks ! Can we get the encoding or the font symbolic name for each span, as there can be different encodings defined for the same base font. Therefore, Font symbolic name helps in this case.

No, this is not possible. Between fonts having identical names down to even the subset prefix "ABCDEF+" cannot be differentiated.

Answer 6 · 2024-06-25T06:46:45.000Z

Can we get the font name from the span as well the base font name too?
For eg.:
For a span, I need to have
"font" : "Calibri" and "BaseFont" : "AFHYFG+Calibri" both.

Answer 7 · 2024-06-25T06:57:37.000Z

If a font is a subset or not can be determined by whether there exists a prefix made of 6 uppercase characters followed by a "+".
There is no other information available.

Answer 8 · 2024-06-27T10:05:06.000Z

Is there a restriction on the number of characters in the subset font name??
For eg.:

The internal structure had the below as the subset font name
/BaseFont /ABCDFG+TimesNewRomanPSMT-BoldCond
and
TOOLS.set_subset_fontnames(True)
and
span["font"]
returned
ABCDFG+TimesNewRomanPSMT-BoldCo

The last two characters from the subset font name are missing.

Can you let me understand why this had happened?

Answer 9 · 2024-06-27T10:30:10.000Z

Yes, there is an in-built length restriction of 31 on the font name.

Answer 10 · 2024-06-27T10:34:59.000Z

Oh, is it??

Which means even though the base font name in the internal structure has the number of characters more than 31, set_subset_fontnames(TRUE), strips it to 31 characters only??

but What if there's a necessity to get the full length base font name???

Answer 11 · 2024-06-27T11:25:21.000Z

No way to do this - sorry.

Answer 12 · 2024-06-27T11:26:58.000Z

That's ok.

Appreciate your quick response.