reutenauer/polyglossia

Explicit hyphens typeset as discretionary hyphens with LuaLaTeX and certain fonts

Closed this issue · 8 comments

While working with Sphinx, which uses Polyglossia when LuaLaTeX is selected, I stumbled onto an issue that totally baffles me. Certain fonts (notably Lato and Palatino Linotype) cause normal "explicit" hyphens to be typeset as Unicode soft/discretionary hyphens (U+00AD) instead of a normal hyphen (U+002D).

Here is a minimum working example. Although the document looks correct, the hyphens (and hyphenated words) are not searchable, at least not in Okular (Poppler-based).

\documentclass{report}
\usepackage{polyglossia}
\setmainlanguage{english}
\usepackage[default=true]{lato}
\begin{document}
The explicit-hyphen in this sentence will be typeset as a unicode discretionary (soft) hypen.
The automatic hyphen was also typeset in the same way.
\end{document}

Changing "lato" to "carlito" for example makes the problem go away. Loading Palatino Linotype with \setmainfont shows a similar issue:

\documentclass{report}
\usepackage{polyglossia}
\setmainlanguage{english}
\usepackage{fontspec}
\setmainfont{Palatino Linotype}
\begin{document}
The explicit-hyphen in this sentence will be typeset as a unicode discretionary (soft) hypen.
The automatic hyphen was also typeset in the same way.
\end{document}

I tried various trace commands but did not notice anything obvious in the log file that would explain why this happens. This TeX Stack Exchange post seems to suggests the wrong glyph is being selected for the generic "hyphen" character, but I do not understand exactly how—and besides the issue is specific to XeTeX in that case.

Does the following works for you?

\documentclass{report}
\usepackage[luatexrenderer=Node]{polyglossia}
\setmainlanguage{english}
\usepackage[default=true]{lato}
\begin{document}
The explicit-hyphen in this sentence will be typeset as a unicode discretionary (soft) hypen.
The automatic hyphen was also typeset in the same way.
\end{document}

Yes, that works! Does this mean the bug might be in HarfBuzz then?

Maybe. Do you experience the same thing with XeTeX? If not maybe luaotfload adds some /ActualText spans that affects text extraction (copy and search) from the PDF when harfbuzz is used.

BTW the explicit hyphen in your example looks like a soft hyphen in the pdf (at leastwhen I copy it)

I don't see any /ActualText spans in the uncompressed pdf.

Interestingly, XeTeX produces U+2010 (unambiguous hyphen, not possibly a minus sign) which is more correct for explicit hyphens but definitely incorrect for discretionary hyphens. (and still not searchable, though that's technically a fault of the PDF viewer)

Then I'm really not sure. It could be a problem in either luaotfload, harfbuzz, the font, or not to be considered a bug. in any case, this is not a polyglossia bug, so I'll close this ticket.

Feel free to reopen if you think otherwise.

Thank you very much for the hint on where to look :) I'll see if I can find out anything more.