Kozea/Pyphen

Surrounding Punctuation Affecting Hyphenation

Smylers opened this issue · 2 comments

A TLA which isn't normally hyphenated can become hyphenated if it's surrounded by punctuation, such as parentheses or quote marks:

>>> import pyphen
>>> dic = pyphen.Pyphen(lang='en_GB')
>>> dic.inserted('LST')
u'LST'
>>> dic.inserted('(LST)')
u'(L-ST)'
>>> dic.inserted('TLA')
u'TLA'
>>> dic.inserted('(TLA)')
u'(T-LA)'
>>> dic.inserted('"TLA"')
u'"T-LA"'

Please ignore surrounding punctuation for the purpose of determining hyphenation points, so that terms which aren't normally hyphenated don't look wrong simply because they've been put in brackets.

Thanks.

As a quick workaround you might want to use something along these lines:

# the following sieves out surrounding characters like brackets
word_detection_pattern = re.compile(r'\w{5,}', re.UNICODE) # use \w, or \a if available
lang = 'en_GB' # accent spoken in Europe, except for a small isle which struggles with adapting ;-)
hyphenator = pyphen.Pyphen(lang=lang)

# gets parts of HTML code which will be subject to hyphenation
language_annotated_text = 'body//*[ancestor-or-self::*/@lang and string-length(text()) > 5]'
for elem in dom_tree.xpath(language_annotated_text): # elem is chunks of text

    # injects hyphenated words
    elem.text = word_detection_pattern.sub(
        lambda matchobj: hyphenator(matchobj.group(0)),
        elem.text,
    )

    # … child.tails of elem omitted for brevity

See https://github.com/wmark/thot/blob/680beef953d3b3930520af738ff52b7779517858/src/thot/plugins/HtmlPostProcessing.py#L33

liZe commented

I definitely think that it's not pyphen’s job to remove extra characters around words, even if it's quite simple to remove surrounding punctuation (as explained above). As it can depend on the context (how do we manage language-specific rules, do we count the extra caracters in minimum number of characters of the first/last syllabe, etc), this work has to be done in the software using pyphen.