/hyphenator

Java 17 library for text hyphenation

Primary LanguageJavaGNU Lesser General Public License v2.1LGPL-2.1

Java CI Java GitHub

hyphenator

Simple implementation of text hyphenation using pattern approach described in Franklin Mark Liang's thesis "Word Hy-phen-a-tion by Com-put-er".

Hyphenator preserves the case of the original text and hyphenates words with non-alphabetic characters as long as they are not placed in the middle of the words.

Requirements

To use this tool you need a list of hyphenation patterns for the desired language. It can be downloaded from TeX hyphenation repository. Choose the *.pat.txt file.

Alternatively, you can download a dictionary file e.g. from LibreOffice repositories. In this case choose file with "hyph" prefix e.g. hyph_pl_PL.dic for Polish language. Make sure to remove the tags at the beginning of the file and only pass the patterns themselves to the Hyphenator.

UTF-8  <---- encoding info, use it to load file
LEFTHYPHENMIN 2  <------ this value can be passed to the Hyphenator as minLeadingLength
RIGHTHYPHENMIN 2  <------ this value can be passed to the Hyphenator as minTralingLength
.ć8    <--- pattern
.4ć3ć8
.ćł8
.2ć1ń8

Example usage

Adding dependency

Library is published to maven central and can be added to your project in the standard way:

<dependencies>
  ...
  <dependency>
    <groupId>io.github.nianna</groupId>
    <artifactId>hyphenator</artifactId>
    <version>1.0.1</version>
  </dependency>

</dependencies>

Hyphenation with default settings

Input text is automatically split into tokens. By default the first and last chunk after hyphenation must be at least 2 characters long. Space is used as word separator and hyphen as syllables separator.

List<String> patterns = ... // load the patterns from the patterns file
Hyphenator hyphenator = new Hyphenator(patterns);

HyphenatedText result = hyphenator.hyphenateText("Testing (automatic) HyPHeNAtioN by computer!");

System.out.println(result.read()); // prints "Test-ing (au-to-mat-ic) Hy-PHeN-AtioN by com-put-er!"

You can also hyphenate a single token:

HyphenatedToken result = hyphenator.hyphenateToken("Testing");

System.out.println(result.read("-")); // prints "Test-ing"
System.out.println(result.hyphenIndexes()); // prints [4]

Customizing hyphenation

To skip some hyphens you can specify the following properties while creating Hyphenator instance.

  • minLeadingLength (default: 2) - hyphen can be placed only after first minLeadingLength characters
  • minTrailingLength (default: 2) - hyphen can be placed only before last minTrailingLength characters
List<String> patterns = ... // load the patterns from the patterns file
HyphenatorProperties properties = new HyphenatorProperties(3, 4);
Hyphenator hyphenator = new Hyphenator(patterns, properties);

HyphenatedText result = hyphenator.hyphenateText("Testing (automatic) HyPHeNAtioN by computer!");

System.out.println(result.read()); // prints "Testing (auto-matic) HyPHeN-AtioN by com-puter!"

Customizing input word separator

To customize the separator on which text is supposed to be split into tokens pass the tokenSeparator argument to the Hyphenator.

List<String> patterns = ... // load the patterns from the patterns file
Hyphenator hyphenator = new Hyphenator(patterns, new HyphenatorProperties(), "|");

HyphenatedText result = hyphenator.hyphenateText("Testing|(automatic)|HyPHeNAtioN|by|computer!");

System.out.println(result.read()); // prints "Test-ing (au-to-mat-ic) Hy-PHeN-AtioN by com-put-er!"

Customizing output

To customize the word or syllables separator used for creating hyphenated text pass arguments to HyphenatedText::read method.

List<String> patterns = ... // load the patterns from the patterns file
Hyphenator hyphenator = new Hyphenator(patterns);

HyphenatedText result = hyphenator.hyphenateText("Testing (automatic) HyPHeNAtioN by computer!");

String hyphenatedText = result.read("|", "_");
System.out.println(hyphenatedText); // prints "Test_ing|(au_to_mat_ic)|Hy_PHeN_AtioN|by|com_put_er!"