/word-breaker

Application that split longer words into meaningful constituents.

Primary LanguageC#Apache License 2.0Apache-2.0

Word Breaker

Module for get meaningful parts of compound words.

Example

example

Algorithm

The algorithm is based on the comparison of words with the dictionary. A dictionary is a large set of words, it used for comparison. The minimum possible length for a subword is equal 3, to exclude extra iterations and non-valued subwords.

Simple:

for each compound {
    for each possible_word {
        if (compound.Contains(possible_word)) {
            results.Add(possible_word)
        }
    }
}

Advanced:

for each compound {
    for each character except the first one {
        get possible_words starting with this character
        for each possible_word {
            if (compound contains the word at the character’s position) {
                results.Add(possible_word)
            }
        }
    }
}

Data structure

The resulting data structure consists of word lists indexed by two properties:

  • Length of the words
  • Two characters of the words
Dictionary<int, Dictionary<string, List<string>>>

where the int is the word length, the string is the first two characters and the lists contain the suitable words.

Search by length and first two letters in HashTable minimize word list for Contains() method. This significantly speeds up the program.

Notes

  • The algorithm can be modified to get all subwords
  • The algorithm can also be modified to receive first shorter or longer subword
  • Be careful to work with umlaut
  • If you have any remarks, let me know