/prefix-search

Implement prefix search using ternary search tree

Primary LanguageC

Ternary Search Tree (tst)

A ternary search tree is a binary search tree (bst) with an additional node pointer.

There is a relative dearth of information available about ternary trees, and specifially, proper node-rotation when a node is deleted from the tree. The only examples for a node-delete found are simple deletes that leave the tree dirty by leaving a node, with or withoutsiblings, but having no valid middle (or equal) pointer.

With a binary search tree each node contains a left and right node-pointer so a binary choice controls traversal. Either a greater than or less than choice. As the result of a comparison between the node-data and a reference either the left or right node-pointer is used to further descend.

A ternary search tree adds a third (or middle) node. While you can still use the left-middle-right notation, ternary trees use a low-equal-high pointer verbiage. With each node holding a key ID, the node pointers for the this code uses a lokid, eqkid, and hikid pointer naming convention. If the difference between a refernence and the node key is negative (lower than), the lokid node is traversed, if they are equal, the eqkid node is followed or hikid is followed.

In addition to the node->key, the node used in this example adds a reference count (a node->refcnt) to the node data to track the number of occurrances for each word it holds. So for example, if using the tree to track the words in an editor buffer (where there may be multiple occurrences of 'the' or other common words), the node holding 'the' is not deleted, upon delete, until no other occurrences remain (i.e. the node->refcnt is zero).

Each individual node has the following form:

                              o
                              |-key
                              |-refcnt
                  ------------+------------
                  |lokid      |eqkid      |hikid
                  o           o           o

The string data (a pointer to a word, or a copy of the word itself) is stored in an additional special node following the node containing the last character (node->key) in the search path for the word, cast to type (node *) and stored as the node->eqkid pointer. Further, since this is the node after the last character, its key is the nul-character (decimal 0) just as you would expect when ending a string. Thus, the 'key's for each of the nodes that make up the search path of a word, are the letters in the word with the final node having a key 0 with either a pointer to the string (if stored in an external data structure) or an allocated copy of the string itself if the string is to be stored in the tree. (as in holding the words for an edit buffer, where the location/address for the string changes with each keypress). In either case, the traversal down nodes to the final node will have a form similar to the following for the word "cat":

                              o
                              |-c
                              |-0
                  ------------+------------
                  |lokid      |eqkid      |hikid
                  o           o           o
                              |-a
                              |-0
                          ----+----
                          |   |   |    note: any of the lokid or hikid nodes
                              o              can also have pointers to nodes
                              |-t            for words that "cat" or "ca" is
                              |-0            a partial prefix to.
                          ----+----
                          |   |   |
                              o
                              |-0
                              |-1    <== the refcnt is only relevant to the final node
                          ----+----
                          |   |   |
                        NULL  o  NULL
                            "cat"

The ternary tree has the same O(n) efficiency for insert and search as does a bst. The delete is only slightly worse due to the proper deletion of the chain of unique nodes that make a word and proper rotation to eliminate the final node containing siblings. Lookup times associated with loading the entire /usr/share/dict/words file and searching range between 0.000002 - 0.000014 sec. However, the prefix search ability offered by the ternary search tree sets it apart from virtually all other data stuctures. While Tri/Radix trees can perform as well, their memory requirements are often 20 times more than a ternary tree.

The benefit of a ternary tree for prefix searching of text lies in its ability to quickly traverse a tree of any size finding the node containing the last character in the wanted prefix. An in-order traversal of that node identifies all strings in the tree containing the prefix.