When you have to store a set of strings, what data structure should you use? You could use hash tables, which sprinkle the strings throughout an array. Access is fast, but information about relative order is lost. Another option is the use of binary search trees, which store strings in order, and are fairly fast. Or you could use digital search tries, which are lightning fast, but use lots of space.
In this article, we'll examine ternary search trees, which combine the time efficiency of digital tries with the space efficiency of binary search trees. The resulting structure is faster than hashing for many typical search problems, and supports a broader range of useful problems and operations. Ternary searches are faster than hashing and more powerful, too.
We described the theory of ternary search trees at a symposium in 1997 (see "Fast Algorithms for Sorting and Searching Strings," by J.L. Bentley and R. Sedgewick, Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, January 1997). In this article, we'll concentrate on what the data structure offers working programmers. Algorithms in C, Third Edition, by Robert Sedgewick (Addison-Wesley, 1998) provides yet another view of ternary search trees. For more information (including all the code in this article and the driver software), refer to http://www.cs.princeton.edu/~rs/strings/ or see "Resource Center," page 3.
A Grove of Trees
Figure 1 is a binary search tree that represents 12 common two-letter words. For every node, all nodes down the left child have smaller values, while all nodes down the right child have greater values. A search starts at the root. To find the word "on," for instance, we compare it to "in" and take the right branch. We take the right branch at "of" and the left branch at "or," and then arrive at "on." Every comparison could access each character of both words.
Figure 1: A binary search tree for 12 words.
Digital search tries store strings character by character. Figure 2 is a tree that represents the same set of 12 words; each input word is shown beneath the node that represents it. (Two-letter words lead to prettier pictures; all structures that we'll see can store variable-length words.) In a tree representing words of lowercase letters, each node has 26-way branching (though most branches are empty, and not shown in Figure 2). Searches are very fast: A search for "is" starts at the root, takes the "i" branch, then the "s" branch, and ends at the desired node. At every node, we access an array element (one of 26), test for null, and take a branch. Unfortunately, search tries have exorbitant space requirements: Nodes with 26-way branching typically occupy 104 bytes, and 256-way nodes consume a kilobyte. Eight nodes representing the 34,000-character Unicode Standard would together require more than a megabyte!
Figure 2: A digital search trie for 12 words.
Ternary search trees combine attributes of binary search trees and digital search tries. Like tries, they proceed character by character. Like binary search trees, they are space efficient, though each node has three children, rather than two. A search compares the current character in the search string with the character at the node. If the search character is less, the search goes to the left child; if the search character is greater, the search goes to the right child. When the search character is equal, though, the search goes to the middle child, and proceeds to the next character in the search string.
Figure 3 is a balanced ternary search tree for the same set of 12 words. The low and high pointers are shown as solid lines, while equal pointers are shown as dashed lines. Each input word is shown beneath its terminal node. A search for the word "is" starts at the root, proceeds down the equal child to the node with value "s," and stops there after two comparisons. A search for "ax" makes three comparisons to the first letter ("a") and two comparisons to the second letter ("x") before reporting that the word is not in the tree.
Figure 3: A ternary search tree for 12 words.
The idea behind ternary search trees dates back at least as far as 1964. In "Randomized Binary Searching with Tree Structures" (Communications of the ACM, March 1964), H.A. Clampett sketched a primitive version of the structure. Computer scientists have proven many theorems about the trees; for instance, searching for a string of length k
in a ternary search tree with n
strings will require at most O
(log
n
+k
) comparisons.
The C Structure
As far as we can tell, previous authors have viewed ternary search trees as a theoretical structure for proving theorems. We were delighted to find that the trees also lead to practical computer programs. Although we've chosen to illustrate the data structure in C, we could have just as easily chosen C++, Java, or other languages.
Each node in a ternary search tree is represented by the following structure:
typedef struct tnode *Tptr; typedef struct tnode { char splitchar; Tptr lokid, eqkid, hikid; } Tnode;
The value stored at the node is splitchar
, and the three pointers represent the three children. The root of the tree is declared to be Tptr root
. We will represent every character in each string, including the null character that terminates it.
Membership Searching
We begin with a recursive version of the search
function. It returns 1 if string s
is in the subtree rooted at p,
and 0 otherwise; it is originally called as rsearch(root, s)
:
int rsearch(Tptr p, char *s) { if (!p) return 0; if (*s < p->splitchar) return rsearch(p->lokid, s); else if (*s > p->splitchar) return rsearch(p->hikid, s); else { if (*s == 0) return 1; return rsearch(p->eqkid, ++s); } }
The first if
returns 0 if the search has run off the end of the tree. The next two if
statements take the low and high branches as appropriate. The final else
branch returns 1 if the current character is the end-of-string character 0, and otherwise moves to the next input character and to the equal branch.
Some programmers might feel more comfortable with the iterative version of the search
function (which has only one argument):
int search(char *s) { Tptr p; p = root; while (p) { if (*s < p->splitchar) p = p->lokid; else if (*s == p->splitchar) { if (*s++ == 0) return 1; p = p->eqkid; } else p = p->hikid; } return 0; }
This is substantially faster on some compilers, and we'll use it in later experiments. On most optimizing compilers, though, the run time of the recursive version is within a few percent of the iterative version.
Both versions of the search use a pattern that we will see repeatedly: If the search character is less, go to lokid
; if it is greater, go to hikid
; if it is equal, go to the next character and eqkid
. The following code illustrates that shorter is not always cleaner, simpler, faster, or better:
int rsearch2(Tptr p, char *s) { return (!p ? 0 : ( *s == p->splitchar ? (*s ? rsearch2(p->eqkid, ++s) : 1) : (rsearch2(*s < p->splitchar ? p->lokid : p->hikid, s)) )); }