Inserting a New String
The insert
function inserts a new string into the tree, and does nothing if it is already present. We insert the string s
with the code root = insert(root, s);
. The first if
statement detects running off the end of new node, initializes it, and falls through to the standard case. Subsequent code takes the appropriate branch, but branches to eqkid
only if characters remain in the string.
Tptr insert(Tptr p, char *s) { if (p == 0) { p = (Tptr) malloc(sizeof(Tnode)); p->splitchar = *s; p->lokid = p->eqkid = p->hikid = 0; } if (*s < p->splitchar) p->lokid = insert(p->lokid, s); else if (*s == p->splitchar) { if (*s != 0) p->eqkid = insert(p->eqkid, ++s); } else p->hikid = insert(p->hikid, s); return p; }
This code is short but subtle, and worth careful study.
The resulting tree stores the strings themselves, but no other information. Real symbol tables store additional data with each string. To illustrate this problem, we'll store a pointer to every string in the tree; this data will be used by later search algorithms. We could add a new info
field to each node, but that would be wasteful. Instead, we'll exploit the fact that a node with a null splitchar
cannot have an eqkid
, and store the data in that field. We could make a union
that contains the two possible pointers, but that would be syntactically clumsy (we would have to refer to the union at each reference). We will instead use the sleazy hack of coercing a string into a Tptr
. The code inside *s == p->splitchar
test, therefore, becomes the following:
if (*s == 0) p->eqkid = (Tptr) insertstr; else p->eqkid = insert(p->eqkid, ++s);
Faster Insertion Functions
We can improve insertion in two different ways. We'll start by tuning the insertion code, and consider different orders in which we can insert the nodes into the tree.
A major cost of the insert
function is calling malloc
to create each node. The function insert2
in the demo program (available electronically) uses the ancient C technique of allocating a buffer of nodes and dealing them out as needed. Profiling shows that this eliminates most of the time spent in storage allocation. We measured the time to insert the 234,936 words in a dictionary file (one word per line, for a total of 2,486,813 characters). Table 1 lists the run times in seconds on three 200-MHz machines, with the compilers optimizing code for speed. The insert3
function (also available electronically) uses other common techniques to reduce the run time even further: transforming recursion to iteration, keeping a pointer to a pointer, reordering tests, saving a difference in a comparison, and splitting the single loop into two loops. These changes are beneficial on all machines, and substantial on the SPARC.
Table 1: Run times of insertion functions.
Better Insertion Orders
In what order should you insert the nodes into a tree? No matter in what order you insert the nodes, you end up with the same digital search trie -- the data structure is totally insensitive to insertion order. Binary search trees are at the opposite end of the spectrum: If you insert the nodes in a good order (middle element first), you end up with a balanced tree. If you insert the nodes in sorted order, the result is a long skinny tree that is very costly to build and search. Fortunately, if you insert the nodes in random order, a binary search tree is usually close to balanced.
Ternary search trees fall between these two extremes. You can build a completely balanced tree by inserting the median element of the input set, then recursively inserting all lesser elements and greater elements. A simpler approach first sorts the input set. The recursive build
function inserts the middle string of its subarray, then recursively builds the left and right subarrays. We use this method in our experiments; it is fast and produces fairly well-balanced trees. The cost of inserting all words in a dictionary with function insert3
is never more than about 10 percent greater than searching for all words. D.D. Sleator and R.E. Tarjan describe theoretical balancing algorithms for ternary search trees in "Self-Adjusting Binary Search Trees" (Journal of the ACM, July 1985).
Comparison to Other Data Structures
Symbol tables are typically represented by hash tables. We conducted a simple experiment to compare ternary search trees to that classic data structure; the complete code is available electronically.
To represent n
strings, our hash code uses a chained table of size tabsize=n
. The hash function, from Section 6.6 of B. Kernighan and D. Ritchie's The C Programming Language
, Second Edition (Prentice Hall, 1988), is reasonably efficient and produces good spread:
int hashfunc(char *s) { unsigned n = 0; for ( ; *s; s++) n = 31 * n + *s; return n % tabsize; }
The body of the search
function is as follows:
for (p = tab[hashfunc(s)]; p; p = p->next) if (strcmp(s, p->str) == 0) return 1; return 0;
For fair timing, we replaced the string comparison function str
cmp with inline code (so the hash and tree search
functions used the same coding style). We used the same dictionary to compare the performance of hashing and ternary trees. We first built the tree by inserting each word in turn (middle element first, then recurring). Finally, we searched for each element word in the input file. Table 2 presents the times in seconds for the two data structures across three machines. For both building and (successful) searching, the two structures have comparable search times.
Table 2: Run times of data structures.
Ternary search trees are usually noticeably faster than hashing for unsuccessful searches. Ternary trees can discover mismatches after examining only a few characters, while hashing always processes the entire key. On some data sets with very long keys and mismatches in the first few characters, ternary trees took less than one-fifth the time of hashing.
Functions insert3
and search
combine to yield a time-efficient symbol table. On one dictionary described at our web site, ternary search trees used about three times the space of hashing. An alternative representation of ternary search trees is more space efficient: When a subtree contains a single string, we store a pointer to the string itself (and each node stores three bits telling whether its children point to nodes or strings). This leads to code that is less efficient, but it reduces the number of tree nodes close to the space used by hashing.
We analyzed the worst-case performance of several aspects of ternary search trees in our theoretical paper. In "The Analysis of Hybrid Trie Structures" (Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, January 1998), J. Clement, P. Flajolet, and B. Valle analyze the expected performance of a broad class of hybrid trees that includes ternary trees, and describe extensive experiments to support their theory.
Other Operations on Ternary Trees
Most standard techniques on binary search trees can be applied immediately to their ternary cousins. For instance, we can print the strings in the tree in sorted order with a recursive traversal:
void traverse(Tptr p) { if (!p) return; traverse(p->lokid); if (p->splitchar) traverse(p->eqkid); else printf("%s/n", (char *) p->eqkid); traverse(p->hikid); }
Simple recursive searches can find the predecessor or successor of a given element or list all items in a given range. If we add a count
field to every node, we can quickly count the elements in a given range, count how many words begin with a given substring, or select the m
th largest element. Most of these operations require logarithmic time in a ternary tree, but linear time in a hash table.