Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Database

Ternary Search Trees


When you have to store a set of strings, what data structure should you use? You could use hash tables, which sprinkle the strings throughout an array. Access is fast, but information about relative order is lost. Another option is the use of binary search trees, which store strings in order, and are fairly fast. Or you could use digital search tries, which are lightning fast, but use lots of space.

In this article, we'll examine ternary search trees, which combine the time efficiency of digital tries with the space efficiency of binary search trees. The resulting structure is faster than hashing for many typical search problems, and supports a broader range of useful problems and operations. Ternary searches are faster than hashing and more powerful, too.

We described the theory of ternary search trees at a symposium in 1997 (see "Fast Algorithms for Sorting and Searching Strings," by J.L. Bentley and R. Sedgewick, Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, January 1997). In this article, we'll concentrate on what the data structure offers working programmers. Algorithms in C, Third Edition, by Robert Sedgewick (Addison-Wesley, 1998) provides yet another view of ternary search trees. For more information (including all the code in this article and the driver software), refer to http://www.cs.princeton.edu/~rs/strings/ or see "Resource Center," page 3.

A Grove of Trees

Figure 1 is a binary search tree that represents 12 common two-letter words. For every node, all nodes down the left child have smaller values, while all nodes down the right child have greater values. A search starts at the root. To find the word "on," for instance, we compare it to "in" and take the right branch. We take the right branch at "of" and the left branch at "or," and then arrive at "on." Every comparison could access each character of both words.

Figure 1: A binary search tree for 12 words.

Digital search tries store strings character by character. Figure 2 is a tree that represents the same set of 12 words; each input word is shown beneath the node that represents it. (Two-letter words lead to prettier pictures; all structures that we'll see can store variable-length words.) In a tree representing words of lowercase letters, each node has 26-way branching (though most branches are empty, and not shown in Figure 2). Searches are very fast: A search for "is" starts at the root, takes the "i" branch, then the "s" branch, and ends at the desired node. At every node, we access an array element (one of 26), test for null, and take a branch. Unfortunately, search tries have exorbitant space requirements: Nodes with 26-way branching typically occupy 104 bytes, and 256-way nodes consume a kilobyte. Eight nodes representing the 34,000-character Unicode Standard would together require more than a megabyte!

Figure 2: A digital search trie for 12 words.

Ternary search trees combine attributes of binary search trees and digital search tries. Like tries, they proceed character by character. Like binary search trees, they are space efficient, though each node has three children, rather than two. A search compares the current character in the search string with the character at the node. If the search character is less, the search goes to the left child; if the search character is greater, the search goes to the right child. When the search character is equal, though, the search goes to the middle child, and proceeds to the next character in the search string.

Figure 3 is a balanced ternary search tree for the same set of 12 words. The low and high pointers are shown as solid lines, while equal pointers are shown as dashed lines. Each input word is shown beneath its terminal node. A search for the word "is" starts at the root, proceeds down the equal child to the node with value "s," and stops there after two comparisons. A search for "ax" makes three comparisons to the first letter ("a") and two comparisons to the second letter ("x") before reporting that the word is not in the tree.

Figure 3: A ternary search tree for 12 words.

The idea behind ternary search trees dates back at least as far as 1964. In "Randomized Binary Searching with Tree Structures" (Communications of the ACM, March 1964), H.A. Clampett sketched a primitive version of the structure. Computer scientists have proven many theorems about the trees; for instance, searching for a string of length k in a ternary search tree with n strings will require at most O(log n+k) comparisons.

The C Structure

As far as we can tell, previous authors have viewed ternary search trees as a theoretical structure for proving theorems. We were delighted to find that the trees also lead to practical computer programs. Although we've chosen to illustrate the data structure in C, we could have just as easily chosen C++, Java, or other languages.

Each node in a ternary search tree is represented by the following structure:

typedef struct tnode *Tptr; 
typedef struct tnode { 
   char splitchar; 
   Tptr lokid, eqkid, hikid; 
} Tnode; 

The value stored at the node is splitchar, and the three pointers represent the three children. The root of the tree is declared to be Tptr root. We will represent every character in each string, including the null character that terminates it.

Membership Searching

We begin with a recursive version of the search function. It returns 1 if string s is in the subtree rooted at p, and 0 otherwise; it is originally called as rsearch(root, s):

int rsearch(Tptr p, char *s) 

{   if (!p) return 0; 
    if (*s < p->splitchar) 
         return rsearch(p->lokid, s); 
    else if (*s > p->splitchar) 
        return rsearch(p->hikid, s); 
    else { 
        if (*s == 0) return 1; 
        return rsearch(p->eqkid, ++s); 
   } 
} 

The first if returns 0 if the search has run off the end of the tree. The next two if statements take the low and high branches as appropriate. The final else branch returns 1 if the current character is the end-of-string character 0, and otherwise moves to the next input character and to the equal branch.

Some programmers might feel more comfortable with the iterative version of the search function (which has only one argument):

int search(char *s) 
{   Tptr p; 
    p = root; 
    while (p) { 
       if (*s < p->splitchar) 
           p = p->lokid; 
       else if (*s == p->splitchar) { 
          if (*s++ == 0) 
              return 1; 
          p = p->eqkid; 
      } else 
          p = p->hikid; 
    } 
    return 0; 
} 

This is substantially faster on some compilers, and we'll use it in later experiments. On most optimizing compilers, though, the run time of the recursive version is within a few percent of the iterative version.

Both versions of the search use a pattern that we will see repeatedly: If the search character is less, go to lokid; if it is greater, go to hikid; if it is equal, go to the next character and eqkid. The following code illustrates that shorter is not always cleaner, simpler, faster, or better:

int rsearch2(Tptr p, char *s) 

{ return (!p ? 0 : ( 
        *s == p->splitchar 
       ? (*s ? rsearch2(p->eqkid, ++s) : 1) 
       : (rsearch2(*s < p->splitchar ? 
                p->lokid : p->hikid, s)) 
     )); 
} 


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.