Jason is a programmer at Maxis, where he can be contacted at [email protected].
The C++ Standard Template Library (STL) introduces the associative containers set, multiset, map, and multimap. The foundation of these classes is a balanced binary tree called a "red-black tree." In this article, I'll examine the nature of red-black trees and the specifics of the implementations used by the Hewlett-Packard and Silicon Graphics versions of the STL.
What are Red-Black Trees?
Central to understanding red-black trees is understanding 2-3-4 trees. 2-3-4 trees are balanced trees where each node contains one, two, or three data members and two, three or four links to child nodes. Each node has one link more than it has data members. 2-3-4 trees are sorted similarly to binary trees.
A 2-node has one data member and two links, a 3-node has two data members and three links and 4-node has three data members and four links (Figure 1).
The insertion rules for 2-3-4 trees enforce balancing. New data is always placed at the bottom of the tree. If the data is being added to a 2-node, the 2-node is converted to a 3-node. If the data is being added to a 3-node, the 3-node is converted to a 4-node. If the data is being added to a 4-node, the middle data member of the 4-node is promoted to its parent node (which may be a 2-, 3-, or 4-node itself), and the remaining two data members are split into two 2-nodes, one of which receives the new data and becomes a 3-node (Figure 2).
If the middle data member is promoted to a 4-node, the 4-node is split just as its child was and its middle data member is promoted. This process continues until a node other than a 4-node or the root node is encountered. If the root node is encountered and it is a 4-node, then it is split and the tree gains a level.
For more detailed descriptions of 2-3-4 trees, see Algorithms, Second Edition, by Robert Sedgewick (Addison-Wesley, 1988) and C++ Programmer's Guide to the Standard Template Library, by Mark Nelson (IDG Books Worldwide, 1995).
Red-Black Trees
The advantage of 2-3-4 trees is that they are intrinsically balanced. However, they are costly to implement and traverse directly. This is where red-black trees come in.
Red-black trees are binary trees that implement the rules for 2-3-4 trees by establishing colors (red and black) for the links between nodes. A red link represents a link between two data members within the same 3- or 4-node, and a black link represents a link between two different 2-, 3-, or 4-nodes. Figure 3 shows how red-black nodes are combined to create 2-, 3-, and 4-nodes.
No node may have a red link to both its parent and its child, and all direct paths from a given node down to a leaf must cross the same number of black links.
Recoloring and Rebalancing
Whenever a node is inserted or removed, the tree must be recolored and rebalanced to enforce the red-black rules.
New data is red-linked to a leaf or single-child node of a red-black tree. The leaf's color configuration is then examined to determine how the tree must be recolored and rebalanced. If the leaf is black-linked to its parent, it is a 2-node or the root of a 3-node, and the new data has converted it into a 3-node or 4-node. No further recoloring or rebalancing need be done.
If the leaf is red-linked to its parent, it is part of either a 3- or a 4-node. If the leaf's parent's other child (the new data's uncle) is black-linked to its parent, the node is a 3-node. If the uncle is red-linked to its parent, the node is a 4-node.
If the node is a 3-node, the leaf is rotated into place so it becomes the root of a subtree consisting of itself, the new data, and its former parent. If the node is a 4-node, the leaf's parent is recolored so it is red-linked to its parent and the leaf and its sibling are recolored so they are black-linked to their parent. The leaf can now receive the new data as a 2-node and become a 3-node. The color configuration of the leaf's parent's subtree must now be examined to determine if it must be recolored and rotated to accommodate the new red link. This process continues upward until a node other than a 4-node is encountered. When a node is removed, the tree is rebalanced and recolored using similar techniques.
Figures 4(a) through 4(c) illustrate the recoloring and rebalancing processes. For more details on red-black recoloring and rebalancing, see "Red-Black Trees," by Bruce Schneier (DDJ, April 1992; available electronically, see "Resource Center," page 3).
The HP Implementation
Both the HP and SGI versions of the STL implement red-black trees with a template class called rb_tree; see Example 1. Its parameters are:
- Key, the type used when comparing nodes.
- Value, the type that is actually stored in the node data.
- KeyOfValue, a function object that converts a given Value type into a Key type.
- Compare, a function object that takes two instances of Key and returns True if the first is less than the second.
The binary tree itself is composed of pointers to rb_tree_node structs. rb_tree_node has a color field, pointers to its parent, left child and right child nodes, and a value field. The color field indicates the color of the link between a node and its parent.
Node management has been optimized to reduce memory fragmentation. Freed nodes are kept in a free list for reuse. This behavior is wrapped by the get_node function, which gets an available node, and the put_node function, which adds a node to the free list when it is no longer needed. As with all STL containers, memory is allocated via the Allocator class, which can be specified by the user.
The rb_tree has three important data members:
- insert_always is a status variable that tells rb_tree whether multiple instances of the same key value are allowed. This variable is set by the constructor and is used by the STL to distinguish between set and multiset and between map and multimap. set and map can only have one occurrence of a particular key, whereas multiset and multimap can have multiple occurrences.
- root is the root node in the binary tree and it is set to NIL initially. NIL is not a zero pointer but rather a node with zero pointers in its parent, right child and left child links, and a black color code.
- header is a special node whose left child points to the leftmost child in the tree, whose right child points to the rightmost child in the tree and whose parent is the root.
Red-Black Traversal: The Iterator
Like all STL container classes, rb_tree uses an iterator to traverse its data. The rb_tree iterator has a single data member, node, which is a pointer to an rb_tree_node. The ++ and -- operators move this node pointer so that it points to the next higher and next lower valued nodes in the tree, respectively. The * operator returns the value field of the current node.
The rb_tree iterator comes in const and nonconst flavors. The begin() member of rb_tree returns an iterator initialized to point to the leftmost node of the tree. This will be, of course, the lowest valued node in the tree.
The end() member of rb_tree returns an iterator, which points to header that has no defined value and is therefore the logical choice for the end() iterator, which is supposed to point "one past" the end of the data structure.
Balancing Acts: Inserting and Erasing
The main node insertion routine is a private function called __insert; see Example 2. __insert performs no traversal but rather assumes that the first two parameters describe the precise location of the insertion. Other insert functions traverse the node tree before calling __insert to perform the actual insertion, recoloring, and balancing.
First, __insert gets a new node with get_node, sets its value to v, and attaches this new node to the left or right of y (x is likely to be one of the NIL children of y, but it doesn't have to be NIL if a node is being inserted in the middle of the tree).
Then, __insert checks the color codes for y and y's other child (x's sibling) to determine if the new node is being added to a 2-, 3-, or 4-node. Once this determination is made, the tree is recolored and rebalanced.
Subtree rebalancing is implemented by the rotate_left and rotate_right functions, which rotate a subtree about a given node.
erase(iterator position) is the primary node removal function. position is an iterator which points to the node to be eliminated. There are three possibilities for position:
- position could be a leaf (for instance, both its children are NIL), in which case it can simply be removed.
- Exactly one of position's children could be NIL, in which case the other child is relinked in position's place.
- Both of position's children could be non-NIL, in which case the leftmost child of position's right subtree is relinked in position's place.
After the node is removed from the tree, the tree is recolored and rebalanced.
The SGI Implementation
Like HP, SGI has developed a freely distributed version of the STL (see "The SGI Standard Template Library," by Matthew H. Austern, DDJ, August 1997). Based on the HP implementation, SGI's version has been restructured to work more effectively with real-world compilers.
Among the issues addressed by the SGI implementation are thread safety, improved memory utilization, default template parameters, efficient code generation, and inconsistencies between real-world compiler implementations and the C++ draft standard.
With regard to rb_tree, the most important modifications were made to the node and iterator structures and to the recoloring and rebalancing algorithms.
First, the definitions of the node and iterator structures have been moved from the rb_tree class scope to the global scope, eliminating the need for compilers to parse them as members of rb_tree.
Second, the node and iterator structs have been redefined as two-level class hierarchies, with __rb_tree_node inheriting from __rb_tree_node_base and __rb_tree_iterator inheriting from __rb_tree_base_iterator. Only the child types (__rb_tree_node and __rb_tree_iterator) are template types. The base types (__rb_tree_node_base and __rb_tree_base_iterator) are not.
The recoloring and rebalancing algorithms have been moved out of the rb_tree class and into the global scope. The functions are __rb_tree_rotate_left, __rb_tree_rotate_right, __rb_tree_rebalance, and __rb_tree_rebalance_for_erase. These functions all operate on the nontemplate __rb_tree_ node_base type. Iterator traversal is implemented in the nontemplate __rb_tree_ base_iterator struct by the increment() and decrement() member functions.
Moving most of the code, which actually implements the red-black algorithms into nontemplate functions and classes makes it easier for real-world compilers to generate more efficient code more quickly. This move also allows the STL to bypass many of the problems real-world compilers encounter when parsing complex nested template declarations.
The Front Men: Associative Containers
With all this functionality, why aren't rb_trees typically used outside the STL? Simply put, it's because rb_tree's useful functionality is presented by the set, multiset, map, and multimap containers.
set and multiset are trees where the keys and values are identical. set allows only one instance of a particular value type and multiset allows multiple instances.
map and multimap are trees where the keys and values are different. Key/value pairs are stored in the tree and retrieved by searching for a key. As with set and multiset, map only allows one instance of a given key value whereas multimap allows multiple instances.
Between these four templates, rb_tree's functionality is fully exploited, making direct utilization of rb_trees unnecessary. Understanding how rb_trees are implemented, however, makes it easier to understand the consequences involved in using the STL's powerful associative container classes.
DDJ
Copyright © 1998, Dr. Dobb's Journal