Pete Becker is a software developer at Dinkumware Ltd., where he works on standard library implementation and documentation for C, C++, and Java. He is Project Editor for the C++ Standard, and for several years has written a column for C/C++ Users Journal. He is currently writing a book about TR1. He can be contacted at [email protected].
One of the most commonly requested additions to the C++ Standard Library is hash tables. Properly used, they can provide significant speed improvements in searches. There was a proposal to add STL-style hash tables to the Standard Library in 1995; it was rejected because the target date for completion of the standard did not leave enough time for proper evaluation and adaptation of the proposed changes. TR1 provides hash tables under the names unordered_map, unordered_multimap, unordered_set, and unordered_multiset [1]. These template classes are defined in the headers <unordered_map> and <unordered_set>. Listing 1 is a synopsis of the header <unordered_map>. Listing 2 is a synopsis of the header <unordered_set>. In addition to these four containers, TR1 provides a set of specializations of a template class named hash that serves as the default hash functions for these containers. Listing 3 has a synopsis of this template class and its specializations, which are defined in the header <functional>.
These containers don't quite fit into the existing set of container requirements, so TR1 provides a new set of requirements for what it formally refers to as "unordered associative containers." Don't let the name confuse you: They are not more powerful forms of associative containers. Their requirements are slightly different from the existing requirements for associative containers, so the two kinds aren't interchangeable. We'll look at their differences in more detail a little later.
One of the biggest problems in designing a generic interface to a hash table is choosing the right set of parameters for tuning the operation of the resulting objects. With too few parameters, the template is too inflexible for broad use. With too many, it's too flexible, and it becomes hard to know what it's really doing. TR1's unordered containers can be tuned at compile time by selecting suitable type arguments for hashing data values and for comparing data values for equality. They can be tuned at runtime by setting the maximum allowable load factor, which determines how many objects can be inserted in the container before it reshuffles them into more buckets, and by calling the member function rehash, which forces that reshuffling.
Hash Tables
A hash table stores elements in buckets. Each bucket can hold zero or more elements. The hash table chooses a bucket for an element based on the hash value for that element. The hash value for an element is determined by passing that element to a hash function. If this is done right, a hash table can provide constant-time searches.
For example, the code in Listing 4 stores integer values in a hash table with five buckets, where each bucket is an object of type std::list<int>. The hash function, hash, takes an argument of type int, and simply converts it to a value of type size_t [2]. The function insert calls hash with the value to be inserted, reduces the result modulo to the size of the container, and uses the result as an index to determine which linked list to append the value to. As you can see, the time it takes to insert a new element into this hash table does not depend on how many elements are already in the list. Thus, insertion into this table is O(1).
Searching this table for a value is more complicated. The function contains calls hash with the target value, reduces the result modulo the size of the container, and uses the result as an index to determine which linked list to search. Having done that, it calls the standard algorithm find to look for the target value in that linked list. In looking for the target value, find starts at the beginning of the sequence of values managed by the linked list and walks through the sequence until it finds a matching value or reaches the end of the sequence [3]. Thus, the time needed to find an element is proportional to the number of elements in the linked list. That's not a problem if there are only a few elements in each linked list, but with a fixed number of linked lists, as in this particular version of a hash table, as we add elements to the table, the linked lists become proportionately longer. On average, each linked list will have n/M elements, where n is the number of elements in the table and M is the number of buckets. When n/M is large, the time needed to find an element is proportional to the number of elements in the table, so it is much longer than the constant-time lookup that hash tables are capable of.
The key to keeping hash table searches fast is to keep the number of elements in each bucket as low as is reasonable. Because the average number of elements in each bucket is n/M, this means that we need to keep that ratio low. To do this, as n increases, we also need to increase M. That is, we need to add more buckets to the table and redistribute the elements held in the table throughout the new set of buckets. This process, known as "rehashing," is so important to maintaining the speed of hash tables that TR1 hash tables are automatically rehashed whenever the average number of elements in each bucket exceeds the table's load factor.
Hash Tables as STL Containers
The C++ Standard defines four standard categories of container types: containers, reversible containers, sequence containers, and associative containers. If you're careful, these abstract categories allow you to design an application that can be implemented with one of several different container types, leaving you the flexibility to select the actual container to use later on, when you know more about the characteristics of the real-world data that the application will handle. If your application only uses operations defined for a particular container category, you can use any container type that satisfies the requirements for that category. For example, we saw last month that the TR1 template class array is a reversible container, but not a sequence container. It can be used in place of a list, a vector, or a deque in any application that relies only on the operations defined for a reversible container.
The C++ Standard Library has four template classes that satisfy the requirements for associative containers: map, multimap, set, and multiset. They differ in how they compare objects and in how they handle insertion of objects that compare as equal. Each element in a map and a multimap is a pair of objects; the first element in the pair is the key, which the container uses to determine whether two elements are equal; the second element in the pair is the value, which is not used in comparing elements. A map won't insert a new element whose key is already present; a multimap allows multiple elements with the same key. A set uses the entire element as the key; a multiset allows multiple elements with the same key.
To be an associative container, a data structure must meet all the requirements for a container as discussed last month, and it must meet several more. Rather than present them all here, Listing 5 has a synopsis of the template class multimap and its associated free functions; the other associative containers support the same interface [4].
TR1 adds a fifth container category. Unordered containers satisfy all of the requirements for containers except that they do not support any of the six comparison operators [5]. Unordered containers also provide a set of operations that is similar to the operations available for associative containers, as well as a set of operations for examining the contents of individual buckets and tuning the operation of the container.
In addition to the container requirements, unordered containers must provide the nested type names key_type and key_equal, which name the type of the container's key and the type of an object that can be used to compare objects of type key_type for equality. They have a default constructor, a copy constructor, and an assignment operator, as well as a templated constructor that takes two iterators that designate a sequence of values to be inserted into the container. Just as with ordered containers, you can insert a value t with the member function insert(t), you can insert with a hint using insert(q, t), where q is an iterator into the container, and you can insert a sequence of values with insert(i, j), where i and j are iterators that designate a sequence of values.
To remove elements, unordered containers provide the usual clear and erase member functions. You can call erase with a value to remove all elements whose key compares equal to that value. You can also call it with an iterator that designates the element to be removed. And, finally, you can call it with a pair of iterators that designates a range of elements within the container to be removed.
To find elements, unordered containers provide the member functions find, count, and equal_range. Each of them takes a key value to search for. The member function find returns an iterator that points to an element whose key compares equal to the key value, or an iterator equal to end() if no such element exists. The member function count returns the number of elements whose keys compare equal to the key value. The member function equal_range returns a pair of iterators that designate a range of elements within the container, all of whose keys compare equal to the key value [6].
Listing 6 is a synopsis of the template class unordered_multimap. If you compare that listing with the synopsis of the template class multimap in Listing 5, you'll see that they have quite a bit in common. The main differences are that unordered_multimap does not provide reverse iterators (that is, it's not a reversible container), and it has a handful of functions for tuning its operation.
Listing 7 shows some of the container operations for the template class unordered_multimap. Although I cautioned you earlier that unordered containers are not direct replacements for associative containers, in this code example, they are interchangeable. If you change the definition of the type table to use a multimap instead of an unordered_multimap, the program still runs correctly, although it might show the container's contents in a different order.
Next Time
Next time we'll look at the rest of the interface to unordered containers: the member functions for examining and tuning the organization of the container. We'll also look at more details of the four unordered containers in TR1.
References
- [1] The more obvious names, hash_set and the like, were rejected because they are already widely used for hash tables that are somewhat different from the hashed containers in TR1.
- [2] Despite what some compilers may tell you, this conversion is well defined and meaningful. If you get a warning for this code, either complain to the compiler writer or add a cast.
- [3] The actual algorithm used in the unordered containers is more sophisticated than this, but it still degenerates into linear time when the bins are overfilled.
- [4] As is often the case, some of the associative containers provide additional members that are not in the associative container requirements. In particular, the template class map provides operator[] and, in the next revision of the Standard, the member function at to look up stored values from key values.
- [5] The reason for this omission is that the unordered containers are unordered. Comparing two unordered containers for equality is expensive, and it is not at all clear what it would mean for one unordered container object to be less than another.
- [6] This requirement means that the simple implementation of a hash table in Listing 4 can't be used as an unordered container. It doesn't group elements that compare equal together, so there might not be a valid range that holds all elements that compare equal to a given key and no others. The TR1 unordered containers use a more sophisticated technique that groups equal elements together.