Factors That Impact Performance
But hash-based containers are not a silver bullet, and wall-clock time varies dramatically based on the hashing function used, the number of objects stored in the container, the time it takes to check values for equality, and other factors. Knowing how these factors interact is crucial to harnessing the performance improvements offered by hash containers.
To understand how each of these factors impacts performance, you have to understand how hash-based containers are implemented. Consider hash sets.
Hash sets can be thought of as arrays of linked lists. Each array element, denoted in Figure 1 as B0...B4
, is a "bucket." When a new item is inserted into the hash set, the item's hash value is computed. Next, the bucket number is computed by taking the hash value modulo the number of buckets. In Figure 1, item A
, whose hash value is 7, is stored in bucket 2 (Hash value of 7 modulus 5 buckets=bucket 2). When you call find(A)
to check if item A
is in the hash set, the program computes A
's hash value of 7, and looks in bucket 2 to see if A
is there. Since there is only one element in bucket 2, this is relatively fast. But even if there is only one element in a bucket, find()
still needs to check if this is the particular element that you are searching for, which involves a call to operator==
. If equality comparison is slow, the performance of your hash container degrades.
Figure 1: Array elements.
Now suppose you insert another item, item Z
, into your hash set. If item Z
has a hash value of 12, and you do not increase the number of buckets in the hash set, then item Z
also ends up in bucket 2 (Figure 2). When more than one item is in a hash bucket, find()
may become slower. If you call find(Z)
, the program again looks in bucket 2, but it first has to examine item A
before moving on to the next element in the linked list and telling you that Z
is in fact in your hash set. Thankfully, as the number of items in your hash set grows, the number of buckets automatically increases. This reduces the chance of two items with different hash values ending up in the same bucket.
Figure 2: Inserting another item into a hash set.
But what if you had many items that had the same hash value, as in Listing One? When two different items hash to the same value, there is a "hash collision." These items always end up in the same bucket, no matter how many buckets you have. This significantly degrades the performance of the collection. If the hash collisions happen often enough, finding items in the container effectively becomes O(n)
instead of O(1)
because you have to traverse the linked lists that form in the hash buckets.
struct terribleHasher { size_t operator()(const myClass& myObj) const { return 1; } };
Listing One
In addition to being relatively unique, hash values should also be very fast to compute. The SGI STL provides some basic hash functions. Most primitives hash to themselves: integer
s, character
s, and long
s all simply cast to size_t
, which is the return type for STL-compatible hash functions. The SGI STL also includes functions to hash char*
s (strings, for instance) that iterate over each character and apply a simple recursive formula.
Handling other simple types is nearly as easy. Just as with primitives, it makes sense to hash anything that can be quickly and uniquely cast to a size_t
to its own value. This cast is trivial and therefore extremely fast. Since it is obviously unique, it should be preferred over other hash functions whenever possible. But rather than reimplementing this hashing concept for each type that can be cast to size_t
, use the template in Listing Two.
template< typename T_TypeToHash > struct SizeTCastHasher { size_t operator()( const T_TypeToHash& i_TypeToHash ) const { return size_t( i_TypeToHash ); } };
Listing Two
One use case for the SizeTCastHasher
is hashing a pointer. For instance, suppose you are writing a program for an Internet advertising business. You have a class adFactory
that creates objects of type myAdvertisement
. The factory class also places the ads on websites by calling myWebsite.placeAd()
; see Listing Three. The website keeps track of the ads that have been placed on it by putting each ad into a hash_set
. Since each myAdvertisement
is created at a new location in memory, you can use the address of the myAdvertisement
object (the value of the pointer) as the key into the map, and cast the pointer to size_t
when you need to calculate the hash value of the key. Of course, this restricts memory management somewhat since two identical objects that are at different memory locations won't hash to the same value. Consequently, you must ensure that you have only one copy of a given myAdvertisement
in memory. However, in exchange for satisfying this restriction, calling lookup functions like website.doWeShowAd()
is now very fast because the hash function is both computationally trivial and hashes each object to a unique value.
class adFactory { ... myAdvertisement* createAd(TAdType adType) { myAdvertisement* newAd=new myAdvertisement(adType); myWebsite* website=websiteFactory.getBestWebsite(newAd); website.placeAd(newAd); return newAd; } ... }; class website { public: ... void placeAd(const myAdvertisement* const ad) { myAds.insert(ad); } bool doWeShowAd(myAdvertisement* ad) const { return myAds.find(ad)!=myAds.end(); } ... private: __gnu_cxx::hash_set< const myAdvertisement*, SizeTCastHasher< const myAdvertisement* > > myAds; };
Listing Three
When designing hash functions for more complex types, it is important to know as much as possible about your data. The range and distribution of the data is particularly important. For example, suppose that in the advertising application, you want to know how many people saw a given ad on a given website. One way to implement this would be a hash map that hashes an ad and a website to a size_t
. Further suppose that each ad and website has a unique identifier associated with it. For instance, this could be a row number that our application assigned to the ad or website when it was loaded from a database. In this scenario, you would like the hash map to hash a pair of unsigned integers (the row number of the ad and website) to another unsigned integer. If you make the assumption that the IDs of the website and ad are both smaller than 216, then you can pack both identifiers into a single size_t
without any hash collisions. ShiftedPairHasher
(in Listing Four) does exactly that.
struct ShiftedPairHasher { size_t operator()(std::pair<unsigned int, unsigned int> key) { return (key.first << ( sizeof(unsigned int) * 8/2)) ^ key.second; } }
Listing Four