Hash Tables
Hash tables are a method of implementing the abstract data type Dictionary, which supports the following operations:
- CREATE - create an empty dictionary
- INSERT - insert an object into the dictionary
- LOOKUP - find an object in the dictionary or return an error
- REMOVE - remove an object from the dictionary.
The simplest implementation of a dictionary is a linear list. To insert an object, stick it in the first empty slot in the list. To find an object, look at each object in the list, succeed if you find it, and fail otherwise. To delete an object, mark its slot as empty. This is fine for small numbers of objects, but the time taken by the LOOKUP operation is proportional to the number of objects in the list.
Hashing is one of the methods used to improve the time complexity of LOOKUP. Define a function that maps the data contained in an object into a table address. Instead of looking at every object in the table to see if an object is present, compute the address from the object you're seeking and look at that element of the table. Either it's there or it isn't. The time complexity of a hash LOOKUP is determined by the time it takes to compute the hash function, which is independent of the table size.
The catch is that if you are using a finite table (which is generally the case!), more than one key can map to the same address. This duplicate mapping is called a collision. You can minimize collisions by increasing the size of the table; if the table is most empty, new objects are unlikely to collide with existing ones. But there will always be a practical limit to how much memory you can afford to waste. Therefore, the time complexity of a hashing algorithm is determined primarily by how efficiently it can resolve collisions.
Use a second function to handle collisions. Add the secondary hash value to the primary one to get a new address and, while a collision still exists, add the secondary hash value to the current address. Eventually, you will arrive at an empty slot in which to put your data.
The best hash function is the simplest one that sufficiently minimizes collisions. Knuth discusses several hash functions, but the best is based on modulo arithmetic and careful selection of table size. If the table size is prime, you will get an even distribution of objects in the table. So your primary hash function becomes:
H1(data) = data MOD tablesize
or, more properly:
H1(data) = map(data) MOD tablesize
where map is some function that derives an integer from the data in the key.
The secondary function must never collide with the primary function. To prevent such collisions, select a number relatively prime to the table size. You can start at any address in the table and visit every other address by repeatedly adding in the secondary hash function.
The best behavior of all would be to use twin primes -- two prime numbers that have the property p1 - p2 = 2. (Why this is the case isn't within the scope of this article; take it on faith or read Knuth.) LOOKUP closely follows the logic of INSERT. The sequence of address produced by adding the secondary hash value is called the collision list for a particular object. To find an object, look down its collision list until you find either the object or an empty slot.