Hash table: Difference between revisions
imported>Nick Johnson (Initial) |
imported>Nick Johnson m (pigeon hole --> pigeonhole) |
||
Line 14: | Line 14: | ||
Since some key types are not [[enumerable]] (read: they have a larger [[cardinality]] than integers), the hash function of such keys must sometimes produce the same hash value for two distinct keys. | Since some key types are not [[enumerable]] (read: they have a larger [[cardinality]] than integers), the hash function of such keys must sometimes produce the same hash value for two distinct keys. | ||
Similarly, because hash tables execute on real computers, the array of value elements must be [[finite]]. Generally, hash algorithms will compute the hash of each key element [[modulo]] the size of the array. As a result, if N elements are added to a hash table that has an array of M elements, and if N > M, then by the [[ | Similarly, because hash tables execute on real computers, the array of value elements must be [[finite]]. Generally, hash algorithms will compute the hash of each key element [[modulo]] the size of the array. As a result, if N elements are added to a hash table that has an array of M elements, and if N > M, then by the [[pigeonhole principle]], some elements must be assigned to the same array indices. This situation is called a ''hash collision,'' since two keys collide on a paticular array element. A hash function that produces no collisions is called a [[perfect hash function]]. | ||
A programmer can minimize the occurrance of hash collisions by carefully selecting a hash function. Such a selection must be made against the type of the key elements, the size of the hash table, and typical use of the application. In all but a few special situations, it is impossible to totally avoid hash collisions. | A programmer can minimize the occurrance of hash collisions by carefully selecting a hash function. Such a selection must be made against the type of the key elements, the size of the hash table, and typical use of the application. In all but a few special situations, it is impossible to totally avoid hash collisions. |
Revision as of 15:29, 21 February 2007
In computer science, a hash table is an unordered data structure which provides O(1) insert or lookup. A hash table operates on pairs of keys and values, where the key may be the value.
Hash tables are often used to implement dictionaries or sets. Every internet router uses a hash table to correlate ranges of MAC addresses to route paths.
How a Hash Table Works
In short, a hash table is an array in which value elements are stored. The address within that array is computed from the value element by the hash function. Any type of data can be used as key or value elements, and the key and value elements may be the same. The only requirement is the existence of a hash function which maps objects of the key type to integers.
For instance, if we wanted to use a hash table to store telephone numbers, indexed against the owner's name, we would have strings as key elements, and numbers as value elements. Our hash function would, given a string, produce an integer. We would then store or retrieve that person's phone number of that index in the array of phone numbers. By using this approach, we avoid costly search algorithms that must check multiple elements to find the one in question. As a result, insertion, deletion, or lookup of an element in a hash table takes only O(1).
There are a few special cases.
Hash Collisions
Since some key types are not enumerable (read: they have a larger cardinality than integers), the hash function of such keys must sometimes produce the same hash value for two distinct keys. Similarly, because hash tables execute on real computers, the array of value elements must be finite. Generally, hash algorithms will compute the hash of each key element modulo the size of the array. As a result, if N elements are added to a hash table that has an array of M elements, and if N > M, then by the pigeonhole principle, some elements must be assigned to the same array indices. This situation is called a hash collision, since two keys collide on a paticular array element. A hash function that produces no collisions is called a perfect hash function.
A programmer can minimize the occurrance of hash collisions by carefully selecting a hash function. Such a selection must be made against the type of the key elements, the size of the hash table, and typical use of the application. In all but a few special situations, it is impossible to totally avoid hash collisions.
Several variants of hash tables exist which handle hash collisions in different ways.
- separate chaining---by far, the simplest---stores an array of linked lists of value elements. If two keys have the same hash value, then they are both inserted into the list. In the expected case, when each list contains no more than one element, then insertions, deletions and lookups are still O(1). If, on the other hand, the hash function has poor distribution, some lists may grow very long. In that case, insertions, deletions and lookups slow to O(n), where n is the length of the longest list.
- linear probing attempts to store the element in the next available array index, tested in sequential order (for instance, the nth, the n+1th, the n+2th, and so on). In that sense, linear probing is an extension to the hash function. Linear probing has two major problems: first, it tends to worsen the distribution of the hash function by creating clusters of occupied array indices, and second it gives the hash table a finite size since the number of array indices does not increase. In the worst case insertions, deletions and lookups slow to O(n), where n is the length of the longest cluster.
- quadratic probing is very similar to linear probing, except a different sequence is used to determine next available array index. Instead of testing them sequentially, quadratic probing tests the nth, the nth+1, the nth+2, then nth+4, the nth+9, and so on. Unlike linear probing, quadratic probing does not tend to produce clusters.
- double hashing stores a hash table of hash tables. This method is most effective when two orthogonal hash functions can be found, and is thus limited by problem domain.