Hash Table Probing

Probing Hash Table Implementation (Probing) (45 minutes) (Spring 2021)

Motivation

As mentioned, linear probing results in adjacent clusters of occupied hash indexes.

This results in a performance decrease, as those adjacent clusters must be linear searched during insert() and contains() operations.

Linear Probing

If h(k) is a hash function of some key/element k, linear probing can be formalized as:

hashIndex(k) = h(k) + i

Where i = probing iterations, and starts at 0, increasing by 1 each iteration.

You can linearly probe by some constant, maybe advancing 2, 5, or 100 hash indexes each iteraton, but this is still linear probing, just advanced by some constant c.

hashIndex(k) = h(k) + ci

Quadratic Probing

As mentioned, quadratic probing somewhat avoids the adjacent clusters that linear probing can create.

Quadratic probing can be formalized as :

hashIndex(k) = h(k) + (i * i) * c

or

hashIndex(k) = h(k) + ((i * i) * c0) + (i * c1)

Quadratic probing creates gaps between the adjacent clusters.

Adjacent clusters will still exist with quadratic probing, but since you are not linearly probing to the next adjacent hash index, the clusters will be less of a performance problem with quadratic probing, than with linear probing.

General Probing and Collision Resolution

The above formulas for hashIndex(k) can be generalized as:

hashIndex(k) = h(k) + f(i)

f(i), some formula based on the number of probing iterations, can be thought of as the Collision Resolution Strategy. Maybe it's advancing by some constant using linear probing, maybe it's advancing by squares using quadratic probing...whatever is happening, the goal of f(i) is to resolve collisions.

Double Hashing

Another probing or collision resolution strategy is double hashing. This is when a second hashing function is introduced, and multipled by the probing iteration. This even further reduces the adjacent clusters problem.

hashIndex(k) = h(k) + i * h2(k)

(where h(k) and h2(k) are two unique hash functions)

However, h2(k) must never evaluate to 0. This would mean just re-trying the same hash index!

If you are using double hashing, the hash table size should be a prime number.

A secondary hash function of h2(k) = R - k % R, where R is some prime that is smaller than the prime table size, may perform well.

Thanks

Thanks to Brian Foster for writing up these notes based on the video lectures.