Hash Set Introduction

Simple Hash Set Introduction and Implementation video (26 minutes) (Spring 2021)

Motivation

The Search Tree ADT (abstract data type) allowed various operations on a set of elements. The elements were maintained in a sorted order. This allowed operations like findMin(), findMax(), printInSortedOrder(), etc.

Search Tree ADT operations were generally O(log(n)), correlated to the height of the tree.

However, recall that array/vector indexed operations, like someVector[someIndex] are O(1), or constant time.

If we are willing to sacrifice the order maintained by a Search Tree, we can arrive at a data structure that gives set operations such as Insert, Remove, and Contains in O(1) or constant time.

This data structure is a Hash Table.

A hash table, ideally, is an array of some fixed size, which contains items.

Hashing and Hash Functions

The general idea of hashing, or hash functions, is to take a piece of data, and output some piece of smaller data. For example, if we were writing a hash function to hash names of people, which would be strings, a hash function could be "return the first character of the name".

So Edward would hash to E.

Mikel would hash to M.

Maya would hash to M.

etc.

If we were hashing numbers, a hash function could be "return the sum of all digits"

So 123 would hash to 6.

45 would hash to 9.

33 would also hash to 6.

etc.

Hash Table

We can combine the idea of an array/vector data structure, and the idea of hash functions, to create a Hash Table.

In a Hash Table, the output of a hash function is used to index into the array/vector data structure.

Since an array/vector has some fixed size (ignoring resize for a moment), and there are in theory limitless values/keys/hash ouputs to be indexed, it can often be good to use modulo in a hash function.

Eg, if a Hash Table array is of size 10, incorporating % 10 into a hash function will always give us indexes between 0 and 9.

Collisions

As mentioned above, a hash table will have some fixed size. There are infinite possible pieces of data to be placed in a Hash Table. (Names, numbers, some complex object type, etc.)

This means that there will almost always be collisions. A collision is when two separate pieces of data hash to the same index.

In the above "names" example, Mikel and Maya collided at index 'M'.

123 and 33 collided at index 6.

There are two main strategies to handle collisions in a Hash Table.

  1. Probing - If an index output by the hash function is taken, move to the next index. For example, if 33 hashes to 6, and index 6 is taken, maybe try index 7?

  2. Chaining - In our hash table array, maintain arrays at each index, instead of a single storage location. This way, if both 33 and 123 hash to 6, they can both be added to the array that exists at index 6 of the main Hash Table array.

With chaining, you are essentially splitting the collection of input data into smaller collections. Linear operations such as linear search, will be faster, in proportion to the reduced size of these collections. (Remember, hashing/hash function is O(1), a constant time operation.) If a hash table has 20 indexes, there may be, at most, a 20x performance increase, as the original collection is split into 20 pieces. This is ONLY IF hashing is random and uniform across the Hash Table container.

Design Concerns

The basic ideas behind the Hash Table ADT have been discussed above. For real implementations, a few design concerns remain.

  1. How to resize the Hash Table container?

  2. How to judge is a certain Hash function is good?

Thanks

Thanks to Brian Foster for writing up these notes based on the video lectures.