Bucket Hash Set: insert and contains

Simple Hash Set Implementation (insert and contains, with buckets) (28 minutes) (Spring 2021)

This lecture covers a chaining hash table. A hash index is calculated, and elements at that hash index are added to a set. This gives our hash table a "collection of collections", or "vector of sets" characteristic.

  vector<set<string> > names;

Load Factor

The number of buckets in a hash table needs to scale with the number of elements, n.

Load Factor is defined as the (number of elements) / (number of buckets).

For both chaining hash tables and probing hash tables, there is an optimal load factor, or load factor range.

Chaining hash tables should be kept to Load Factor < 1.

Probing hash tables should be kept to Load Factor < 0.5

The general process of filling a hash table is:

  1. Insert elements until Load Factor > Max Desired Load Factor

  2. Double the size of the hash table, and insert elements from the old, smaller hash table into the new, larger hash table.

This process, of increasing size and re-inserting elements, is called rehashing.

Rehash

Since the size of a hash table is increased (doubled) during a Rehash, elements may end up in different buckets.

This is because the hash index of an element is calculated by hashFunction(element) % tableSize. Since the tableSize changes, the hash index may change.

We can implement a rehash function modelled after std::unordered_map::rehash.

  void rehash(int new_num_buckets) {
    vector<set<string> > new_names;
    new_names.resize(new_num_buckets);
    for (const set<string>& bucket : names) {
      for (const string& name : bucket) {
        new_names[name_hash(name) % new_names.size()].insert(name);
      }
    }
    names = move(new_names);
  }

Contains

The general algorithm for contains() in a chaining hash table is:

  1. Get the hash index for the element/key that you're searching for

  2. If the collection at that hash index is not empty, iterate over the collection, looking for the element/key.

  3. "Not found" would be an empty hash index, or reaching the end() iterator, in the collection at that hash index.

  bool contains(const string& name) {
    const set<string>& subset = names[name_hash(name) % names.size()];
    return subset.find(name) != subset.end();
  }

Insert

The general algorithm for insert() in a chaining hash table is:

  1. Get the hash index for the element/key that you're inserting

  2. If the collection at that hash index does not contain the element/key, insert it.

  3. If the element/key is already in the collection at that hash index, there's nothing to do!

  bool insert(const string& name) {
    ++size_;
    if (size_ > names.size() * max_load_factor) {
      rehash(2 * names.size());
    }
    return names[name_hash(name) % names.size()].insert(name).second;
  }

Full Example Code

This is also available attached to the module.

#include <iostream>
#include <vector>
#include <string>
#include <set>
#include <utility>

using namespace std;


unsigned int name_hash(const string& name) {
  return hash<string>{}(name);
}

// We assume no name is empty! (In other words, "" is not a valid name)
class NamesHashSet {
  static constexpr double max_load_factor = 1.0;

  vector<set<string> > names;
  int size_;

public:
  NamesHashSet() : size_(0) {
    names.resize(1);
  }

  int size() {
    return size_;
  }

  bool contains(const string& name) {
    const set<string>& subset = names[name_hash(name) % names.size()];
    return subset.find(name) != subset.end();
  }

  bool insert(const string& name) {
    ++size_;
    if (size_ > names.size() * max_load_factor) {
      rehash(2 * names.size());
    }
    return names[name_hash(name) % names.size()].insert(name).second;
  }

  void rehash(int new_num_buckets) {
    vector<set<string> > new_names;
    new_names.resize(new_num_buckets);
    for (const set<string>& bucket : names) {
      for (const string& name : bucket) {
        new_names[name_hash(name) % new_names.size()].insert(name);
      }
    }
    names = move(new_names);
  }

  void print() {
    cout << '(';
    for (const auto& bucket : names) {
      cout << '[';
      for (const string& name : bucket) {
        cout << name << ", ";
      }
      cout << "], ";
    }
    cout << ')' << endl;;
  }
};



int main() {
  vector<string> names_vec = {"Alice", "Brock", "Mikel", "Maya", "Nana"};
    NamesHashSet names;
  for (const string& name : names_vec) {
    cout << "Inserting " << name << " with hash " << name_hash(name) << ":" << endl;
    names.insert(name);
    names.print();
  }
  for (const string& name : names_vec) {
    cout << "names.contains(\"" << name << "\") == " << names.contains(name) << endl;
  }
  for (const string& name : {"Not a name", "Coconuts"}) {
    cout << "names.contains(\"" << name << "\") == " << names.contains(name) << endl;
  }

  return 0;
}

Thanks

Thanks for Brian Foster for writing up these notes based on the video lectures.