Binary Search

If you haven't read the Linear Search notes, read those first.

Binary Search

Linear Search might be fast for small inputs but if you the input (the number of things you're searching through) becomes twice as big, it's going to take about twice as long to search. If the input gets K times as big, it's going to take about K times as long to search.

We can do better! With Binary Search and a sorted list, if we double the size of the input, it'll take the same amount of extra time no matter how big the input is.

For the rest of this page, we'll use an example where we have a list of words and we want to know if some word is in that list. This might be useful for checking if a word is an English word, like when you're typing and you see the red squiggly lines under the word because it's mispellled, or if you implemented a game like Scrabble and you want to know if a word is valid or not.

Idea

Let's walk through an example, searching for the word "olive" in this sorted list:

index	word
0	apple
1	banana
2	durian
3	lemon
4	lime
5	mango
6	orange
7	peach
8	pear
9	raspberry

Since the list is sorted, you know that if the word you're looking for is in the list, it must be between the first word (at index 0) and the last word (at index 9). Binary Search is a process of elimination. We can look in the middle and see if that middle word comes before or after "olive"... and just keep repeating this until we either find "olive" or eliminate all words in the list.

Note: In C++ and many programming languages, it is idiomatic to represent a range of integers by including the lower bound and excluding the upper bound!

At each step, you check the middle word in the range. Since we start with [0, 10), we first look at index 5: "mango".

lower bound	upper bound	middle index	middle word	what happens
0	10	(1 + 10) / 2 == 5	mango	"olive" comes after "mango" so we update the lower bound to be just past "mango"
6	10	(6 + 10) / 2 == 8	pear	"olive" comes before "pear" so we update the upper bound to be just before "pear"
6	8	(6 + 8) / 2 == 7	peach	"olive" comes before "peach" so we update the upper bound
6	7	(6 + 7) / 2 == 6	orange	"olive" comes before "orange" so we update the upper bound
6	6	(6 + 6) / 2 == 6	orange	now our range is empty (lower bound == upper bound), so we know "olive" is not in the list

Implementation

bool _contains_word_bs(
        const vector<string>& words, const string& word, int low, int high) {
    // Base case: Could not find the word
    if (low == high) {
        return false;
    }
    const int middle = low + (high - low) / 2;
    const string& mid_word = words[middle];
    if (mid_word == word) {
    // Base case: Found the word!
        return true;
    } else if (word < mid_word) {
        high = middle;
    } else {  // if (word > mid_word)
        low = middle + 1;
    }
  // Recursive case: We've narrowed down the search to the left or right half.
    return _contains_word_bs(words, word, low, high);
}

bool contains_word_bs(const vector<string>& words, const string& word) {
    return _contains_word_bs(words, word, 0, words.size());
}

std::binary_search, defined in the <algorithm> header is a more general version of binary search than contains_word_bs.

How fast is it?

binary_search_timed.cpp

#include <algorithm>
#include <chrono>
#include <iostream>
#include <fstream>
#include <random>
#include <string>
#include <vector>

using std::cout;
using std::endl;
using std::ifstream;
using std::string;
using std::vector;

vector<string> read_lines(const string& filepath, const long max_lines=-1) {
  ifstream f(filepath);
  vector<string> lines;
  for(string line; getline(f, line); ) {
    lines.push_back(line);
    if (max_lines > 0 && lines.size() >= max_lines) {
      break;
    }
  }
  return lines;
}

vector<string> get_random_words(const vector<string>& words, const int sample_size) {
    std::default_random_engine dre;
    std::uniform_int_distribution<int> uniform_dist(0, words.size() - 1);
    vector<string> random_words;
    for (int i = 0; i < sample_size; ++i) {
      random_words.push_back(words[uniform_dist(dre)]);
    }
    return random_words;
}

bool _contains_word_bs(
    const vector<string>& words, const string& word, int low, int high) {
  // Base case: Could not find the word
  if (low == high) {
    return false;
  }
  // Bonus question: We can also calculate `middle = (low + high) / 2`.
  //   When is calculating `middle = low + (high - low) / 2` better?
  const int middle = low + (high - low) / 2;
  const string& mid_word = words[middle];
  if (mid_word == word) {
    // Base case: Found the word!
    return true;
  } else if (word < mid_word) {
    high = middle;
  } else {  // if (word > mid_word)
    low = middle + 1;
  }
  // Recursive case: We've narrowed down the search to the left or right half.
  // Note that instead of recursing here, we could just as easily put the code
  //   in this function in a loop.
  return _contains_word_bs(words, word, low, high);
}

bool contains_word_bs(const vector<string>& words, const string& word) {
  return _contains_word_bs(words, word, 0, words.size());
}

// Returns the number of seconds to search for all `sample_words` in `words`.
double time_contains_word_bs(const vector<string>& words, const vector<string>& sample_words) {
  auto start_time = std::chrono::steady_clock::now();
  for (const string& w : sample_words) {
    contains_word_bs(words, w);
  }
  auto end_time = std::chrono::steady_clock::now();
  std::chrono::duration<double> elapsed_seconds = end_time - start_time;
  return elapsed_seconds.count();
}

// Returns the number of seconds to search for all `sample_words` in `words`.
double time_std_find(const vector<string>& words, const vector<string>& sample_words) {
  auto start_time = std::chrono::steady_clock::now();
  for (const string& w : sample_words) {
    std::binary_search(words.begin(), words.end(), w);
  }
  auto end_time = std::chrono::steady_clock::now();
  std::chrono::duration<double> elapsed_seconds = end_time - start_time;
  return elapsed_seconds.count();
}

int main(int argc, const char* argv[]) {
  const int sample_size = 100000;
  const char* words_filename = argc > 1 ? argv[1] : "words.txt";
  // Output a CSV that can be graphed.
  cout << "\"Number of words\",contains_word_bs,std::binary_search\n";
  for (int n = 1; n < 1000000; n *= 2) {
      vector<string> words = read_lines(words_filename, n);
      std::sort(words.begin(), words.end());
      vector<string> random_words = get_random_words(words, sample_size);
      cout
          << words.size() << ","
          << time_contains_word_bs(words, random_words) << ","
          << time_std_find(words, random_words) << '\n';
  }
  return 0;
}

$ clang++ -pedantic -Wall -lm -std=c++20 -o binary_search_timed binary_search_timed.cpp

$ ./binary_search_timed

"Number of words",contains_word_bs,std::binary_search
1,0.0056199,0.0173886
2,0.010135,0.024675
4,0.0137802,0.0302772
8,0.0197854,0.035572
16,0.0264409,0.0417622
32,0.0343236,0.0433392
64,0.0321582,0.039424
128,0.0349039,0.045379
256,0.0411828,0.0516638
512,0.0461718,0.0578878
1024,0.0534536,0.0639551
2048,0.0597787,0.0686268
4096,0.0674313,0.076616
8192,0.0732493,0.0819305
16384,0.0841993,0.0913133
32768,0.0985319,0.106697
65536,0.110652,0.122715
124536,0.123997,0.141723
124536,0.122756,0.139681
124536,0.115488,0.130762

Binary Search Timed

In practice, it looks like it's much faster than Linear Search, at least for larger inputs. It's possible that Binary Search is actually slower for small inputs but by eliminating half of the possible words at every iteration instead of only one word, even a slow Binary Search will beat a Linear Search on large inputs.

If Linear Search has to iterate for every word in the worse case scenario, how many times does Binary Search have to iterate? We know that in the base cases (when we narrow down the range to 1 word or 0 words), it only takes one iteration. We also know that for the general/recursive case, we cut the range of possible words in half. So basically, if there are N words, the most iterations we might do (the most recursive calls we might do) is the same as the number of times we can divide N by 2: log base 2 of N.