COMP 210: Data Structures

Hash Tables: Fast Lookup


Why Hash Tables Exist

Hash Tables aim to provide O(1) time complexity (on average) for the three basic data structures operations: insertion, deletion, and searching. They do this by converting the data item (or part of the data item), known as the key, directly into the address (or index) in the table where the data is stored.

A rough analogy is when you look up a word, like “table”, in a physical dictionary. You don’t start on page 1 of the dictionary and begin a linear search, looking at page after page for your word. You don’t even do a binary search, starting in the middle, comparing your word to what is on that page, deciding which half contains your word, comparing your word to what is on the page in the middle of that half, etc. Instead, you make an educated guess as to where a word starting with an “t” would appear in the dictionary, and start there. (Some dictionaries provide markers showing where each letter begins, in which case you could go straight to “T” without any guessing at all.)

Simple Example:

A slightly better simple example would be if you want to create a small phone book application where you store a table of names and phone numbers for your classmates, to help with coordinating group projects. (Names and phone numbers are key-value pairs in the context of a map or dictionary.)

You decide to use the first letter of each classmate’s name to get the index into your table. ‘A’ will correspond to index 0, ‘B’ to index 1, etc. This indicates a table of size 26, with an entry for each letter in the alphabet. You can easily calculate the index into the table from the first letter by subtracting the ASCII (or UTF-8) value of ‘A’:

    index = firstLetter - 'A';

This formula is called your hash function, and the result (your index) is called a hash code.

For example, you might have the following classmates and hash codes:

Hash Codes based on First Letter
Key (Classmate Name)     Hash Code (Char 1 - ‘A’)
Snoopy ‘S’ - ‘A’ = 18
Charlie Brown ‘C’ - ‘A’ = 2
Franklin ‘F’ - ‘A’ = 5
Peppermint Patty ‘P’ - ‘A’ = 15
Lucy ‘L’ - ‘A’ = 11

You can now store Snoopy’s name and phone number in your table at index 18, Charlie Brown’s at index 2, Franklin’s at index 5, etc. Inserting, deleting, or searching for an entry in the table requires calculating the index (O(1)) and accessing the table entry at that index (O(1)).

Collisions:

What if you now want to add Linus’s name and phone number to your table? The problem is that, given the way we identified the hash function, Linus’s name has the same hash code as Lucy’s name, but they can’t both be stored at index 11.


The Issues

This points out two issues with hash tables, in spite of the promise they give of fast (O(n)) insertion, deletion, and search.

You can reduce the likelihood of collisions by defining a good-sized table and by developing a good hash function that distributes your hash codes widely, but collisions can still occur, and any hash table implementation must address the possibility.


Addressing Collisions

There are two common approaches to Collision Resolution.

  1. Chaining
    Each entry in the hash table contains a reference to a linked list (or possibly a small BST) with all of the key-value pairs that hash to that entry. If the original table is big enough and the hash codes are distributed evenly enough, the length of any given linked list of collisions will be reasonably short, giving access performance that is still close to O(1).

  2. Open Addressing
    When a collision occurs, the system looks for another slot that is still open in the table. There are several approaches here. For example, one approach is to check each table entry sequentially, starting at the hash code index and adding 1 to the index until an open spot is found. This is called a step of 1; you could also choose other step sizes. A different approach is to apply a second hash function to the key to come up with the step size, and then repeatedly add that to the index to find an open slot.

     $index = originalHashFunction(key);$
     $step = secondaryHashFunction(key);$

    Regardless of how you choose the step, you start with the original hash code (index). If there is already another entry at table[index], look for an open entry at table[index + step], then at table[index + 2*step], etc.