COMP 210: Data Structures

More About Hash Tables


The Big Picture

Hash Tables aim to provide O(1) time complexity (on average) for the three basic data structures operations: insertion, deletion, and searching. They do this by converting the data item (or part of the data item) as a key that can be converted directly into the address, or index, where the data is stored.

The Core Concept: Hashing

A hash table consists of two primary components:

  1. The Hash Function: The hash function is a mathematical function that takes an input key (e.g., a person’s name, a 16-digit product ID, or a string) and transforms it into a fixed, small, non-negative integer – the hash code or hash value – that can be used as an index into a table.

        Hash(Key) Index

  2. The Array (The Table): This is the structure where the actual data (the value associated with the key) is stored. It could be a simple ArrayList with just the values, or it could be a map, i.e., a table containing key-value pairs. The hash codes calculated by the hash function are used as the indices into this array.

    The size of the table is “baked into” the hash function since the resulting hash codes must be in the range of 0 < hashCode < sizeOfTable. A good hash function distributes the resulting hash codes (indices) as evenly as possible across the table size to minimize collisions.

Collisions

The key challenge in hash tables is that multiple keys may hash to the same index. This is called a collision. This becomes more likely as the number of data entries gets closer to the size of the table. (And becomes inevitable if the number of data entries exceeds the size of the table – see Separate Chaining below.)

Example:
    Hash("apple")  → 5
    Hash("grape")  → 5 (Collision!)

We cannot store both items at index 5 without overwriting one. Hash tables rely on Collision Resolution techniques to handle this.


Collision Resolution Techniques

There are two main approaches for resolving collisions:

  1. Separate Chaining (Open Hashing)
    Each entry in the hash table contains a reference to a secondary data structure, rather than a data item directly.

    • Each slot in the main hash table array holds a pointer to a linked list (or occasionally, a small BST).

    • If two keys hash to index i, their data items are simply added to the linked list at array index i.

    If the original table is big enough and the hash codes are distributed evenly enough, the length of any given linked list of collisions will be reasonably short, giving access performance that is still close to O(1).

    Advantages: Simple to implement; never “runs out of room” unless memory is exhausted.
    Disadvantages: Requires extra space for the pointers/lists; lookup becomes O(N) in the worst case (if all elements hash to the same spot).

  2. Open Addressing (Closed Hashing)
    When a collision occurs, the system looks for an available empty slot in the main array itself, following a specific probing strategy.

    • Linear Probing: The table searches for the next available slot sequentially: index i + 1, then i + 2, i + 3, etc., wrapping around if necessary.

    • Quadratic Probing: The table searches slots i + 12, i + 22, i + 32, etc.

    • Double Hashing: A second hash function is used to determine the size of the step to take when probing.

    Advantages: Better memory utilization (no separate linked lists); faster iteration over the keys.
    Disadvantages: More complex deletion (you can’t just remove a node, you must mark it as “deleted”); suffers from clustering (long runs of occupied slots), which degrades performance.

Load Factor and Rehashing

The performance of a hash table is critically dependent on its load factor α.

    α = Number of Items (N) / Size of Array (M)

If the load factor gets too high (typically above 0.7 for open addressing or above 1.0 for separate chaining), the chance of collisions increases dramatically, pushing the average time complexity closer to O(N). In other words, once a hash table that uses open addressing gets to be about 70% full, performance starts to degrade quickly. On the other hand, if the hash table uses chaining, it starts to slow down as the size of the data approaches the size of the table.

When the load factor exceeds a threshold, the hash table performs rehashing:

  1. A new array, typically double the size of the old one, is allocated.

  2. Every single item from the old array is inserted into the new, larger table using a hash function that depends on the table size.

  3. The old, small array is discarded.

While rehashing is an O(N) operation, it happens so infrequently that the amortized time complexity for insertion remains O(1).


Simple Hash Table Example Using Separate Chaining

Imagine we have a small phone book application where we store names (the Key) and phone numbers (the Value).

We will implement the hash table using Separate Chaining, so each entry in the hash table will be a (possibly empty) linked list of entries that hash to that index.

Inserting Data

We want to insert the following data, calculating hash codes using the hash function above. We insert each data entry into the linked list at the index given by the hash function.

Key (Classmate Name)     Sum of ASCII Values   Hash Code (Sum % 10)
“Snoopy” 83 + 110 = 193 3
“Charlie Brown” 67 + 104 = 171 1
“Franklin” 70 + 114 = 184 4
“Peppermint Patty” 80 + 101 = 181 1
“Lucy” 76 + 117 = 193 3

By sheer bad luck, all of our keys hash to indices in the first half of our table, and we have two collisions (Snoopy / Lucy and Charlie Brown / Peppermint Patty) among only five data entries.

Addressing the Collisions

When we insert “Peppermint Patty”, we are inserting data into a linked list that already contains data for “Charlie Brown”. When we insert “Lucy”, we are inserting into a linked list that already contains “Snoopy”.

Final Hash Table:

Index     Data
0 empty
1 [Key: “Peppermint Patty”] [Key: “Charlie Brown”]
2 empty
3 [Key: “Lucy”] [Key: “Snoopy”]
4 [Key: “Franklin”]
5 .. 9 empty

Searching

Now, imagine we want to find the phone number for “Snoopy”:

  1. Hash the Key: Index = Hash(“Snoopy”) = 3.
  2. Go to Index 3: We land at the list starting with “Lucy”.
  3. Traverse the Chain: We check the first item. Is it “Snoopy”? No, it’s “Lucy.”
  4. Check Next: We move to the next item in the list. Is it “Snoopy?” Yes.
  5. Return Value: We return the associated phone number.

In this simple case, this search required one hash calculation (instant) and two comparisons (traversing the linked list). This demonstrates how collisions increase the complexity above the ideal O(1).

Note: Based on the number of collisions we have encountered so far, you might already be wondering whether
    Index = (ASCII(char1) + ASCII(char2)) mod M
is a good hash function. It probably is not. If we were to add “Linus” to the list, for example, we would have a third data entry hashing to index 1. Adding Sally, on the other hand, would not result in a collision; “Sally” hashes to 0.

Key (Classmate Name)     Sum of ASCII Values   Hash Code (Sum % 10)
Linus 76 + 105 = 181 1
Sally 83 + 97 = 180 0

Linear Probing

In this method, the main array holds the data directly, and we search for the next empty slot within that array upon collision.

Let’s look at the same example that we used above, this time with linear probing.

Inserting Data

The data items hash to the exact same initial indices as in the previous example:

Key (Classmate Name)     Sum of ASCII Values   Hash Code (Sum % 10)
Snoopy 83 + 110 = 193 3
Charlie Brown 67 + 104 = 171 1
Franklin 70 + 114 = 184 4
Peppermint Patty 80 + 101 = 181 1
Lucy 76 + 117 = 193 3

The first three items are placed directly into their hashed indices:

Index     Content
0 empty
1 Charlie Brown
2 empty
3 Snoopy
4 Franklin
5 .. 9 empty

Inserting “Pepperment Patty” and “Lucy”, though, results in collisions.

Addressing the Collisions

Placing Peppermint Patty:

  1. Calculate Hash:Peppermint Patty” hashes to index 1.
  2. Check Index 1: Slot 1 is occupied by “Charlie Brown” (Collision!).
  3. Probing (Linear): We check the next consecutive slot:
    • Check Index 2: Empty.
  4. Insert: “Peppermint Patty” is placed at index 2.

Placing Lucy:

  1. Calculate Hash:Lucy” hashes to index 3.
  2. Check Index 3: Slot 3 is occupied by “Snoopy” (Collision!).
  3. Probing (Linear): We check the next consecutive slot:
    • Check Index 4: Occupied by “Franklin.”
    • Check Index 5: Empty.
  4. Insert: “Lucy” is placed at index 5.

Final Hash Table:

Index     Content
0 empty
1 Charlie Brown
2 Peppermint Patty
3 Snoopy
4 Franklin
5 Lucy
6 .. 9 empty

Stop and Think:

What would be the result of adding “Linus” and “Sally” to this hash table? Remember that the hash code for “Linus” is 1, while the hash code for Lucy is 0. How many comparisons would be required for each?

Searching

When searching for a data entry, we follow the same steps: compute the hash code, check that index, if that is not the entry we are searching for then compute a new index using the step, check the new index, etc.

Primary Clustering

This time our bad luck (or poor hash function) has led to five data entries being clustered together – indices 1, 2, 3, 4, and 5 are all full and adjacent. This is called Primary Clustering. A long cluster makes subsequent insertions and searches into that cluster much slower, degrading the average time complexity away from O(1).

In this case, inserting “Lucy” required three comparisons and searching for “Lucy” would require the same, because “Lucy” was forced two steps away from the original hash location due to the cluster.

Variations on Linear Probing

The simplest form of linear probing, which we used in our example above, uses a step size of 1. In other words, when we encountered a collision we tried the next spot, incrementing the index by 1. We could also have used a different step, such as 2 or 3. We have to be careful, though, in choosing the step. Consider if we had chosen a step of 5 in our example above. In that case, the data for “Peppermint Patty” would have been placed in index 6 and the data for “Lucy” would have been placed in index 8. This would have been a definite improvement. If the next name happened to also hash to 1, though, we would have checked 1, found a collision, checked 6, found a second collision, and then been back at 1 again ($ (6 + 5) % 10 = 1$). We would now be in a cycle, and either throw an exception or rehash the table to a larger size. We would have a similar problem with 2, although it would take more checks to arrive back where we started. The problem is that 2 and 5 are both divisors of 10. If our step were 3, on the other hand, or any other number that is relatively prime to the size of the table, we would be checking different indices in each pass through the table.

For this reason, hash tables that use linear probing are generally created with prime sizes – a capacity of 101, for example, rather than 100. That way, any step size is relatively prime to the capacity.

Another variation on probing that provides a more dynamic way to define the step is quadratic probing, where the step is a function of the index: i + 12, i + 22, i + 32, etc.


Double Hashing

Double Hashing is the most advanced of the standard probing techniques. It is highly effective because it uses a secondary hash function to calculate a key-specific step size, effectively preventing primary clustering.

To make Double Hashing mathematically robust, the table size (M) must be a prime number. For this example, we will use a table size of 11 (indices 0-10).

Note that the form of the secondary hash function h2 = (Sum mod R) is similar to the form of the primary hash function h1, but with a smaller prime modulus (R instead of M) and an extra subtraction. The subtraction changes the range of h2 from 0...(R − 1) to 1...R, guaranteeing that the step is never 0. A step of 0 would mean never stepping away from the collision, checking the same value over and over again.

Inserting Data

First we calculate the initial indices and step sizes for our example data using the primary and secondary hash functions, with M = 11 and R = 7. Note that the change in table size (M) from 10 to 11 will change our initial hash codes and collisions. We will also include “Linus” and “Sally” in this example.

Key (Classmate Name) Sum h1(k) = Sum mod 11 Sum mod 7 h2(k) = 7 − (Sum mod 7)
Snoopy 193 6 4 3
Charlie Brown 171 6 3 4
Franklin 184 8 2 5
Peppermint Patty 181 5 6 1
Lucy 193 6 4 3
Linus 181 5 6 1
Sally 180 4 5 2

How do we use these formulas to place our data entries into the table, addressing collisions when needed?

  1. “Snoopy”: h1 = 6. Index 6 is empty. (Placed at 6)
  2. “Charlie Brown”: h1 = 6. Collision at 6 (Snoopy).   h2 = 4 (Step size).
    • Probe 1 (i = 1): (6 + 1 ⋅ 4)mod 11 = 10. Index 10 is empty. (Placed at 10)
  3. “Franklin”: h1 = 8. Index 8 is empty. (Placed at 8)
  4. “Peppermint Patty”: h1 = 5. Index 5 is empty. (Placed at 5)
  5. “Lucy”: h1 = 6. Collision at 6 (Snoopy).   h2 = 3 (Step size).
    • Probe 1 (i = 1): (6 + 1 ⋅ 3)mod 11 = 9. Index 9 is empty. (Placed at 9)
  6. “Linus”: h1 = 5. Collision at 5 (Peppermint Patty).   h2 = 1 (Step size).
    • Probe 1 (i = 1): (5 + 1 ⋅ 1)mod 11 = 6. Collision at 6 (Snoopy).
    • Probe 2 (i = 2): (5 + 2 ⋅ 1)mod 11 = 7. Index 7 is empty. (Placed at 7)
  7. “Sally”: h1 = 4. Index 4 is empty. (Placed at 4)

Final Hash Table:

Index     Content
0 empty
1 empty
2 empty
3 empty
4 Sally
5 Peppermint Patty
6 Snoopy
7 Linus
8 Franklin
9 Lucy
10 Charlie Brown

The Advantage: Scattered Probing

The key takeaway is that even though Snoopy, Charlie Brown, and Lucy all initially hashed to 6, they took different paths to find an empty slot. Snoopy took the slot at index 6, Charlie Brown had a collision at 6 but then stepped 4 to index 10, and Lucy had a collision at 6 but stepped 3 to get to index 9.

Double hashing ensures that every key follows a distinct path through the hash table when resolving a collision.

In theory, this behavior can more effectively scatter the colliding elements throughout the table. Even when the data are clustered, as they are in this example, the differing secondary hashes provide more opportunity for stepping over and out of a cluster. Fewer collisions keep insertion, deletion, and lookup times closer to O(1).