Computer Science II -- Week 10

Introduction

Note that the links from this page are to handouts that will be distributed the night of class.  Only print out those handouts if you could not attend class.

Main topics this week:

Hash Tables
Hash Functions
Load Factor
Collisions
Linear Probing
Quadratic Probing
Separate Chaining

Hash Tables

So far we've looked at various data structures for searching.  Linked lists were O(N), so we don't use them for large amounts of data.  Binary search trees were O(log N), but they're slow to insert because of the need to keep them balanced.

Hash tables are an attempt to provide searching in O(1) time.  This means that no matter how many elements you put into a hash table, it always takes the same amount of time to find any particular element.

First, let's look at the basic concept.  In a hash table you have a key (an index used to find a value) and a value (whatever data you want to store with the key).

Assume my key is an integer from 1 to 10.  If I store my data in a linked list, I'll need to search every element of the linked list to find the node that has data for any particular key.  But if I put my data in an array, like this:

int a[10];

Then I can use my key as an index into the array.  Now I no longer have to search every element to find the data for a particular key.  The key itself tells me exactly where to look in the array.  No matter how big I make my array, I can get to any item by using the key as an index.  This is O(1) searching.

I can get or set data like this:

a [key - 1]

Note that I have to subtract one from the key to get a valid array index (arrays start at 0, remember?).  This massaging of the key to get an array index is called a hash function.

A hash function is simply a function that accepts a key and returns an array index.   We could write the hash function for the above array like this:

int hash (int key)
{
    return key - 10;
}

And then we could access array elements like this:

a [hash (key)]

This combination of a hash function and an array for storing data is what makes a hash table.

Hash Functions

The hash function we saw above works fine if the key is a simple integer in an appropriate range for our array.  But what if the key is an integer between 1 and 1000?  We would need a different hash function:

int hash (int key)
{
    return key % 10;
}

The modulus operator will return the remainder after the key is divided by 10.   This will always be a number between 0 and 9, a perfect index into an array that has 10 elements.

What if our key is a string?  Remember associative arrays could use strings as keys, no reason why a hash table shouldn't be able to do the same.  Again, we need a different hash function, one that will take a string and convert it into an array index.

Our first try at a hash function for strings might be to convert each letter into a number.  A would be 1, B would be 2, etc.  Then we could add up all the numbers, and come up with a single integer.  If we then use modulus to convert that integer into a range that's appropriate for our array, we would have an array index.

The main problem with that approach is that a hash function should make a good attempt at producing a unique array index for a unique key.  The above approach for strings would return the same array index for each of the following:

"abc"
"bca"
"cba"
"acb"

A better approach for strings is to take the position of each letter into account as well.

Load Factor

The load factor of a hash table is just a technical way of saying how full the hash table is.  If the hash table is empty, the load factor is 0.  If the hash table is full, the load factor is 1.  The load factor can be any number between 0 and 1 to indicate how full the hash table is.

Collisions

Consider the following hash function:

int hash (int key)
{
    return key % 10;
}

What happens if we try to put keys 11 and 21 into this hash table?  Both hash out to array index 1.  Obviously, both items cannot go into the same array element.   This is called a collision.

There are three main ways of handling collisions in a hash table.

Linear Probing

When you insert an item into a hash table, you call the hash function to convert the key into an array index.  A collision is when that array index already has an item.  

Let's call the original return from the hash function I.  The linear probing strategy says that if you find that array element I is occupied, then try array element I+1.  If that's occupied, try I+2, then I+3, etc.  You continue adding one until you find an empty array element. 

Let's look at how this works for some sample inserts:

insert     44
insert     144
insert     244

Now look at how you would find an existing piece of data in a hash table.  Follow the same strategy for insert, but when you hit an empty array element, you know the data doesn't exist.

Linear probing has several disadvantages. 

Consider inserting 44, 144, and 244 into a hash table.  Now remove 144.  Now try to find 244.  Linear probing is only suitable for hash tables where you don't want to remove items.

Primary clustering: Consider the following sequence of inserts: 13, 25, 54, 93, 106.   As the load factor of the hash table rises, we have to search longer for an empty spot because values tend to cluster together. 

Also, as values cluster together, when we try to find a particular item, we have to look through items that could not possibly match.  Consider trying to find 93 after the above sequence of inserts.  We have to compare with 54 and 25, even though those two values did not hash out to the same value as 93. 

So as the load factor approaches 1, this tends to degrade the searching of a hash table into O(N)...no better than a linear search through an array.  If a hash table uses linear probing, it should also grow the array to keep the load factor low.

Quadratic Probing

Quadratic probing is similar to linear probing.  If the key hashes to index I and the array already has an element at index I, then you try I+12.  If that's occupied, you try I+22, I+32, I+42, etc.   Eventually you find an empty array element. 

Let's look at how this works for some sample inserts: 44, 144, 244, 344.

Quadratic probing tends to spread items throughout the array better than linear probing, and avoids the disadvantage of primary clustering.  When you try to find an item, you are more likely to be comparing only against other items that hashed out to the same index.

But we still have the problem of not being able to remove items from the hash table.

And as the load factor approaches 1, searching is less and less efficient.

Separate Chaining

Separate chaining takes a different approach.  Each element in the array is a linked list.  When collisions happen, you simply append an item to the linked list.

Let's look at how this works for some sample inserts: 44, 144, 244

This fixes the problem of not being able to find an items after another item is removed.

However, if all our items hash into the same array index, then searching the hash table degrades to searching a linked list.  To make separate chaining efficient, you need to limit the length of each linked list.  Say we limit the linked lists to a maximum of three elements.

When we want to insert a new item into an array element but the linked list is already full, we double the size of the array and rehash all the items in the linked list.   Remember that the hash function uses the size of the array as a modulus, so some of the linked list items may hash to different spots.

Let's see how this works for these values, in an array that starts out at size 10: 44, 144, 214, 14

What happens if, when we double the size of the array and rehash, all the items in the linked list end up at the same spot?  We double the size of the array again and rehash again.

We can see this if we insert these values into a hash table that starts out at size 10: 14, 34, 114, 134

We may have to double and rehash many times before we finally succeed in adding the element we want to add.

Searching in separate chaining is O(1).  This is true even though we are searching through a linked list, because the size of the linked list does not depend on the number of items in the hash table.  There can only be a maximum of three items in any linked list, regardless of how many items are in the hash table.

This method of growing the hash table and rehashing items is often called a dynamic hash table.