Introduction to Algorithms
Hash Tables
CSE 680
Prof. Roger Crawfis
Motivation
Arrays provide an indirect way to access a set.
Many times we need an association between two
sets, or a set of keys and associated data.
Ideally we would like to access this data directly with
the keys.
We would like a data structure that supports fast
search, insertion, and deletion.
Do not usually care about sorting.
Theabstract data type is usually called a Dictionary,
Map or Partial Map
float googleStockPrice = stocks[“Goog”].CurrentPrice;
Dictionaries
What is the best way to implement this?
Linked Lists?
Double Linked Lists?
Queues?
Stacks?
Multiple indexed arrays (e.g., data[key[i]])?
Toanswer this, ask what the complexity of the
operations are:
Insertion
Deletion
Search
Direct Addressing
Let’s look at an easy case, suppose:
The range of keys is 0..m-1
Keys are distinct
Possible solution
Set up an array T[0..m-1] in which
T[i] =x if x T and key[x] = i
T[i] = NULL otherwise
This is called a direct-address table
Operations take O(1) time!
So what’s the problem?
Direct Addressing
Directaddressing works well when the
range m of keys is relatively small
But what if the keys are 32-bit integers?
Problem 1: direct-address table will have
232 entries, more than 4 billion
Problem 2: even if memory is not an issue, the
time to initialize the elements to NULL may be
Solution: map keys to smaller range 0..p-1
Desire p = O(m).
Hash Table
Hash Tables provide O(1) support for all
of these operations!
The key is rather than index an array
directly, index it through some function,
h(x), called a hash function.
myArray[ h(index) ]
Key questions:
What is the set that the x comes from?
What is h() and what is its range?
Hash Table
Consider this problem:
IfI know a priori the p keys from some finite
set U, is it possible to develop a function
h(x) that will uniquely map the p keys onto
the set of numbers 0..p-1?
Hash Functions
In general a difficult problem. Try something simpler.
U 0
(universe of keys)
h(k1)
k1
h(k4)
k4
K
k5
(actual h(k2) = h(k5)
keys)
k2 h(k3)
k3
p-1
Hash Functions
A collision occurs when h(x) maps two keys to the
same location.
U 0
(universe of keys)
h(k1)
k1
collision
h(k4)
k4
K
k5
(actual h(k2) = h(k5)
keys)
k2 h(k3)
k3
p-1
Hash Functions
A hash function, h, maps keys of a given type to
integers in a fixed interval [0, N - 1]
Example:
h(x) = x mod N
is a hash function for integer keys
The integer h(x) is called the hash value of x.
A hash table for a given key type consists of
Hash function h
Array (called table) of size N
The goal is to store item (k, o) at index i = h(k)
Example
We design a hash table 0
storing employees 1 025-612-0001
records using their 2 981-101-0002
social security number,
3
SSN as the key.
SSN is a nine-digit
4 451-229-0004
…
positive integer
Our hash table uses an
array of size N = 10,000 9997
and the hash function 9998 200-751-9998
h(x) = last four digits of
9999
x
Example
Our hash table uses an 0
array of size N = 100.
1 025-612-0001
We have n = 49
employees. 2 981-101-0002
Need a method to handle 3
collisions.
4 451-229-0004
As long as the chance for
…
collision is low, we can
achieve this goal.
Setting N = 1000 and 9997
looking at the last four 9998 200-751-9998
digits will reduce the 9999
176-354-9998
chance of collision.
Collisions
Can collisions be avoided?
If my data is immutable, yes
See perfect hashing for the case were the set of keys is
static (not covered).
In general, no.
Two primary techniques for resolving
collisions:
Chaining – keep a collection at each key slot.
Open addressing – if the current slot is full
use the next open one.
Chaining
Chaining puts elements that hash to the
same slot in a linked list:
U ——
(universe of keys) k1 k4 ——
——
k1
——
K k4 k5 ——
(actual k7 k5 k2 k7 ——
keys)
——
k2 k3 k3 ——
k8
k6
k8 k6 ——
——
Chaining
How do we insert an element?
U ——
(universe of keys) k1 k4 ——
——
k1
——
K k4 k5 ——
(actual k7 k5 k2 k7 ——
keys)
——
k2 k3 k3 ——
k8
k6
k8 k6 ——
——
Chaining
How do we delete an element?
Do we need a doubly-linked list for efficient delete?
U ——
(universe of keys) k1 k4 ——
——
k1
——
K k4 k5 ——
(actual k7 k5 k2 k7 ——
keys)
——
k2 k3 k3 ——
k8
k6
k8 k6 ——
——
Chaining
How do we search for a element with a
given key? T
U ——
(universe of keys) k1 k4 ——
——
k1
——
k4 k5
K ——
(actual k7 k5 k2 k7 ——
keys)
——
k2 k3
k8 k3 ——
k6
k8 k6 ——
——
Open Addressing
Basic idea:
To insert: if slot is full, try another slot, …, until
an open slot is found (probing)
To search, follow same sequence of probes as
would be used when inserting the element
If reach element with correct key, return it
If reach a NULL pointer, element is not in table
Good for fixed sets (adding but no deletion)
Example: spell checking
Open Addressing
The colliding item is placed in a
different cell of the table.
No dynamic memory.
Fixed Table size.
Load factor: n/N, where n is the number
of items to store and N the size of the hash
table.
Cleary, n ≤ N, or n/N ≤ 1.
To get a reasonable performance, n/N<0.5.
Probing
They key question is what should the
next cell to try be?
Random would be great, but we need to
be able to repeat it.
Three common techniques:
LinearProbing (useful for discussion only)
Quadratic Probing
Double Hashing
Linear Probing
Linear probing handles Example:
collisions by placing the h(x) = x mod 13
colliding item in the next Insert keys 18, 41, 22,
(circularly) available table 44, 59, 32, 31, 73, in this
order
cell.
Each table cell inspected
is referred to as a probe.
0 1 2 3 4 5 6 7 8 9 10 11 12
Colliding items lump
together, causing future
collisions to cause a 41 18 44 59 32 22 31 73
longer sequence of 0 1 2 3 4 5 6 7 8 9 10 11 12
probes.
Search with Linear Probing
Consider a hash table A that Algorithm get(k)
uses linear probing i h(k)
get(k) p0
We start at cell h(k)
repeat
We probe consecutive
locations until one of the c A[i]
following occurs if c =
An item with key k is found, return
or
An empty cell is found, or
null
N cells have been else if [Link] () = k
unsuccessfully probed return
To ensure the efficiency, if k [Link]()
is not in the table, we want to else
find an empty cell as soon as
possible. The load factor can i (i +
NOT be close to 1. 1) mod N
pp+1
until p=N
return null
Linear Probing
Search for key=20. Example:
h(20)=20 mod 13 =7. h(x) = x mod 13
Go through rank 8, 9, …, Insert keys 18, 41, 22,
12, 0. 44, 59, 32, 31, 73, 12, 20
Search for key=15 in this order
h(15)=15 mod 13=2.
Go through rank 2, 3 and
return null. 0 1 2 3 4 5 6 7 8 9 10 11 12
20 41 18 44 59 32 22 31 73 12
0 1 2 3 4 5 6 7 8 9 10 11 12
Updates with Linear Probing
To handle insertions and put(k, o)
deletions, we introduce a We throw an exception if the
special object, called table is full
AVAILABLE, which replaces We start at cell h(k)
deleted elements We probe consecutive cells
remove(k) until one of the following
We search for an entry with occurs
key k
A cell i is found that is either
empty or stores
If such an entry (k, o) is AVAILABLE, or
found, we replace it with the N cells have been
special item AVAILABLE unsuccessfully probed
and we return element o We store entry (k, o) in cell i
Have to modify other methods
to skip available cells.
Quadratic Probing
Primaryclustering occurs with linear
probing because the same linear pattern:
if
a bin is inside a cluster, then the next bin
must either:
alsobe in that cluster, or
expand the cluster
Insteadof searching forward in a linear
fashion, try to jump far enough out of the
current (unknown) cluster.
Quadratic Probing
Suppose that an element should appear
in bin h:
ifbin h is occupied, then check the following
sequence of bins:
h + 12, h + 22, h + 32, h + 42, h + 52, ...
h + 1, h + 4, h + 9, h + 16, h +
25, ...
For example, with M = 17:
Quadratic Probing
If
one of h + i2 falls into a cluster, this
does not imply the next one will
Quadratic Probing
For example, suppose an element was
to be inserted in bin 23 in a hash table
with 31 bins
The sequence in which the bins would
be checked is:
23, 24, 27, 1, 8, 17, 28, 10, 25, 11, 30, 20, 12, 6, 2, 0
Quadratic Probing
Even if two bins are initially close, the
sequence in which subsequent bins are
checked varies greatly
Again, with M = 31 bins, compare the
first 16 bins which are checked starting
with 22 and 23:
22, 23, 26, 0, 7, 16, 27, 9, 24, 10, 29, 19, 11, 5, 1, 30
23, 24, 27, 1, 8, 17, 28, 10, 25, 11, 30, 20, 12, 6, 2, 0
Quadratic Probing
Thus, quadratic probing solves the
problem of primary clustering
Unfortunately, there is a second problem
which must be dealt with
Suppose we have M = 8 bins:
12 ≡ 1, 22 ≡ 4, 32 ≡ 1
In this case, we are checking bin h + 1
twice having checked only one other bin
Quadratic Probing
Unfortunately, there is no guarantee that
h + i2 mod M
will cycle through 0, 1, ..., M – 1
Solution:
require that M be prime
in this case, h + i2 mod M for i = 0, ..., (M –
1)/2 will cycle through exactly (M + 1)/2
values before repeating
Quadratic Probing
Example with M = 11:
0, 1, 4, 9, 16 ≡ 5, 25 ≡ 3, 36 ≡ 3
With M = 13:
0, 1, 4, 9, 16 ≡ 3, 25 ≡ 12, 36 ≡ 10, 49 ≡ 10
With M = 17:
0, 1, 4, 9, 16, 25 ≡ 8, 36 ≡ 2, 49 ≡ 15, 64 ≡ 13, 81 ≡
13
Quadratic Probing
Thus, quadratic probing avoids primary
clustering
Unfortunately, we are not guaranteed
that we will use all the bins
In reality, if the hash function is
reasonable, this is not a significant
problem until l approaches 1
Secondary Clustering
The phenomenon of primary clustering
will not occur with quadratic probing
However, if multiple items all hash to the
same initial bin, the same sequence of
numbers will be followed
This is termed secondary clustering
The effect is less significant than that of
primary clustering
Double Hashing
Use two hash functions
If M is prime, eventually will examine every
position in the table
double_hash_insert(K)
if(table is full) error
probe = h1(K)
offset = h2(K)
while (table[probe] occupied)
probe = (probe + offset) mod M
table[probe] = K
Double Hashing
Many of same (dis)advantages as linear
probing
Distributes keys more uniformly than
linear probing does
Notes:
h2(x)should never return zero.
M should be prime.
Double Hashing Example
h1(K) = K mod 13
h2(K) = 8 - K mod 8
we want h2 to be an offset to add
18 41 22 44 59 32 31 73
0 1 2 3 4 5 6 7 8 9 10 11 12
44 41 73 18 32 53 31 22
0 1 2 3 4 5 6 7 8 9 10 11 12
Open Addressing Summary
Ingeneral, the hash function contains two
arguments now:
Key value
Probe number
h(k,p), p=0,1,...,m-1
Probe sequences
<h(k,0), h(k,1), ..., h(k,m-1)>
Should be a permutation of <0,1,...,m-1>
There are m! possible permutations
Good hash functions should be able to produce
all m! probe sequences
Open Addressing Summary
None of the methods discussed can generate
more than m2 different probing sequences.
Linear Probing:
Clearly, only m probe sequences.
Quadratic Probing:
The initial key determines a fixed probe sequence,
so only m distinct probe sequences.
Double Hashing
Each possible pair (h1(k),h2(k)) yields a distinct
probe, so m2 permutations.
Choosing A Hash Function
Clearlychoosing the hash function well
is crucial.
What will a worst-case hash function do?
What will be the time to search in this case?
What are desirable features of the hash
function?
Should distribute keys uniformly into slots
Should not depend on patterns in the data
From Keys to Indices
A hashfunction is usually the composition of
two maps:
hash code map: key integer
compression map: integer [0, N - 1]
An essential requirement of the hash function
is to map equal keys to equal indices
A “good” hash function minimizes the
probability of collisions
Java Hash
Java provides a hashCode() method for the Object class, which
typically returns the 32-bit memory address of the object.
Note that this is NOT the final hash key or hash function.
This maps data to the universe, U, of 32-bit integers.
There is still a hash function for the HashTable.
Unfortunately, it is x mod N, and N is usually a power of 2.
The hashCode() method should be suitably redefined for structs.
If your dictionary is Integers with Java (or probably .NET), the
default hash function is horrible. At a minimum, set the initial
capacity to a large prime (but Java will reset this too!).
Note, we have access to the Java source, so can determine this. My guess
is that it is just as bad in .NET, but we can not look at the source.
Popular Hash-Code Maps
Integer cast: for numeric types with 32
bits or less, we can reinterpret the bits of
the number as an int
Component sum: for numeric types with
more than 32 bits (e.g., long and
double), we can add the 32-bit
components.
We need to do this to avoid all of our set of
longs hashing to the same 32-bit integer.
Popular Hash-Code Maps
Polynomial accumulation: for strings of
a natural language, combine the
character values (ASCII or Unicode) a 0
a 1 ... a n-1 by viewing them as the
coefficients of a polynomial: a 0 + a 1 x
+ ...+ x n-1 a n-1
Popular Hash-Code Maps
The polynomial is computed with Horner’s
rule, ignoring overflows, at a fixed value x:
a0 + x (a1 + x (a2 + ... x (an-2 + x an-1 ) ... ))
The choice x = 33, 37, 39, or 41 gives at
most 6 collisions on a vocabulary of 50,000
English words
Java uses 31.
Why is the component-sum hash code
bad for strings?
Random Hashing
Random hashing
Uses a simple random number generation
technique
Scatters the items “randomly” throughout
the hash table
Popular Compression Maps
Division: h(k) = |k| mod N
the choice N =2 m is bad because not all the bits are
taken into account
the table size N should be a prime number
certain patterns in the hash codes are propagated
Multiply, Add, and Divide (MAD):
h(k) = |ak + b| mod N
eliminates patterns provided a mod N ¹ 0
same formula used in linear congruential (pseudo)
random number generators
The Division Method
h(k) = k mod m
Inwords: hash k into a table with m slots using
the slot given by the remainder of k divided by m
What happens to elements with adjacent
values of k?
What happens if m is a power of 2 (say 2P)?
What if m is a power of 10?
Upshot: pick table size m = prime number
not too close to a power of 2 (or 10)
The Multiplication Method
For a constant A, 0 < A < 1:
h(k) = m (kA - kA)
What does this term represent?
The Multiplication Method
For a constant A, 0 < A < 1:
h(k) = m (kA - kA)
Fractional part of kA
Choose m = 2P
Choose A not too close to 0 or 1
Knuth: Good choice for A = (5 - 1)/2
Recap
So,
we have two possible strategies for
handling collisions.
Chaining
Open Addressing
We have possible hash functions that try
to minimize the probability of collisions.
What is the algorithmic complexity?
Analysis of Chaining
Assume simple uniform hashing: each
key in table is equally likely to be hashed
to any slot.
Given n keys and m slots in the table:
the load factor = n/m = average # keys
per slot.
Analysis of Chaining
What will be the average cost of an
unsuccessful search for a key?
O(1+)
Analysis of Chaining
What will be the average cost of a
successful search?
O(1 + /2) = O(1 + )
Analysis of Chaining
So the cost of searching = O(1 + )
If the number of keys n is proportional to
the number of slots in the table, what is
?
A: = O(1)
Inother words, we can make the expected
cost of searching constant if we make
constant
Analysis of Open Addressing
Consider the load factor, , and assume each key is
uniformly hashed.
Probability that we hit an occupied cell is then .
Probability that we the next probe hits an occupied
cell is also .
Will terminate if an unoccupied cell is hit: (1- ).
From Theorem 11.6, the expected number of probes
in an unsuccessful search is at most 1/(1- ).
Theorem
1 111.8: Expected number of probes in a
ln
successful
search is at most:
1