0% found this document useful (0 votes)
13 views56 pages

Understanding Hash Tables and Functions

The document provides an introduction to hash tables, discussing their motivation, structure, and operations such as insertion, deletion, and search. It explains the concept of hash functions, collisions, and techniques for collision resolution including chaining and open addressing. Additionally, it covers probing methods like linear and quadratic probing to manage collisions effectively.

Uploaded by

anishbrawl4321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views56 pages

Understanding Hash Tables and Functions

The document provides an introduction to hash tables, discussing their motivation, structure, and operations such as insertion, deletion, and search. It explains the concept of hash functions, collisions, and techniques for collision resolution including chaining and open addressing. Additionally, it covers probing methods like linear and quadratic probing to manage collisions effectively.

Uploaded by

anishbrawl4321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Introduction to Algorithms

Hash Tables

CSE 680
Prof. Roger Crawfis
Motivation
 Arrays provide an indirect way to access a set.
 Many times we need an association between two
sets, or a set of keys and associated data.
 Ideally we would like to access this data directly with
the keys.
 We would like a data structure that supports fast
search, insertion, and deletion.
 Do not usually care about sorting.
 Theabstract data type is usually called a Dictionary,
Map or Partial Map
 float googleStockPrice = stocks[“Goog”].CurrentPrice;
Dictionaries
 What is the best way to implement this?
 Linked Lists?
 Double Linked Lists?

 Queues?

 Stacks?

 Multiple indexed arrays (e.g., data[key[i]])?

 Toanswer this, ask what the complexity of the


operations are:
 Insertion
 Deletion

 Search
Direct Addressing
 Let’s look at an easy case, suppose:
 The range of keys is 0..m-1
 Keys are distinct

 Possible solution
 Set up an array T[0..m-1] in which
 T[i] =x if x T and key[x] = i
 T[i] = NULL otherwise

 This is called a direct-address table


 Operations take O(1) time!
 So what’s the problem?
Direct Addressing
 Directaddressing works well when the
range m of keys is relatively small
 But what if the keys are 32-bit integers?
 Problem 1: direct-address table will have
232 entries, more than 4 billion
 Problem 2: even if memory is not an issue, the

time to initialize the elements to NULL may be


 Solution: map keys to smaller range 0..p-1
 Desire p = O(m).
Hash Table
 Hash Tables provide O(1) support for all
of these operations!
 The key is rather than index an array
directly, index it through some function,
h(x), called a hash function.
 myArray[ h(index) ]
 Key questions:
 What is the set that the x comes from?
 What is h() and what is its range?
Hash Table

 Consider this problem:


 IfI know a priori the p keys from some finite
set U, is it possible to develop a function
h(x) that will uniquely map the p keys onto
the set of numbers 0..p-1?
Hash Functions
 In general a difficult problem. Try something simpler.

U 0
(universe of keys)
h(k1)
k1
h(k4)
k4
K
k5
(actual h(k2) = h(k5)
keys)

k2 h(k3)
k3

p-1
Hash Functions
 A collision occurs when h(x) maps two keys to the
same location.
U 0
(universe of keys)
h(k1)
k1
collision
h(k4)
k4
K
k5
(actual h(k2) = h(k5)
keys)

k2 h(k3)
k3

p-1
Hash Functions
 A hash function, h, maps keys of a given type to
integers in a fixed interval [0, N - 1]
 Example:
h(x) = x mod N
is a hash function for integer keys
 The integer h(x) is called the hash value of x.
 A hash table for a given key type consists of
 Hash function h
 Array (called table) of size N
 The goal is to store item (k, o) at index i = h(k)
Example
 We design a hash table 0  
storing employees 1  025-612-0001
records using their 2  981-101-0002
social security number,
3  
SSN as the key.
 SSN is a nine-digit
4  451-229-0004

…
positive integer
 Our hash table uses an
array of size N = 10,000  9997  
and the hash function  9998  200-751-9998
h(x) = last four digits of
 9999
 
x
Example
 Our hash table uses an 0  
array of size N = 100.

1  025-612-0001
We have n = 49
employees. 2  981-101-0002
 Need a method to handle 3  
collisions.

4  451-229-0004
As long as the chance for

…
collision is low, we can
achieve this goal.
 Setting N = 1000 and  9997  
looking at the last four  9998  200-751-9998
digits will reduce the  9999
176-354-9998

chance of collision.
Collisions
 Can collisions be avoided?
 If my data is immutable, yes
 See perfect hashing for the case were the set of keys is
static (not covered).
 In general, no.
 Two primary techniques for resolving
collisions:
 Chaining – keep a collection at each key slot.
 Open addressing – if the current slot is full

use the next open one.


Chaining

 Chaining puts elements that hash to the


same slot in a linked list:
U ——
(universe of keys) k1 k4 ——
——
k1
——
K k4 k5 ——
(actual k7 k5 k2 k7 ——
keys)
——
k2 k3 k3 ——
k8
k6
k8 k6 ——
——
Chaining

 How do we insert an element?

U ——
(universe of keys) k1 k4 ——
——
k1
——
K k4 k5 ——
(actual k7 k5 k2 k7 ——
keys)
——
k2 k3 k3 ——
k8
k6
k8 k6 ——
——
Chaining
 How do we delete an element?
 Do we need a doubly-linked list for efficient delete?

U ——
(universe of keys) k1 k4 ——
——
k1
——
K k4 k5 ——
(actual k7 k5 k2 k7 ——
keys)
——
k2 k3 k3 ——
k8
k6
k8 k6 ——
——
Chaining

 How do we search for a element with a


given key? T
U ——
(universe of keys) k1 k4 ——
——
k1
——
k4 k5
K ——
(actual k7 k5 k2 k7 ——
keys)
——
k2 k3
k8 k3 ——
k6
k8 k6 ——
——
Open Addressing
 Basic idea:
 To insert: if slot is full, try another slot, …, until
an open slot is found (probing)
 To search, follow same sequence of probes as

would be used when inserting the element


 If reach element with correct key, return it
 If reach a NULL pointer, element is not in table

 Good for fixed sets (adding but no deletion)


 Example: spell checking
Open Addressing
 The colliding item is placed in a
different cell of the table.
 No dynamic memory.
 Fixed Table size.

 Load factor: n/N, where n is the number


of items to store and N the size of the hash
table.
 Cleary, n ≤ N, or n/N ≤ 1.
 To get a reasonable performance, n/N<0.5.
Probing

 They key question is what should the


next cell to try be?
 Random would be great, but we need to
be able to repeat it.
 Three common techniques:
 LinearProbing (useful for discussion only)
 Quadratic Probing

 Double Hashing
Linear Probing
 Linear probing handles  Example:
collisions by placing the  h(x) = x mod 13
colliding item in the next  Insert keys 18, 41, 22,
(circularly) available table 44, 59, 32, 31, 73, in this
order
cell.
 Each table cell inspected
is referred to as a probe.
0 1 2 3 4 5 6 7 8 9 10 11 12
 Colliding items lump
together, causing future
collisions to cause a 41 18 44 59 32 22 31 73

longer sequence of 0 1 2 3 4 5 6 7 8 9 10 11 12
probes.
Search with Linear Probing
 Consider a hash table A that Algorithm get(k)
uses linear probing i  h(k)
 get(k) p0
 We start at cell h(k)
repeat
 We probe consecutive
locations until one of the c  A[i]
following occurs if c = 
 An item with key k is found, return
or
 An empty cell is found, or
null
 N cells have been else if [Link] () = k
unsuccessfully probed return
 To ensure the efficiency, if k [Link]()
is not in the table, we want to else
find an empty cell as soon as
possible. The load factor can i  (i +
NOT be close to 1. 1) mod N
pp+1
until p=N
return null
Linear Probing
 Search for key=20.  Example:
 h(20)=20 mod 13 =7.  h(x) = x mod 13
 Go through rank 8, 9, …,  Insert keys 18, 41, 22,
12, 0. 44, 59, 32, 31, 73, 12, 20
 Search for key=15 in this order
 h(15)=15 mod 13=2.
 Go through rank 2, 3 and
return null. 0 1 2 3 4 5 6 7 8 9 10 11 12

20 41 18 44 59 32 22 31 73 12

0 1 2 3 4 5 6 7 8 9 10 11 12
Updates with Linear Probing
 To handle insertions and  put(k, o)
deletions, we introduce a  We throw an exception if the
special object, called table is full
AVAILABLE, which replaces  We start at cell h(k)
deleted elements  We probe consecutive cells
 remove(k) until one of the following
 We search for an entry with occurs
key k
 A cell i is found that is either
empty or stores
 If such an entry (k, o) is AVAILABLE, or
found, we replace it with the  N cells have been
special item AVAILABLE unsuccessfully probed
and we return element o  We store entry (k, o) in cell i
 Have to modify other methods
to skip available cells.
Quadratic Probing

 Primaryclustering occurs with linear


probing because the same linear pattern:
 if
a bin is inside a cluster, then the next bin
must either:
 alsobe in that cluster, or
 expand the cluster

 Insteadof searching forward in a linear


fashion, try to jump far enough out of the
current (unknown) cluster.
Quadratic Probing

 Suppose that an element should appear


in bin h:
 ifbin h is occupied, then check the following
sequence of bins:
h + 12, h + 22, h + 32, h + 42, h + 52, ...
h + 1, h + 4, h + 9, h + 16, h +
25, ...
 For example, with M = 17:
Quadratic Probing

 If
one of h + i2 falls into a cluster, this
does not imply the next one will
Quadratic Probing

 For example, suppose an element was


to be inserted in bin 23 in a hash table
with 31 bins
 The sequence in which the bins would
be checked is:
23, 24, 27, 1, 8, 17, 28, 10, 25, 11, 30, 20, 12, 6, 2, 0
Quadratic Probing

 Even if two bins are initially close, the


sequence in which subsequent bins are
checked varies greatly
 Again, with M = 31 bins, compare the
first 16 bins which are checked starting
with 22 and 23:

22, 23, 26, 0, 7, 16, 27, 9, 24, 10, 29, 19, 11, 5, 1, 30
23, 24, 27, 1, 8, 17, 28, 10, 25, 11, 30, 20, 12, 6, 2, 0
Quadratic Probing

 Thus, quadratic probing solves the


problem of primary clustering
 Unfortunately, there is a second problem
which must be dealt with
 Suppose we have M = 8 bins:

12 ≡ 1, 22 ≡ 4, 32 ≡ 1
 In this case, we are checking bin h + 1
twice having checked only one other bin
Quadratic Probing

 Unfortunately, there is no guarantee that


h + i2 mod M
will cycle through 0, 1, ..., M – 1
 Solution:
 require that M be prime
 in this case, h + i2 mod M for i = 0, ..., (M –

1)/2 will cycle through exactly (M + 1)/2


values before repeating
Quadratic Probing

 Example with M = 11:


0, 1, 4, 9, 16 ≡ 5, 25 ≡ 3, 36 ≡ 3
 With M = 13:
0, 1, 4, 9, 16 ≡ 3, 25 ≡ 12, 36 ≡ 10, 49 ≡ 10
 With M = 17:
0, 1, 4, 9, 16, 25 ≡ 8, 36 ≡ 2, 49 ≡ 15, 64 ≡ 13, 81 ≡
13
Quadratic Probing

 Thus, quadratic probing avoids primary


clustering
 Unfortunately, we are not guaranteed
that we will use all the bins
 In reality, if the hash function is
reasonable, this is not a significant
problem until l approaches 1
Secondary Clustering

 The phenomenon of primary clustering


will not occur with quadratic probing
 However, if multiple items all hash to the
same initial bin, the same sequence of
numbers will be followed
 This is termed secondary clustering
 The effect is less significant than that of
primary clustering
Double Hashing
 Use two hash functions
 If M is prime, eventually will examine every
position in the table
 double_hash_insert(K)
if(table is full) error
probe = h1(K)
offset = h2(K)
while (table[probe] occupied)
probe = (probe + offset) mod M
table[probe] = K
Double Hashing

 Many of same (dis)advantages as linear


probing
 Distributes keys more uniformly than
linear probing does
 Notes:
 h2(x)should never return zero.
 M should be prime.
Double Hashing Example
 h1(K) = K mod 13
 h2(K) = 8 - K mod 8
 we want h2 to be an offset to add
 18 41 22 44 59 32 31 73

0 1 2 3 4 5 6 7 8 9 10 11 12

44 41 73 18 32 53 31 22

0 1 2 3 4 5 6 7 8 9 10 11 12
Open Addressing Summary
 Ingeneral, the hash function contains two
arguments now:
 Key value
 Probe number

h(k,p), p=0,1,...,m-1
 Probe sequences
<h(k,0), h(k,1), ..., h(k,m-1)>
 Should be a permutation of <0,1,...,m-1>
 There are m! possible permutations
 Good hash functions should be able to produce
all m! probe sequences
Open Addressing Summary
 None of the methods discussed can generate
more than m2 different probing sequences.
 Linear Probing:
 Clearly, only m probe sequences.
 Quadratic Probing:
 The initial key determines a fixed probe sequence,
so only m distinct probe sequences.
 Double Hashing
 Each possible pair (h1(k),h2(k)) yields a distinct
probe, so m2 permutations.
Choosing A Hash Function

 Clearlychoosing the hash function well


is crucial.
 What will a worst-case hash function do?
 What will be the time to search in this case?

 What are desirable features of the hash


function?
 Should distribute keys uniformly into slots
 Should not depend on patterns in the data
From Keys to Indices
 A hashfunction is usually the composition of
two maps:
 hash code map: key  integer
 compression map: integer  [0, N - 1]

 An essential requirement of the hash function


is to map equal keys to equal indices
 A “good” hash function minimizes the
probability of collisions
Java Hash
 Java provides a hashCode() method for the Object class, which
typically returns the 32-bit memory address of the object.
 Note that this is NOT the final hash key or hash function.
 This maps data to the universe, U, of 32-bit integers.
 There is still a hash function for the HashTable.
 Unfortunately, it is x mod N, and N is usually a power of 2.
 The hashCode() method should be suitably redefined for structs.
 If your dictionary is Integers with Java (or probably .NET), the
default hash function is horrible. At a minimum, set the initial
capacity to a large prime (but Java will reset this too!).

Note, we have access to the Java source, so can determine this. My guess
is that it is just as bad in .NET, but we can not look at the source.
Popular Hash-Code Maps

 Integer cast: for numeric types with 32


bits or less, we can reinterpret the bits of
the number as an int
 Component sum: for numeric types with
more than 32 bits (e.g., long and
double), we can add the 32-bit
components.
 We need to do this to avoid all of our set of
longs hashing to the same 32-bit integer.
Popular Hash-Code Maps

 Polynomial accumulation: for strings of


a natural language, combine the
character values (ASCII or Unicode) a 0
a 1 ... a n-1 by viewing them as the
coefficients of a polynomial: a 0 + a 1 x
+ ...+ x n-1 a n-1
Popular Hash-Code Maps
 The polynomial is computed with Horner’s
rule, ignoring overflows, at a fixed value x:
a0 + x (a1 + x (a2 + ... x (an-2 + x an-1 ) ... ))
 The choice x = 33, 37, 39, or 41 gives at
most 6 collisions on a vocabulary of 50,000
English words
 Java uses 31.

 Why is the component-sum hash code


bad for strings?
Random Hashing

 Random hashing
 Uses a simple random number generation
technique
 Scatters the items “randomly” throughout

the hash table


Popular Compression Maps
 Division: h(k) = |k| mod N
 the choice N =2 m is bad because not all the bits are
taken into account
 the table size N should be a prime number
 certain patterns in the hash codes are propagated

 Multiply, Add, and Divide (MAD):


 h(k) = |ak + b| mod N
 eliminates patterns provided a mod N ¹ 0
 same formula used in linear congruential (pseudo)
random number generators
The Division Method
 h(k) = k mod m
 Inwords: hash k into a table with m slots using
the slot given by the remainder of k divided by m
 What happens to elements with adjacent
values of k?
 What happens if m is a power of 2 (say 2P)?
 What if m is a power of 10?
 Upshot: pick table size m = prime number
not too close to a power of 2 (or 10)
The Multiplication Method

 For a constant A, 0 < A < 1:


 h(k) =  m (kA - kA) 

What does this term represent?


The Multiplication Method

 For a constant A, 0 < A < 1:


 h(k) =  m (kA - kA) 

Fractional part of kA

 Choose m = 2P
 Choose A not too close to 0 or 1
 Knuth: Good choice for A = (5 - 1)/2
Recap

 So,
we have two possible strategies for
handling collisions.
 Chaining

 Open Addressing
 We have possible hash functions that try
to minimize the probability of collisions.

 What is the algorithmic complexity?


Analysis of Chaining

 Assume simple uniform hashing: each


key in table is equally likely to be hashed
to any slot.
 Given n keys and m slots in the table:
the load factor  = n/m = average # keys
per slot.
Analysis of Chaining

 What will be the average cost of an


unsuccessful search for a key?

 O(1+)
Analysis of Chaining

 What will be the average cost of a


successful search?

 O(1 + /2) = O(1 + )


Analysis of Chaining

 So the cost of searching = O(1 + )


 If the number of keys n is proportional to
the number of slots in the table, what is
?
 A:  = O(1)
 Inother words, we can make the expected
cost of searching constant if we make 
constant
Analysis of Open Addressing
 Consider the load factor, , and assume each key is
uniformly hashed.
 Probability that we hit an occupied cell is then .
 Probability that we the next probe hits an occupied
cell is also .
 Will terminate if an unoccupied cell is hit: (1- ).
 From Theorem 11.6, the expected number of probes
in an unsuccessful search is at most 1/(1- ).
 Theorem
1  111.8: Expected number of probes in a
ln
successful 
search is at most:
  1  

You might also like