0% found this document useful (0 votes)

92 views61 pages

Hashing PDF

The document provides an overview of hash tables and their main components. It discusses how a hash table uses a hash function to map keys to array indices in constant average time to allow for efficient search, insertion and deletion. The key aspects covered are: - Hash tables consist of an array and a hash function that maps keys to indices. Collisions occur when different keys map to the same index. - Collision resolution techniques include chaining, where collided keys are stored in linked lists at indices, and open addressing, where keys are stored elsewhere in the table. - Linear probing is one open addressing approach where the probe sequence is the initial index plus increments. This can lead to clustering which hurts performance.

Uploaded by

ImogenDyer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views61 pages

Hashing PDF

Uploaded by

ImogenDyer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Hashing & Hash Tables

Cpt S 223. School of EECS, WSU

Overview

Hash Table Data Structure : Purpose

To support insertion, deletion and search in

average-case constant
t t ti
time

Hash function

Assumption: Order of elements irrelevant

==> data structure *not* useful for if you want to
maintain
i t i and
d retrieve
ti
some kind
ki d off an order
d off the
th
elements

Hash[ string key] ==> integer value

Hash table ADT

I l
Implementations,
t ti
Analysis,
A l i Applications
A li ti
Cpt S 223. School of EECS, WSU

Hash table: Main components

key

value

john

TableSize
e

Hash index

h(john)

key
Hash
function

How to determine ?
Cpt S 223. School of EECS, WSU

Hash table
(implemented as a vector)

Hash Table

Hash table is an array of fixed

size TableSize

key

Element value

Array elements indexed by a

key, which is mapped to an
array index (0TableSize-1)
Mapping (hash function) h
from key to index
E.g., h(john) = 3
Cpt S 223. School of EECS, WSU

Hash Table Operations

Insert

T [h(john)] = <john,25000>
Data
record

Delete

Hash key

Hash
f
function
ti

T [h(
[h(john)]
john )] = NULL

T [h(john)] returns the

element hashed for john

What happens if h(john)

h( john ) == h(
h(joe)
joe ) ?
collision
Cpt S 223. School of EECS, WSU

Factors affecting Hash Table

Design

Hash function

Table size

Usuallyy fixed at the start

Collision handling scheme

Cpt S 223. School of EECS, WSU

Hash Function

A hash function is one which maps an

elements key into a valid hash table index

h(key) => hash table index

Note that this is (slightly) different from saying:

h(string) => int
Because the key can be of any type

E.g., h(int) => int is also a hash function!

But also note that anyy type

yp can be converted into
an equivalent string form
Cpt S 223. School of EECS, WSU

h(key) ==> hash table index

Hash Function Properties

A hash function maps key to integer

Constraint: Integer should be between
[0, TableSize-1]
A hash function can result in a many-to-one mapping
(causing collision)

Collision occurs when hash function maps two or more keys

to same array index

Collisions
C
lli i
cannott be
b avoided
id d but
b t its
it chances
h
can be
b
reduced using a good hash function

Cpt S 223. School of EECS, WSU

h(key) ==> hash table index

Hash Function Properties

A good hash function should have the

properties:
1.

Reduced chance of collision

Different keys should ideally map to different
indices
Distribute keys uniformly over table

Should be fast to compute

Cpt S 223. School of EECS, WSU

Hash Function - Effective use

of table size

Simple hash function (assume integer keys)

h(Key) = Key mod TableSize

For random keys, h() distributes keys evenly

over table

What if TableSize = 100 and keys are ALL

multiples of 10?
Better if TableSize is a prime number

Cpt S 223. School of EECS, WSU

Different Ways to Design a

Hash Function for String Keys
A very simple function to map strings to integers:
Add up character ASCII values (0-255) to produce
integer keys

E.g., abcd = 97+98+99+100 = 394

==> h(abcd) = 394 % TableSize

Potential problems:
Anagrams will map to the same index

Small strings may not use all of table

h(abcd) == h(dbac)
Strlen(S) * 255 < TableSize

Time proportional to length of the string

Cpt S 223. School of EECS, WSU

Different Ways to Design a

Hash Function for String Keys

Approach 2

Treat first 3 characters of string as base-27 integer (26

letters plus space)

Key = S[0] + (27 * S[1]) + (272 * S[2])

Better than approach 1 because ?

Potential problems:
Assumes first 3 characters randomly distributed

Not true of English

Apple
Apply
pp
Appointment
Apricot

collision

Cpt S 223. School of EECS, WSU

Different Ways to Design a

Hash Function for String Keys

Approach 3
Use all N characters of string as an
N-digit
g base-K number

Choose K to be prime number

larger than number of different
digits (characters)

I.e., K = 29, 31, 37

If L = length of string S, then

L 1

h( S ) S [ L i 1] 37 i mod TableSize
i 0

Problems:

Use Horners rule to compute h(S)

potential overflow

Li it L for
Limit
f long
l
strings
ti
larger runtime

Cpt S 223. School of EECS, WSU

Collision resolution techniques

Techniques
T
h i
to
t Deal
D l with
ith
Collisions
Chaining
Open addressing
Double hashing
Etc.
Etc
Cpt S 223. School of EECS, WSU

Resolving Collisions

What happens when h(k1) = h(k2)?

==>
> collision !

Collision resolution strategies

Chaining

Store colliding keys in a linked list at the same

hash table index

Open addressing

Store colliding
g keys
y elsewhere in the table
Cpt S 223. School of EECS, WSU

Ch i i
Chaining
Collision resolution technique #1

Cpt S 223. School of EECS, WSU

Chaining strategy: maintains a linked list at

every hash index for collided elements
Insertion sequence: { 0 1 4 9 16 25 36 49 64 81 }

Hash table T is a vector of

linked lists

Insert element at the head

(as shown here) or at the tail

Key k is stored in list at

T[h(k)]
E.g.,
g TableSize = 10

h(k) = k mod 10
Insert first 10 perfect
squares
Cpt S 223. School of EECS, WSU

Implementation of Chaining
Hash Table
Vector of linked lists
(this is the main
hashtable)

Current #elements in
the hashtable
Hash functions for
i t
integers
and
d string
ti
keys
Cpt S 223. School of EECS, WSU

Implementation of Chaining
Hash Table
This is the hashtables
current capacity
(aka. table size)

This is the hash table

index for the element
x
Cpt S 223. School of EECS, WSU

Duplicate check

Later, but essentially

resizes the hashtable if its
getting crowded

Cpt S 223. School of EECS, WSU

Each of these
operations takes time
linear in the length of
the list at the hashed
index location
Cpt S 223. School of EECS, WSU

All hash objects must

define == and !=
operators.

Hash function to
handle Employee
object type
Cpt S 223. School of EECS, WSU

Collision Resolution by
Chaining: Analysis

Load factor of a hash table T is defined as follows:

N = number of elements in T
M = size
i off T
= N/M

i.e., is the average length of a chain

Unsuccessful search time: O()

(current size)
(t bl size)
(table
i )
( load factor)

Same for insert time

Successful search time: O(/2)

Ideally, want 1 (not a function of N)
Cpt S 223. School of EECS, WSU

Potential disadvantages of
Chaining
Linked lists could get long
Especially when N approaches M
Longer
L
linked
li k d lists
li t could
ld negatively
ti l impact
i
t
performance
More memory because of pointers
Absolute worst-case (even if N << M):
All N elements in one linked list!
Typically the result of a bad hash function
Cpt S 223. School of EECS, WSU

O
Open
Addressing
Add
i
Collision resolution technique #2

Cpt S 223. School of EECS, WSU

Collision Resolution by
Open Addressing

An inplace approach

When a collision occurs, look elsewhere in the

table for an empty slot

Advantages over chaining

No need for list structures

No
o need
eed to
o allocate/deallocate
a oca e/dea oca e memory
e o y du
during
g
insertion/deletion (slow)

Disadvantages

Slower insertion May need several attempts to find an

empty slot
Table needs to be bigger (than chaining-based table) to
achieve average
average-case
case constant
constant-time
time performance

Load factor 0.5

Cpt S 223. School of EECS, WSU

Collision Resolution by
Open Addressing

A Probe sequence is a sequence of slots in hash table while

searching for an element x
h0(x),
(x) h1(x),
(x) h2(x),
(x)
Needs to visit each slot exactly once
Needs to be repeatable (so we can find/delete what weve
inserted)
Hash function

hi(x) = (h(x) + f(i)) mod TableSize

f(0) = 0
==> position for the 0th probe
f(i)
( ) is the distance to be traveled relative to the 0th p
probe
position, during the ith probe.
Cpt S 223. School of EECS, WSU

Linear Probing
i probe
th

index =

Linear probing:
0th probe
b
occupied

1st

occupied

2nd probe

occupied

probe

f(i) = is a linear function of i,

E.g., f(i) = i
hi(x) = (h(x) + i) mod TableSize

3rd probe

0th probe
index

Probe sequence: +0, +1, +2, +3, +4,

unoccupied

Populate x here
Continue until an empty slot is found
#failed probes is a measure of performance
Cpt S 223. School of EECS, WSU

ith probe
index =

0th probe
index

Linear Probing

f(i) = is a linear function of i, e.g., f(i) = i

hi(x) = (h(x) + i) mod TableSize

Probe sequence: +0, +1, +2, +3, +4,

Example: h(x) = x mod TableSize

h0(89)
h0(18)
h0(49)
h1(49)

= (h(89)+f(0)) mod 10 = 9
= (h(18)+f(0)) mod 10 = 8
= (h(49)+f(0)) mod 10 = 9 (X)
= (h(49)+f(1)) mod 10
= (h(49)+ 1 ) mod 10 = 0
Cpt S 223. School of EECS, WSU

Linear Probing Example

I
Insert
t sequence: 89,
89 18
18, 49
49, 58
58, 69

#unsuccessful
probes:

time

Cpt S 223. School of EECS, WSU

7
total

Linear Probing: Issues

Probe sequences can get longer with time

Primary clustering

Keys tend to cluster in one part of table

Keys that hash into cluster will be added to
the end of the cluster (making it even
bigger)
Side effect: Other keys could also get
affected if mapping to a crowded
neighborhood
Cpt S 223. School of EECS, WSU

Linear Probing: Analysis

Expected number of
probes for insertion or
unsuccessful search
1
1
1

2
2 (1 )
Expected number of
probes for successful
search
1
1
1

2 (1 )

Example ( = 0.5)

Insert / unsuccessful
search

Successful search

2.5 probes
1 5 probes
1.5
b

Example ( = 0.9)

Insert / unsuccessful
search

50.5 probes

Successful search

Cpt S 223. School of EECS, WSU

5.5 probes
32

Random Probing: Analysis

Random probing does not suffer from

clustering
Expected number of probes for insertion or
unsuccessful search:
1
1

Example

l
ln

= 0.5: 1.4 probes

= 0.9: 2.6 probes
Cpt S 223. School of EECS, WSU

# probe
es

Linear vs. Random Probing

U - unsuccessful search
S - successful search
I - insert

Linear probing
Random probing

good

bad

Load factor
Cpt S 223. School of EECS, WSU

Quadratic Probing
Quadratic probing:
occupied
occupied

0th probe
1st probe
2nd probe

Avoids primary clustering

f(i) is quadratic in i
e.g., f(i) = i2
hi(x) = (h(x) + i2) mod
TableSize

occupied

3rd probe

Probe sequence:
q
+0, +1, +4, +9, +16,

occupied

Continue until an empty slot is found

#failed probes is a measure of performance
Cpt S 223. School of EECS, WSU

Quadratic Probing

Avoids primary clustering

f(i) is quadratic in I,
I e.g.,
e g f(i) = i2

hi(x) = (h(x) + i2) mod TableSize

Probe sequence: +0,

+0 +1,
+1 +4,
+4 +9,
+9 +16,
+16

Example:

h0(58) = (h(58)+f(0))
(h(58) f(0)) mod
d 10 = 8 (X)
h1(58) = (h(58)+f(1)) mod 10 = 9 (X)
h2(58)
( 8) = (h(58)+f(2))
(h( 8) f(2)) mod
d 10
0=2
Cpt S 223. School of EECS, WSU

Q) Delete(49), Find(69) - is there a problem?

Quadratic Probing Example

I
Insert
t sequence: 89,
89 18
18, 49
49, 58
58, 69
+12

+12
+22

+22

+02

+02
#unsuccessful
probes:

Cpt S 223. School of EECS, WSU

+12
2

+02
2

5
total

Quadratic Probing: Analysis

Difficult to analyze
Theorem 5.1

New element can always be inserted into a table

that is at least half empty and TableSize is prime

Otherwise, may never find an empty slot,

even is one exists
Ensure table never gets half full

If close, then expand it

Cpt S 223. School of EECS, WSU

Quadratic Probing

May cause secondary clustering

Deletion

Emptying
p y g slots can break probe
p
sequence
q
and
could cause find stop prematurely
Lazy deletion

Differentiate
Diff
ti t b
between
t
empty
t and
dd
deleted
l t d slot
l t
When finding skip and continue beyond deleted slots

If you hit a non-deleted empty slot, then stop find procedure

returning not found

May need compaction

at some time
Cpt S 223. School of EECS, WSU

Quadratic Probing:
Implementation

Cpt S 223. School of EECS, WSU

Quadratic Probing:
Implementation
Lazy deletion

Cpt S 223. School of EECS, WSU

Quadratic Probing:
Implementation

Ensure table
size is prime

Cpt S 223. School of EECS, WSU

Quadratic Probing:
Implementation
Find

Skip DELETED;
No duplicates

Quadratic probe
sequence (really)

Cpt S 223. School of EECS, WSU

Quadratic Probing:
Implementation
Insert
No duplicates

Remove
No deallocation
needed
Cpt S 223. School of EECS, WSU

Double Hashing: keep two

hash functions h1 and h2

Use a second hash function for all tries I

other than 0:
f(i) = i * h2(x)
Good choices for h2(x) ?

Should never evaluate to 0

h2(x) = R (x mod R)

R is prime number less than TableSize

P i
Previous
example
l with
ith R=7
R 7

h0(49) = (h(49)+f(0)) mod 10 = 9 (X)

h1(49) = (h(49)+1*(7 49 mod 7)) mod 10 = 6
Cpt S 223. School of EECS, WSU

f(1)

Double Hashing Example

Cpt S 223. School of EECS, WSU

Double Hashing: Analysis

Imperative that TableSize is prime

E g insert 23 into previous table

E.g.,

Empirical tests show double hashing

close to random hashing
Extra hash function takes extra time to
compute
t

Cpt S 223. School of EECS, WSU

Probing Techniques - review

Quadratic probing:
0th try
t
1st try
2nd try

0th try
1st try

0th try
i

2nd try
t

try
3rd try

3rd

Double hashing*:

Cpt S 223. School of EECS, WSU

2nd try

1stt try
3rd try

Linear probing:

*(determined by a second
hash function)

Rehashing

Increases the size of the hash table when load factor

becomes too high (defined by a cutoff)
Anticipating that prob(collisions) would become
higher
Typically expand the table to twice its size (but still
prime)
Need to reinsert all existing elements into new hash
table

Cpt S 223. School of EECS, WSU

Rehashing Example
h(x) = x mod 7
= 0.57
0 57

h(x) = x mod 17
= 0.29
0 29

Insert 23

Rehashing

= 0.71

Cpt S 223. School of EECS, WSU

Rehashing Analysis

Rehashing takes time to do N insertions

Therefore should do it infrequently
Specifically

Mustt h
M
have been
b
N/2 iinsertions
ti
since
i
last
l t
rehash
A
Amortizing
ti i the
th O(N) costt over the
th N/2 prior
i
insertions yields only constant additional
time per insertion
Cpt S 223. School of EECS, WSU

Rehashing Implementation

When to rehash

When load factor reaches some threshold

(e.g,. 0.5), OR
When an insertion fails

Applies across collision handling

schemes
Cpt S 223. School of EECS, WSU

Rehashing for Chaining

Cpt S 223. School of EECS, WSU

Rehashing for
Quadratic Probing

Cpt S 223. School of EECS, WSU

Hash Tables in C++ STL

Hash tables not part of the C++

Standard Library
Some implementations of STL have
hash tables (e.g.,
(e g SGI
SGIss STL)

hash_set
hash map
hash_map

Cpt S 223. School of EECS, WSU

Hash Set in STL

#include <hash
<hash_set>
set>
struct eqstr
{
bool operator()(const char* s1, const char* s2) const
{
return strcmp(s1, s2) == 0;
}
};
void lookup(const hash_set<const char*, hash<const char*>, eqstr>& Set,
const char* word)
{
hash_set<const char*, hash<const char*>, eqstr>::const_iterator it
= Set.find(word);
cout << word << ": "
<< (it != Set.end()
Set end() ? "present" : "not present")
<< endl;
}

Key

Hash fn

Key equality test

int main()
{
hash_set<const char*, hash<const char*>, eqstr> Set;
Set.insert("kiwi");
lookup(Set, kiwi");
}
Cpt S 223. School of EECS, WSU

Hash Map in STL

#i l d <h
#include
<hash_map>
h
>
struct eqstr
{
bool operator() (const char* s1, const char* s2) const
{
return strcmp(s1, s2) == 0;
}
};

Key
Data
Hash fn
Key equality test
int main()
{
hash_map<const char*, int, hash<const char*>, eqstr> months;
Internally
months["january"] = 31;
treated
months["february"] = 28;
like insert

(or overwrite
months["december"] = 31;
if key
cout << january -> " << months[january"] << endl;
already present)
}
Cpt S 223. School of EECS, WSU

Problem with Large Tables

What if hash table is too large to store

in main memory?
Solution: Store hash table on disk

Minimize disk accesses

But

Collisions
ll
require disk
d k accesses
Rehashing requires a lot of disk accesses
Solution: Extendible Hashing
Cpt S 223. School of EECS, WSU

Hash Table Applications

Symbol table in compilers

Accessing tree or graph nodes by name
E.g.,
g , city
c ty names
a es in Goog
Google
e maps
aps
Maintaining a transposition table in games
Remember previous game situations and the move taken
(avoid re
re-computation)
computation)
Dictionary lookups
Spelling checkers
Natural
N t l llanguage understanding
d t di (word
(
d sense))
Heavily used in text processing languages
E.g., Perl, Python, etc.
Cpt S 223. School of EECS, WSU

Summary

Hash tables support fast insert and

O(1) average case performance

Deletion possible
possible, but degrades
performance

Not suited if ordering of elements is

important
Many applications
Cpt S 223. School of EECS, WSU

Points to remember - Hash

tables

Table size prime

Table size much larger than number of inputs
(to maintain closer to 0 or < 0.5)
Tradeoffs between chaining vs. probing
C lli i chances
Collision
h
decrease
d
in
i this
hi order:
d
linear probing => quadratic probing =>
{random probing, double hashing}
Rehashing required to resize hash table at a
time when exceeds 0.5
Good for searching. Not good if there is some
Cpt S data.
223. School of EECS, WSU
61
order implied by

Hashing
50% (2)
Hashing
43 pages
Lect Hashing
No ratings yet
Lect Hashing
36 pages
DSA2 Chapter 5 Hashing
No ratings yet
DSA2 Chapter 5 Hashing
44 pages
Lecture 4 Hash Table Stu
No ratings yet
Lecture 4 Hash Table Stu
13 pages
Cse373 10 Hashing
No ratings yet
Cse373 10 Hashing
36 pages
Hash Tables and Collision Resolution
No ratings yet
Hash Tables and Collision Resolution
47 pages
Chapter 5 - Hashing - Part1
No ratings yet
Chapter 5 - Hashing - Part1
28 pages
Lecture8 HashTables PartA
No ratings yet
Lecture8 HashTables PartA
21 pages
BCS304-DSA Notes M-5
100% (1)
BCS304-DSA Notes M-5
22 pages
09 Hashtables
No ratings yet
09 Hashtables
25 pages
Understanding Hash Tables and Collisions
No ratings yet
Understanding Hash Tables and Collisions
27 pages
Unit IV Hashing and Set 9
No ratings yet
Unit IV Hashing and Set 9
8 pages
4
No ratings yet
4
29 pages
Hashing
No ratings yet
Hashing
96 pages
Understanding Hash Tables and Functions
No ratings yet
Understanding Hash Tables and Functions
51 pages
Hashing
No ratings yet
Hashing
20 pages
Chapter10 HashTables
No ratings yet
Chapter10 HashTables
49 pages
HASHING
No ratings yet
HASHING
16 pages
Hash Tables: Concepts & Implementations
No ratings yet
Hash Tables: Concepts & Implementations
53 pages
Introduction to Hashing Techniques
No ratings yet
Introduction to Hashing Techniques
65 pages
Lecture 23 Hash Code Map
No ratings yet
Lecture 23 Hash Code Map
41 pages
Hashing
No ratings yet
Hashing
33 pages
Hashing Techniques and Collision Resolution
100% (1)
Hashing Techniques and Collision Resolution
22 pages
Dsa 4
No ratings yet
Dsa 4
55 pages
3 Hashing
No ratings yet
3 Hashing
20 pages
DSA Lab 11 Hashing
No ratings yet
DSA Lab 11 Hashing
9 pages
CH 4
No ratings yet
CH 4
58 pages
Hash Tables: A Guide for CS Students
No ratings yet
Hash Tables: A Guide for CS Students
48 pages
Hash Table Search Complexity Explained
No ratings yet
Hash Table Search Complexity Explained
43 pages
University Institute of Engineering CSE-2 Year: Advanced Data Structures and Algorithms
No ratings yet
University Institute of Engineering CSE-2 Year: Advanced Data Structures and Algorithms
26 pages
TCP2101 Algorithm Design & Analysis: - Hash Tables
No ratings yet
TCP2101 Algorithm Design & Analysis: - Hash Tables
58 pages
Hashing RPK
No ratings yet
Hashing RPK
61 pages
Hashing and Indexing Techniques Explained
No ratings yet
Hashing and Indexing Techniques Explained
28 pages
Hash Tables and Collision Handling Techniques
No ratings yet
Hash Tables and Collision Handling Techniques
25 pages
Primary Clustering in Hashing
No ratings yet
Primary Clustering in Hashing
61 pages
Hashing Techniques for CS Students
No ratings yet
Hashing Techniques for CS Students
25 pages
Understanding Hashing Techniques
No ratings yet
Understanding Hashing Techniques
47 pages
Hashing
No ratings yet
Hashing
37 pages
Hashing Part1
No ratings yet
Hashing Part1
73 pages
Hashing
No ratings yet
Hashing
16 pages
6 - Hashing
No ratings yet
6 - Hashing
52 pages
Hashing PDF
No ratings yet
Hashing PDF
56 pages
Group 15 Hash Tables
No ratings yet
Group 15 Hash Tables
42 pages
Lecture 27 - Hashing
No ratings yet
Lecture 27 - Hashing
48 pages
Hashing Techniques Done
No ratings yet
Hashing Techniques Done
53 pages
What Is Hashing
No ratings yet
What Is Hashing
11 pages
Understanding Hashing in Data Structures
No ratings yet
Understanding Hashing in Data Structures
44 pages
VND - Openxmlformats Officedocument - Wordprocessingml.document&rendition 1
No ratings yet
VND - Openxmlformats Officedocument - Wordprocessingml.document&rendition 1
9 pages
06 - APS - Hash Table
No ratings yet
06 - APS - Hash Table
28 pages
Hashing
No ratings yet
Hashing
9 pages
Hashing
No ratings yet
Hashing
56 pages
Idst 2016 SA 05 Hashing
No ratings yet
Idst 2016 SA 05 Hashing
68 pages
DS Lecture - 6 (Hashing)
No ratings yet
DS Lecture - 6 (Hashing)
27 pages
Hashing and Collision Resolution Methods
No ratings yet
Hashing and Collision Resolution Methods
47 pages
Hash Table Data Structure
No ratings yet
Hash Table Data Structure
34 pages
Hashing Techniques Explained
No ratings yet
Hashing Techniques Explained
23 pages
Hash Table
No ratings yet
Hash Table
24 pages
Hashing: For Efficient Look-Up Tables
No ratings yet
Hashing: For Efficient Look-Up Tables
70 pages
Stat Theory
No ratings yet
Stat Theory
10 pages
Hashing PDF
No ratings yet
Hashing PDF
61 pages
Human Resources Management
No ratings yet
Human Resources Management
2 pages
21 TCS Rate Card
No ratings yet
21 TCS Rate Card
1 page