DATA STRUCTURES AND
ALGORITHMS
Week 13: Hashing
Sohail Muhammad
Department of CS
Bahria University, Islamabad
1
Introduction To Searching
• Linear Search & Binary Search
• Locate an item by a sequence of comparisons
• Item being sought – repeatedly compared with items in the list
• Fast Searching
• Location of an item is determined directly as a function of the item itself
• No hit and trial comparisons
• Ideally – Time required to locate an item is constant and does not depend
on the number of items stored
• Hash tables and hash functions
2
Why Hashing??
• Increased content especially internet
• Impossible to find anything, unless new data structures and algorithms
for storing and accessing data are developed.
• Problem with traditional data structures like Arrays and Linked Lists?
• Sorted array ->Binary search -> time complexity =O(log n)
• Unsorted array -> Linear search -> time complexity = O(n)
• Either case may not be desirable if we need to process a very large data set.
• A new technique called hashing that allows us to update and retrieve
any entry in constant time O(1). The constant time or O(1) performance
means, the amount of time to perform the operation does not depend
on data size n.
Applications of Hashing
• Compilers use hash tables to implement the symbol table (a data
structure to keep track of declared variables)
• Game programs
• Spell Checking
• Substring Pattern Matching
• Searching
• Document comparison
4
When not to use hashing?
• Hash tables are very good if there is a need for many searches in a
reasonably stable table
• Hash tables are not so good if there are many insertions and
deletions, or if table traversals are needed
• If there are more data than available memory then use a tree
• Also, hashing is very slow for any operations which require the entries
to be sorted
• e.g. Find the minimum key
5
A simple example – direct
hashing
• 7 integers, ranging 0-10 to be stored in a hash table:
• Key = {7, 3, 6, 4,9,1,5}
• The hash table can be implemented by:
• An integer array, table.
• Initialize each array element with some dummy value, like –1.
• Store value i at location table[i].
0 1 2 3 4 5 6 7 8 9 10
-1 -1
1 -1 -1
3 -1
4 5 -1
-1 6 7 -1 -1
-1 9 -1
6
A simple example – direct
hashing
• To check whether a particular value number stored in the hash table,
we only need to check:
• Hash Function:
• The function h defined by h(i) = i that determines the location of an item i in
the hash table is called hash function
table[number] = number
7
A simple example – direct
hashing
• 7 integers, ranging 0 – 999 to be stored in a hash table
• The hash table can be implemented by:
• An integer array, table.
• Initialize each array element with some dummy value, like –1.
• Store value i at location table[i].
0 1 997 998 999
8
A simple example – direct
hashing
• For the hash function h(i) = i:
• Time required to search the table for a given item is constant, only one
location needs to be examined
• Very Time efficient – not Space efficient at all
• 7 out of 1000 locations used – 993 unused locations
• Since it is possible to store 7 values in 7 locations, we can improve on
space utilization
9
Hash functions
• One possible hash function could be:
Key = {7, 3, 6, 4,9,1,5} h(i) = i modulo 7
• Or in C++ syntax:
int h(int i)
{
return i % 7;
}
0 1 2 3 4 5 6
-17 -1
1 -1
9 -1
3 -1
4 5 -1
-1 6
10
Hashing and hash functions
The above function would always produce an integer in the
range 0 –24.
The integer 52 is thus stored in table[2], since :
h(52) = 52 % 25 = 2
Similarly, 129, 500 and 49 are stored in locations 4,0 and
24 respectively.
Hash Table 500 -1 52 -1 129 . . . . . 49
…………..
0 1 2 3 23 24
11
Hash tables – formal definition
• The hash table structure is an array of some fixed size, containing the
items.
A stored item needs to have a data member, called key, that will be
used in computing the index value for the item.
• Key could be an integer, a string, etc
• e.g. a name or Id that is a part of a large employee structure
• The size of the array is TableSize.
The items that are stored in the hash table are indexed by values from
0 to TableSize – 1.
• Each key is mapped into some number in the range 0 to TableSize –
1.
• The mapping is implemented through a hash function.
12
Example Hash
Table
0
1
Items key 2
john 25000 john25000
25000
3 john
phil 31250 key Hash hash 4 phil31250
phil 31250
Function
dave 27500 index 5
6 dave27500
dave 27500
mary 28200
7 mary28200
mary 28200
void insert(int key) 8
int hash(int key)
{ 9
{
int h = hash(key);
return key%table_size;
table[h] = key;
}
}
13
Example
• The simplest kind of hash table is an array of records.
• This example has 701 records.
[0] [1] [2] [3] [4] [5] [ 700]
...
An array of records
14
Example
• Each record has a special field, called its key.
• In this example, the key is the ID of an individual – a long
integer
[4]
ID: 506643548
[0] [1] [2] [3] [5] [ 700]
...
15
Example
• The rest of the record has information about the person.
[4]
ID: 506643548
[0] [1] [2] [3] [5] [ 700]
...
16
Example
• When a hash table is in use, some spots contain valid records, and
other spots are empty
[0] [1] [2] [3] [4] [5] [ 700]
Number 281942902 Number 233667136 Number 506643548 Number 155778322
...
17
Example: Inserting a New Record
ID: 580625685
• In order to insert a new record, the key must
somehow be converted to an array index
• The index is called the hash value of the key
[0] [1] [2] [3] [4] [5] [ 700]
Number 281942902 Number 233667136 Number 506643548 Number 155778322
...
18
Example: Inserting a New
Record
• Simplest hash function: ID: 580625685
(ID mod 701)
(580625685 mod 701) = 3
[0] [1] [2] [3] [4] [5] [ 700]
Number 281942902 Number 233667136 Number 506643548 Number 155778322
...
19
Example: Inserting a New Record
• The new record is inserted at location 3 in the hash table
[0] [1] [2] [3] [4] [5] [ 700]
Number 281942902 Number 233667136 ID: 580625685 Number 506643548 Number 155778322
...
20
Hash function examples for integer
keys
• Let us consider a hash table size = 1000
• Truncation: If students have an 9-digit identification number, take the
last 3 digits as the table position
• E.g. 925371622 becomes 622
• Folding: Split a 9-digit number into three 3-digit numbers, and add
them
• E.g. 925371622 becomes 925 + 376 + 622 = 1923
• Modular arithmetic: If the table size is 1000, the first example always keeps
within the table range, but the second example does not (it should be mod 1000)
• E.g. 1923 mod 1000 = 923 (1923 % 1000)
21
Hash function
• A hash function should be easy and fast to compute
• A hash function should scatter the data evenly throughout the hash
table.
• How well does the hash function scatter random data?
• How well does the hash function scatter non-random data?
• If the input keys are integers then simply Key mod TableSize is a
general strategy.
• If the keys are strings, hash function needs more care
• First convert it into a numeric value.
22
Hash functions for non-numeric keys
• Add up the ASCII values of all characters of the key and take mod
with table size
int h(String x, int M)
{
char ch[];
ch = x.toCharArray();
int xlength = x.length();
int i, sum;
for (sum=0, i=0; i < x.length(); i++)
sum += ch[i];
return sum % M;
}
Apple = (65+112+112+108+101)%27 = 498 % 27 = (498-486) = 12
23
Collision
If, when an element is inserted, it hashes to the same value as
an already inserted element, then we have a collision and need
to resolve it.
0
1
2
24
Collision Resolution
25
Separate Chaining
• The idea is to keep a list of all elements that hash
to the same value.
• The array elements are pointers to the first nodes of the
lists.
• A new item is inserted to the front of the list.
• Advantages:
• Better space utilization for large items.
• Simple collision handling: searching linked list.
• Overflow: we can store more items than the hash table
size.
• Deletion is quick and easy: deletion from the linked list.
26
Example
Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81
hash(key) = key % 10.
0 0 0
1 1 81 1
2 2
3 3
4 4 64 4
5 5 25
6 6 36 16
7 7
8 8
9 9 49 9
27
Operations
• Initialization:
• All entries are set to NULL
• Search:
• Locate the cell using hash function.
• Sequential search on the linked list in that cell.
• Insertion:
• Locate the cell using hash function.
• (If the item does not exist) insert it as the first item in the
list.
• Deletion:
• Locate the cell using hash function.
• Delete the item from the linked list.
28
class Node{
public :
int key ;
Node * next ;
} ;
class hash{
public :
Node * hashtable[MAX] ;
//Hash function that generate hash index int
hashfunction(int key);
//Intialize the array of pointers to NULL void hash();
// Insert a value in the hash table; You need to create a node, insert the
value in the node and place the node at appropriate location.
void insert(int k);
// Display the complete data in the hash table void
display();
};
29
Open addressing
• Separate chaining has the disadvantage of using linked lists.
• Requires the implementation of a second data structure.
• In an open addressing hashing system, all the data go inside the table.
• If a collision occurs, alternative cells are tried until an empty cell is
found.
30
Open Addressing
• There are three common collision resolution
strategies:
• Linear Probing
• Quadratic probing
• Double hashing
31
Load Factor
• load factor :
= (current number of items) / tableSize
• measures how full a hash table is.
• Hash table should not be too loaded if we want to get better
performance from hashing.
• A good load factor is generally <=0.5 in case of open
addressing
• Only in Separate Chaining the load factor can be greater
than 1.
32
Linear Probing
• In linear probing, collisions are resolved by sequentially scanning an
array (with wraparound) until an empty cell is found.
Thus, once 77 collides with 52 at location 2, we simply put 77
in position 3.
Hash Table 500 -1 52 77 129 . . . . . 49
…………..
0 1 2 3 23 24
33
Linear Probing
Hash Table 500 -1 52 77 129 102 . . . . 49
…………..
0 1 2 3 23 24
To insert 102, we follow the probe sequence consisting of
locations 2,3,4, and 5 to find the first available locations and
thus store 102 in table[5].
Note: If the search reaches the end of the table, we continue
at first location.
34
Linear Probing
• To determine if a specified value is in the hash table, we first apply
the hash function to compute the position at which this value
should be found.
• There can by one of the following cases:
• The location is empty
• The location contains the specified value
• The location contains some other value
• Begin a circular linear search until either the item is found or we reach an empty
location or the starting location.
Let hash(x) be the slot index computed using hash function and S be the table size
If slot hash(x) % S is full, then we try (hash(x) + 1) % S
If (hash(x) + 1) % S is also full, then we try (hash(x) + 2) % S
If (hash(x) + 2) % S is also full, then we try (hash(x) + 3) % S
35
Linear Probing -- Example
• Example:
• Table Size is 11 (0..10) 0 9
• Hash Function: h(x) = x mod 11 1
• Insert keys: 20, 30, 2, 13, 25, 24, 10, 9 2 2
• 20 mod 11 = 9 3 13
• 30 mod 11 = 8 4 25
• 2 mod 11 = 2
• 13 mod 11 = 2 2+1=3 5 24
• 25 mod 11 = 3 3+1=4 6
• 24 mod 11 = 2 2+1, 2+2, 2+3=5 7
• 10 mod 11 = 10 8 30
• 9 mod 11 = 9 9+1, 9+2 mod 11 =0
9 20
10 10
36
Linear Probing -- Clustering Problem
• One of the problems with linear probing is that table items tend to
cluster together in the hash table.
• i.e. table contains groups of consecutively occupied locations.
• This phenomenon is called primary clustering.
• Clusters can get close to one another, and merge into a larger
cluster.
• Thus, the one part of the table might be quite dense, even though
another part has relatively few items.
• Primary clustering causes long probe searches, and therefore,
decreases the overall efficiency.
37
Clustering Problem
• As long as table is big enough, a free cell can always be found, but
the time to do so can get quite large
• Larger table size are preferred
• Studies suggest the use of tables whose capacities are approx. 1.5 to
2 times the number of items that must be stored
38
Quardatic probing
• Quadratic Probing eliminates the clustering problem
of linear probing.
• If the hash function evaluates to h and a search in cell
h is inconclusive, we try cells h + 12, h+22, … h + i2.
• i.e. It examines cells 1,4,9 and so on away from the original
probe.
• Subsequent probe points are a quadratic number of
positions from the original probe point.
39
Quadratic Probing
• Quadratic probing: almost eliminates clustering problem
• Steps to follows:
• Start from the original hash location i
• If location is occupied, check locations i+12, i+22,
i+32, i+42 ...
• Wrap around table, if necessary.
let hash(x) be the slot index computed using hash function.
If slot hash(x) % S is full, then we try (hash(x) + 1*1) % S
If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) % S
If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3) % S
40
Quadratic Probing -- Example
• Table Size is 11 (0..10) 0
• Hash Function: h(x) = x mod 11 1
2 2
• Insert keys: 20, 30, 2, 13, 25, 24, 10, 9
3 13
• 20 mod 11 = 9
• 30 mod 11 = 8 4 25
• 2 mod 11 = 2 5
• 13 mod 11 = 2 2+12=3 6 24
• 25 mod 11 = 3 3+12=4
7 9
• 24 mod 11 = 2 2+12, 2+22=6
• 10 mod 11 = 10 8 30
• 9 mod 11 = 9 9+12, 9+22 mod 11, 9 20
9+32 mod 11 =7 10 10
41
Double Hashing
• A second hash function is used to drive the collision
resolution.
• We apply a second hash function to x and probe at
a distance hash2(x), 2*hash2(x), … and so on.
• The function hash2(x) must never evaluate to zero
let hash(x) be the slot index computed using hash function.
If slot hash(x) % S is full, then we try (hash(x) + 1*hash2(x)) % S
If (hash(x) + 1*hash2(x)) % S is also full, then we try (hash(x) + 2*hash2(x)) % S
If (hash(x) + 2*hash2(x)) % S is also full, then we try (hash(x) + 3*hash2(x)) % S
42
Double Hashing
• Double hashing also reduces clustering.
• Idea: Increment using a second hash function h2. Should
satisfy:
h2(key) 0
h2h1
• Probes following locations until it finds an unoccupied place
h1(key)
h1(key) + h2(key)
h1(key) + 2*h2(key),
... 43
Double Hashing -- Example
• Example: 0
• Table Size is 11 (0..10) 1
• Hash Function: 2
h1(x) = x mod 11 3 58
h2(x) = 7 – (x mod 7) 4
• Insert keys: 58, 14, 91 5
• 58 mod 11 = 3
6 91
• 14 mod 11 = 3 3+7=10
• 91 mod 11 = 3 3+7, 3+2*7 mod 11=6 7
8
9
10 14
44
class hash
{
Public:
int HashTable[MAX];
//Hash Function to generate the index //hash(key) = key%MAX;
int hashfunction(int key);
//A function that accepts the hash table and key to be inserted and inserts the “key” at
appropriate location in the table. Use linear probing to resolve collisions. The returned
values is the index at which the key is inserted.
int linear_probing(int HashTable[], int key);
//A function that accepts the hash table and key to be inserted and inserts the “key” at
appropriate location in the table. Use linear probing to resolve collisions. The returned
values is the index at which the key is inserted.
int quadratic_probing(int HashTable[], int key);
//A function that inserts values in the table and resolves collisions using
quardatic probing.
int double_hashing(int HashTable[], int key);
};
HINTS :
Quardatic probing can be implemented like:
for (i = 0; i% MAX != pos ; i++)
pos = (pos + i * i) % MAX ;
45
Self Assessment
• Insert the given:
• keys = {“pineapple”, “grapefruit”, “apricot”, “coconut”},
• into a hash table of size 10, using a hash function:
• H(key) = key % tablesize
• To apply the given hash function, first convert the non-numeric key
values into numeric, by counting the number of characters in each
key (e.g. ‘sam’ has three characters, so its numeric value will be 3).
• In case of collision apply linear probing.
• Also compute the load factor of the given hash table.
46
Self Assessment
• Insert the keys={89, 18, 49, 58, 69} in a Hash table of size 10 using
the hash function H(key)=key % tablesize and each of the following
collision resolution techniques separately.
• Linear Probing
• Quadratic Probing
• Double Hashing using a second hash function H2(key)=7-(key % 7),
47
Summary
• The analysis shows us that the table size is not really important, but
the load factor is.
• TableSize should be as large as the number of expected elements in
the hash table.
• To keep load factor around 1 .
• TableSize should be prime for even distribution of keys to hash table
cells.
48