Final Hashing
Final Hashing
Hashing
Hashing
• Find items with keys matching a given search key
– Given an array A, containing n keys, and a search key
x, find the index i such as x=A[i]
– As in the case of sorting, a key could be part of a
large record.
2
Hashing
• The search time of each algorithm discussed so
far depends on the number n of elements in the
collection S of data.
• A searching technique, which is essentially
independent of the number n is called hashing or
hash addressing. The terminology which we use
in our presentation of hashing will be oriented
toward file management.
3
Hashing
• First of all, we assume that there is a file F of n
records with a set K of keys which uniquely
determine the records in F.
• Secondly, we assume that F is maintained in
memory by a table T of m memory locations and
that L is the set of memory addresses of the
locations in T.
• For notational convenience, we assume that the
keys in K and the addresses in L are (decimal)
integers.
4
Hashing
• The subject of hashing will be introduced by the following
example
– Suppose a company with 68 employees assigns a 4-digit employee number
to each employee which is used as the primary key in the company's
employee file. We can, in fact, use the employee number as the address of
the record in memory. The search will require no comparisons at all.
Unfortunately, this technique will require space for 10 000 memory
locations, whereas space for fewer than 30 such locations would actually be
used. Clearly, this tradeoff of space for time is not worth the expense.
5
Hashing
• The general idea of using the key to determine the address
of a record is an excellent idea, but it must be modified so
that a great deal of space is not wasted, This modification
takes the form of a function H from the set K of keys into the
set L of memory addresses. Such a function.
H: K →L
is called a hash function or hashing function.
• Unfortunately, such a function H may not yield distinct
values: it is possible that two different keys k and k, will yield
the same hash address.
6
Hashing
• This situation is called collision, and some method must be
used to resolve it. Accordingly, the topic of hashing is
divided into two parts: (1) hash functions and (2) collision
resolutions. We discuss these two parts separately
•
7
Hash Functions
• The two principal criteria used in selecting a hash function
H: K →L are as follows:
– First of all the function H should be very easy and quick to compute.
– Second the function H should, as far as possible, uniformly distribute
the hash addresses throughout the set L so that there are a
minimum number of collisions.
– Naturally, there is no guarantee that the second condition can be
completely fulfilled without actually knowing beforehand the keys and
addresses.
– However, certain general techniques do help. One technique is to
"chop" a key k into pieces and combine the pieces in some way to
form the hash address H(k). (The term "hashing" comes from this
technique of "chopping" can be easily and quickly evaluated by the
computer.
8
Hash Functions
• The two principal criteria used in selecting a hash function
H: K →L are as follows:
– First of all the function H should be very easy and quick to compute.
– Second the function H should, as far as possible, uniformly distribute
the hash addresses throughout the set L so that there are a
minimum number of collisions.
– Naturally, there is no guarantee that the second condition can be
completely fulfilled without actually knowing beforehand the keys and
addresses.
– However, certain general techniques do help. One technique is to
"chop" a key k into pieces and combine the pieces in some way to
form the hash address H(k). (The term "hashing" comes from this
technique of "chopping" can be easily and quickly evaluated by the
computer.
9
Hash Functions
• Some popular hash functions are below. We emphasize that
each of these hash functions key into pieces.)
– Division method.
• Choose a number m larger than the number n of keys in K. (The
number m is usually chosen to be a prime number or a number
without small divisors, since this frequently minimizes the number
of collisions.) The hash function H is defined by
H(k) = k (mod m) or H(k) = k (mod m) + 1
Here k (mod m) denotes the remainder when k is divided by m.
The second formula is used when we want the hash addresses
to range from 1 to m rather than from 0 to m- 1.
10
Hash Functions
– Midsquare method.
• The key k is squared. Then the hash function H is defined by
H(k) = l
where l is obtained by deleting digits from both ends of k
We emphasize that the same positions of k must be used
for all of the keys.
– Folding method.
• The key k is partitioned into a number of parts, k. except possibly
the last, has the same number of digits as the required address.
Then the parts are added together, ignoring the last carry. That
is, k, where each part.
H(k) = k + k, + + k
where the leading-digit carries, if any, are ignored. Sometimes,
for extra "milling," the even-numbered parts, k2. are each
reversed before the addition. 11
Hash Functions
– Midsquare method.
• The key k is squared. Then the hash function H is defined by
H(k) = l
where l is obtained by deleting digits from both ends of k
We emphasize that the same positions of k must be used
for all of the keys.
– Folding method.
• The key k is partitioned into a number of parts, k. except possibly
the last, has the same number of digits as the required address.
Then the parts are added together, ignoring the last carry. That
is, k, where each part.
H(k) = k + k, + + k
where the leading-digit carries, if any, are ignored. Sometimes,
for extra "milling," the even-numbered parts, k2. are each
reversed before the addition. 12
Hash Functions : Example
– Division method.
• Choose a prime number m close ta 99, such as m - 97, then
H(3205) = 4, H(7148) = 67, H(2345)= 17
That is, dividing 3205 by 97 gives a remainder of 4, dividing 7148
by 97 gives a remainder of 67, and dividing 2345 by 97 gives a
remainder of 17.
13
Hash Functions : Example
In the case that the memory addresses begin with 01 rather than
00, we choose that the function H(k) = k(mod m) + 1 to obtain:
H(3205)= 4 + 1 = 5, H(7148) = 67 +1=68, H(2345) = 17 + 1 = 18.
– Midsquare method.
• The following calculations are performed:
k: 3205 7148 2345
k2: 10 272 025 51 093 904 5 499 025
H(k): 72 93 99
Observe that the fourth and fifth digits, counting from the right, are
chosen for the hash address.
14
Hash Functions : Example
– Folding method.
• Chopping the key k into two parts and adding yields the following
hash addresses:
H(3205) = 32 + 05 = 37, H(7148) = 71 + 48 =19,
H(2345) = 23 + 45 =68
15
Hash Tables
• When K is much smaller than U, a hash table
requires much less space than a direct-address
table
– Can reduce storage requirements to |K|
– Can still get O(1) search time, but on the average
case, not the worst case
16
Hash Tables
Idea:
– Use a function h to compute the slot for each key
– Store the element in slot h(k)
17
Example: HASH TABLES
0
U
(universe of keys) h(k1)
h(k4)
K k1
h(k2) = h(k5)
(actual k4 k2
keys)
k5 k3 h(k3)
m-1
18
Do you see any problems
with this approach?
0
U
(universe of keys) h(k1)
h(k4)
K k1
h(k2) = h(k5)
(actual k4 k2
keys) Collisions!
k5 k3 h(k3)
m-1
19
Collisions
• Two or more keys hash to the same slot!!
• For a given set K of keys
– If |K| ≤ m, collisions may or may not happen,
depending on the hash function
– If |K| > m, collisions will definitely happen (i.e., there
must be at least two keys that have the same hash
value)
• Avoiding collisions completely is hard, even with
a good hash function
20
Collision Resolution
• Suppose we want to add a new record R with
key k to our file F but suppose the memory
location address H(k) is already occupied. This
situation is called collision.
• This subsection discusses two general ways of
resolving collisions.
21
Collision Resolution
• The particular procedure that one chooses
depends on many factors.
• One important factor is the ratio of the number e
of keys in K (which is the number of records in
F) to the number m of hash addresses in L This
ratio, A= n/m, is called the load factor.
22
Handling Collisions
• We use the following methods:
– Chaining
– Open addressing
• Linear probing
• Quadratic probing
• Double hashing
23
Handling Collisions Using Chaining
• Idea:
– Put all elements that hash to the same slot into a
linked list
25
Insertion in Hash Tables
Alg.: CHAINED-HASH-INSERT(T, x)
insert x at the head of list T[h(key[x])]
27
Searching in Hash Tables
Alg.: CHAINED-HASH-SEARCH(T, k)
28
Analysis of Hashing with Chaining:
Worst Case
• How long does it take to T
search for an element with a 0
given key?
• Worst case:
– All n keys hash to the same slot
– Worst-case time to search is
chain
(n), plus time to compute the
hash function m-1
29
Analysis of Hashing with Chaining:
Average Case
• Average case
– depends on how well the hash function
distributes the n keys among the m slots T
• Simple uniform hashing assumption: n0 = 0
– Any given element is equally likely to hash
into any of the m slots (i.e., probability of n2
collision Pr(h(x)=h(y)), is 1/m) n3
• Length of a list: nj
T[j] = nj, j = 0, 1, . . . , m – 1
• Number of keys in the table: nk
n = n0 + n1 +· · · + nm-1
nm – 1 = 0
• Average value of nj:
E[nj] = α = n/m
30
Load Factor of a Hash Table
• Load factor of a hash table T:
T
= n/m 0
31
Analysis of Search in Hash Tables
• If m (# of slots) is proportional to n (# of
elements in the table):
• n = O(m)
32
Open Addressing
• If we have enough contiguous memory to store all the keys
(m > N) store the keys in the table itself e.g., insert 14
h(k,p), p=0,1,...,m-1
• Probe sequences
<h(k,0), h(k,1), ..., h(k,m-1)>
• Linear probing
• Quadratic probing
• Double hashing
35
Linear probing: Inserting a key
• Idea: when there is a collision, check the next available
position in the table (i.e., probing)
i=0,1,2,...
39
Double Hashing
(1) Use one hash function to determine the first slot
(2) Use a second hash function to determine the
increment for the probe sequence
h(k,i) = (h1(k) + i h2(k) ) mod m, i=0,1,...
• Initial probe: h1(k)
• Second probe is offset by h2(k) mod m, so on ...
• Advantage: avoids clustering
• Disadvantage: harder to delete an element
• Can generate m2 probe sequences maximum
40
Double Hashing: Example
h1(k) = k mod 13 0
1 79
h2(k) = 1+ (k mod 11) 2
h(k,i) = (h1(k) + i h2(k) ) mod 13 3
4 69
• Insert key 14: 5 98
h1(14,0) = 14 mod 13 = 1 6
7 72
h(14,1) = (h1(14) + h2(14)) mod 13 8
= (1 + 4) mod 13 = 5 9 14
10
h(14,2) = (h1(14) + 2 h2(14)) mod 13 11 50
= (1 + 8) mod 13 = 9 12
41