Algorithms All Sortings
Algorithms All Sortings
Introduction to Algorithm
In this section we shall see the definition of Algorithm with simple examples.
Example 1: How to fill petrol in your car
1.
2.
3.
4.
5.
6.
7.
8.
9.
in computer programming. To make a computer useful in problem solving, we must give the
problem as well as the technique to solve the problem to it. So by programming the computer
with various algorithms to solve problems, the computers can be made "intelligent". Computers
are well-suited for solving tedious problems because of their speed and accuracy.
Much of the study of computer science is dedicated to finding efficient algorithms and
representing them so that computers will understand that. In our study about algorithms, we will
learn what defines an algorithm, Algorithm design techniques, well-known Algorithms and their
advantages.
Definition
In the introduction, we have given an informal definition of an algorithm as "a set of instructions
for solving a problem" and we illustrated this definition with instructions for filling petrol in a car
and giving directions to a friend's house. Though we can understand these simple algorithms,
they may be very ambiguous for a computer. For an algorithm to be applicable to a computer, it
must have certain characteristics. We will state these characteristics in our formal definition of an
algorithm.
An algorithm is a well-ordered sequence of unambiguous and effectively computable
instructions that when executed produces a result and halts in a finite amount of time
With the above definition, we can list five important characteristics of algorithms.
The above five characteristics need a little more explanation, so we will look at each one in
detail.
Characteristics of Algorithms
a. Algorithms are well-ordered
Since an algorithm is a collection of instructions or operations, we must know the correct order to
execute the instructions. If the order is not clear, we may perform the wrong instruction or we
may be uncertain as to which instruction should be performed next. This is an important
characteristic because, an algorithm can be executed by a computer only if it knows the exact
order of steps to perform.
b. Algorithms have unambiguous operations
Each instruction in an algorithm must be sufficiently clear so that it doesn't need to be
simplified. Given a list of numbers, we can easily order them from smallest to largest with the
simple instruction "Sort these numbers". However, a computer needs more detailed instruction to
sort numbers. It must be told to search for the largest number, how to find the largest number,
how to compare numbers, etc. The operation "Sort these numbers" is ambiguous to a computer
because the computer has no basic operations for sorting. Basic operations used for writing
2.
3.
.
.
.
While our algorithm seems to be quite clear, we have two problems. First, the algorithm must
have an infinite number of steps because there are an infinite number of even integers greater
than two. Second, the algorithm will run infinity times. These problems violate our definition that
an algorithm must halt in a finite amount of time. Every algorithm must reach its end.
For further reading you may refer the websites below.
http://mat.gsia.cmu.edu/classes/dynamic/dynamic.html
http://www.codechef.com/wiki/tutorial-dynamic-programming
Algorithm design techniques are common approaches to the construction of efficient solutions to
problems. Such methods are of interest because:
They can be converted into common control and data structures provided by most of
the high-level languages.
The temporal and spatial requirements of the algorithms which result from this can be
precisely analyzed.
Even though more than one technique may be applicable for a certain problem, it is often the
case that an algorithm constructed by one approach is clearly superior to equivalent solutions
built using alternative techniques.
Following are the most important design techniques to solve different types of the problems:
Greedy algorithms
Divide-and-conquer
Dynamic programming
There are three objects A, B, and C and the capacity of the kit bag is 20. Profits and weights of
the objects A, B, and C are given like this, (p1, p2, p3) = (25, 24, 15), (w1, w2, w3)= (18, 15, 10).
We got totally four appropriate solutions. In the first appropriate solution, we have taken fraction
i.e., half of the object A, one third of B and one fourth of C.
So total weight of the three objects taken into the kit bag is
18 * 1/2 + 15 * 1/3 + 10 * 1/4 = 9 + 5 + 2.5 = 16.5Kg,
which is less than the capacity of the kit bag (20), where 18, 15 and 10 are the weights of the
objects A,B and C respectively.
Total profit gained is 25 * 1/2 + 24 * 1/3 + 15 * 1/4 = 12.5 + 8 + 3.75 = $24.25, where 25, 24 and
15 are the profits of the objects A, B and C respectively.
Similarly, the profits earned in the remaining appropriate solutions are obtained like this.
2nd solution: 25 * 1 + 24 * 2/15 + 15 * 0 (object C is not taken) = 25 + 3.2 + 0 = $28.2
3rd solution: 25 * 0(object A is not taken) + 24 * 2/3 + 15 * 1 = 0 + 16 + 15 = $31
4th solution: 25 * 0(object A is not taken) + 24 * 1 + 15 * 1/2 = 0 + 24 + 7.5 = $31.5
It is clear that the 4th one is a best solution among all these solutions, since we are attaining
maximum profit using this solution. Using this approach we can get the best solution (without
applying a Greedy algorithm), but it is time consuming. The same can be achieved very easily,
using the Greedy technique.
Applying Greedy algorithm:
Arrange the objects in random order of the ratio pi /wi . i.e., p1/w1 >= p2/w2 >= ... pn/wn. This is our
selection criteria which helps in getting the best, since it attains a balance between the rate at
which profit increases and the rate at which capacity is used. This approach surely provides the
best solution.
In first stage, the object B(entire object) can be taken into the kit bag since its weight is 15 Kg
(less than the capacity of the kit bag, 20Kg). So we gained a profit i.e., $ 24 i.e., the profit of the
entire object B. In the second stage, a fraction of the object C, 10/2 = 5Kg is taken because if we
take the entire object , then the total weight of the objects that are collected into the kit bag
becomes 15+10 =25Kg which is beyond the capacity of the kit bag, 20 Kg.
Now the total weight of the objects in the kit bag is 20Kg, and we cannot take any more objects
into the kit bag since it leads to an unattainable solution. Now we gained a maximum profit i.e.,
$24+$(15/2) = $24 + $7.5 = $31.5, we got the best solution directly using Greedy technique.
Note: - Greedy techniques are mainly used to solve optimization problems. They do not always
give the best solution. This will be discussed more in introduction to another technique "Dynamic
Programming".
For further reading you may refer the websites below.
http://en.wikipedia.org/wiki/Analysis_of_algorithms
http://www.brpreiss.com/books/opus5/html/page58.html
Example
Binary search problem: Let A={ a1, a2, a3, ..., an} be a list of elements sorted in increasing order
i.e., a1 <= a2 <= ... <= an. We need to determine whether a given element x is present in the list or
not. If it exists, return it's position, otherwise return 0.
Applying Divide and Conquer
Determine the size of the list. If it is small (lower limit=upper limit i.e., only one element),
compare it with the element to be searched. If they are same, you are lucky, you got the solution.
Otherwise, if the list is more than one element, find the middle index. Divide the list into two
halves; such that the first sublist contains the elements from the first element to the element
before the middle index and the second sublist contain the elements starting from the element
after the middle index to the end of the list. Compare the given element with the middle element.
We will find one of the three possible cases below.
1.
If both are same, return the middle index and stop the searching process.
2.
If the element is smaller than the middle element, then your element may exist in the
first half of the list only (since the list is a sorted one). Again apply the same searching
technique over the first sublist which is a smaller instance of the main problem, till you
find the given element equal to the middle element, leading to a successful search or an
empty list (lower limit becomes greater than the upper limit or upper limit becomes less
than the lower limit) leading to an unsuccessful search.
3.
If the element is greater than the middle element, your element may exist in the next
half of the list. Again apply the same divide and conquer strategy to get the solution.
Again 88 is greater than the middle number of the current list i.e., 69.
88>69
We need to confine the searching process to the second sublist.
Lower limit = 7+1 = 8, Upper limit = 9
Now the middle index is 8,
Now, 88 is getting matched with the middle element 88, So the position of 88 is 8.
88 = 88
The goal of this problem is to reach the largest sum. Our greedy approach to obtain an optimal
solution is like this.
"At each step, choose the sum that appears to be the optimal immediate choice."
So, this strategy chooses 12 instead of 3 at the second step, and will not reach the best solution,
which contains 99.
So, to solve these kinds of problems, it is compulsory that, we need to check all possible decision
sequences and we need to pick out the best one. But again it is a time consuming job. Dynamic
programming reduces this set of decision sequences by avoiding some decision sequences that
cannot possibly be an optimal. This is achieved by applying "Principle of Optimality" .
"Principle of Optimality" defines that an optimal sequence of decisions has the property that
whatever the initial state and decision are, the remaining states must constitute an optimal
decision sequence with regard to the state resulting from the first decision.
Example: Travelling salesperson problem:
A salesperson has to do sales in the given n cities. He must start from one city and he has to
cover all the remaining cities exactly once and must return to the city where he has started sales.
This is considered as one tour. Now we need to find an optimal tour; that is a tour of minimum
cost.
Example: Let us consider one instance of it.
In the above digraph, vertices are representing the cities and edges are representing the cost.
The sales person has to visit four cities v1, v2, v3 and v4. We are considering v1 as the starting city.
Below are the three possible tours we obtain.(without applying Dynamic programming)
Let vk be the first vertex after v1 on an optimal tour, the subpath of that tour from vk to v1 must be
shortest path from vk to v1 that passes through each of other vertices exactly once. This applies
The below table shows the costs of all possible tours starting from one city, visiting only the other
two cities
Cost of the optimal tours starting from one city , visiting only two cities
The below table shows the costs of all possible tours starting from one city, visiting only the other
three cities
Cost of the optimal tour starting from one city, visiting the remaining cities(3 cities)
Initial state
Goal state(s)
A set of operators that transform one state into another. Each operator has
preconditions and post conditions.
A utility function - evaluates how close a given state to the goal state is (optional).
The solving process is based on the construction of a state-space tree, the nodes in it represent
states, the root represents the initial state, and one or more leaves are goal states. Each edge is
labeled with some operator.
If a node b is obtained from a node a as a result of applying the operator O, then b is a child of a
and the edge from a to b is labeled with O.
The solution is obtained by searching the tree until a goal state is found.
Backtracking uses depth-first search usually without cost function. The main algorithm is as
follows:
1.
2.
The utility function is used to tell how close a given state is to the goal state and whether a given
state may be considered as a goal state or not.
If there can be no children generated from a given node, then we backtrack - read the next node
from the stack.
Example:
You are given two jugs, a 4-gallon one and a 3-gallon one. Both the Jugs dont have any
measuring marker on it. Consider that there is a tap that can be used to fill the jugs with water.
Now how can you get exactly 2 gallons of water into the 4-gallon jug?
We have to decide:
For example:
1. Problem state- pair of numbers (X,Y) where X - water in jar 1 called A,
Y - water in jar 2 called B.
Initial state: (0,0)
Final state: (2,_ ) here "_" means "any quantity"
2. Available actions (operators):
Available actions (operators)
Asymptotic notations
Recurrence Relations
Reading a file: the number of read operations depends on the number of records in
a file.
If N (N is the size of the input) is the number of the elements to be processed by an algorithm,
then the number of operations can be represented as a function of N: f (N) (sometimes we use
lower case n).
We can compare the complexity of two algorithms by comparing the corresponding functions.
Moreover, we are interested what happens with the functions for large N, i.e. we are interested in
the asymptotic growth of these functions.
Classification of functions by their asymptotic growth
Each particular growing function has its own speed of growing. Some functions grow slower,
others grow faster.
The speed of growth of a function is called asymptotic growth. We can compare functions by
studying their asymptotic growth.
Asymptotic notations
Given a function f(n), all other functions fall into three classes:
Substitution method
Master's method
a. Substitution method
The premise for this type of solution is to continually substitute the recurrence relation on the
right hand side (i.e. substitutes a value into the original equation and then derives a previous
version of the equation). A variety of derivations should lead to a generalized form of the
equation that can be solved for a time constraint on the algorithm.
T(n) = T(n/2) + c
At this point n/2 is the substitution for n in the original equation:
T(n/2) = T((n/2)/2) + c
T(n/2) = T(n/4) + c
To derive the original equation, add c to both sides of the equation
T(n/2) + c = T(n/4) + c + c
Therefore:
T(n) = T(n/2) + c = T(n/4) + c + c
Now another equation has been derived, a new substitution that can be made in the original: n/4
for n
T (n/4) = T((n/4)/2) + c
T (n/4) = T(n/8) + c
T (n/4) + c + c = T(n/8) + c + c + c
T(n) = T(n/2) + c = T(n/4) + c + c = T(n/8) + c + c + c
In general, the basic equation is:
T(n) = T(n/2k) + kc for k > 0
Assuming that n = 2k allows for:
T(n) = T(n/2k) + kc
T(n) = T(n/n) + (log n) * c
T(n) = T(1) + c log n
T(n) = b + c log n
T(n) = O(log n)
b. Recursion tree method
It is helpful for visualizing what happens when a recurrence is iterated. It diagrams the tree of
recursive calls and the amount of work done at each call. For instance, consider the recurrence
T(n)=2T(n/2) + n2.
The recursion tree for this recurrence is of the following form:
http://algs4.cs.princeton.edu/23quicksort/
2.
Counting
3.
4.
Factorial Computation
5.
Here in this section, we will try to exchange the values of any two variables i.e. swap values
between the two variables step by step. Working through the mechanism with a particular
example can be a useful way of detecting design faults.
Problem Statement
This means that memory cell or variable a contains the value 721, and memory cell or variable b
contains the value 463.
Exchanging the Values of Two Variables
Algorithm development
Our task is to replace the contents of a with 463, and the contents of b with 721. In other words,
we want to end up with the configuration below:
Target Configuration
To change the value of a variable, we can use the assignment operator. As we want a to assume
the value currently belongs to b, and b the value belonging to a, we could perhaps make the
exchange with the following assignments:
a := b (1)
b := a (2)
where " := " is the assignment operator. In (1) " := " causes the value stored in memory cell b to
be copied into memory cell a.
Let us work through these two steps to make sure they have the desired effect.
We started out with the configuration
The assignment (1) has changed the value of a but has left the value of b untouched. Checking
with our target configuration, we see that a has assumed the value 463 as required. So far so
good! We must also check on b. When the assignment step (2) i.e. b := a is made after executing
step(1) , we end up with:
In executing step (2) a is not changed while b takes on the value that currently belongs to a. The
configuration that we have ended up with does not represent the solution we are seeking. The
problem arises because in making the assignment:
a:=b
we have lost the value that originally belonged to a ( i.e. 721 has been lost). It is this value that
we want b to finally assume. Our problem must therefore be stated more carefully as:
new value of a := old value of b;
new value of b := old value of a;
What we have done with our present proposal is to make the assignment
new value of b := new value of a
In other words, when we execute step(2) we are not using the value a that will make things work
correctly- because a has already changed.
To solve this exchange problem, we need to find a way of not destroying "the old value of a"
when we make the assignment
a := b
A way to do this is to introduce a temporary variable t and copy the original value of a into this
variable before executing step (1). The steps to do this are:
t := a;
a := b;
After these two steps, we have
We are better off than last time because now we still have the old value of a stored in t. It is this
value that we need for assignment to b. We can therefore make the assignment
b := t;
After execution of this step, we have
Rechecking with our target configuration, we see that a and b have now been interchanged as
required. The exchange procedure can now be outlined.
The use of an intermediate variable allows the exchange of two variables to proceed
correctly.
Working through the mechanism with a particular example can be a useful way of
detecting design faults.
(e.g. a[i] and a[j]). The steps for such an exchange are:
t := a[i];
a[i] :=a[j];
a[j] = t;
Performance Analysis of exchange of two variables
In this algorithm, use of an intermediate temporary variable allows the exchanging of two
variables to proceed correctly.
This algorithm will take O(1) as running time.
For further reading you may refer the websites below.
webserver.ignou.ac.in/virtualcampus/adit/course/cst103/block3/unit2/cst103-bl3-u2-06.htm
Sometimes, we need the count of variables, values, objects etc.. in an application in order for
better programming. Counting concept is used in each and every basic application. For example,
counting the number of contacts you have in your cell phone and then calculating the memory it
has occupied. There are many such applications where counting comes into picture.
Problem Statement
Given a set of n students examination marks (in the range to 100) make a count of the number of
students that passed the examination. A pass is awarded for all marks of 50 and above.
Algorithm development
Counting mechanisms are very frequently used in computer algorithms. Generally, a count must
be made of the number of items in a set which possess some particular property or which satisfy
some particular condition or conditions. This class of problems is typified by the "examination
marks" problem.
As a starting point for developing a computer algorithm for this problem, we can consider how we
might solve a particular example by hand.
Suppose we are given the set of marks
55, 42, 77, 63, 29, 57, 89
To make a count of the passes for this set, we can start at the left, examine the first mark (i.e.55),
see if it is>50, and remember that one student has passed so far. The second mark is then
examined and no adjustment is made to the count. When we arrive at the third mark, we see it is
>50 and so we add one to our previous count. The process continues in a similar manner until all
marks have been tested.
In more detail , we have:
Order in which marks are examined
After each mark has been processed, the current count reflects the number of students who have
passed in the marks list so far encountered. We must now ask, how can the counting be
achieved? From our example above, we see that every time we need to increase the count , we
build on the previous value. That is,
current_count := previous_count + 1
When, for example, we arrive at mark 57, we have
previous_count := 3
current_count therefore becomes 4. Similarly, when we get to the next mark (i.e. 89)
current_count of 4 must assume the role of the previous_count. This means that whenever a new
current_count is generated, it must then assume the role of previous_count before the next mark
is considered. The two steps in this process can be represented by
current_count := previous_count+1 (1)
previous_count := current_count
(2)
These two steps can be repeatedly applied to obtain the count required. In conjunction with the
conditional test and input of the next mark we execute (1), followed by step (2), followed by step
(1), and followed by step (2) and so on.
ecause of the way in which previous_count is employed in step (1), we can substitute the
expression for previous_ count in step (2) into step (1) to obtain the simpler expression
current_count := current_count + 1
|
|
(new value)
(old value)
The current_ count on the RHS (Right Hand side) of the expression assumes the role of
previous_count. As this statement involves an assignment rather than an equality, (which would
be impossible) it is a valid computer statement. What it describes is the fact that the new value of
current count is obtained by adding 1 to the old value of current count. Viewing the mechanism in
this way makes it clear that the existence of the variable previous_count in its own right is
unnecessary. As a result, we have a simpler counting mechanism. The essential steps in our
pass counting can therefore be summarized as:
while less than n marks have been examined do
(a) read next mark,
(b) if current mark satisfies pass requirement, then add one to count.
Before any marks have been examined, the count must have the value zero. To complete the
algorithm, the input of the marks and the output of the number of passes must be included. The
detailed algorithm is described as following.
1.
2.
3.
4.
5.
6.
Performance Analysis
1.
Initially, and each time through the loop, the variable count represents the number of
passes so far encountered. On termination (when i = n) count represents the total
number of passes in the set. Because i is incremented by 1 with each iteration,
eventually the condition i < n will be violated and the loop will terminate.
2.
It was possible to use substitution to improve the original solution to the problem. The
simplest and most efficient solution to a problem is usually the best all-round solution.
3.
Often mathematical formulae require the addition of many variables, summation or sigma
notation is a convenient and simple form of shorthand used to give a concise expression for a
sum of the values of a variable.
Let x1, x2, x3,....,xn denote a set of n numbers. x1 is the first number in the set. xi represents the ith
number in the set.
Using summation concept, we can make the logic work more accurate and efficient.
Problem Statement
Given a set of n numbers, design an algorithm that adds these numbers and returns the
resultant sum. Assume n is greater than or equal to zero.
Algorithm development
One of the most fundamental things that we are likely to do with a computer is to add a set of n
numbers. When confronted with this problem in the absence of a computer, we simply write the
numbers down one under the other and start adding up the right column. For example, consider
the addition of 421, 583 and 714.
421
583
714
....8
In designing a computer algorithm to perform this task, we must take a somewhat different
approach. The computer has a built-in device which accepts two numbers to be added, performs
the addition and returns the sum of the two numbers (see Fig. 2.28). In designing an algorithm to
add a set of numbers, a primary concern is the mechanism for the addition process. We will
concentrate first on this aspect of the problem before looking at the overall design.
The simplest way that we can instruct the computer's arithmetic unit to add a set of numbers is to
write down an expression that specifies addition we wish to be performed. For our three numbers
mentioned previously, we could write
s := 421 + 583 + 714
(1)
The assignment operator causes the value resulting from the evaluation of the right hand side of
statement (1) to be placed in the memory cell allocated to the variables. Expression (1) will add
three specific numbers as required. Unfortunately, it is capable of doing little else. Suppose we
wanted to sum three other numbers. For this task, we would need a new program statement. It
would therefore seem reasonable that all constants in expression (1) could be replaced by
variables.
We would then have
s := a + b + c
(2)
Expression (2) adds any three numbers provided they have previously assigned as values or
contents a, b, c respectively. Expression (2) as the basis of program for adding numbers is more
general and more useful than expression (1). It still has a serious deficiency - it can only add set
of three numbers. A fundamental goal in designing algorithms and implementing programs is to
make the programs general enough so that they will successfully handle a wide variety of input
conditions. That is, we want a program that will add any n numbers where n can take on a wide
range of values. The approach we need to take to formulate an algorithm to add n numbers in a
computer is different from what we would do conventionally to solve the problem. Conventionally,
the input of successive numbers with each iterative step. Our complete algorithm can be
outlined.
1.
2.
3.
4.
5.
Compute current sum by adding the number read to the most recent sum.
6.
Performance Analysis
Initially, and each time through the loop, the sum s reflects the sum of the first i
numbers read. On termination (when i = n ) s will represent the sum of n numbers.
Because i is incremented by 1 with each iteration, eventually, the condition i < n will be
violated and the loop will terminate.
The design employed makes no consideration of the accuracy of the resultant sum or
the finite size of numbers that can be accurately represented in the compute. An
algorithm that minimizes errors in summation does so by adding at each stage, the two
smallest numbers remaining.
The obvious or direct solution to the problem is considerably different to the computer
solution. The requirement of flexibility imposes this difference on the computer solution.
Consideration of the problem its lowest limit (i.e. n = 0 ) leads to a mechanism that can
be extended to larger values of n by simple repetition. This is a very common device in
computer algorithm design.
A program that reads and sums n numbers is not a very useful programming tool. A much more
practical implementation is a function that returns the sum of an array of n numbers. That is
function asum{var a : nelements , n : integer) : real;
var sum {the partial sum},
i : integer;
begin {compute the sum of n real array elements ( n >= 0 )
sum := 0.0;
for i : = 1 to n do
sum := sum + a[ i ];
asum := sum;
end
Applications
Average calculations, variance and least square calculations.
For further reading you may refer the websites below.
http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=binarySearch
n! = n x (n - 1)! for n 1
Using this definition, we can write the first few factorials as :
1! = 1x0!
2! = 2x1!
3! = 3x2!
:
:
If we start with p = 0!, We can rewrite the first few steps in computing n! as:
p :=1
(1)
= 0!
p := p * 1
= 1!
p := p * 2
= 2!
p := p * 3 (2...n + 1) = 3!
p := p * 4
= 4!
:
:
From step (2) onwards we are actually repeating the same process over and over. For the
general ( i + 1)th step, we have
p := p * i ( i + 1 )
This general step can be placed in a loop to iteratively generate n!. This allows us to take
advantage of the fact that the computers arithmetic unit can only multiply two numbers at a time.
In many ways, this problem is very much like the problem of summing a set of n numbers
(algorithm 2.4.1.3). In the summation problem, we performed a set of additions. Where as in this
problem, we need to generate a set of products. It follows from the general ( i + 1)th step that all
factorials for n 1 can be generated iteratively. The instance where n = 0 is a special case which
must be accounted for directly by the assignment
p := 1 ( by definition of 0! )
The central part of the algorithm for computing n! therefore involves a special initial step followed
by n iterative steps.
Build each of the n remaining products p from its predecessor by an iterative process.
Algorithm description
1.
2.
Set product p for 0! (special case ). Also set product count to zero.
3.
4.
5.
6.
This algorithm is most usefully implemented as a function that accepts as input a number n and
returns as output the value of n!.
Applications
Probability, statistical and mathematical computations.
Introduction
Reversing the digits of an integer play an important role in hashing as well as in information
retrieval concepts. Even in database operations, reversing has its role.
Lets discuss how we can reverse the digits in an integer using the step by step process i.e. by
usage of algorithms.
Problem Statement
Design an algorithm that accepts a positive integer and reverses the order of its digits.
Algorithm development
Digit reversal is a technique that is sometimes used in computing to remove bias from a set of
numbers. It is important in some fast information-retrieval algorithms. A specific example clearly
defines the relationship of the input to the desired output. For example,
Input : 27953
Output : 35972
Although we might not know at this stage exactly how we are going to do this reversal, one thing
is clear we are going to need to access individual digits of the input number. As a starting point,
we will concentrate on this aspect of the procedure. The number 27953 is actually
2 x 104 + 7 x 103 + 9 x 102 + 5 x 101 + 3
To access the individual digits, it is probably going to be easiest to start at one end of the number
and work through to the other end. The question is at which end should we start? Other than
visually, it is not easy to tell how many digits there are in the input number, it will be best to try to
establish the identity of the least significant digit(i.e. the right most digit). To do this, we need to
effectively "chop off" the least significant digit in the number. In other words, we want to end up
with 2795 with the 3 removed and identified.
We can get the number 2795 by integer division of the original number by 10
i.e. 27953 div 10 795
This chops off the 3 but does not allow us to save it. However, 3 is the remainder that results
from dividing 27953 by 10. To get this remainder we can use the mod function. That is,
27953 mod 10 3
Therefore, if we apply the following two steps
r := n mod 10 (l) => (r = 3)
n := n div 10 (2) => (n = 2795)
we get the digit 3, and the new number 2795. Applying the same two steps to the new value of n,
we can obtain the 5 digit. We now have a mechanism for iteratively accessing the individual
digits of the input number.
Our next major concern is to carry out the digit reversal. When we apply our digit extraction
procedure to the first two digits, we acquire first the 3 and then 5. In the final output, they appear
as:
3 followed by 5 (or 35)
If the original number was 53, then we could obtain its reverse by first extracting the 3,
multiplying it by 10,and then adding 5 to give 35,
That is,
3 x 10 + 5 = 35
The last three digits of the input number are 953. They appear in the "reversed" number as 359.
Therefore, at the stage, when we have the 35 and then extract the 9, we can obtain the
sequence 359 by multiplying 35 by 10 and adding 9.
That is
35 x 10 + 9 359
Similarly
359 x 10 + 7 3597
and
3597 x 10 + 2 35972
The last number obtained from the multiplication and addition process is the "digit reversed"
integer we have been seeking. On closely examining the digit extraction, and the reversal
process, it is evident that they both involve a set of steps that can be performed iteratively.
We must now find a mechanism for building up the "reversed" integer digit by digit. Let us
assume that the variable "dreverse" is to be used to build the reversed integer. At each stage in
building the reversed integer, its previous value is used in conjunction with the most recently
extracted digit.
Rewriting the multiplication and addition process, we have just described in terms of the variable
dreverse we get
Reversing integer digit by digit
2.
extract the rightmost digit from the number being reversed and append this digit to the
right hand side end of the current reversed number representation;
remove the rightmost digit from the number being reversed.
When we include input and output considerations and details on initialization and termination, we
arrive at the following algorithm description.
1.
2.
3.
4.
Use the remainder function to extract the rightmost digit of the number being
reversed.
5.
6.
Use integer division by 10 to remove the rightmost from the number being reversed.
This algorithm is most suitably implemented as a function which accepts as input the integer to
be reversed and returns as output the integer with its digits reversed.
Applications
Hashing and information retrieval, and data base applications.
2.
3.
Given an integer n, devise an algorithm that will find its smallest exact divisor other than one.
Algorithm development
Taken at face value, this problem stems to be rather trivial. We can take the set of numbers 2, 3,
4, ....., n and divide each one in turn into n. As soon as we encounter a number in the set that
exactly divides into n our algorithm can terminate as we must have found the smallest exact
divisor of n. This is all very straightforward. The question that remains, however, is can we design
more efficient algorithm?
As a starting point of this investigation, let us work out and examine the complete set of divisors
for some particular number .Choosing the number 36 as our example, find that its complete set
of divisors is
{2, 3, 4, 6, 9, 12, 18}
We know that an exact divisor of a number divides into that number leaving no remainder. For
the exact divisor 4 of 36, we have:
That is, there are exactly 9 fours in 36. It also follows that the bigger number 9 also divides
exactly into 36. That is,
36 / 4 9
36 / 9 4
and 36 / (4 x 9) 1
Similarly, if we choose the divisor 3, we find that it tells us that there is a bigger number 12 that is
also an exact divisor of 36. From this discussion, we can draw the conclusion that exact divisors
of a number must be paired.
Clearly, in this example we would not have to consider either 9 or 12 as potential candidates for
being the smallest divisor because both are linked with another smaller divisor. For our complete
set of divisors of 36, we see that:
Complete set of divisors of 36
From this set, we can see that the smallest divisor (2) is linked with the largest divisor (18), the
second smallest divisor (3) is linked with the second biggest divisor (12) and so on. Following this
line of reasoning, we can see that our algorithm can safely terminate when we have a pair of
factors that correspond to
(a) The biggest smaller factor s,
(b) The smallest bigger factor b.
Now, let us extend this idea and representation of two integers, 30 and 18, whose GCD x we
may be seeking.
Studying the below (Fig: 2.32) diagram, we see that for the common divisor situation both n and
m can be thought of as being divided up into segments of size x. When the two blocks n and m
aligned from the left, the section AB of n must match the full length of m. The number n, however,
exceeds m by the length BC. The question is, what can we say about the segment BC?
If x is going to be an exact divisor of both n and m and if AB is exactly divided by x, then so too
must BC be exactly divided up into segments of size x.
Having made this observation, our next problem is to try to work out how to find the largest of the
common divisors that n and m may share.
Considering the simpler problem first, we know that the greatest divisor of a single number is the
number itself (e.g. The greatest exact divisor of 30 is 30).
To try to answer this question, let us return to our specific problem of trying to find the GCD of 18
and 30. We have the situation shown in below Fig: 2.33. The other piece of information available
to us is that the segment BC will need to be exactly divided by the GCD. And since 18 is not the
GCD the number X we are seeking must be less than 18 and it must exactly divide into the
segment BC. The biggest number that exactly divides into BC must be 12 since BC is 12 itself. If
12 is to be the GCD of 18 and 30, it will have to divide exactly into both 18 and 30. In taking this
step, we have actually reduced our original problem to the smaller GCD problem involving 12
and 18, as shown in Fig: 2.34.
Applying a similar argument to our smaller problem, we find that since 12 is not a divisor of 18,
we are going to end up with a still smaller problem to consider. That is, we have the situation
shown in Fig: 2.35. With this latter problem, the smaller of the two numbers 6 and 12 (i.e. 6) is an
exact divisor of 12. Once this condition is reached, we have established the GCD for the current
problem and hence also for our original problem.
We can now summarize our basic strategy for computing the GCD of two numbers:
1. Divide the larger of the two numbers by the smaller number.
2. If the smaller number exactly divides into the larger number then
the smaller number is the GCD.
else
remove from the larger number the part common to the smaller number and
repeat the whole procedure with the new pair of numbers.
Our task now is to work out the details for implementing and terminating the GCD mechanism.
First let us consider how to establish if the smaller number exactly divides into the larger number.
Exact division can be detected by there being no remainder after integer division. The mod
function allows us to compute the remainder resulting from an integer division. We can use:
r := n mod m
provided we had initially ensured that n m, If r is zero, then m is the GCD. If r is not zero, then
as it happens, it corresponds to the "non-common part between n and m. (E.g. 30 mod 18 = 12.)
It is therefore our good fortune that the mod function gives us just the part of n we need for
solving the new smaller GCD problem.
Furthermore, r by definition must be smaller than m. What we need to do now is set up our
iterative construct using the mod function. To try to formulate this construct, let us return to our
GCD (18, 30) problem.
For our specific example, we have:
Our example suggests that with each reduction in the problem size the smaller integer assumes
the role of the larger integer and the remainder assumes the role of the smaller integer.
The reduction in problem size and role changing steps are carried out repeatedly. Our example
therefore suggests that the GCD mechanism can be captured iteratively with a loop of the form:
while GCD not found do
(a) get remainder by dividing the larger integer by the smaller integer;
(b) let the smaller integer assume the role of the larger integer;
(c) let the remainder assume the role of the smaller integer.
Now we must decide in detail how the loop should terminate. Our earlier discussion established
that the current divisor will be the GCD when it divides exactly into the integer that is larger (or
equal) member of the pair. The exact division will correspond to a zero remainder. It follows that
we can use this condition to terminate the loop. For example:
while remainder is not equals to zero do
2.
2.
Establish the two positive non-zero integers smaller and larger whose GCD is being
sought.
Repeatedly
3.
get the remainder from dividing the larger integer by the smaller integer;
4.
5.
let the remainder assume the role of the divisor until a zero remainder is obtained.
6.
Applications
Reducing a fraction to its lowest terms.
Every integer can be expressed as a product of prime numbers. Design an algorithm to compute
all the prime factors of an integer n.
Algorithm development
Examination of our problem statement suggests that
n = f1 x f2 x f3 ..... x fk where n > 1 and f1 f2 .... fk
The elements f1, f2, f3, ...., fk are all prime numbers. Applying this definition to some specific
examples, we get:
8=2x2x2
12 = 2 x 2 x 3
18 = 2 x 3 x 3
20 = 2 x 2 x 5
60 = 2 x 2 x 3 x 5
An approach to solving this factorization problem that immediately comes to mind is to start with
the divisor 2 and repeatedly reduce n by a factor of 2 until 2 is no longer an exact divisor. We
then try 3 as a divisor and again repeat the reduction process and so on until n has been
reduced to 1.
Consider the case when n = 60.
Marking with an asterisk the unsuccessful attempts to divide, we have
We can make several observations from this example. Firstly, 2 is the only even number that we
need to try. In fact if we pay careful attention to our original definition, we will see that only prime
numbers should be considered as candidate divisors. Our present approach is to test a lot of
unnecessary divisors. From this, we can begin to see that generation of a set of prime numbers
should be an integral part of our algorithm. Prime factors of n must be less than or equal to n.
This suggests that we should perhaps produce a list of primes up to n before going through the
process of trying to establish the pime factors of n.
Further thought reveals that there is a flaw in this strategy which comes about because prime
factors may occur in multiples in the factorization we are seeking. A consequence of this is that in
precomputing all the primes up to n, we may end up computing a lot more primes than are
needed as divisors for the current problem. (As an extreme example, if n was 1024, we would
calculate primes up to 32 whereas in fact the largest prime factor of 1024 is on 2).
A better and more economical strategy is therefore to only compute prime divisors as they are
needed. For this purpose, we can include a modified version of the sieve of Eratosthenes that we
developed earlier. As in our earlier algorithm, as soon as we have discovered n is prime, we can
terminate. At this stage, let us review the progress we have made. The top-level description of
the central part of the algorithm is:
While "it has not been established that n is prime" do
begin
(a) if nxtprime is divisor of n then save nxtprime as a factor and
reduce n by "nxtprime"
else get next prime,
(b) try nxtprime as a divisor of n.
end
We now must work out how the "not prime" test for our outer loop should be implemented. The
technique we employed earlier was to use integer division and test for zero remainder. Once
again this idea is applicable. We also know that as soon the prime divisor we are using in our test
becomes greater than n, the process can terminate.
Initially, when the prime divisors we are using are much less than n, know that the testing must
continue. In carrying out this process, we want to avoid having to calculate the square root of n
repeatedly. Each time we make the division:
n div nxtprime (e.g. 60 div 2)
we know the process must continue until the quotient q resulting from this division is less than
"nxtprime".
At this point, we will have: (nxtprime) 2 > n
which will indicate that n is prime. The conditions for it not yet being established that n is prime
are therefore:
(a)
Exact division
(i.e r := n div nxtprime = 0),
(b)
Quotient greater than divisor
(i.e. q := n mod nxtprime > nxtprime).
The truth of either condition is sufficient to confirm that the test be repeated again.
Now we need to explore how the algorithm will terminate. If we follow the factorisation process
through for a number of examples, we find that there are two ways in which the algorithm can
terminate. One way for termination is where n is eventually reduced to 1. This can happen when
the largest prime factor is present more than once (e.g. in the case of 18 where) the factors are
(2 x 3 x 3).
The other possible situation is where we terminate with a prime factor that only occurs once (e.g.
the factors of 70 are 2 x 5 x 7). In this instance, we have a termination condition where n is > 1.
Therefore, after our loop terminates, we must check which termination condition applies and
adjust the prime factors accordingly. The only other considerations are the initialization conditions
and the dynamic generation of primes as required.
Since we have already considered the prime number generation problem before, we will assume
there is a function available which when given a particular prime as an argument returns the next
prime. The sieve of Eratosthenes can be readily adapted for this purpose and in the present
example, we will assume that the procedure eratosthenes which returns nxtprime is available.
The prime factorization algorithm can now be given in detail.
Algorithm description
1.
Establish n the number whose prime factors are sought.
2.
Compute the remainder r and quotient q for the first prime nxtprime =2
3.
While n is not prime do (not able to understand)
a.
if nxtprime is an exact divisor of n then
(a.1)
save nxtprime as a factor f,
(a.2)
reduce n by nxtprime.
else
(a'.1)
get next biggest prime for sieve of Eratosthenes,
b.
Compute next quotient q and remainder r for the current value of n and
current prime divisor nxtprime.
4.
If n is greater than 1 then
add n to list as a prime factor f.
5.
Return the prime factors f of the original number n.
Application
Factoring numbers with up to six digits.
Sorting by Selection
2.
Sorting by Exchange
3.
Sorting by Insertion
4.
Sorting by Partitioning
5.
Sorting by Selection
The selection sort algorithm starts by finding the minimum value in the array and moving it to the
first position. This step is then repeated for the second lowest value, then the third, and so on
until the array is sorted. This sort is known for its simplicity and is usually used for sorting only
small amounts of data.
Problem Statement
Given a randomly ordered set of n integers, sort them into the non decreasing order using the
selection sort.
Algorithm development
An important idea in sorting of data is to use a selection method to achieve the desired ordering.
In its simplest form at each stage in the ordering process, the next smallest value must be found
and placed in order.
Consider the unsorted array
What we are attempting to do is to develop a mechanism that converts the unsorted array to the
ordered configuration below;
Comparing the sorted and unsorted array we see that one way to start off the sorting process
would be to perform the following two steps:
1.
2.
By applying these steps we have certainly managed to get the 3 into position a[1] but in doing we
lost the 20 that was there originally. Also we now have two 3's in the array whereas we started
out initially with just one. To avoid these problems we will obviously have to proceed somewhat
differently. Before placing the 3 in position a[1] we will need to save the Value that is already
there. A simple way to save it would be to store it in a variable.
The next question is what can we do with the 20 once the 3 has been stored in position a[1]. We
need to put it back in the array but the question is where? If we put it in position 2 we are not
much better off because then we need to find a place for the 35. A careful examination of the
situation shown in below figure Fig: 2.41 & 2.42 reveals that we can in fact put it back in the
position where we found the smallest value. The 3 still in that position has already been relocated
so no information is lost.
To achieve these two changes we need a mechanism that not only finds the minimum but also
remembers the only location where the minimum is currently sorted. That is every time the
minimum is updated we must save its position.
min := a[1];
p := 1;
for j := 2 to n do
if a[j] < min then
begin
min := a[j];
p := j;
end
The 20 and the 3 can then be swapped using the following statements (note the 3 is saved in the
temporary variable min) which will need to be placed outside the loop for finding the minimum.
a[p] : = a[1];
{puts 20 in array position p}
a[1] := min;
We now have a mechanism that "sorts" one element but what we need is a mechanism that will
allow us to sort n elements. The same strategy can be applied to find and place the second
smallest element in its proper order. To find the second smallest element we will need to start
looking in the array at position 2 because if we start at position 1 we would again find the 3 which
would be of no help. The steps we need are therefore:
min := a[2]
p := 2;
for j := 2 to n do
Algorithm description
1.
2.
While there are still elements in the unsorted part of the array do
3.
Find the minimum min and its location p in the unsorted part of the array a[i. .n].
4.
Exchange the minimum min in the unsorted part of the array with first element a[i] in
the unsorted array.
First time, through the inner loop (n - 1) comparisons are made, the second time (n - 2), the third
time (n - 3) and finally 1 comparison is made.
The number of comparisons is therefore always:
nc = (n - 1) + (n - 2) + (n - 3) + ...... + 1 = n(n - 1)/2
For each value of element until n, it needs to compare and check whether there is any value
smaller than it and needs (n - 1) comparisons. Thus it can be easily seen that run time
complexity of selection sort is (n2).
But since it just requires the (n) swaps means linear order writes to the memory which is
optimal to any sorting algorithm.
The worst case occurs if the array is already sorted in descending order. Nonetheless, the time
required by selection sort algorithm is not very sensitive to the original order of the array to be
sorted. The test "if A[j] < A[min]" is executed exactly the same number of times in every case.
Applications
Sorting only small amounts of data much more efficient methods are used for large data sets.
For further reading you may refer the websites below.
www.dreamincode.net/forums/topic/10157-data-structures-in-c-tutorial/
With the data as it stands there is very little order present. What we are always looking for in
sorting is a way of increasing the order in the array. We notice that the first two elements are "out
of order" in the sense that no matter what the final sorted configuration is, 30 will need to appear
later than 12. If the 30 and 12 are interchanged, we will have "increase in the order" in the data.
This leads to the configuration below:
After examining the new configuration we see that the order in the data can be increased further
by comparing and swapping the second and third elements. With this new change we get the
configuration:
The investigation we have made suggests that the order in the array can be increased using the
following steps:
For all adjacent pairs in the array do,
If the current pair of elements is not in non decreasing order then exchange two sorted
elements.
After applying this idea to all adjacent pairs in our current data set we get the configuration
below:
On studying the mechanism carefully we see that it guarantees that the biggest element 41 will
be forced into the last position in the array. In effect the last element at this stage is "sorted". The
array is still far from being sorted.
If we start at the beginning and apply the same mechanism again we will be able to guarantee
that the second biggest value (i.e. 39) will be in the second last position in the array. In the
second pass through the array there is no need to involve the last element in comparison of
adjacent pairs because it is already in its correct place. By the same reasoning, when a third
pass is made through the data the last two elements are in their correct place and so they need
Our final consideration is to fill in details of inner loop to ensure that it only operates on the
"unsorted" part of the array. Adjacent pairs can be compared using a test of the form
if a[j] > a[j + l] then "exchange pair"
After applying this idea to all adjacent pairs in our current data set we get the configuration
below:
There are still other refinements that we can make to this algorithm but we will leave these as
supplementary problems.
Algorithm description
1.
2.
3.
4.
5.
6.
(1.a).
7.
(1.b).
8.
mechanism is suggestive of a selection sort where we selected the smallest element in the
unordered part and placed it on the end of the sorted part.
We have a simple, systematic and alternative way we could choose the next item to be inserted
is to always pick the first element in the unordered part (i.e. x in our example).We then need to
approximately insert x into the ordered part and in the process, extend the ordered section by
one element. Diagrammatically what we want to do is illustrated in below fig : 2.50.(a) and 2.50.
(b).
Initially the whole array is in an unordered state. For this starting condition x will need to be the
second element and a[1] the "ordered part".
The ordered part is extended by first inserting the second element, then the third element, and so
on. At this stage the outline for our insertion sort algorithm is
for i := 2 to n do
begin
(a) Choose next element for insertion (x := a[i])
(b) insert x into the ordered part of the array
end
To make room for x to be inserted,all elements greater than x need to be moved up one place
(shaded area in the fig. 2.50.(a) & 2.50.(b)).Starting with j := i the following steps will allow us to
move backwards through the array and accomplish the task.
While x < a [ j - 1 ] do
begin
a[j] := a[j - 1]
j := j - 1
end
As soon as we have established that x is greater than or equal a[j-1] the loop will terminate and
we will know that x must be placed in position a[j] to satisfy the non-descending order
requirements. There is a problem of termination with this loop when x happens to be less than all
the elements a[1...i-1]. In this instance our loop will cause the suffix a[0] to be referenced. We
must therefore protect against this problem. One way around the problem would be to include a
check on the i suffix used.
While (x < a[j-1] and j > 2 ) do
This is going to mean a more costly test that will have to be executed very frequently. Another
approach we can take to terminate the loop correctly is to temporarily place x as a sentinel at
either a[0] or a[1] which will force the loop to terminate. This is rather untidy programming. We
may therefore ask, is there any other cleaner alternative? Our concern is always that we may
want to insert an element that is smaller than the element that at each stage occupies position
a[1]. If the minimum element were in position a[1] we would have no concern about the insertion
mechanism terminating.
This suggests that the easiest way to overcome our problem is to find the minimum and put in
place before the insertion process. Once the minimum is put in place the first two elements must
be ordered and so we can start by inserting the third element. To test our design, the below fig
2.51 shows this mechanism applied to a specific example.
Algorithm description
1.
2.
3.
4.
5.
6.
7.
8.
last elements if they were out of order, and so on. Applying this mechanism to the unsorted array
listed below:
Having made this first pass through the array we must now ask, how to proceed next? The first
pass has provided us with a set of ordered pairs. One way to proceed might be to try to extend
these pairs into larger ordered segments.
Exploring this idea we find that it comes down to a merging operation and so loses the
characteristic of exchanging data over large distances. Moving in from both ends at first sight
seems like a good idea but simply exchanging pairs when they are out of order is not of much
help. What appears to be wrong with this strategy is that we have not reduced the size of the
problem to be solved, in any easily recognizable way.
We therefore, have two alternatives, we can either change the way in which we process the
elements, or we can change the criterion used to exchange pair of elements. If we decided to
pursue the latter course we will need to find an alternative criterion for swapping pairs of
elements. What other criteria are there? This seems to be a hard problem to decide abstractly so
let us refer back to our original data set to see if we can find any clues there.
Examining the data set carefully, we see that when the first and last elements are compared (i.e.
20 and 39) no exchange takes place.
However, 20 is the third largest element and so it should really be moved to the right-hand end of
the array as quickly as possible. There would be no point in exchanging 20 with 39 but it would
make a lot of sense to exchange 20 with the small value 3 in the seventh position in the array.
What does this approach suggest?
It has led us to the idea that it is probably a good strategy to move the bigger elements to the
right hand end as quickly as possible and at the same time to move the smaller elements to the
left hand end of the array.
Our problem is then to decide which elements are big and which elements are small in the
general case. Reflecting for a moment we realize that this problem does not have a general
solution we seem to be stuck! The only possibility left is to take a guess at the element that might
allow us to distinguish between the big and the small elements. Ideally after the first pass through
the data, we would like to have all big elements in the right half of the array into and all small
elements in the left half of the array. This amounts to partitioning the array into two subsets.
For example:
We must now decide how to achieve this partitioning process. Following our earlier discussion,
the only possibility is to use an element from the array. This raises the question, is partitioning
the array compatible with exchanging data over large distances in the early stages of the sort?
To try to answer this question, as a test case we can choose 18 as the partitioning value, since it
is the fourth largest of the eight values. Carrying out this step the first thing we discover is that 20
should be in the partition for larger elements rather than in the partition for smaller elements.
That is,
If 20 is in the wrong partition, it implies that there must be a small element that is wrongly placed
in the other partition. To make progress in partitioning the array,we need to place 20 in the
partition for larger elements . The only satisfactory way to do this is to "make room" for it in the
partition of larger elements by finding a "small" element in that partition and exchanging the pair.
The 3 in position 7 is a candidate for this transfer of data.
If we move into the array from the right looking for small elements, and in from the left looking for
large elements(where large and small are relative to 18),we will have a way of partitioning the
array and at the same time,exchanging data over large distances as we had originally set out to
do. At this point we may use the partitioning method.
The basic steps in partitioning algorithm are:
1.
Extend the two partitions inward until a wrongly partitioned pair is encountered
2.
3.
4.
extend the two partitions inwards again until another wrongly partitioned pair is
encountered.
Applying this idea to the sample data set given above we get:
The partitioning method discussed above can take as many as n + 2 comparisons to partition n
elements. This can be improved by replacing the loop test i j by the test i < j - 1. For this new
implementation, when termination occurs with i = j - 1, is then necessary to perform an extra
exchange outside the loop.
Unlike our earlier proposal, the smaller elements are now completely separated from the larger
elements. How can this result be exploited? We now have two partitions whose values can be
treated independently. That is, if we sort the first three values and then later sort the last five
values, the whole array will be completely ordered.
In sorting these two partitions, we want to again try to adhere to the idea of exchanging data over
large distances. Since the two partitions are independent, we can treat them as two smaller
problems that can be solved in a manner similar to the original problem.
Although for small data sets (like the one we have been considering) this looks clumsy, we can
imagine that for much larger sets it will provide an efficient way of transferring data over large
distances. What we must now establish is how this partitioning idea and transferring distances
fits in with our goal for sorting the data. To investigate this relationship, we can again apply the
partitioning process to the two partitions derived from the original data set.
To make it easier to understand the mechanism we can consider is a choice of partitioning
elements that allows us to divide the two partitions in half(That is, we can choose 8 for the left
partition and 35 for the right partition). When the partitioning mechanism is applied to these,
partitions with 8 and 35 as the partitioning elements we get:
Notice that from this example that an exchange should be made when we encounter elements
greater than or equal to our partitioning value. We will come back to this later.
Examining the results of these latest partitioning steps, we see that there are now four partitions.
We notice that these partitions are partially ordered(i.e. the elements in the first partition are
smaller than the elements in the second partition which are in turn smaller than the elements in
the third partition and so on.
If the partitions of size 2 and 3 are again trivially partitioned it will result in the array being sorted.
What we have learned from this consideration is that repeated partitioning gives us a way of of
moving data over long distances and at the same time, when the process is taken to the limit, it
leads to the complete data set being sorted.
We have therefore discovered that repeated partitioning can be used for sorting. Our task now is
to work out how to implement this mechanism. Before doing this we can summarize our progress
with the algorithm.
The basic mechanism we have evolved is:
while all partitions not reduced to size one do
Choose next partition to be processed.
Select a new partitioning value from the current partition.
Partition the current partition into two smaller partially ordered sets.
There are two major issues that must now be resolved. Firstly, we must find a suitable method for
selecting the partitioning values. Our earlier discussion led us to the conclusion that the best we
would be able to do was "guess" a suitable partitioning value. On average, if the data is random,
we may expect that this approach will give reasonable results. The simplest strategy for selecting
the partitioning value is to always use the first value in the partition.
A little thought reveals that this choice will be a poor one when the data is ready sorted and when
it is in reverse order. The best way to accommodate these cases is to use the value in the middle
of the partition. This strategy will not alter the performance for random data but it will result in
much better performance for the more common special cases we have mentioned.
If lower and upper specify the array limits for a given partition then the index middle for the
middle element can be computed by averaging the two limits.
middle := {lower + upper) div 2
In our earlier example we saw how the original data was divided initially into two partitions that
needed to be processed further. Because only one partition at a time can be considered for
further partitioning, it will be necessary to save information on the bounds of the other partition so
that it can be partitioned later.
For example:
Further consideration of the repeated partitioning mechanism suggests that we will have to save
the bounds for a large number (rather than just one) of the partitions as the process extends to
smaller and smaller partitions. The easiest way to do this is to store these in an array (it really
functions as a stack).This raises the question of how much extra storage needs to be allocated
for this array.
To answer this we will need to look at the worst possible partitioning situation. In the worst case,
we could end up with only one element in the smaller partition each time. This situation would
result in the need to store the bounds for n-1 partitions if we were sorting n elements. The
associated cost would be 2(n1) array locations.
To ensure that our algorithm will handle all cases correctly, it seems that we will need to include
the 2(n-1)extra locations. This is a costly increase in storage and so we should see if there is any
way of reducing it.
To do this, we will need to take closer look at the worst case situation where we have assumed
that the bounds for partitions of size 1 must be stored away each time. That is,
The problem arises here because we always (unknowingly) keep selecting the larger partition to
be processed next. If instead for this case we could control the situation so that the smaller
partition was processed next, then only two locations would be needed to save the larger
partition bounds for later processing.
This improvement comes about because "partitioning" the partition of size 1 leads to no extra
partitions being generated, before the larger partition is again processed. With this strategy, the
situation above is no longer the worst case. The worst case will now apply when the "biggest"
smaller partition is processed next at each stage. The biggest smaller partition is half the size of
the current partition.
If we start out with an array of size n to be sorted, it can be halved log2 n times before a partition
of size 1 is reached .Therefore, if we adopt the strategy of always processing the smaller partition
next we will only need additional storage proportional to logs. This is much more acceptable even
for very large n (That is, we will only need about twenty locations of partition storage to process
an array of 2000). A simple test that determines where the partitions meet relative to the middle
can decide which is the larger partition? That is,
if partitions meet to the left of the middle then
"Process left partition and save right's limits"
else
"Process right partition and save left's limits"
We can now again summarize the steps in our algorithm and incorporate the details of the
2.
Select the element in the middle of the partition as the partitioning value.
3.
4.
Save the larger of the partitions from step (c) for later processing.
At this point in our development, the thing that is not very clear is how the algorithm is going to
terminate. In our earlier discussion we established that when the partitioning process is
repeatedly applied, we eventually end up with a single element to be "partitioned". In concern
with this process of reducing the partition size to 1, the larger of the two partitions is being
"stacked" away for later processing. Following figure 2.60 illustrates the partitioning mechanism
and the "stacking away" of the larger partition in each case.
When a partition of size 1 has been reached we have no alternative other than to start the
partitioning process on the partition that has most recently been stored away. To do this we must
remove its limits from the top of the stack. For our example ,it will require that segment is
partitioned next.
Our task is to ensure that all partitions have been reduced to size 1. When all partitions have
been reduced to size 1 there will be no limits left on the stack for further processing. We can use
this condition to terminate the repeated partitioning mechanism. The mechanism can start with
the limits for the complete array on the stack. We then proceed with repeated partitioning until
the stack is empty. This can be signaled by maintaining a pointer to the "top-of-the-stack". When
this array index is zero it will indicate that all partitions have been reduced to size.
We have now explored the problem sufficiently to be able to provide a detailed description of the
algorithm. A detail of the partitioning mechanism follows:
Algorithm description
1.
2.
Place upper and lower limits for array on the stack and initialize pointer to top of stack.
3.
4.
remove upper and lower limits of array segment from top of Stack.
5.
6.
7.
(b.2) Partition the current segment into two with respect to the
current
8.
9.
middle value
(b.3) Save the limits of the larger partition on the stack of later
10.
11.
Performance Analysis
The quick sort algorithm is recognized as probably the most efficient of the internal sorting
algorithms. It gains its efficiency by keeping the number of data movement operations small.
Detailed analysis of the algorithm shows that the comparison step dominates.
In the average case, of the order of n log2n comparisons are made for an array of n elements.
The worst case, where the largest (or smallest) element is always selected as the partitioning
value, results in the order of n2 comparisons.
The quick sort algorithm has a number of attractive attributes. It has the advantage over most
other O(n log2n) algorithms in that it only moves elements when it is absolutely necessary (That
is, it can be said that it achieves its goal by applying the principle of least effort a powerful and
elegant strategy). This can be important where large amounts of data have to be moved around.
In some cases this can be avoided by swapping pointers rather than the actual data.
Applications
Internal sorting of large data sets.
streams by repeatedly reading a block of input that fits in memory, a run, sorting it, then writing it
to the next stream. It merges runs from the two streams into an output stream. It then repeatedly
distributes the runs in the output stream to the two streams and merges them until there is a
single sorted output.
Problem Statement
Merge two sorted arrays of integers, both with their elements in ascending order into a single
ordered array.
Algorithm development
Merging two or more sets of data is a task that is frequently performed in computing. It is simpler
than sorting because it is possible to take advantage of the partial order in the data.
Examination of two ordered arrays should help us to discover the essentials of a suitable
merging procedure. Consider the two arrays:
A little thought reveals that the merged result should be as indicated below. The origins (array a
or b) are written above each element in the c array.
What can we learn from this example? The most obvious thing we see is that c is longer than
either a or b. In fact c must contain a number of elements corresponding to the sum of the
elements in a and b (i.e. n + m). Another thing that comes from this example is that to construct
c, it is necessary to examine all elements in a and b.
The example also indicates that to achieve the merge a mechanism is needed for appropriately
selecting elements from a and b. To maintain order in placing elements into c it will be necessary
to make comparisons in some way between the elements in a and the elements in b.
To see how this might be done let us consider the smallest merging problem (that of merging two
elements).
To merge the two one-element arrays all we need to do is select the smaller of the a and b
elements and place it in c. The larger element is then placed into c. Consider the example below:
The 8 is less than 15 and so it must go into c[1] first. The 15 is then placed in c[2] to give:
We should be able to make a similar start to our larger problem of merging m and n elements.
The comparison between a[1] and b[1] allows us to set c[1].
We then have:
After the 8 has been placed in c[1] we need a way of deciding which element must be placed
next in the c array. In our two-element merge the 15 was placed next in c. However, with our
larger example placing 15 in c[2] will leave the 11 out of order. It follows that to place the next
element in c we must compare the next element in b (i.e. 11) with the 15 and then places the
smaller of these into c.
We can now start to see what is happening. In the general case the next element to be placed
into c is always going to be the smaller of the first elements in the parts of arrays a and b.
After the second comparison (i.e. a[1] < b[2]?) we have Fig. 2.66.(a), and by repeating the
process (i.e. comparing a[1] and b[3]) we obtain Fig. 2.66.(b). To keep track of the start of the
"yet-to-be-merged" parts of both the 'a' and 'b' arrays two index pointers i and j will be needed. At
the outset they must both have the value 1. As an element is selected from either 'a' or 'b' the
appropriate pointer must be incremented by 1.
This ensures that i and j are always kept pointing at the respective first elements of the yet-to-bemerged parts of both arrays. The only other pointer needed is one that keeps track of the number
of elements placed in the merged array to date. This pointer, denoted k, is simply incremented by
1 with each new element added to the c array.
If we follow the step-by-step merging process that we began above we eventually get to the
situation shown in Fig. 2.66.(c)
When we run out of elements to be merged in one of the arrays we can no longer make
comparisons between elements in the two arrays. We must therefore have a way of detecting
when one of the arrays run out. Which array runs out first depends on the particular data set.
One approach we can take with this problem is to include tests to detect when either array runs
out. As soon as this phase of the merge is completed another mechanism takes over which
copies the yet-to-be-merged elements into the c array. An overall structure we could use is:
1. while (i <= m) and (j <= n) do
a. compare a[i] and b[j] then merge the smaller member of the pair into c array
This observation gives us the clue that we need to set up a simpler mechanism. If we ensure that
the largest element in the two arrays is present on the ends of both arrays then the last two
elements to be merged must be the last element in the array and the last element in the b array.
With this situation guaranteed we no longer have to worry about which array runs out first. We
simply continue the merging process until (n + m) elements have been inserted in the c array.
The central part of this implementation could be:
This latest proposal although a clean simple implementation can in certain circumstances do
considerably more comparisons than are necessary.
In the algorithms (Fig : 2.67), have we made best use of the information available? Since we
have access to the last elements of both arrays, it is easy to determine in advance which array
will be completed first in the merge and which array will need to be copied after the merge. With
these facts established we should be able to reduce the tests in the while-loop from two to one
and at the same time cut down on the number of comparisons. A comparison between a[m] and
b[n] will establish which array finishes merging first.
The central part of this implementation could be:
(2)
(3)
procedure merge
1. Establish the arrays a[1 .. m] and b[1 .. n],
If last a element less than or equal to last b element then
(a) merge all of a with b,
(b) copy rest of b,
else
(a') merge all of b with a,
(b') copy rest of a.
procedure merge-copy
1. Establish the arrays a[1 .. m] and b[1 .. n] with
If last element of a less than or equal to first element of b then
(a) copy all of a into first m elements of c,
(b) copy all of b into c starting at m + 1,
else
(a') merge all of a with b into c,
(b') copy rest of b into c starting at position just past where merge finished.
procedure shortmerge
1. Establish the arrays a[1 .. m] and b[1 .. n] with a[m] <= b[n].
While all of a array still not merged do
(a) if current a element less than or equal to current b element then
(4)
Performance Analysis
For two input arrays of sizes m and n, the number of comparisons varies from 2 to (n + m + 1) to
complete the task. When there is no overlap of the ranges of the two arrays the minimum
situation applies.
In considering the behavior of this algorithm we will focus on the procedure short-merge. On
completion of the ith step of the while-loop the first ( i - 1 ) elements will have been merged with
the first (j - 1) elements of the b array.
The number of merged elements after the ith step will be i + j - 2. The first k - 1 elements after the
ith step will be in non-descending order.
The loop terminates because with each iteration either i or j is incremented and since it is
guaranteed that there exists a j such that a[m] b[j] before short-merge is called, a condition will
eventually be established that i will be incremented beyond m.
Applications
Sorting, tape storing and data processing
Linear Search
2.
Binary Search
3.
Hash Algorithm
Linear Search
Given a collection of objects, the goal of search is to find a particular object in this collection or to
return that the object does not exist in the collection. Often the objects have key values on which
one tries to search and data values which correspond to the information one wishes to retrieve
once an object is found. For example, a telephone book is a collection of names (on which one
searches) and telephone numbers (which correspond to the data being sought).Here we shall
consider only searching for key values (e.g., names) with the understanding that in reality, one
often wants the data associated with these key values. The collection of objects is often stored in
a list or an array.
Given a collection of n objects in an array A [1 ... n], the i th element A[i] corresponds to the key
value of the ith object in the collection. Often, the objects are sorted by key value (e.g., a phone
book), but this need not be the case.
Different algorithms for search are required depending on the data being sorted or not. The input
to a search algorithm is an array A of objects, n the number of objects and x the key value being
sought. In what follows, we describe four algorithms for search.
Suppose that the given array was not necessarily sorted. This might correspond, for example, to
collection of exams which have not yet been sorted alphabetically. If a student wanted to obtain
her exam score, how could she do so? She would have to search through the entire collection of
exams, one-by-one, until her exam was found. This corresponds to the unordered linear search
algorithm.
Linear- Search [A, n, x]
for i = 1 to n do
if A[i] = x then
return i
else
i=i+1
return "x not found"
Note that in order to determine that an object does not exist in the collection, one need to search
through the entire collection.
Now consider the following array:
Consider executing the pseudo code Unordered-Linear-Search [A, 8, 54]. The variable i would
initially be set to 1, and since A[1] (i.e., 34) is not equal to x (i.e., 54), i would be incremented by 1
according to point 4 above . Since A[2] i.e. 16= x, i would again be incremented, and this would
continue until i = 5. At this point, A[5] = 54 = x, so the loop would terminate and 5 would be
returned according to point 3 above.
Now consider executing the pseudo code Unordered-Linear-Search [A, 8, 53]. Since 53 does not
exist in the array, i would continue to be incremented until it exceeded the bounds of the for loop,
at which point the for loop would terminate. In this case, "x not found" would be returned in
according to point 5 above.
Performance Analysis
The number of comparisons depends on where the record with the argument key appears in the
table. If it appears at first position, then one comparison, or if it appears at last position, then n
comparisons. Average = (n + 1) / 2 comparisons. For unsuccessful search, comparisons will be
n.
Therefore, number of comparisons in any case is O (n).
We can establish the relevant half by comparing the value sought with the value in the middle of
the set. This single test will eliminate half of the values in the set from further consideration. Now
we have a problem, only half the size of the original problem, suppose it is established that the
value we are seeking is in the second half of the list (e.g. somewhere between the (n/2) th value
and the (n)th value).
Once again, it is either in the first half or the second half of this reduced set.
By examining first the nth value and then the 3n/4 value we are able to eliminate from further
consideration essentially three quarters of the values in the original data set in just two
comparisons.
We can continue to apply this idea of reducing our problem by half at each comparison until we
encounter the value we are seeking or until we establish that it is not present in the set.
With the telephone directory problem, the latter task is just as easily accomplished. This is
commonly known as the divide-and-conquer strategy. The corresponding searching method we
are starting to develop is known as the binary search algorithm.
At this stage we have the general strategy:
repeatedly
"examine middle value of remaining data and on the basis of this comparison eliminate half of
the remaining data set"
until (the value is found or it is established that it is not present).
Let us now consider a specific example in order to find the details of the algorithm needed for
implementation.
To calculate the middle index this time we can subtract 8 from 15 (i.e., the amount discarded) to
give 7 the number of values remaining. We can then divide it by 2 to give 3 which can be added
to 9 to give 12. Studying the upper and lower index values we see that if we add them and divide
by 2 we end up with 12. Tests on several other values confirm that this method of calculating the
middle works in the general case,
e.g. middle := (lower + upper) div 2
When a[12] is examined it is established that 44 is less than a[12]. It follows that 44 (if it is
present) must be in the range a[9]......a[11].Accordingly the upper limit becomes one less than
the middle value, i.e.
upper := middle - 1
We then have:
From this we see that with each comparison either the lower limit is increased or the upper limit
is decreased. With the next comparison we find the value we are seeking and its position in the
array. That is, in just 3 comparisons we have located the value we wanted. It can be seen that an
additional comparison would have been required if the value we had been seeking were either
42 or 45.
Our algorithm must handle the situation where the value we are seeking is not present in the
array. When the value we are seeking is present, the algorithm terminates when the current
middle value is equal to the value sought. Clearly this test can never be true when the value
sought is not present. Some other condition must therefore be included to guarantee that the
algorithm will terminate.
To investigate termination when the value is not present, consider what happens when we are
searching for 43 rather than 44. The procedure progresses as before until we get the
configuration below:
The next comparison of 43 with a[middle] = a[9] tells us the value we are seeking above is the
middle value. We therefore get
lower := middle + 1 = 10
We then have lower = 10 and upper = 9. That is, lower and upper have crossed over. Another
investigation in searching for 45 indicates that lower once again becomes greater than upper.
When the value we are seeking is present, lower may become greater than upper but only when
the value sought is found. Since all unsuccessful searches eventually pass through the stage
when upper = lower = middle (because of the way in which middle is calculated) we can use the
condition,
lower > upper
In conjunction with the equality test of the array value with the value sought to terminate our
algorithm. Before leaving off this discussion, we should be sure that the algorithm terminates
when the value we are seeking is less than the first value a[1] or greater than the last value a[n].
A further check for the special case when there is only one element in the array confirms that the
algorithm functions correctly and also terminates as required.
At this stage we have the following algorithm.
1. Establish ordered array size n and the value sought x.
2. Establish the ordered data set a [1....n].
3. Set upper and lower limits.
4. Repeatedly
(a). Calculate middle position of remaining array.
(b). if value sought is greater than middle value then
Adjust lower limit to one greater than middle,
else
Adjust upper limit to one less than middle until value sought is found or
lower becomes greater than upper.
5. Set found accordingly.
The termination condition for the algorithm above is somewhat clumsy. Furthermore, it is difficult
to prove the algorithm is correct. It is therefore, useful to investigate whether or not there is a
more straight forward implementation. In any particular search, to test whether the current middle
value (i.e. a[middle]) is equal to the value sought usually applies only within one step of where
the algorithm would otherwise terminate (see the binary decision tree for an explanation). This
means that if the algorithm can be formulated to terminate with a single condition it will lead to a
more elegant and also a more efficient solution.
Careful examination of our original algorithm indicates that we cannot simply remove the test
a[middle] x, as this leads to instances where the algorithm will not terminate correctly. The
problem arises because it is possible to move past a successful match. To obtain a better
solution to the problem, we must therefore maintain conditions that prevent this bypass. A way to
do this is to ensure that upper and lower position is on the target termination position in such a
way that they do not cross over or move past the match condition.
In other words, if x is present in the array, we want the following condition to hold after each
iteration:
a[lower] x a[upper]
It follows that lower and upper will need to be changed in such a way so as to guarantee this
condition if x is present. If we can do this, we should be able to find a suitable termination
condition involving just lower and upper. Our starting configuration will be:
Beginning with:
middle : = (lower + upper) div 2
We can make the following conditional test in an effort to bring upper and lower closer together:
x > a[middle]
If this condition is true then x must be in the range a[middle + l .. upper] if it is present in the
array. It follows that with this condition true, we can make the assignment:
lower : = middle + 1
On the other hand if this condition is not true (i.e. therefore x < a[middle]) x must be in the range
a[lower...middle] if it is present in the array. The variable upper can therefore be reassigned as:
upper := middle
(Note that the assignment upper : = middle - 1 is not made because the test x > a[middle] is not
strong enough to discount the possibility that (a[middle] = x). We are now left with the question of
deciding how this mechanism can be terminated.
The mechanism by which lower and upper are changed is that if the element x is present either
one of them could reach an array element equal to x first. If upper descends to an array value
equal to x first, then lower must increase until this element is reached from below. The
complementary situation will apply if lower reaches an array element equal to x first. These two
situations suggest that the termination condition:
lower = upper is probably the most appropriate if x is present.
The fact that lower is set to (middle + 1) rather than just middle guarantees that lower will be
increased after each pass through the loop where lower is reassigned. The guarantee that upper
is decreased each time through the loop that its reassigned is more subtle since it is always
assigned to the current middle value. The truncation caused by the integer division, i.e.
middle := (lower + upper) div 2
Ensures that the middle value is always less than the current upper value For example,
middle = (2 + 4) div 2 = 3 < 4
middle = (2 + 3) div 2 = 2 < 3
Because middle is decreased in this way, it follows that upper will also be decreased whenever it
is reassigned.
At this point we should also check the special cases when x is not present to ensure that the
algorithm always terminates correctly. The most important special cases are:
There is only one element in the array.
X is less than the first element in the array.
X is greater than the last element in the array.
X is in the range a[l...n] But is absent from the array.
A check of these cases reveals that the algorithm terminates correctly when the element x is not
present in the array. We can now, therefore, give a detailed description of our algorithm.
Algorithm description
1. Establish the array a[1 .. n], and the value sought is x.
2. Assign the upper and lower variables to the array limits.
3. While lower < upper do
(a) Compute the middle position of remaining array segment to be searched,
(b) if the value sought is greater than current middle value then
(b.1) adjust lower limit accordingly.
else
(b'.1) adjust upper limit accordingly.
4. If the array element at lower position is equal to the value sought then
(a) return found
else
(a') return not found
Performance analysis of Binary search
The binary search algorithm in general offers a much more efficient alternative than the linear
search algorithm. Its performance can be best understood in terms of binary search tree. For an
array of 15 elements the tree has the form shown in below
For this tree it can be seen that at most 4 comparisons are needed. In general no more than
log2n + 1 comparison are required. This means that even for an array of one million entries only
about twenty comparisons would be required. By comparison, a linear search requires on
average n/2 comparisons.
Time complexity of binary search is logarithmic, which is executed in O(logn).
Suppose we were to focus on finding 44 (in the more usual context it would be some information
associated with 44 that we would be seeking).
Our earlier experience with the binary search illustrated how we could take advantage of the
order in an array of size n to locate an item in about log2 n steps. Exploitation of order just does
not seem to be a strong enough criterion to locate items much faster than this. The almost
unbelievable aspect of the faster method is the statement that "it usually only has to look at one,
two or three items before terminating successfully".
To try to make progress, let us focus on the extreme case where it only examines one item
before terminating. When first confronted with this idea it seems to be impossible-how could we
possibly find an item in one step?
To find 44 in our example we would have to somehow go "magically" to location 10 first. There
does not seem to be any characteristic that 44 possess that would indicate that it was stored in
location 10. All we know about 44 is its magnitude.
We seem to have come to a dead end! How could 44's magnitude be of any help in locating its
position in the array in just one step? Reflecting on this for a while, we come to the conclusion
that only if the number 44 were stored in array location 44 would be able to find it in one-step.
Although the idea of storing each number in the array location dictated by its magnitude would
allow us to locate it or detect its absence in one step, it does not seem practical for general use.
To find 44 in our example we would have to somehow go "magically" to location 10 first. There
does not seem to be any characteristic that 44 possess that would indicate that it was stored in
location 10. All we know about 44 is its magnitude. We seem to have come to a dead end! How
could 44's magnitude be of any help in locating its position in the array in just one step?
Reflecting on this for a while, we come to the conclusion that only if the number 44 were stored in
array location 44 would be able to find it in one-step.
Although the idea of storing each number in the array location dictated by its magnitude would
allow us to locate it or detect its absence in one step, it does not seem practical for general use.
For example, suppose we wanted to store and rapidly search a set of 15 telephone numbers
each of seven digits, e.g. 4971667. With the present scheme, this number would have to be
stored at array location 4971667. Any array with such an enormous number of locations just to
store and search for 15 numbers does not seem worth it. Before abandoning the idea of looking
for a better method let us take one last look at the "progress" we have made.
To make things easier let us return to our earlier example. We have a set of 15 numbers between
10 and 60. We could store and search for these elements in an array of size 60 and achieve onestep retrieval. The problem with this is that in the general case this would mean that too much
space would have to be used. This leads to the question; can we apply the same retrieval
principles using a smaller storage space?
One response to this last question would be that we might proceed by "normalizing" the data.
This would amount to applying a transformation to each number before we search for it (i.e. 60
could be transformed so that it becomes 15, 20 becomes 5 and so on, by doing an integer
division by 4 and rounding to the nearest integer).
When we apply this normalizing transformation to our sample data set and round it off to the
nearest integer we find that some values share the same index in the range 1 to 15 while other
indices in this range are not used. Therefore, we see that by reducing the array size from the
magnitude range (60) to the occupancy range (15) we have introduced the possibility of multiple
occupancy.
Clearly the less we reduce the range the less will be the risk of multiple occupancy or collisions
as they are usually called.
This proposal that we have made seems to solve a part of our problem.
The normalized set of values would be:
It is easy to imagine situations where this normalizing scheme will introduce even more severe
collision situations (i.e. suppose the largest value was much bigger than all other values in the
set). These observations suggest that our normalizing approach is probably not the best one to
take.
Our most recent observations leave us with the question, is there any alternative pre-search
transformation that we can apply to our original data? What we desire of such a transformation is
that for our particular example it produces values in the range 1->15 and that it is not badly
affected by irregularities in the distribution of the original data. One such alternative
transformation would be to compute the values of the original set modulo 15 and add 1(we will
assume that the numbers to be stored are all positive). Applying this transformation (which is
usually referred to as hashing) to our original data set we get:
This result is a little disappointing as there are still many collisions. We are however; better off
with this method because it can accommodate irregularities in the distribution better. We made
an observation earlier that the more we try to "compress" a data set the more likely it is that there
will be collisions.
To test this observation let us see if we can reduce the number of collisions by choosing to use
an array with say 19 locations rather than 15 (i.e. we will calculate our numbers modulo 19 rather
than 15 and drop the plus one). In most circumstances we would be prepared to concede the
use of 20% extra storage if it would give us very fast retrieval with very few collisions.
Once again there are a number of collisions but this time there are no multiple collisions.
Studying this table we see that 11 of the 15 values have been stored such that they can be
retrieved in a single step. We are therefore left with the question of what to do about the
collisions?
If we can find a method of storing the values that collided so that they too can be retrieved
quickly then we should end up with an algorithm that has high average retrieval efficiency.
To study this problem let us return to our example. The situation as it stands is that there are a
number of "free" locations in the array which could be used to store the four elements that made
collisions. We may therefore ask the question for each individual element where is the best place
to store this element?
For example, we can ask where is the best place to store the 31 which collided with the element
in location 12. Studying this particular case we see that if 31 is placed in location 13 then we
would be able to locate it in two steps. The way we could do this would be to compute:
31 mod 19 -> location 12
We would then examine the element in location 12 and find that it is not 31. Having made this
finding we could move to the next position in the array (i.e. position 13) and find 31. We must
now ask can we generalize this idea?
The elements that made collisions are underlined. The tail of each arrow indicates where an
element that collided should have been stored. Before proceeding further let us try to assess the
average number of steps to retrieve an element from this table assuming all elements are likely
to be retrieved with equal probability.
We have:
1.
2.
3.
Therefore, for this configuration an element will be located in 1 to 4 steps on average. This
seems very promising, particularly if such performance extends to the general case. Rather than
pursue this verification here, we will assume it to be true and continue with the development of a
hash searching algorithm that will match our proposed storage scheme.
To make things simpler, we will assume that the goal is to construct a hash searching function
that simply returns the position of the search key (a number) in the hash table (As we remarked
earlier, in practice we would not be interested just in the position of a key but rather in the
information associated with the key i.e. we might need to locate a person's address by doing a
hash search using his/her name).
Our discussion so far suggests that the steps in locating an item in the table should be relatively
straightforward. Our method for resolving collisions requires that we perform the following basic
steps:
1. Derive a hash value modulo the table size from the search key.
2. If the key is not at hash value index in array then
(a) Perform a forward linear search from current position in the array modulo,
the array size.
For a search item key and a table size n we can accomplish step (1) using:
position : = key mod n
We can then use a test of the form:
if table[position] <> key then ...
to decide whether key is located at the index position in the array table.
To complete the development of the algorithm all that remains is to work out the details of the
search. In constructing the forward search all we need to do is examine successive positions in
the table taking note that the index will need to "wrap-around" when it gets to the end of the array
(see following fig: 2.84).
We can use the mod function to accomplish the wrap-around, that is,
the form:
if table[position] = empty then ...
We might now propose the following search:
While key not found and current position not empty do
(a) Move to next location modulo table size.
This seems to satisfy our requirements. However, when we study the termination conditions
carefully we find that we have overlooked the possibility that the table could be full and the key
may not be present. In that case we have a potentially infinite loop. How can we avoid this?
Referring back to Fig. 2.84, we find that as soon as we arrive back at the position at which we
started we will know that the key is not present. An extra test seems to be needed.
We now have established three ways in which the search can terminate:
1. Successfully on finding the key.
2. Unsuccessfully at an empty location.
3. Unsuccessfully with a full table.
The need to test these three conditions as each new position is examined seems rather clumsy.
We might therefore ask, is there a way to reduce the number of tests? Our earlier use of
conditions for terminating a search may be relevant.
Following this line of reasoning we see that the first and third tests can be combined. We can
force the search to be successful after examining all elements in the table. The way we can do
this is by temporarily storing the key in the table at the original hashed position after finding that
the value there does not match the search key.
The original value can then be restored after the search is complete. A Boolean variable named
active which is set to false on either finding the key or on finding an empty location can be used
to terminate the search loop. A separate Boolean variable can be used to distinguish between
the termination conditions. This variable should not be set when the key is found only after a
complete search of the table.
Our development of the algorithm is now complete.
Algorithm description
1. Establish the hash table to be searched, the key sought, the empty condition
value and the table size.
2. Compute hash index for key modulo the table size.
3. Set Boolean variables to terminate search.
4. If key located at index position then
(a) set conditions for termination
else
(a') set criteria to handle full table condition.
5. While no termination condition satisfied do
(a) compute next index modulo table size.
(b) if key found at current position then
(b.1) set termination conditions and indicate found if valid
else
(b'.1) if table position empty then signal termination.
6. Remove sentinels and restore table.
7. Return result of search.
Performance Analysis
The performance of the hash searching algorithm can be characterized by the number of items in
the table that need to be examined before termination. This performance is a function of the
fraction of occupied locations (called the load factor ). It can be shown, after making certain
statistical assumptions, that on average [1+(1+/(1- ))]/2 locations will need to be examined in
successful search (e.g. for a table that is 80% full this will be fewer than three locations
irrespective of table size). The cost of making an unsuccessful search is more expensive.
On average [1+(1+/(1- )2)] / 2 locations must be examined before encountering an empty
location
(e.g. for a table that is 80% full, this will amount to 13 locations).
Applications
Used for fast retrieval of data from both small and large tables.