AdaData PDF
AdaData PDF
and
Algorithms
in
Ada
by
Leon E. Winslow
Warning: Users may print one copy for personal use. Written permission is required in advance
for any other use. Without written permission, no other use is authorized.
3. Stacks
3.1 Stacks . . . . . . . . . . . . . . 2
3.1.1 Representations . . . . . . . . . . 2
3.1.1.1 Array Representation . . . . . . . 2
3.1.1.2 Linked Representation . . . . . . 5
3.1.2 Timing . . . . . . . . . . . . 7
3.1.3 Comparison of Array and Linked Representations . 7
3.2 Some Stack Examples . . . . . . . . . 9
3.3 Postfix Expressions . . . . . . . . . . 11
3.3.1 A Postfix Interpreter. . . . . . . . . 12
3.3.2 Translating Infix to Postfix . . . . . . . 14
3.4 Generic Implementations . . . . . . . . . 17
3.5 Exceptions . . . . . . . . . . . . 25
3.6 Iteration . . . . . . . . . . . . . 29
4. Queues and Pipes
4.1 Queues . . . . . . . . . . . . . 2
4.1.1 Representations . . . . . . . . . . 2
4.1.1.1 Array Representation . . . . . . . 2
4.1.1.2 Linked Representation . . . . . . 6
4.1.2 Timing . . . . . . . . . . . 8
4.1.3 Comparison of Array and Linked Representations . 8
4.2 Examples Using Queues . . . . . . . . . 10
4.3 Multiple Queues . . . . . . . . . . . 12
4.4 Implementing the Data Type: Queue . . . . . 18
4.4.1 The Package Specification . . . . . . . 20
4.4.2 Declaring Queues in a User Program . . . . 22
4.4.3 The Package Body . . . . . . . . . 24
4.4.4 Initialization. . . . . . . . . . 29
4.5 Pipes . . . . . . . . . . . . . . 35
4.5.1 Representations . . . . . . . . . . 37
4.5.2 Pipes and Filters . . . . . . . . . 41
4.5.3 An Ada Pipe Package . . . . . . . . 47
7. Trees
7.1 Trees . . . . . . . . . . . . . . 2
7.2 Binary Trees . . . . . . . . . . . . 8
7.2.1 Binary Search Trees . . . . . . . . 8
7.2.2 Binary Tree Operations . . . . . . . . 13
7.2.3 Representations . . . . . . . . . . 19
7.2.3.1 Linked Representation. . . . . . . 19
7.2.3.2 Array Representation . . . . . . . 25
7.2.3.3 Timing . . . . . . . . . . . 26
7.3 Using Trees . . . . . . . . . . . . 32
7.4 Ada Implementation . . . . . . . . . . 37
7.5 Threaded Trees . . . . . . . . . . . 43
7.5.1 Representations . .. . . . . . . . . 45
7.5.1.1 Array Representation . . . . . . . 45
7.5.1.2 Linked Representation. . . . . . . 48
7.5.2 Timing and Comparisons . . . . . . . 49
7.6 N-Way Trees . . . . . . . . . . . . 51
7.6.1 N-Way Tree Operations. . . . . . . . 52
7.6.2 Representations . . . . . . . . . . 55
7.6.3 Timing . . . . . . . . . . . . 55
8. Graphs
8.1 Graphs . . . . . . . . . . . . . 2
8.2 Computer Representations . . . . . . . . 7
8.2.1 Adjacency Matrix Representation . . . . . 7
8.2.2 Adjacency List Representation . . . . . . 10
8.2.3 Timing . . . . . . . . . . . . 10
8.2.4 Ada Implementation . . . . . . . . 12
8.3 Paths . . . . . . . . . . . . . . 19
8.3.1 Breadth First Traversals . . . . . . . 19
8.3.2 Depth First Traversals . . . . . . . . 30
8.4 Weighted Graphs . . . . . . . . . . . 36
8.5 Topological Sort . . . . . . . . . . . 44
8.6 Complexity Theory . . . . . . . . . . 48
9. Searching
9.1 Overview . . . . . . . . . . . . . 2
9.2 Sequential Searching . . . . . . . . . 4
9.2.1 Optimizing Sequential Search Algorithms . . . 4
9.2.2 Search Time . . . . . . . . . . 6
9.2.3 Data Reorganization . . . . . . . . 8
9.2.4 Ada Implementation . . . . . . . . 11
9.3 Binary Searches . . . . . . . . . . . 14
9.3.1 Comparison of Sequential and Binary Searches . 16
9.3.2 Ada Implementation . . . . . . . . 18
9.4 Trees . . . . . . . . . . . . . . 21
9.4.1 Balanced Trees . . . . . . . . . . 21
9.4.2 Optimal Trees . . . . . . . . . . 24
9.4.3 Bottom Up Trees . . . . . . . . . 30
9.4.3.1 The 2-3 Tree . . . . . . . . 30
9.4.3.2 B Trees . . . . . . . . . 33
9.4.3.3 B+ Trees . . . . . . . . . 35
9.5 Tries . . . . . . . . . . . . . . 37
9.6 Hashing . . . . . . . . . . . . . 39
9.6.1 Hashing Functions . . . . . . . . . 42
9.6.2 Ada Implementation. . . . . . . . . 45
9.7 Perfect Hashing . . . . . . . . . . . 47
9.8 Multiple Key Searches. . . . . . . . . . 49
10. Sorting
10.1 Exchange Sorts . . . . . . . . . . . 2
10.1.1 Selection Sort . . . . . . . . . . 2
10.1.2 Insertion Sort . . . . . . . . . . 4
10.1.3 Bubble Sort . . . . . . . . . . 7
10.1.4 Quicksort . . . . . . . . . . . 9
10.2 Tree Methods . . . . . . . . . . . 15
10.2.1 Heap Sort . . . . . . . . . . . 15
10.2.2 Tournament Sort . . . . . . . . . 19
10.3 Merge Sort . . . . . . . . . . . . 22
10.4 Radix Sort . . . . . . . . . . . . 25
10.4.1 Least Significant Digit . . . . . . . . 26
10.4.2 Most Significant Digit . . . . . . . 27
10.5 Ada Sort Routines . . . . . . . . . . 29
10.6 Theory . . . . . . . . . . . . 31
10.7 Comparing the Sorting Methods . . . . . . . 32
A REVIEW
of
ADA STRUCTURES
Sooner or later every data structure must be translated into working code. Ada has only three
basic components for implementing a data structure: an array, a record, and an access variable,
so every data structure must eventually be implemented in terms of these three components.
This chapter is a quick review of these three components and their use in constructing elemen-
tary data structures. It also has an elementary discussion of some of the difficulties in translating
a data structure into efficient, robust, general, Ada code.
1
Ada Review
2.1.1. Records
A record is a way of grouping related data, generally a set of attributes or characteristics of a
particular object. For example, a person's name, age, and ID code might be grouped into a
record of the form:
type Person_Record is
record
Name : String( 1..20 );
Age : Integer range 1..99;
ID : Integer range 10000..99999;
end record;
Component Component
Name Value
Age 21
Name J Doe
ID 12345
with one column for the component name and the second column for the component value. The
component names need not be in order. To determine the value of a particular component it is
necessary to search the first column for the component name and then retrieve the associated
value from the second column.
A second way is to store the record in a long string where each component name precedes the
component value. The component names do not even have to be in order; for example:
[Age:21;Name:J Doe;ID:12345]
2
Ada Review
is the same record as above with the components separated by semicolons and with each compo-
nent name separated from the corresponding component value by a colon. If one always stores
the components in a fixed order, the component names are not necessary. If one agrees to
always store the components in the order Name, Age, and then ID, separated by a semicolon,
then the above record can be stored in the form:
[J Doe;21;12345].
All of these methods work. Method 1 will be used later to implement other data structures
and Method 2 is used in, for example, Lisp. Both methods, however, are very slow and require
tedious searching to retrieve or alter a component value.
Ada normally stores record components sequentially in memory. Assuming a computer
memory with one address for each byte, the Customer record might be stored as follows:
Address Value
1,000 first letter of name,
1,001 second letter of name,
.... ....
1,019 last letter of name,
1,020 Age, and
1,024 ID code,
This memory layout allows twenty bytes for the customer's name and four bytes for each integer
value. The exact amount of space used for the customer's name is specified in the record. The
exact amount of space used for each integer value will vary depending upon the exact computer
and compiler used. The base address, the 1000 in this case, is obviously completely arbitrary.
This method is significantly faster than any of the storage methods presented above because the
compiler can generate code which directly addresses the desired component value.
2.1.2. Arrays
Let us start with a one dimensional array and consider its basic design and implementation.
A single dimensional array is a list of items referred to by their location in the list. More
precisely, a subscript value is associated with each item in the array and the item is referred to by
its subscript value. For example, the array:
Quiz = ( 7, 10, 5 )
is a list with three entries, and, assuming the corresponding subscripts are 1, 2, and 3, then the
items in the array are referred to respectively as Quiz(1), Quiz(2), and Quiz(3). Thus the
location of the items in the list is called the subscript and there is one item in the list for each
distinct value of the subscript. The items in an array must all be of the same data type.
Arrays are normally declared in Ada as types; for example,
type Vector is
array (Integer range <>)
of Integer;
3
Ada Review
where the use of the three lines emphasizes the three basic parts of the declaration. In this case,
the name of the data type is Vector, the subscripts are integers (with any desired range of values)
and the items in the array are integers.
The individual arrays are then defined in terms of the general type:
Quiz : Vector( 1..3);
Grade : Vector( 1..20);
where Quiz has subscripts in the range 1 to 3 and Grade has subscripts in the range 1 to 20.
Subscripts can have any value allowed by the subscript range; for example, the subscript
range of Vector is Integer so any integer values can be used as subscripts. Thus:
V1 : Vector(-10..10); and
V2 : Vector(-100..-10);
are both valid vector declarations, the first with subscripts ranging from -10 to +10 and the
second with subscripts ranging from -100 to -10.
Arrays can also have multiple subscripts. To declare a two dimensional array called Grades,
we might use:
type Matrix is
array (Integer range<>, Integer range<> )
of Float;
Grades : Matrix( 1..2, 1..3 );
The first subscript, in this case, refers to the row and the second subscript to the column, so that,
for example, Grades(2,3) is the entry in the second row, third column of the array.
Arrays with more than two subscripts are possible, but this text only uses one and two
dimensional arrays. Arrays with enumerated variable subscripts are also possible, but the
subscripts in this text are almost always integers.
It is possible to store an array in an unordered table using the subscript for one column; thus
the vector:
Quiz = ( 7, 10, 5 )
might be stored in the form:
Subscript Value
2 10
3 5
1 7
Note that with this storage scheme the entries need not be in order. This storage scheme is slow
because to locate, say Quiz(2), the computer must first search the first column for the subscript 2
4
Ada Review
before it can retrieve the value of Quiz(2) from the second column. Variations on this method
are used in systems where the subscript values are determined at run time.
A faster storage scheme takes advantage of the fact that computer memory is normally given
sequential addresses with one address for each byte of memory storage. This makes it easy to
store a one dimensional array; for example, the array Quiz above might be stored in sequential
locations starting at address 1000 as follows (assuming it takes four bytes to store one integer
value):
Address Value
1,000 7
1,004 10
1,008 5
Address Value
1,000 Grade(1,1)
1,004 Grade(1,2)
1,008 Grade(1,3)
1,012 Grade(2,1)
1,016 Grade(2,2)
1,020 Grade(2,3)
To illustrate implementing object classes in Ada using records, let us start with a simple
object class: Time values measured in minutes and seconds. In other words, every Time value
5
Ada Review
consists of a pair of values, say 3:45, where the first value is the minutes part of the time and the
second value is the seconds part of the time. We will have only two operations, adding two
Time values and multiplying a Time value by an integer. This set of operations is deliberately
kept simple to keep the example simple, but more operations can be added at any time to a
working package, so this is no limitation on the final set of time operations.
One way to store a Time value is to use a record:
type Time_Type is
record
Minute : integer;
Second: integer range 0..59;
end record;
The two arithmetic operations are straightforward except for the fact that a Second value can
only go up to 59 before it "overflows" and increments the associated minute value. This means
that all sums and products must test for this "overflow" and treat it correctly.
An algorithm to add two time values is:
Multiply( Time, I )
Product <-- Time.Second * I
Quotient <-- Product / 60
Answer.Second <-- Product - 60*Quotient
Answer.Minute <-- Time.Minute * I + Quotient
Return Answer
end multiply
The corresponding Ada package specification is in Specification 2.2.1 and the package body is
in Program 2.2.1.
6
Ada Review
package Time_Package is
private
type Time_Type is
record
Minute: Integer;
Second: Integer range 0..59;
end record;
end Time_Package;
7
Ada Review
begin
Sum := Time1.Second + Time2.Second;
if Sum < 60
then Answer.Second := Sum;
Answer.Minute := Time1.Minute + Time2.Minute;
else Answer.Second := Sum - 60;
Answer.Minute := 1 + Time1.Minute + Time2.Minute;
end if;
return Answer;
end "+";
---------------------------------------------------------------
--------------------------------------------------------------
begin
Product := Time.Second * I;
Quotient := Product / 60;
Answer.Second := Product - 60*Quotient;
Answer.Minute := Time.Minute * I + Quotient;
return Answer;
end "*";
end Time_Package;
8
Ada Review
R, S, T : Time_Type;
and operations of the form:
R := S + T;
R := S * 7;
to manipulate time values. Comparisons of the form:
if R = S or S < T
and iterative statements of the form:
while ( R /= S ).
are also possible. This set of operations is particularly simple, but extensions are straightforward
and left for the exercises.
Exercises
9
Ada Review
To illustrate implementing object classes in Ada using arrays, let us start with a simple float
vector. The object values are vectors with float values and the operations are to add two vectors
and to multiply a constant times each of the items in a vector.
To be more precise, the vectors are declared using a type statement of the form:
type Float_Vector is
array ( Integer range <> )
of Float;
The standard vector sum operation takes two vectors with the same subscript range, say
Lengths1 : Float_Vector ( 1..3 );
Lengths2 : Float_Vector ( 1..3 );
and adds the vectors component by component to produce the sum. If the vectors have the
values:
Lengths1 = ( 4.0, 7.0, 3.0 )
Lengths2 = ( 2.0, 4.0, 5.0 )
then the component by component sum is:
Sum = ( 6.0, 11.0, 8.0 ).
A general algorithm to produce this sum is:
10
Ada Review
If Left and Right don't have the same subscript range, then error
return Answer
end sum
The product function multiplies each component of a vector by the same constant; in other
words, multiplying the constant 5 times the vector ( 3, 6, 9 ) produces the vector ( 15, 30, 45 ).
An algorithm is:
return Answer
end multiply
The package specification is in Specification 2.3.1. This specification has one new feature,
an error. We discuss error handling in more detail in Chapter 3, but, for now, this package raises
an exception if the given error occurs.
The corresponding package body is in Program 2.3.1. Note the use of the array attributes
'First, 'Last, and 'Range to denote respectively the first subscript value of the array, the last
subscript value of the array, and the range of values of the array subscript. This insures that the
package works for any user declared subscript range.
As a second example, we also include an object class of two dimensional arrays called matri-
ces. The general form of the data is:
type Float_Matrix is
array ( Integer range <>, Integer range <> )
of Float;
The two operations are similar to those of a vector; the first operation produces the sum of two
matrices and the second operation multiplies every item in a matrix by a constant. The
algorithms are similar to the ones for vector except that the loops must run over two sets of
subscripts. In this case, we use 'Range(1) and 'Range(2) to denote the range of subscript values
of the first and second subscript respectively. Similarly, 'First(1), 'First(2), 'Last(1), and 'Last(2)
refer to the first and second subscript of the specified array. The package specification is in
Specification 2.3.2 and the package body in Program 2.3.2.
11
Ada Review
package Float_Vector_Package is
type Float_Vector is
array ( Integer range <> )
of Float;
12
Ada Review
begin
--Left and Right must have the same subscripts.
if (Left'First /= Right'First)
or (Left'Last /= Right'Last) then
raise Subscript_Mismatch;
end if;
--Terminate
return Answer;
end "+";
--------------------------------------------------------------
begin
for I in Right'Range loop
Answer(I) := Left * Right(I);
end loop;
return Answer;
end "*";
end Float_Vector_Package;
13
Ada Review
package Float_Matrix_Package is
type Float_Matrix is
array ( Integer range <>, Integer range <> )
of Float;
end Float_Matrix_Package;
14
Ada Review
begin
--Left and Right must have the same subscripts.
if (Left'First(1) /= Right'First(1))
or (Left'Last(1) /= Right'Last(1))
or (Left'First(2) /= Right'First(2))
or (Left'Last(2) /= Right'Last(2)) then
raise Subscript_Mismatch;
end if;
begin
for Row in Right'Range(1) loop
for Col in Right'Range(2) loop
Answer(Row,Col) := Left * Right(Row,Col);
end loop;
end loop;
return Answer;
end "*";
end Float_Matrix_Package;
15
Ada Review
2.4. Pointers
Access variables can be used with any kind of data. They are probably most commonly used
with records, but they can be used with arrays or even other access variables. Thus, the
declarations:
type Float_Vector is
array ( Natural range <> )
of Float;
type Pointer_To_Vector is access Float_Vector;
define a type which can access a vector and the additional declaration:
P, Q, R : Pointer_To_Vector;
1. allocate space for a vector with 100 entries and set P to the address of this space,
2. allocate space for a vector with 11 entries and set Q to the address of this space, and
3. set R to the address of the first vector.
Using Pointers
One of the most common uses of pointers in data structures is in linked lists. Assume we
have a list of integers that we need to store in the computer. The problem is that the size of the
list can vary drastically from one time to the next. If we store the list in an array, the array will
have to be large enough to store the longest possible list. Since the list can be very short in some
cases, allocating space for the longest possible list is very wasteful of space. One solution is to
only allocate space for exactly as many items as occur in the list. To do this we store each item
in a record which contains the item and a pointer to the next record:
Item Next
16
Ada Review
Such records are called nodes. For example, the list 27, 36, 22 would be stored in the form:
First
27
36
22 Λ
where First is a pointer to the first record in the list, each node (except the last one) points to the
next node in the list, and the last node contains a null value for the pointer, denoted by the Greek
letter, Λ.
To use a linked list, we start with the basic declarations of a node and a pointer to a node:
type List_Node;
type Pointer is access List_Node;
type List_Node is
record
Item: Integer; --Item in list.
Next: Pointer; --Pointer to next node in list.
end record;
Once the list is constructed, it is straightforward to process the list. The basic code to print the
list, for example, is (assume Ptr is of type Pointer and Put will print an integer):
--Initialize a pointer
Ptr := First;
Similarly, the basic code to sum the items in the list is:
--Initialize a pointer
Ptr := First;
Total := 0;
17
Ada Review
Comparing both code segments, it should be clear that the basic pattern for processing all of the
elements in a linked list is:
Initialize
Ptr <-- First
Initialize any other necessary variables
To simplify the presentation, assume that all insertions of new items are made at the head of
the list; that is, if the list contains the two items: 22, 77 and a new item, 33, is inserted into the
list, then the list is 33, 22, 77. To make this insertion, it is necessary to allocate a new node,
insert the 33 in the new node, and then link the new node to the linked list. This can be done by
the code segment:
where the initial value of the node components are explicitly given in the allocation itself.
An Ada package specification for a particularly simple list package, one with only two
operations Insert and Print, is given in Specification 2.4.1. The corresponding package body is
in Program 2.4.1. The list is implemented in the package body using a linked list. Discussion of
more sophisticated implementation schemes is given in Section 2.6.2.
The major advantage of linked lists is that the programmer can allocate exactly the amount of
storage needed at the point that it is needed.
18
Ada Review
package Integer_Linked_List_Package is
procedure Print;
--prints the whole list, one item per line.
end Integer_Linked_List_Package;
19
Ada Review
with Ada.Text_IO;
package body Integer_Linked_List_Package is
package Int_IO is
new Ada.Text_IO.Integer_IO ( Integer );
type List_Node;
type Pointer is access List_Node;
type List_Node is
record
Item : Integer; --Item in list.
Next : Pointer; --Pointer to next node in list.
end record;
-------------------------------------------------------------
procedure Insert ( New_Data : in Integer ) is
begin
First := new List_Node'( Item => New_Data,
Next => First );
end Insert;
-------------------------------------------------------------
procedure Print;
begin
--Initialize a pointer
Ptr := First;
end Integer_Linked_List_Package;
Linked List Package Body
Program 2.4.1
20
Ada Review
Pointers can also be used to save space in large arrays of records when many of the records
are blank. To illustrate, assume we have to set up a program to handle airline flights. The
flights are numbered from 100 to 999 and for various reasons we want each flight accessed by its
flight number. So the obvious way to store the data is to set up an array with 900 records, one
for each flight. If each flight record has three components:
- a source (20 bytes),
- a destination (20 bytes), and
- kind of airplane (10 bytes)
then there are a total of 50 bytes per record. Now 900 records at 50 bytes per record requires a
total storage space for 45,000 bytes. On the other hand, if there are only a small number of
flights, say 30 or 40 flights, then most of the space for the 900 records is empty. One way to
save space is to use the array, but, instead of storing the records in the array, store only a pointer
to the record in the array. Declarations to accomplish this are:
type Flight_Record is
record
...details omitted...
end record;
type Flight_Vector is
array ( Natural range <> )
of Record_Pointer;
Flights : Flight_Vector (100..999);
With these declarations, Flights( Flight_Number ) is a pointer whose value is either null (there is
no such Flight_Number) or a pointer to a record containing the corresponding Flight_Record.
With these declarations, there is still a vector with 900 entries, only now each entry is only a
pointer. Assuming each pointer takes only four bytes (a typical value), then the vector requires
900 times 4 or 3600 bytes of storage. There is also a 50 byte record for each flight. If there are
40 flights then these records use 40 times 50 or 2000 bytes of storage. The total storage is the
sum of these two values 3600 + 2000 or a total of 5600 bytes. Compare this to the 45,000 bytes
necessary if each item in the vector is itself a record.
The exact way that records and arrays are stored in memory was discussed in Section 2.1.
The thing that was not discussed is how or when the assignment of memory space to a given
record or array is made. In theory this can be done at several different times ranging from when
the program is compiled to when the program actually executes.
It is easy to assign memory space during compilation, but the result can be very wasteful of
space and very inflexible programs. Consider, for example, a program with two large arrays,
which are never used simultaneously, so that, at least in theory, when the program is finished
21
Ada Review
with one array, it can use the same memory space for the second array. Generally speaking, the
compiler would allocate separate memory space for each array even though the arrays can share
the same memory space. This can lead to programs which use an unnecessarily large amount of
memory space. Similarly, if the program defines an array with 100 entries and a problem comes
along which needs 101 entries in the array, the program must first be altered and then recom-
piled and relinked before use.
Conversely, if all memory assignments are made during program execution, the program can
be very flexible and never use more memory than absolutely necessary. The difficulty, is that
assigning memory space during program execution slows down the program execution, perhaps
drastically.
Ideally, the designer and programmer should be able to control when memory assignments
are made so as to optimize flexibility and speed. This can be done to some extent in Ada and
this section presents some techniques for doing this and taking advantage of this capability.
Variables declared in an Ada procedure or function are normally allocated memory space
when the procedure or function is invoked. Thus, given the procedure starting with the
statements:
procedure Sample is
X : Integer;
Vector : array (1..100) of Float;
space is allocated for X and Vector at the time Sample is invoked; that is, at execution time.
When the function Sample finishes executing, space for X and Vector is deallocated. If Sample
is the main procedure, then X and Vector are allocated memory space for the whole time the
program is executing. If, on the other hand, Sample is a subprogram (either a procedure or a
function), then space is allocated for X and Vector only while Sample is executing. If Sample is
invoked several times during program execution, then the variables X and Vector might be
assigned a different memory address for each invocation. This also implies that no memory
space is saved for X and Vector between invocations. Thus, if another vector is declared in a
different subprogram, the two vectors can share the same memory space provided the two
subprograms are never invoked simultaneously. With this storage allocation scheme, the
memory actually allocated at any given time is the minimum amount of memory possible given
this configuration of main program and subprograms.
Memory allocation for packages has two features. Packages are normally instantiated in a
procedure or function and the memory space is allocated for package global variables at the time
that the package is instantiated. The package memory space is deallocated at the time the
program or function which instantiated the package finishes execution. Procedures or functions
within the package can have their own local variables and space for these variables is allocated
when the subprogram is invoked and deallocated when the subprogram finishes execution.
There are times when the programmer wants more direct control over when memory is
allocated. One way to do this is to use a record and a pointer to the record. Recall that the
declarations:
22
Ada Review
type Flight_Record is
record
... component details not important at the moment...
end record;
FLT_125 : Flight_Record;
assign enough memory space to FLT_125 to store one copy of the record. If three variables are
declared to be of type Flight_Record, then space is allocated for three records.
On the other hand, the declarations
type Flight_Record is
record
... component details not important at the moment...
end record;
define an access type variable, P, which "points" at a node of type Flight_Record. No memory
space is allocated at this point for a Flight_Record. In fact, no space is allocated for a copy of
Flight_Record until an allocation statement like:
P := new Flight_Record;
It is important to separate the concept of a particular object class and its general representa-
tion data structure from the actual Ada package implementation. This section presents two
examples to illustrate some of the differences and some of the choices that must be made when
translating a basic concept into working Ada code.
_____________________________________________________________________
* While no manual can give the complete and final set of guidelines, this particular manual is
the distillation of the thinking of many of the leading thinkers and practitioners.
23
Ada Review
As a first example of the choices involved in designing an Ada program to implement a data
type, consider a patient record of the general form:
and two operations, Get_Patient which inputs a patient record, and Put_Patient, which prints out
a patient record.
Recall that a data type (object class) is a set of values and some operations on those values,
so patient records with the two operations clearly form a possible data type. The obvious way to
implement a data type in Ada is to use a package. Thus, one way to implement a Patient_Record
data type is use a package specification of the form: (Text is a data type for processing textual
data; the complete description and implementation package are in Appendix B.)
with Text_Package;
package Patient_Record_Package is
type Patient_Record is
record
Name : Text_Package.Text; -- Patient's name.
Age : Integer range 1..99; -- Patient's age.
City : Text_Package.Text; -- Patient's city.
end record;
end Patient_Record_Package;
This package allows the client or user program to declare as many variables as need as
Patient_Records and then use the Get_Patient and Put_Patient operations to input and output the
records. Since the record declaration has no protection for individual record components, the
client or user program can access and alter any component in a patient record. This allows the
client or user program to process the patient records any way it wants to.
Note the way this package itself uses another package, Text_Package, to declare the data
types for the Name and City fields of a record. This Text_Package also contains input and
output operations as well as comparison operators and it also allows assignment of Text
24
Ada Review
variables so Text variables can be input, output, compared, and assigned values. Thus, Text
variables can be used as a standard data type in either a main program or in the definition of
another data type. That is true of any data definition, so, for example, the Patient_Records
defined above can themselves be used as a data type in other packages.
Since the implementation details of the Get_Patient and Put_Patient procedures are straight-
forward, we omit any discussion of them and concentrate on the specification portion of the
Patient_Record_Package.
The above specification does work and allows the client or user program to do anything it
wishes with the patient records and their components. There are times, however, when the
package designer wishes to limit client or user access to the record components. Patient records,
for example, are normally considered confidential; even a patient's name is confidential. Thus,
one might want to let the client or user program process the whole patient record but not be able
to access a patient's name, age, or address. To do this, the record is declared to be private. A
package specification to do this is:
with Text_Package;
package Patient_Record_Package is
private; -- Declarations.
type Patient_Record is
record
Name : Text_Package.Text; -- Patient's name.
Age : Integer range 1..99; -- Patient's age.
City : Text_Package.Text; -- Patient's city.
end record;
end Patient_Record_Package;
This package specification limits the client or user program to inputting, outputting, compar-
ing two records for equality, and assigning a record to a variable of patient type. In other words,
the client or user program cannot access any component of a record. This safeguards the record
from improper access to patient information.
One might even want to go further and specify that the patient records can only be input and
output. Even comparing two records for equality or copy a record from one place to another is
too dangerous to allow. To do this requires one change in the last package specification above.
The statement:
25
Ada Review
The addition of the word limited restricts the client or user program to only inputting and
outputting patient records. The client or user program, at least in theory, has no idea of what
information is stored in a patient record and has no access to this information and, in fact, cannot
even compare two records for equality nor copy a patient record from one place to another.
On occasion one wants to forbid access to some record components and allow access to other
record components. If, for example, we want to let the client or user program access only the
name component off a patient record, we would
1. make the whole record limited private, and
2. add another procedure to the package specification.
This new procedure would take a patient record and return the patient's name. In other words,
the client or user program cannot access a patient's name except through a package procedure or
function. The package designer then has complete control over which components a user can
access and which component values are protected from the user.
The whole point of this section is that once a data structure is designed, there are still choices
the designer must make before writing an Ada program. With records (or variables or arrays),
the designer must choose how much and what kind of access the user has to components of the
whole data structure. The next section shows that even more sophisticated choices must be
made.
The second example, a simple list of integers, illustrates some more choices that must be
made. In this case, the choices are somewhat dependent upon the particular application of the
final package.
To simplify the presentation, assume there is only one list operation, Insert, which adds an
item to the list. More operations could be added, but they would only obscure the discussion of
the major choices that need to be made.
To implement this object class as an Ada package, we first choose an implementation data
structure and develop an insertion algorithm. Assume the list is implemented using an array to
store the list:
List is record
Size : Natural := 0 --Number of items in list.
Array : array (1..Maximum_Size) --Array to store items in list.
end record
26
Ada Review
Insert ( Item )
If Size >= Maximum_Size
then Overflow error
else Size <-- Size + 1
Array( Size ) <-- Item
end insert
The next question is how to translate this data structure and algorithm into an equivalent Ada
package. There are several possible Ada implementations. Let us start with implementing the
data structure using a type record:
type Data_Array_Type is
array (Natural range 1..Maximum_Size)
of Integer;
type List_Node is
record
Size : Natural := 0; --Number of items in list.
Data_Array : Data_Array_Type;--Array to store list items.
end record;
There are still a number of choices to be made. Shall we implement one list (an object) or a
list data type (an object class) so that the user/client can declare as many lists as desired. There
is a time and a place for both cases, so we will present Ada implementations for both choices.
Let us start with implementing a single list. The Ada Quality and Style: Guidelines for
Professional Programmers, Version 02.01.01, Section 4.2.1, gives the following criteria for
package design:
- "Put only what is needed for the use of a package into its specification."
- "Avoid unnecessary visibility; hide the implementation details of a program from its users."
- "Objects which must persist should be declared in package bodies."
Translating these rules into a viable design suggests that:
1. No reference to the actual data scheme used to store the data need or should appear
in the package specification.
2. The Ada code used to define the array storage should be declared in the package body.
This suggests a package specification like the following:
package List_Package is
end List_Package;
where the specification contains no information about the actual implementation; that is, the data
declarations are left for the package body.
There are some minor improvements that can be made. First, this version uses the data type
Integer throughout the package. This is fine if we never intend or need to alter the package to
27
Ada Review
work for a list of something other than integers. If, for example, we later need a list of Float
numbers it is conceivable that the package could be altered easily, but the package is easier to
alter if the item data type only occurs once in the package; that is, declare a new data type, say
Data_Type, and use this type throughout the package. The specification for this case is:
package List_Package is
subtype Data_Type is Integer;
end List_Package;
Now the list item data type can be altered by changing only a single entry right at the beginning
of the package specification.
A still better scheme is to realize that a list is essentially an Abstract Data Type (ADT)
because it doesn't matter what the data is, the list object class simply saves the data. The best
way to implement this version is to make the package generic as follows:
generic
package Single_Generic_List is
end Single_Generic_List;
This gives a generic list, but leaves the user/client program no way to control the maximum
size of the list. The easiest way to do this is to include the maximum size as one of the generic
parameters. The final result is in Program 2.6.2.1.
Since the list of items persists between list operations, this program declares the data struc-
ture in the package body. This also hides implementation details from the user and allows the
user to concentrate on the list itself and the list operations rather than the implementation details.
With this implementation, a user/client program can declare lists of arbitrary kinds. For
example, the statement:
package Integer_List is
new Single_Generic_List ( Data_Type => Integer,
Maximum_Size => 100 );
package Float_List is
new Single_Generic_List ( Data_Type => Float,
Maximum_Size => 500 );
28
Ada Review
generic
package Single_Generic_List is
end Single_Generic_List;
-----------------------------------------------------------------
---------------------------------------------------------------
-- Package Body --
with Ada.Text_IO;
package body Single_Generic_List is
type Data_Array_Type is
array (Natural range 1..Maximum_Size )
of Data_Type;
-----------------------------------------------------------------
end Insert;
end Single_Generic_List;
29
Ada Review
instantiates of a list containing up to 500 float numbers. (Generics and instantiations are covered
again in more detail in Chapter 4.)
The next step is to create a list package that can handle multiple lists. The easiest way to do
this is to use the same technique used to specify the Time_Type in Section 2.2; that is, make the
list a record and move the record declaration into the package specification. Assuming, for the
moment, only integer lists, a typical package specification of this kind is:
package List_Package is
private --Declarations
type Data_Array_Type is
array (Natural range 1..1000)
of Data_Type;
type List is
record
Size : Natural := 0; --Number of items in list.
Data_Array : Data_Array_Type;--Array to store list.
end record;
end List_Package;
The Insert procedure now has a List_Name as a parameter; this allows the user/client program to
specify which list to insert the data into. For simplicity, all lists are given a maximum size of
1000 items.
With this package specification the user/client program can declare any number of lists; for
example,
with List_Package;
procedure Sample_Client_Program is
30
Ada Review
The difficulty with this specification is that it violates the Ada Quality and Style guide
recommendations about hiding implementation details in the package body. If the definition of
the List record is moved to the package body, then the user must have some way of accessing a
particular list. To do this it is necessary to use an access variable to point at the desired list. The
basic specification scheme is then:
package List_Package is
type List is limited private;
private -- Declarations
type List_Node_Type;
type List is access List_Node_Type;
end List_Package;
where the actual declaration of the List_Node_Type is left for the package body.
Now the user/client program can still use statements like the following to declare as many
lists as desired:
with List_Package;
procedure Sample_Client_Program is
The difference between this set of definitions and the previous ones is that now the data is stored
in the list package rather than in the client or user program.
There is a difficulty with this approach. Since List1, List2, and List3 are access types, they
are automatically initialized to null by the Ada compiler. This means there must be some way to
make these variables point at a particular List_Node_Type record. One way to do this is to add
another operation, say Create, to assign a particular List_Node_Type record to a specified list
name. The user/client program must then invoke this operation before using any other operation
on the list. This can be done, but for reasons of robustness (Ada Quality and Style guide Section
8.2), it is better to use the Controlled type (introduced in Chapter 4, Section 4.4.4) which will
correctly initialize the variables, List 1, etc. with no additional effort on the user/client’s part.
Specification 2.6.2.2 and Program 2.6.2.2 contain a generic version of this scheme.
This section has presented some of the possible Ada implementations of the same basic data
structure and algorithm. In other words, the simple data structure and insertion algorithm given
at the beginning of this section can be implemented several different ways in Ada. Each Ada
implementation has its own advantages and disadvantages which are independent of the basic
data structure and algorithm. The same statement can be made about many of the data structures
31
Ada Review
generic
package Multiple_Generic_Lists is
private
type List_Node_Type;
type List is access List_Node_Type;
end Multiple_Generic_Lists;
32
Ada Review
with Ada.Text_IO;
package body Multiple_Generic_Lists is
type Data_Array_Type is
array (Natural range 1..Maximum_Size)
of Data_Type;
type List_Node_Type is
record
Size : Natural := 0; --Number of items in list.
Data_Array : Data_Array_Type;--List of items.
end record;
----------------------------------------------------------------
----------------------------------------------------------------
begin
--If List_Name not yet attached to List_Node record, do it.
if List_Name = null then
List_Name := new List_Node_Type;
end if;
end Multiple_Generic_Lists;
33
Ada Review
presented in this book. For this reason, we will present the basic data structures and their
algorithms in some detail, but, as a general rule, present only one of the many possible Ada
implementations of the data structure and algorithms and leave the other possible Ada imple-
mentations for the reader. The interested reader is referred to the Ada Quality and Style guide,
Section 8.3.4, for more detailed discussion of some of these points.
One of the goals of this book is to familiarize the reader with the various Ada implementa-
tion options possible and the pros and cons of each option. So, in the course of this book, each
of these options is covered in more detail with a discussion of the reasoning behind each option.
Exercises
1. Add the following operations to the Text package contained in Specification 2.6.1.1 and
Program 2.6.1.1:
a. concatenation,
b. an index function which returns the position of one Text value inside another
Text value,
c. a substitute operation which replace one Text value by a second Text value
inside a third Text value,
d. a delete operation which deletes first occurrence of one text value from a
second text value,
e. a function to convert a Text value into an Integer value,
f. a function to convert a Text value into a String value,
g. a function to convert an Integer value to a Text value,
h. a function to convert a String value to a Text value, and
i. a substring function which returns a specified substring of a given string. Be
sure to make allowances for various types of errors.
2. Extend the list packages in Specifications 2.6.2.1 and 2.6.2.2 to include a delete operation.
34
Stacks
STACKS
Stacks are widely used in many kinds of applications ranging from operating systems to
inventory management. They are also a particularly simple ADT and allow us to study some
ways of implementing ADTs without being confused by the details of the ADT itself.
This chapter presents first the general concept of stacks along with some algorithms for
implementing stacks and some uses of stacks. It then presents Ada code corresponding to the
algorithms.
The chapter also presents ways of implementing ADT's in Ada as generic packages and
concludes with coverage of Ada exceptions and iterators.
1
Stacks
3.1. Stacks
Note that the three procedures, Push, Pop, and Clear, all modify data whereas the function,
Empty, only returns a value. A function should never have any side effects; that is, it should
never alter any data values. Thus, we will use a procedure any time a data value must be altered
and we will use a function only when nothing is altered; that is, when only a value is returned by
the function.
The stack differs from most of the data types described earlier in that we have not specified a
particular set of values. The stack has a specific set of operations, but the data items can be of
any type; what is important is that they are pushed onto and popped off of the stack. The data
items could be people's names, computer programs, or numbers of some type. The particular
type doesn't matter so long as the package pushes and pops them. Recall that an abstract data
type is a data type which has been abstracted from any particular data type or representation; that
is, it features the basic operations of an object class but is independent of any particular type of
data or any particular kind of representation. Thus the stack definition above qualifies as an
abstract data type.
3.1.1. Representations
There are a number of ways of representing a stack, but an efficient stack representation
should be directly in terms of one of the three basic components for implementing a data struc-
ture. Therefore, we give data specifications and algorithms for one representation based on
arrays and a second representation based upon linked lists (a combination of records and
pointers).
The array representation stores the items in an array such that the first item pushed onto the
stack is placed in Array(1), the second item pushed onto the stack is stored in Array(2), and so
forth. There is a variable, called Top, whose value is the number of items currently on the stack.
After executing the operations:
Push( A ), Push( B ), Push( C )
the array would be:
2
Stacks
1 2 3 4 5 6 7 8
A B C ...
where the integers above the array indicate the subscript of the array component. In this case,
the value of Top is 3.
The basic steps for pushing an item onto the stack are:
Top <-- Top + 1
Array( Top ) <--- Item
Using an array, however, imposes an additional restriction. Since an array has a fixed amount of
storage space, it is necessary to insure that the array has space to store another item. Including a
check for empty storage space in the array gives the algorithm:
If Array is full
then Overflow error
else Top <-- Top + 1
Array( Top ) <--- Item
Note that while the original stack ADT definition contained only one possible error (in the
Pop definition), because of the limitations of the array, it is necessary to add an additional error
in the Push routine. Such additional errors due to the representation are a necessary evil that, as
will be seen later, can complicate program development and modification.
To pop something from the stack, it suffices to pick off the item in Array( Top ) and decrease
Top by 1. Including a check for an empty stack gives the algorithm:
If Stack is empty
then Underflow error
else Save <-- Array( Top )
Top <-- Top - 1
Return( Save )
To test for an empty stack, it suffices to use:
Return( Top = 0 )
and to clear the stack we can use:
Top <-- 0
While the algorithms are developed individually, good software engineering practice requires
any object to be modularized; that is, the data and all of the operations on the data should be in a
single module or, in Ada terminology, in a package. Thus we need some algorithmic way to
describe a module. The format used in this book is illustrated in Module 3.1.2.1.1 for an array
implementation of a stack. First comes the data specification of those data items which are
common to all of the operations and persevere from one operation invocation to the next. Next
are the set of operation algorithms, one algorithm for each operation. Module 3.1.2.1.1 thus
contains the complete modular description for an array representation of a stack.
Note that one extra operation is included, a test for a full stack; this is included because it is
used in the push operation although some experts consider the test for a full stack as one of the
standard stack operations.
3
Stacks
Data Specification
Maximum_Size : Positive Number; --Maximum size of stack.
Algorithms
Clear
Top <-- 0
end clear
Empty
Return( Top = 0 )
end empty
Full
Return( Top = Maximum_Size )
end full
Push( New_Data )
If Full
then
Overflow error
else
Top <-- Top + 1
Array( Top ) <-- New_Data
end push
Pop( The_Data )
If Empty
then
Underflow error
else
The_Data <-- Array( Top )
Top <-- Top - 1
end pop
4
Stacks
Item Next
where Item contains the item or value stored in the stack and Next contains a pointer to the node
containing the previous item on the stack. In record form the definition is:
Node is record
Item : ??? --Item in stack, any data type
Next : Pointer to Node --Pointer to node containing previous item on stack
end record
where the ??? indicates that the data type of Item is arbitrary.
There is also one additional value, a pointer, called Top, which points at the node currently
containing the item on the top of the stack.
After executing the operations:
Push( A ), Push( B ), Push( C )
the linked data structure would be:
Top A
C Λ
where Top points at the node containing the item on top of the stack; this node in turn points to
the node containing the next item on the stack, and so on down to the bottom of the stack where
the value of the pointer is Λ to indicate the bottom of the stack.
To push a new item on top of the stack, we must first get a new, empty node, insert the value
of the new item into the node, and then link the node into the stack. If the stack is empty, the
algorithm is:
Top <-- new Node( Item => New_Data, Next => Λ )
When the stack is not empty the algorithm is:
Top <-- new Node( Item => New_Data, Next => Top )
Since the value of Top is null when the stack is empty, the single statement:
Top <-- new Node( Item => New_Data, Next => Top )
suffices for both cases.
Popping an item from the stack is straightforward:
If Stack is empty,
then Underflow error
else Save <-- Top.Item
Top <-- Top.Next
Return( Save )
5
Stacks
Data Specification
Top : Pointer to Node; --Pointer to top item of stack.
Node is record
Item : ??? --Holds one item in stack, any data type.
Next : Pointer to Node --Pointer to next node in stack .
end record Node;
Algorithms
Clear
Top <-- null
end clear
Empty
Return ( Top = null )
end empty
Push( New_Data )
Top <-- new Node ( Item => New_Data, Next => Top )
end push
Pop( The_Data )
If Empty
then
Underflow error
else
The_Data <-- Top.Item
Top <-- Top.Next
end pop
6
Stacks
3.1.2 Timing
Since all of the algorithms contain only assignment and if statements, the time to execute
each one of these algorithms is O(1). For uniformity, this is presented in the following table.
Operation Representation
Array Linked
Clear O(1) O(1)
Push O(1) O(1)
Pop O(1) O(1)
Empty O(1) O(1)
Three major criteria for comparing different representations of an ADT are: speed, space,
and flexibility. This section compares the array and linked representation of a stack using these
three criteria.
y The execution time of all of the operations is O(1) for both representations
which gives the impression that the two representations are equal in speed. This
is true except for one detail. The push operation in the linked representation
must get a new node each time it executes. This usually implies a call on the
operating system to allocate the space necessary for the node. Calls on the
operating system can be very time consuming and in practice can make the
linked representation much slower than the array representation.
y To speed up the push operation in the linked representation, various schemes for
reducing the number of calls on the operating system are used. Some Ada
compilers generate code which calls on the operating system for blocks of ten or
twenty nodes at a time, thus greatly reducing the number of calls on the operat-
ing system and greatly speeding up the execution time of the linked representa-
tion. This is nice when this occurs, but it is not something a designer should
depend upon. Another approach is to alter the pop operation so that every time
an item is popped from the stack the node that is freed up is placed in a linked
list. The push operation then uses empty nodes from this linked list whenever
possible instead of calling on the operating system. Providing the size of the
stack grows and shrinks more than once during the execution of the program,
reusing the nodes this way can increase the execution speed.
y The space used by the two representations depends upon several factors.
Clearly the array must be large enough to handle the worst case, so, under
normal circumstances the array representation uses the most space. The nodes
in a linked list representation, on the other hand, use extra space to store the
pointers. If the designer knows in advance the maximum number of items that
are actually in the stack at one time, then the array representation can use less
space. Obviously, a designer seldom knows in advance how many items are in
the stack at the same time, so, under normal circumstances, the linked list uses
the least amount of space.
7
Stacks
Exercises
4. A priority stack is a stack where each item has a "priority" or rating and items are stored in
the stack from lowest priority to highest priority. Two items with the same priority are stored in
FILO order; that is, the first item pushed onto the stack is the last item popped from the stack.
Develop a module to implement a priority stack using:
a. an array or b. a linked
representation.
5. A stack with refusal is one that can refuse to push any item satisfying some criteria. Develop
a module for such a stack using:
8
Stacks
a. an array or b. a linked
representation.
Now that we know how to implement stacks, we are ready to consider some applications of
stacks. This section assumes the existence of a stack package and uses this package to simplify
the solutions of some typical stack problems, problems whose solution depends upon a stack.
Example 3.2.1. Develop a stack based algorithm to recognize a string consisting of n A's
in a row followed by n B's. (A recognizor outputs a Yes or a No depending on whether the
string matches the required pattern.) Note that n can be any value as long as there are the same
number of A's and B's.
The basic algorithm pushes each incoming "A" onto a stack and then, for each incoming "B,"
pops an "A" off the stack. The stack should be empty after the last "B" is processed and not
before. An algorithm is:
Initialize
Clear stack
Input: Char
Repeat for each 'A' (while Char = 'A' and not eof )
Push Char
Input: Char
end repeat
Repeat for each 'B' (while Char = 'B' and not eof and stack not Empty)
Pop Item
Input: Char
end repeat
Terminate
If Stack is not Empty and Char = 'B', then Pop Item from stack
If eof and Stack is Empty and Char = 'B'
then 'Yes'
else 'No'
end
The test for eof in the terminate section is necessary because of the way Ada tests for the end of
the file. To be more precise, Ada sets the end of file flag as soon as the last item in the file is
read. Thus, as soon as the last B is input, the algorithm exits the second loop without popping
the last value of A from the stack. This test statement then pops the last A off the stack, if possi-
ble.
9
Stacks
Example 3.2.2. Develop an algorithm to check an arithmetic expression for proper nesting of
two kinds of brackets, say {} and []. Brackets are properly nested if, for each left bracket there
is a matching right bracket, and, while matched pairs can be inside of one another, say [{}], no
overlapping such as [{]} is allowed. Also assume the last character in an expression is a semico-
lon.
The basic algorithm pushes each open bracket, "(" or "[", onto the stack. Each closed bracket,
")" or "]", is matched to the item on the top of the stack; if the closed bracket and the open
bracket are not both of the same type, then the expression is invalid. An algorithm is:
Initialize
Clear stack
End of expression <-- false
Error <-- false
Repeat for each Item (while not End of expression and not Error and not eof)
Input: Item
Do-One-Of
Item = '(' or '[' : Push Item
Item = ')' : Pop Token
Error <-- ( Token /= '(' )
Item = ']' : Pop Token
Error <-- ( Token /= '[' )
Item = ';' : End of expression <-- true
end do-one-of
end repeat
Terminate
If End of expression and no Error
then Output 'OK'
else Output 'No good'
end
Exercises
2. Develop an algorithm to check a string for a palindrome, a string which reads the same
backwards and forwards. Some sample palindromes are: "Mom", "Madam Im Adam." and
"Able was I ere I saw Elba."
3. Develop an algorithm to check Ada programs for correct nesting of statements. Limit the
algorithm to working for assignment, if, and loop statements.
10
Stacks
4. Extend the last exercise to "pretty print" the program it is examining. That is, the algorithm
should output the test program with proper indentation.
5. How many ways can the input letters A, B, C, D be permuted if each letter as it is input is
either output immediately or stored in a stack for later output (letters on the stack can be output
at any time).
6. A company values their inventory on a LIFO basis; i.e., each item in inventory contains the
price it costs the company and whenever the company sells an item it always sells the last item it
received of that type so the company's gain is the difference between the selling price of the item
and what the company paid for the last copy of that item. Develop an algorithm to keep track
of the company's gains. Assume the company sells only one product and that the product is
always bought and sold in lots of one.
7. Extend Exercise 6 to
a. allow the product to be bought and sold in arbitrary quantities, and
b. allow more than one product.
Most of us are familiar with the mathematical notation 5+3 or 5/3 where the operator (+, -, *,
or /) is placed between the two numbers. This notation, called infix notation because the opera-
tor is between the two numbers, is simple and useful. Its major disadvantage is that we must use
parenthesis as the expressions become more complex. The expression 3/(2+5), for example,
cannot be written without the parentheses. Unfortunately, the more complicated our expression
manipulations become, the more of a nuisance parentheses become.
The Polish mathematician Lukasiewicz found a way around this problem by introducing
postfix notation which eliminates the need for parentheses. In postfix, the operator always
comes after the two operands; thus, in postfix notation:
32* = 6
32+ = 5
To evaluate a postfix expression we replace each occurrence of a triple of the form:
<number> <number> <operator>
by its value and keep on doing this until only the final result remains. For example,
25+3* = 73* = 21
2 5 + 3 9 + * = 7 12 * = 84
2 4 + 3 1 + * = 6 3 1 + * = 6 4 * = 24.
where in each case the triple to be evaluated is underlined.
Postfix is conceptually simpler than infix because we don't have to worry about the parenthe-
ses; for example, the infix expression 3/(2+5) becomes 3 2 5 + / in postfix notation. Postfix also
greatly simplifies expression manipulation -- an important part of computer science.
Prefix notation accomplishes the same goal by placing the operator before the two operands.
Thus, in prefix notation:
11
Stacks
+23 = 5
*23 = 6
and so forth. It has the same advantages as postfix notation.
Lukasiewicz developed postfix notation to solve problems in mathematical logic, but the
concept is very useful in computer science. Most of our theory and techniques for compiler
writing are based upon using either postfix notation or prefix notation as a step half way between
a high level infix language and a binary code program. It is much easier to write two small
algorithms (the first to translate infix to prefix and the second to translate prefix to binary code)
than it is to write an algorithm to go directly from infix to binary code.
Between the two world wars, Poland was the home of a number of world class mathematical
logicians. Lukasiewicz was a member of this group and to honor him and this group this
notation is often called "Polish notation," "Reverse Polish," and other terms that include the
word "Polish."
The basic algorithm inputs a postfix expression one item at a time. When it inputs a number,
then it pushes the number onto a stack and, whenever it inputs an operator, it pops the top two
items from the stack, performs the desired operation on the two numbers, and pushes the result
back onto the stack. Thus, to evaluate the postfix expression 2,5,+, the algorithm pushes the 2
onto the stack, then pushes the 5 onto the stack, and, when it inputs the + sign, it pops the two
numbers from the stack, adds them and pushes the result, 7, back onto the stack. Note that this
process leaves the result on the stack.
A more detailed algorithm is:
Initialize
Clear stack
More-to-do <-- true
12
Stacks
Terminate
Stack must contain exactly one item (the Answer) or else there is an Error
end
One of the interesting features of this algorithm is that it also can produce error messages for
invalid input. In particular, if the input expression contains too few operators, there will be
numbers left in the stack at the end. If the input expression contains too few numbers, the
algorithm will try to pop data from an empty stack. In either case, the input expression is wrong.
As written, the algorithm above assumes the stack always contains something to pop. This is
valid provided we use an exception to handle a stack underflow. Exceptions are covered in more
detail in Section 3.5; for now it suffices to assume that if a stack becomes empty, this whole
algorithm halts.
Many scientific calculators, such as those made by Hewlett-Packard, use reverse Polish
notation. The calculator itself uses an algorithm like the one above to evaluate these express-
ions.
Exercises
3. Develop a program to translate a postfix input expression into a sequence of assembler state-
ments. You may assume the assembly language includes stack operations: Push, Addition,
Subtraction, Multiplication, and Division, where the four arithmetic operations are applied to the
top two items on the stack and the result left on the stack.
4. Develop a program to translate a postfix input expression into a sequence of assembler state-
ments. You may assume the assembly language is a three address machine; that is, every
machine instruction is in the form:
Op A_address B_address C_address
where the operation (Addition, Subtraction, Multiplication, or Division) is applied to the
contents of the first two addresses and the result stored in the third address.
5. Develop a postfix evaluator program to evaluate postfix expressions containing any combina-
tion of the operators: +, -, *, /, <, <=, >, <=, =, and /=.
13
Stacks
Example 3.3.2.1. Develop an algorithm to translate infix expressions to postfix expressions. For
example, given the input A+B*C, the algorithm should output A, B, C, *, +. Assume for the
moment that the infix expressions contain no parenthesis. Also assume each infix expression
ends with a semicolon.
The "secret" to translating infix to postfix is to use a stack and to assign a priority to each
operator. The priorities are:
Operator Priority
; 0
+,- 1
*,/ 2
** 3
Note that the higher priority operators are the operators executed first in an expression.
Now, as each token (number, data name, or operator) is input, it is processed as follows:
1. Numbers and data names are output.
2. Operators are pushed onto the stack, unless the operator on the top of the stack
has higher priority, in which case, the stack is popped and the popped operator
output.
This continues until the operator on the top of the stack has priority less than or equal to the new
operator. At this point, the new operator is pushed onto the stack.
As an example, assume we want to translate the infix expression:
2+3*5;
into the equivalent postfix expression. When the 2 is input, it is immediately output. Next the +
is input and pushed onto the stack. At this point we have:
Output = 2 and the stack = | + |
Next the 3 is input and immediately output so that at this point:
Output = 2, 3
When the * is input, its priority is compared to that of the operator on top of the stack. Since *
has higher priority than +, the * is pushed onto the stack giving the stack:
|*|
|+|
Next, the 5 is output as soon as it is input, so that the output at this point is:
Output = 2, 3, 5
Next the semicolon is input. Since the semicolon has lower priority than the *, the * is popped
from the stack and output giving:
Output = 2, 3, 5, *
Since the semicolon also has lower priority than the new top of the stack, +, the + is popped
from the stack and output giving:
Output = 2, 3, 5, *, +
Since the stack is now empty, the semicolon itself is output to produce the final postfix version:
2, 3, 5, *, +, ;
of the input, infix expression.
14
Stacks
Before giving a detailed algorithm, it is worth noting that the algorithm executes faster if
each stack entry has two fields, one for the operator and a second for the priority of the operator.
This eliminates the need to continually reevaluate the priority of the item on the top of the stack.
We also assume the stack has a "peek" operation; one which returns the value of the top item on
the stack, but does not alter the value of the stack.
Since each item on the stack consists of an operator and a priority, the stack consists of
records and each record has two fields, one for the operator and one for the priority. The stack
then pushes or pops a whole record at a time. The algorithm will denote the record by enclosing
the operator and priority in a set of square brackets.
An algorithm is:
Initialize
Clear stack
Push [' ', -2]
More-to-do <-- true
Repeat for each Token in expression (while More-to-do and not eof)
Input: Token
Do-One-Of
Token = Number : Output Token
Token = +, -, *, /, ; : While Priority-of(Token) <= Priority-of(Peek)
Pop Operator
Output Operator
end while
If Token = ';'
then More-to-do <-- false
else Push [Token, Priority-of(Token)]
end do-one-of
end repeat
Terminate
Token must be a semicolon and the Stack must contain exactly one item [' ', -2]
or else there is an Error
end
Since the item pushed onto the stack during initialization has a lower priority than any operator,
it is impossible to underflow the stack in this algorithm. On the other hand, the algorithm
assumes that every input expression is a valid infix expression. It does no error checking of the
input expression.
15
Stacks
This algorithm is essentially the same as the previous one. The major difference is that we
add two new cases in the Do-One-Of statement, one for "(" and the second for ")". If the Token
is "(" we simply push the "(" with priority -1 onto the Stack. If the Token is ")", then we pop
items off the stack and output them until we pop the matching "(". The algorithm is:
Initialize
Clear stack
Push [' ', -2]
More-to-do <-- true
Repeat for each Token in expression (while More-to-do and not eof)
Input: Token
Do-One-Of
Token = Number : Output Token
Token = +, -, *, /, ; : While Priority-of(Token) <= Priority-of(Peek)
Pop Operator
Output Operator
end while
If Token = ';'
then More-to-do <-- false
else Push [Token, Priority-of(Token)]
Token = ( : Push ['(', -1]
Token = ) : Pop Temp
Repeat while Priority-of(Temp) >= 0
Output Temp
Pop Temp
end repeat
If Temp /= '(', then error
end do-one-of
end repeat
Terminate
Token must be a semicolon and the Stack must contain exactly one item [' ',-2]
or else there is an Error
end
Since the first item pushed onto the stack in the initialize part of this algorithm has priority
-2, which is lower than the priority of any of the standard operators, this algorithm cannot
attempt to pop from an empty stack. For this reason, the algorithm has only very limited error
checking capabilities. It will process almost any input string containing integers and operators,
regardless of their format.
Exercises
1. Translate each of the following into both prefix and postfix notation:
a. 3+5 c. 3*5+2*4 e. (3+5) g. 3*(5+2)*4
16
Stacks
5. Amend the algorithm in Example 3.3.2.1 so that numbers are treated the same as the +, -, *,
and / operators. To do this let a number have priority equal to 5 and show that the algorithm still
works if the numbers are lumped in with the operators; i.e., there is no need for the Do-One-Of
selection.
6. Assume that the algorithm in Example 3.3.2.1 is altered so that no record [' ', -2] is pushed
onto the stack during initialization. What effects does this have on the translation process?
7. Alter the algorithm in Example 3.2.2.2 so that ")" is included with the +, -, *, / case; in other
words, eliminate the special case for ")".
Many programmers assume that once the decision is made between using an array or a linked
representation that the Ada code is completely determined. This is false. There are a number of
further design decisions that must be made and each will affect the final Ada code. To mention
only a few possibilities:
1. Should the code implement a single stack (an object) or define a new data type
(an object class) so the user/client program can declare as many stacks as
desired?
2. Should the package be generic or non-generic?
3. How should the package handle errors?
17
Stacks
These possibilities lead to a large number of different Ada packages all for the same ADT. For
example, to develop Ada code for a stack ADT, the designer must choose between:
1. an array or a linked list implementations,
2. a single object or an object class implementation,
3. generic or non-generic implementations, and
4. at least two error handling methods.
This gives a total of at least sixteen different Ada packages for the stack ADT.
As a general rule, there is no right or wrong choice for any of these possibilities. Or to be
more precise, the best choice depends upon the circumstances. What is right in one set of
circumstances may be completely wrong in a different set of circumstances. One of the goals of
this text is to introduce you to the possibilities and the factors that enter into the choice. We start
this process by making some choices as to how to implement a stack and developing the corre-
sponding Ada code.
The first choice we must make is between implementing a single stack (an object) or imple-
menting a stack data type (an object class) so that the package user can define as many stacks as
desired. The advantages of implementing a single stack include simpler and faster executing
code. There are problems, however, which require more than one stack, so it is sometimes
necessary to implement a stack data type. The techniques are slightly different and there is a
place for each kind of implementation, so we will present the single stack implementation in this
chapter and present the means for implementing a stack data type in the next chapter.
The data items in a stack can be of any data type; what is important is that the items are
pushed onto and popped off of the stack. The data items could be people's names, computer
programs, or numbers of some type. The particular type doesn't matter so long as the package
pushes and pops them. In other words, a stack is an ADT. Therefore, a natural choice for the
package implementation is a generic package so that the package user can specify any desired
data type at instantiation time.
While generic packages are a convenient and natural implementation of ADTs, they have
two major handicaps: speed and space. Generic packages tend to take more memory space and
execute much slower than the equivalent non-generic package. Thus, even though the underly-
ing object class is an ADT, non-generic packages are often used whenever execution speed or
memory space is important.
The following is an example of a generic stack package using an array implementation. The
maximum size of the stack and the data type of the items in the stack are specified as parameters
at the time the package is instantiated. We give first the package and then some sample instan-
tiations and usage.
The declaration of a generic package is very similar to the declaration of an ordinary
package. The only difference is the generic specification which precedes the usual package
specification. This generic specification contains a list of the variables whose values or data
types the user can specify at instantiation time.
To illustrate, the package specification for a simple generic stack package (based upon an
array implementation) can be:
18
Stacks
generic
package Stack_Package is
procedure Clear;
--Sets the stack to empty.
This package specification has two generic parameters, Data_Type and Maximum_Size,
which the user can specify at instantiation. The first parameter specifies the data type to be
stored in the stack and the second is a positive integer which specifies the maximum size of the
stack. Note that the actual data specification of the stack (the precise way the stack is repre-
sented or stored) is not in the package specification.
Note also that data type Data_Type is private. This means that the package can only
perform assignments and equal and not equal comparisons on the data. That is, the
"privateness," is limited to the package. Back in the calling or invoking program, the data can be
processed any way the user desires. This assures the user that his data is only processed in
certain, simple ways in the package.
This differs from the usage of private in a non-generic package. In a non-generic package,
declaring something to be private limits the way the calling or invoking program can use the
data. In a generic package, on the other hand, declaring something to be private in the generic
portion of the specification limits the way the package can use the data. This distinction is
important.
This distinction is also a natural one. The user, for example, only wants the user data pushed
onto and popped off the stack. The user does not want the stack package to alter the user's data
in any way. Since the package is generic, declaring Data_Type as private guarantees the user
the package can only perform a few basic operations on the data.
Note that the actual data specification of the stack (the precise way the stack is represented or
stored) is NOT in the package specification. This is called information hiding and there are
several reasons for hiding implementation details. First, a stack is a stack is a stack and the exact
implementation data structure is not a part of a stack definition. In other words, a stack is an
19
Stacks
ADT and the implementation should allow the user to use the stack in a way that is independent
of any particular representation; that is, a program can use a stack package without worrying
about the implementation data structure. Second, this is the ultimate in data protection. There is
no way the user can mess up the stack if it is stored in the package body and is accessible to the
user only by means of package operations. This relieves the user of any need to even think
about it. Third, and also important, we can later change the implementation details in the
package body without recompiling all the programs that use this package. When a large number
of programs all use the same package, this advantage can be significant. (See The Ada 95
Quality and Style: Guidelines for Professional Programmers, Version 01.00.01, Section 4.2.1
for further discussion of this point.)
To instantiate this package we create a new package with the desired features; for example,
the package:
with Stack_Package;
package Integer_Stack
is new Stack_Package( Data_Type => Integer,
Maximum_Size => 100 );
is now a stack package (object) that has a stack of up to 100 integers. Similarly, the package:
with Stack_Package;
package Float_Stack
is new Stack_Package( Data_Type => Float,
Maximum_Size => 300);
with Name_Type_Package;
with Stack_Package;
package Name_Stack
is new Stack_Package(
Data_Type => Name_Type_Package.Name,
Maximum_Size => 50 );
instantiates a stack object of up to 50 items of type NAME where the data type NAME is defined
in Name_Type_Package and may even be an array or record. This allows records and even
arrays to be stored on the stack.
These new packages are used the same way any other package is used. In particular, to
invoke any of the stack operations, we use the appropriate package name followed by the proce-
dure name; for example:
Integer_Stack.Clear;
Float_Stack.Push ( X ); or
Name_Stack.Pop ( X );
20
Stacks
The package body for an array implementation is sketched in Program 3.4.1. Since the
actual stack data specification is not in the package specification, it must be in the package body.
The first two lines declare the array used to store the stack and the next line defines the Top
variable. The remainder of the code is simply the Ada code corresponding to the algorithms in
Algorithm 3.1.2.1.1.
There is some question about whether or not the Top variable should be initialized to zero in
the declaration. One school of thought states that initialization is executable code and should not
be part of the declaration. Another school of thought states that every variable should be initial-
ized at the point where it is declared. This book takes the view that normally variables should be
initialized in the body of the code, not in a declaration, but this case is a definite exception. The
reason: The stack clear operator is the only executable code to initialize the value of Top to zero.
If, for any reason, the user does not execute the Clear operation before executing any other stack
operation, the stack would not be initialized properly and the remainder of the program would be
useless. Therefore, to insure that the stack is always properly initialized, we initialize the value
of Top to zero in the declaration section of the package body.
The package body for a linked implementation is sketched in Program 3.4.2. Since only a
single stack is being implemented, all of the data declarations for the stack are contained in the
package body. The actual code for the four operations is essentially the same as in the
algorithms in Module 3.1.2.2.1. (Note that the value of Top is initialized to null in the declara-
tion section. Why isn't this necessary in Ada and what are the advantages of doing it anyway?)
The package specification for the linked implementation, however, raises some questions. If
the package specification is independent of the actual implementation data structure, then it must
be identical for both the array and the linked implementations. This, however, requires includ-
ing the Maximum_Size variable in the linked implementation even though it is not used or
necessary in the linked implementation. On the other hand, using different package specifica-
tions for the different implementations leads to difficulties in changing a program from using
one implementation to using another implementation of the same ADT. Neither choice is ideal.
Some authorities prefer to use a distinct package specification for each possible implementa-
tion. This allows the designer to concentrate on one particular implementation and to ignore the
other implementation features. In fact, some insist that each implementation have a different
name which leaves no doubt as to the kind of implementation. Thus, a package containing an
array implementation of a stack might be named Array_Stack_Package or Bound-
ed_Stack_Package while a package containing a linked implementation of an array might be
named Linked_Stack_Package or Unbounded_Stack_Package. This way the reader of a
user/client program knows immediately what kind of implementation is being used.
The major argument for using the same package specification for both implementations is
that a user/client program can switch from one implementation to the other without having to
recompile which can be a major advantage in a large program. The major disadvantage of
keeping the two specifications the same is that one or both implementations must implement
features that are only needed in the other implementation. Thus, keeping the same package
specification for both implementations requires including the Maximum_Size variable in the
generic part of the package specification of the linked implementation even though the linked
implementation never uses the variable Maximum_Size.
Note that deciding between these two approaches is itself another design decision.
21
Stacks
Since the design and use of two distinct package specifications is straightforward and
obvious, this section will present a design where the two implementations use the same package
specification. To do this, we will include the variable Maximum_Size in the package specifi-
cation and ignore it in the package body of the linked implementation. Now however the
user/client may have to specify a value for the variable which is never used. One possible
solution is to give the variable a default value of some kind, say 1000. The package specifica-
tion would then be:
generic
package Stack_Package is
...the operation definitions are the same as before...
end Stack_Package;
Now if the user omits the value of Maximum_Size from the instantiation, the value is
automatically set at 1000 or whatever default value is used.
There is one final, practical consideration. It is always a good idea to debug generic
packages in two stages. In the first stage, debug the package using Integer declarations instead
of generic declarations. Only when the Integer version is working and tested, do you introduce
the generic declarations. If you try to combine the two stages, it is difficult to decide whether an
error is due to the basic procedures or to the generic features because the two can interact in
strange ways.
Exercises
4. Discussion Question: When, where, and under what circumstance should implementation
variables be initialized?
22
Stacks
type Data_Array_Type is
array (1..Maximum_Size)
of Data_Type;
------------------------------------------------------------
------------------------------------------------------------
procedure Clear is
begin
Top := 0;
end Clear;
-------------------------------------------------------------
procedure Push (New_Data : in Data_Type) is
begin
if Top = Maximum_Size
then
...code omitted...
else
Top := Top + 1;
Items( Top ):= New_Data;
end if;
end Push;
-------------------------------------------------------------
procedure Pop (The_Data : out Data_Type) is
...code omitted...
end Pop;
-------------------------------------------------------------
function Empty return Boolean is
...code omitted...
end Empty;
end Stack_Package;
23
Stacks
type Node is
record
Item: Data_Type; --Item in stack.
Next: Link := null; --Link to node containing
--previous item on stack.
end record;
----------------------------------------------------------
----------------------------------------------------------
procedure Clear is
begin
Top := null;
end Clear;
----------------------------------------------------------
procedure Push (New_Data : in Data_Type) is
begin
Top := new Node'( Item => New_Data, Next => Top )
end Push;
----------------------------------------------------------
procedure Pop (The_Data : out Data_Type) is
...code omitted...
end Pop;
----------------------------------------------------------
function Empty return Boolean is
...code omitted...
end Empty;
end Stack_Package;
24
Stacks
3.5. Exceptions
Handling errors always requires careful analysis and the best solution is highly dependent
upon the particular problem. Sometimes the best way to handle an error is to output an error
message and stop the program. This keeps the error from being propagated into bigger and
bigger errors. Sometimes, however, the error is not fatal and the programmer can handle it
provided some warning is issued. But in this case, the question is how to pass error information
back to the invoking program. Many languages use an extra parameter, an error parameter, to
pass error information back to the invoking program. The invoking program then must always
test this parameter to determine if an error occurred during the subprogram execution. This
method has its advantages and can be used in Ada, but Ada has another method that we want to
consider --- the exception.
The exception in Ada is a particularly smooth way of treating all errors. It eliminates the
need for the continual testing. At the point where the error occurs, the programmer "raises an
exception" by using a raise statement. For example, to raise a stack overflow error in the stack
push operation we can do as follows:
begin
if Top >= Maximum_Size
then
raise Stack_Overflow; --This "raises" the exception
else
Top := Top + 1;
Items( Top ) := New_Data;
end if;
end Push;
Note: It is always good practice whenever possible to test for exceptions at the beginning of a
procedure before any calculations or alterations are made to any variables in the procedure or
package.
The general form of the raise statement consist of the word raise followed by the name of an
exception. For example,
raise Stack_Overflow;
raise Bad_Command;
raise Stack_Underflow;
are all valid raise statements provided the exception names have been previously declared; that
is, the declarations
Stack_Overflow : exception;
Bad_Command : exception;
Stack_Underflow : exception;
are included in the declaration section of the package specification. The following generic stack
package illustrates how to declare exceptions in a specification.:
25
Stacks
generic
package Stack_Package is
procedure Clear;
--Sets the stack to empty.
Note that the comments following each routine include a list of the exceptions that can be raised
by that routine. The actual declaration of the exceptions is at the end of the specification with
each declaration followed by a comment describing the exception. This particular example is a
generic package, but exactly the same scheme is used for non-generic packages.
The advantage of the raise statement is that it is up to the package user to decide how to
handle the exception. If the user decides to ignore the exception, the computer will print an
error message and stop execution of the program as soon as the exception is raised. This means
the user does not even have to know the exception exists; the computer will still stop execution
as soon as the exception is raised.
On the other hand, if the user wants to do something special for the exception, that can be
done. The general form of the exception handler is a block in the form:
begin
any code which might raise the exception
exception
when Exception_Name => code to handle exception
end;
26
Stacks
To illustrate the use of an exception handler, recall the algorithm of Example 3.2.2 which
verifies that the brackets (, ), [, and ] were correctly used in an expression.
Initialize:
Clear stack
End_of_Expression <-- false
Error <-- false
Terminate
If End of expression and no Error
then Output 'OK'
else Output 'No Good'
end
We can extend this algorithm to handle exceptions by adding to the end of the algorithm the
block:
exception
stack overflow ==> expression too long
stack underflow ==> too many right brackets
Note that these are all of the exceptions in the stack specification package.
Instantiating as a character stack the generic stack package developed earlier which uses
exceptions for the overflow and underflow errors, this algorithm becomes the Ada program in
Program 3.5.1.
Note that the main block of this program is an exception block. An exception can also be
used in any part of a program by enclosing that block in a begin end pair.
When the exception is raised in the stack package, the computer works its way back up
through the sequence of calling programs until it finds a block with the correct exception. It
then executes the statement(s) associated with the appropriate exception. In the sample program
above, if an exception is raised in the stack package, the computer returns to the invoking or
calling program, Brackets, and executes the appropriate exception statement. Once one of the
27
Stacks
begin
--Initialize
Item_Stack.Clear;
End_of_Expression := false;
Error := false;
--Terminate
if Error then Put("Bad expression"); end if;
if not Item_Stack.Empty then
Ada.Text_IO.Put( "Too many left brackets" );
end if;
exception
when Item_Stack.Stack_Overflow =>
Ada.Text_IO.Put( "Expression too long" );
when Item_Stack.Stack_Underflow =>
Ada.Text_IO.Put( "Too many right brackets" );
end Brackets;
Program 3.5.1
28
Stacks
exception statement(s) are executed, the program exits the block. In the example above, exiting
the block means the program stops executing. If the block is inside a larger block, then the
program exits the inner block to continue executing the outer block.
If the computer works its way back up through the sequence of calling programs without
finding a block with the correct exception, then the computer outputs an error message and stops
executing the program.
Exercises
1. What are some of the advantages of using an error parameter in a subprogram? What are
some of the advantages of using exceptions to handle errors in subprograms? Compare the two
methods of handling errors.
2. Determine the exceptions that should be included as part of the array operations:
a. find the minimum value in an array,
b. output the array,
c. compute the sum of the values in the array, and
d. count the number of times a given value occurs in the array.
3.6 Iteration
The stack is one particular way to store and retrieve a collection of items. The remainder of
this book contains many other ways of storing and retrieving a collection of items; such collec-
tions are sometimes called containers. The array, linked list and stack are all examples of
containers. While each distinct kind of container has its own way to store and retrieve data, they
all have some kind of operations to store and retrieve data.
Most containers also have some way to process all of the items in the container, an iteration
operation which applies the same function or procedure to every item in the container. The
iteration operation serves the same purpose as the for statement in processing the items in an
array. If, for example, we have a procedure, called Put_Item, to print one item, then the
statement:
Iterate (Put_Item)
will print every item in the container.
The basic iterate routine in the container module assumes:
1. the iterate routine has a subprogram as a parameter and
2. the parameter subprogram already exists and itself has one parameter, an item
in the container.
The iterate routine is included in the container module or package. The iterator invokes the
subprogram once for each item in the container; its general form is:
29
Stacks
Iterate ( Subprogram )
Repeat for each Item in container
Execute subprogram (Item)
end repeat
To illustrate the whole process, assume we want to extend the stack package to include an
iterator. Ada indirectly allows subprograms to be passed as parameters. To be more precise,
one passes a pointer to the desired subprogram. Subprogram pointers are treated the same way
any other access data type is treated. Thus, the statements:
declare P to be a pointer to a procedure with one integer parameter. Similarly, the statements:
declare Q to be a pointer to an integer valued function with two integer parameters. If we have
procedures, called P1 and P2, with one integer parameter and functions F1 and F2 with two
integer parameters, then the statements:
P := P1'Access;
P := P2'Access;
Q := F1'Access;
Q := F2'Access;
set P and Q to point to the specified subprograms. The term "Access" is an Ada attribute that
changes a procedure or function name into a pointer to that procedure or function. Thus, if P1 is
a procedure name, then P1'Access is a pointer to the procedure P1.
To use this capability, the Stack package will have to be extended to include the iterate
operation. The new version of the stack specification is in Specification 3.6.1. Note that it
contains two new statements, one to define the type Procedure_Access_Type and one to specify
the actual Iterate procedure.
An abbreviated version of the package body for an array implementation is in Program 3.6.1.
The actual Ada code is a straightforward rewrite of the algorithm above using the Proce-
dure_Pointer parameter to specify the procedure to be executed for each value of I.
30
Stacks
generic
package Stack_Package is
type Procedure_Access_Type is
access procedure ( Item : in Data_Type );
procedure Clear;
--clears the stack to empty.
Specification 3.6.1
A Generic Stack with a Procedure Iterator
31
Stacks
-- Array representation
type Data_Array_Type is
array (1..Maximum_Size)
of Data_Type;
---------------------------------------------------------------
---------------------------------------------------------------
... procedures Clear, Push and Pop along with...
... function Empty are the same as before...
...and omitted here...
---------------------------------------------------------------
procedure Iterate( Procedure_Pointer: Procedure_Access_Type )
is
begin
-- Repeat for each item in stack
for I in 1..Top loop
Procedure_Pointer (Items(I));
end loop;
end Iterate;
end Stack_Package;
Program 3.6.1
Stack Package Body with Iterator: Array Implementation
32
Stacks
To illustrate the use of this iteration procedure, consider the following Ada procedure to print
one integer value:
with Ada.Text_IO;
procedure Put_Procedure ( Item : in Integer) is
begin
Int_IO.Put( Item );
Ada.Text_IO.New_Line;
end Put_Procedure;
This procedure can be invoked in any program using an integer stack by assuming first that we
have the object package:
with Stack_Package;
package Integer_Stack is
new Stack_Package( Data_Type => Integer,
Maximum_Size => 100);
with Integer_Stack;
with Put_Procedure;
procedure Sample is
P : Integer_Stack.Procedure_Access_Type;
..other declarations...
begin
P := Put_Procedure'Access;
Integer_Stack.Iterate ( P );
The result is a listing of all the items on the stack at the time the Iterate procedure is invoked.
Any other procedure can be substituted for the Put_Procedure; for example, the iteration opera-
tion can be combined with a procedure to print only items with a value greater 10 and the result
will be a list of only those items on the stack with a value greater than 10. It is also possible to
redefine either the procedure or the iteration operation so as to allow more parameters. More
examples are given in the exercises.
It is also possible to develop an iteration operation which iterates over a function. The only
change necessary in Specification 3.6.1 and Program 3.6.1 is to change every reference to
33
Stacks
procedure Function_Iterate (
Function_Name : Function_Access_Type;
Result : in out Data_Type ) is
begin
for I in 1..Top loop
Result := Function_Name( Items(I), Result );
end loop;
end Function_Iterate;
Each time the Function_Iterate procedure invokes the Sum_Function, it passes as parameters the
value of the next item in the stack and the last value of Sum. The Sum_Function then returns
the sum of these two items which is then inserted into Result. The final value of Result is the
sum of the items in the stack.
To use this function in a program with an integer stack, assume that Integer_Stack package:
with Stack_Package;
package Integer_Stack is
new Stack_Package( Data_Type => Integer,
Maximum_Size => 10);
Sum := 0;
Integer_Stack.Function_Iterate( Sum_Function'Access, Sum );
suffice to perform the desired iteration. The first parameter, Sum_Function’Access is used
directly rather that declaring a pointer variable, setting the pointer variable to Sum_Function’Ac-
cess and then using the pointer variable as a parameter. The resulting code is shorter and clearer
than the alternative.
34
Stacks
Other functions can of course be used; for example, iterating over a function to return the
largest of two items will return the largest item in the stack.
The most obvious question at this point is why all of this bother; wouldn't it be simpler to
extend the stack package to include what ever particular procedure or function was necessary to
solve a given problem, say a procedure to print the items in the stack or a function to add the
items in the stack. This of course can be done and has often been done in the past. The diffi-
culty is that one ends up with either a large collection of stack packages, each one with a differ-
ent additional procedure or function, or a single stack package with tens or even hundreds of
different operations. The advantage of using an iteration operation is that only one additional
iteration procedure is necessary (or two if one includes one for procedures and one for functions)
and with no changes or extensions to the stack package, the result can be used to obtain any
procedure or function of the items in the stack.
While we have used a stack to illustrate iteration, you should remember that an iterator can
be used with any container. Indeed, it is more useful with some of the containers presented later,
but any container, including stacks, can use an iterator.
Exercises
1. Assuming a stack package with a procedure iterator is available, develop procedures which
when invoked from the iterator procedure produce:
a. a list of all the items in the stack with a value greater than 10,
b. a new stack which is a copy of the old one, and
c. a new stack which is a copy of the old one less all items with a value greater than 10.
2. Assuming a stack package with a function iterator is available, develop functions which
when invoked from the iterator procedure produce:
a. a count of the number of items in the stack,
b. the maximum value on the stack, and
c. the average value on the stack (assuming the stack contains Float values).
3. Develop a stack package with a procedure iterator (one parameter) and a function iterator
(two parameters) assuming:
a. an array implementation or
b. a linked implementation.
What exceptions should be associated with these iterators? In particular, what should be done
about attempts to iterate over an empty stack?
4. Develop some procedures to be invoked from a procedure iterator which require more than
one parameter. Develop some functions to be invoked from a function iterator which require
more than two parameters.
35
Queues and Pipes
This chapter presents queues and pipes, two examples of first-come-first-served waiting
lines. Definitions, examples of their use, various representations, and implementations are
included.
It also presents another way of defining and implementing an ADT as an Ada data type so
that the user/client program can define and use as many queues as needed. This method is
presented in terms of queues, but it is also applicable to many other ADTs.
1
Queues and Pipes
4.1. Queues
Enqueue( Data ) inserts the value of Data at the rear of the queue.
Dequeue( Data ) removes and returns the item at the front of the queue provided
the queue is not empty, otherwise it returns an error.
Clear sets the queue to empty.
Empty a Boolean function which is true if and only if the queue is empty.
This definition is independent of any particular type of data or any particular kind of repre-
sentation; that is, it is the definition of an ADT, a queue ADT. As usual, the three procedures,
Enqueue, Dequeue, and Clear, all modify data whereas the function, Empty, only returns a value
and does not modify any data.
4.1.1. Representation
There are two basic representation techniques for queues, one stores items in arrays and the
other stores items in a linked list. We cover each in turn.
A simple minded array representation of a queue stores the items in an array so that the first
item in the queue is stored in Array(1), the second item in Array(2), and so forth. New items are
enqueued by storing them in the first empty location in the array. Items are dequeued by remov-
ing the item in Array(1) and sliding the rest of the items forward one position in the array so that
the new front of the queue is in Array(1). This data structure is easy to implement, but it
requires moving every item in the queue forward one position every time an item is dequeued.
This takes O(Size) time to execute and is much too slow for practical purposes.
A faster executing method is to use a circular array representation. Assume that we insert
the first item in Array(1), insert the second item in Array(2), and so forth until we insert an item
in Array(Last), the last location in the array. As we dequeue items, we first remove the item in
Array(1), then the item in Array(2), and so forth -- but where do we insert new items?
If we use a straightforward array representation, there is no place left to store the next item in
the queue. The array can only be used once and it has already been used. The obvious answer is
to start reusing the beginning of the array. After using Array(Last), we store the next item in
Array(1) provided this location is available again. This introduces a new problem: How do we
know that the beginning of the array is available again? Especially as we start reusing the same
spaces over and over again. A simple solution is to use a variable, called Size, to keep track of
the number of items currently in the queue. Whenever the value of Size is less than the number
of spaces in the array, there is room to enqueue another item.
2
Queues and Pipes
It helps to follow an example or two by hand to see exactly how the items are stored in the
array. To start, we need two more variables, one called Front and one called Rear, to keep track
of the current location in the array of the front and rear of the queue.
The algorithms are cleaner if we use modular arithmetic and let the subscripts of the array
run from zero to (Maximum_Size - 1). (Ada includes modular arithmetic. See your Ada
manual.) Now assume we execute the operation, Enqueue( A ), and that the first item is inserted
in Array(1). After the operation, the array would appear as follows:
0 1 2
A
0 1 2
A B
0 1 2
C A B
0 1 2
C B
with Front = 2, Rear = 0 and Size = 2. (Actually the value A would still be stored in Array(1),
but the drawing is clearer if we ignore this fact.)
Since the value of Size is now less than three, the number of locations in the array, we can
Enqueue a new item at the rear of the queue. Thus, after executing Enqueue( D ), the array
becomes:
0 1 2
C D B
3
Queues and Pipes
If Array is full
then Overflow error
else Rear <-- (Rear + 1) mod (Maximum_Size)
Array( Rear ) <-- Item
Size <-- Size + 1
Note that, because the array is limited in size, we had to introduce an error, Overflow, that was
not included in the definition of the queue ADT. This is the first, and in the case of the queue,
the only representation dependent error.
An algorithm for dequeueing an item is:
If Queue is Empty
then Underflow error
else Data <-- Array( Front )
Front <-- (Front + 1) mod ( Maximum_Size )
Size <-- Size - 1
4
Queues and Pipes
Data Specification
Algorithms
Clear:
Front <-- 1
Rear <-- 0
Size <-- 0
end clear
Empty Full
Return( Size = 0 ) Return( Size = Maximum_Size )
end empty end full
Enqueue( New_Data )
If Array is full
then Overflow error
else Rear <-- (Rear + 1) mod( Maximum_Size )
Array(Rear) <-- New_Data
Size <-- Size + 1
end enqueue
Dequeue( The_Data )
If Queue is Empty
then Underflow error
else The_Data <-- Array( Front )
Front <-- (Front + 1) mod( Maximum_Size )
Size <-- Size - 1
end dequeue
5
Queues and Pipes
where Item contains the item stored in the queue and Next contains a pointer to the next item in
the queue.
After executing the operations:
the data structure would appear as follows where Front is a pointer to the item at the front of the
queue and Rear is a pointer to the node containing the last item in the queue:
Front
A
Rear
B
C Λ
where, as usual, Λ, denotes the null pointer or the last node in the linked list.
The basic algorithm to dequeue a data item is:
If Queue is empty
then Underflow error
else Data <-- Front.Item
Front <-- Front.Next
6
Queues and Pipes
Data Specification
Node is record
Item : ??? --Information in queue, any data type.
Next : Pointer to Node --Pointer to next entry in queue.
end record Node;
Algorithms
Clear: Empty:
Front <-- Λ Return( Front = Λ )
end clear end empty
Enqueue( New_Data )
--Set up new node
New_Head <-- new Node( Item => New_Data, Next => Λ )
Dequeue( The_Data )
If Queue is Empty
then Underflow error
else The_Data <-- Front.Item
Front <-- Front.Next
end dequeue
7
Queues and Pipes
4.1.2. Timing
Since the algorithms contain only assignments and if statements, their execution times are all
O(1):
Operation Representation
Array Linked
Clear O(1) O(1)
Empty O(1) O(1)
Enqueue O(1) O(1)
Dequeue O(1) O(1)
Every conclusion made about the comparison of array and linked representations of a stack
remains true for the comparison of array and linked representations of a queue. In particular:
Z Since the execution time is O(1) for all of the operations for both representa-
tions, the two methods are equally fast except for the time required to get a new
node in the linked representation of the Enqueue operation. The same speedup
techniques are possible, but generally, the array representation is significantly
faster than the linked representation.
Z The space usage of the two representations depends upon several factors, but in
general, the linked representation uses less space.
Z The linked representation is definitely the most flexible of the two
representations.
Exercises
3. How many ways can the input letters ABCD be permuted if each letter as it is input is either
output immediately or stored in a queue for later output.
8
Queues and Pipes
5. Show that a circular array representation of a queue can be developed using any two of the
three variables Front, Rear and Size. Discuss the pros and cons of trading space for time.
6. A priority queue is a ADT which assumes each entry has a "priority" and the highest priority
item is always dequeued first. If two items with the same priority are in the queue, then they are
dequeued on a FIFO basis. Develop a priority queue assuming:
a. an array representation or b. a linked representation.
7. A queue with refusal returns the first item in the queue with value greater than a user deter-
mined value. If no value in the queue is greater than the specified value, then a failure occurs.
Develop a queue with refusal assuming:
a. an array representation or b. a linked representation.
8. Computer operating systems use multilevel queues. A two level queue for example has two
queues -- one with high priority and one with low priority. All high priority items are removed
before any low order priority items are removed. Develop:
a. an abstract data type,
b. a set of algorithms and storage scheme, and
c. a package based on part (b)
for (1) a two level queue and (2) a multilevel queue.
9. A deque (pronounced "deck" or "DQ") is a double ended queue; that is, inserts and removals
can be made at either end of the queue. Develop
a. a deque ADT,
b. a set of algorithms and storage scheme for a deque based on:
i. an array,
ii. a linked structure, or
iii. a circular structure,
c. a package based on part (b) of this exercise.
9
Queues and Pipes
Queues are used for most problems that involve a first come, first served process. Queues,
for example, are a natural for modeling waiting lines. The heaviest user of queues on most
computers is the computer operating system. Operating systems use queues to store lists of tasks
to be executed such as things to be printed, records to be read, or jobs to be run.
Example 4.2.1. A trucking company loads shipments onto outgoing trucks in a first come, first
served order. To insure this is done, their computer system has the following input commands:
When a truck is ready to load, the shipments are loaded onto the truck in a first come, first
served basis until the truck is full. Design this system.
Assume there is a queue and each SHIP command inserts the shipment's ID number and
weight into the queue. A TRUCK command takes shipments from the queue until the truck is
full. It simplifies the algorithm if we assume there is a PEEK operator, which returns the value
of the item at the front of the queue, but does not alter the queue. Also assume the queue
processes records with two fields, an ID field and a Weight field.
An algorithm is (the square brackets indicate a record):
Initialize
Clear Queue
More-to-do <-- true
Terminate:
If Command = Done and Queue is Empty and eof
then Output: Normal Termination
else Output: Abnormal Termination
end
10
Queues and Pipes
Example 4.2.2. Design an algorithm to simulate the customers at Burger Bistro and determine
the average time to be served. Assume a customer enters Burger Bistro and enters a waiting
line. When the customer reaches the head of the line, a cashier fills the customer's order and
takes the customer's money. Also assume the probability of a customer entering Burger Bistro
during any given minute is 0.1 and assume it takes three minutes to serve a customer.
To solve this problem, we will simulate a large number of minutes using the basic pattern:
During each minute, there is a 0.1 probability a customer enters, so we assume we have a
random number generator, RAND, which generates numbers between 0.0 and 1.0 with a uniform
random distribution. If the value of RAND is less than or equal to 0.1, we will assume a
customer has entered during this minute. If a customer does enter, then the time the customer
entered, that is, the current time, is inserted in a queue.
If, at the beginning of each minute, the cashier is not busy and the queue is not empty, then
the first entry in the queue is dequeued and the difference between this queue entry and the
current time is the waiting time for this customer.
Since it takes three minutes to serve a customer, we can determine if the cashier is busy by
using an integer variable, Busy, which will be set to 3 anytime the cashier starts to serve a
customer and which is decremented by 1 at the beginning of each minute (provided it is
currently greater than 0). After three minutes, the value of Busy will be back to zero and the
cashier can start to serve another customer.
A more detailed simulation is:
Initialize
Clear Queue
Busy <-- 0
Total wait time <-- 0
No of Customers <-- 0
Terminate
Output (Total wait time / No of Customers)
end
11
Queues and Pipes
Testing this simulation requires care and thoroughness. One way is to print out the values of
all the variables at the end of each minute for a large number of minutes and then, by hand,
verify each and every value. Can you devise a better way?
Exercises
1. A library has one copy of a book for a large number of students. When a student requests
the book, if the book is available, the student gets the book, otherwise the student's name is put
into a queue. Whenever a student returns the book, it is issued to the student at the front of the
queue. Develop a system to keep track of the book.
3. An airline keeps a list of passengers waiting for an empty seat on a flight. When a reserva-
tion is canceled, it is filled from a first come first served waiting list. Develop a program to
handle this list for a single flight.
4. A word processor is designed so that footnotes are inserted in the document at the point they
are referenced and when the word processor outputs the document, it collects all the footnotes
that appear on the same page and outputs them at the bottom of the page. Develop an algorithm
for outputting a page containing footnotes. Hint: Assume each footnote is on a separate line
and that each footnote line starts with a # symbol.
5. A store has a checkout line where customers wait to be checked out. If customers arrive with
a probability of 0.5 during any one minute interval and the actual average time to check out a
single customer is 0.75 minutes (normally distributed), then how long does the average customer
have to wait in the line? Use a simulation to generate your answer. (Assume you have a
function, called BELL, which generates normally distributed numbers.)
6. A company values their inventory on a FIFO basis; i.e., each item in inventory contains the
price it costs the company and whenever the company sells an item it always sells the earliest
item it received of that type. The company's gain is the difference between the selling price of
the item and what the company paid for that particular copy of the item. Develop an algorithm
to keep track of the company's gain. Assume the company sells only one product and that the
product is always bought and sold in lots of one.
Some problems require multiple queues. To handle these problems we must assume it is
possible to assign names to queues and that each queue operator now contains the name of the
queue. The new operators are:
12
Queues and Pipes
We will consider first some multiple queue examples and then in the next section how one can
implement and use multiple queues.
Example 4.3.1. A trucking company computer system has the following input commands:
When a truck is ready to load, the shipments are loaded onto the truck in a first-come-first-
served basis until the truck is full. Design this system assuming the company ships to Chicago,
Memphis, and New York.
We assume there are three queues, one for each possible destination. Call the queues
Chicago, Memphis, and New_York. A SHIP command inserts the shipment's weight into the
appropriate queue. A TRUCK command takes shipments from the queue until the truck is full.
It simplifies the algorithm if we assume there is a PEEK operator, which returns the value of the
item at the front of the queue, but does not alter the queue. Assuming this operator, a rough
algorithm is:
Initialize
Clear all three Queues
More-to-do <-- true
Terminate
end
13
Queues and Pipes
Since the city names are also the destinations, the algorithm uses the city name to determine
which queue to use.
Example 4.3.2. Assume the Burger Bistro example in the previous section is expanded to
include a second queue. To be more precise, at the head of the first queue a cashier takes a
customer's order and money, but the customer then stands in a second line to fill the order.
Assume the time to fill an order is also 3 minutes. What is the average time from entering the
first queue to receiving the hamburgers?
We obviously need two queues, a cashier queue and a hamburger queue. Each new
customer first enters the cashier queue. When the customer reaches the head of the cashier
queue, he/she orders a meal, pays the cashier, and then enters the hamburger queue where the
customer eventually receives the meal.
We will use the identifiers Cashier Queue and Server Queue to represent the two queues.
We also use a Cashier Busy and a Server Busy to represent the two servers. The simulation is:
Initialize
Clear both Queues
Cashier Busy <-- 0
Server Busy <-- 0
Wait time <-- 0
No of Customers <-- 0
Terminate
Output (Waiting time / No of Customers)
end
14
Queues and Pipes
Since multiple queues are now possible, it is also possible to have records containing queues
and even an array of queues. The following example illustrates the use of an array of queues.
Example 4.3.3. For each book in the library, the library keeps a list of patrons waiting to check
out the book. In other words, if a patron wishes to check out a book which is already checked
out to another patron, the new patron's name is added to the waiting list. When a book is
returned, the library first checks to see if there is someone waiting to take out this book. If so,
the book goes to the patron waiting the longest. If no one is waiting for the book, it is put back
on the shelf. Develop a system to keep track of the books and the waiting lists.
Assume that each book has a unique book number and that each patron has a unique name.
Let the commands for the system be:
REQUEST <Book #> <Patron Name> request the specified book.
RETURN <Book #> returns the specified book.
where the book number and patron name are parameters of the system commands.
This problem is interesting because we need not two or three queues, but thousands of
queues, one for each book. The simplest way to do this is to set up an array of queues; to be
more precise, assume for simplicity that the library has one thousand books, numbered from 1 to
1000. Also assume we have the ability to define an array of queues; for example, let
Waiting: Array ( range 1..1000) of Queue
define an array containing 1000 queues, numbered from 1 to 1000. Now, statements such as:
For I = 1 to 1000
Clear ( Waiting(I) )
end for
will clear all of the queues. The statement:
Enqueue ( Waiting(3), "A" )
will insert the letter A in the third queue and the statement:
Dequeue ( Waiting(20), Data )
will dequeue the item at the head of the 20th queue into the variable Data.
The actual library problem needs one more array to keep track of whether or not each book
is currently in the library or checked out; let
In : Array ( range 1..1000) of Boolean
be this array.
An algorithm which uses these two arrays to process the commands is:
Initialize
For I = 1 to 1000 --clears queues and marks books as in the library
Clear ( Waiting(I) )
In(I) <-- true
end for
More-to-do <-- true
15
Queues and Pipes
Terminate
end
To implement these kinds algorithms in Ada we need the ability to define a new data type,
Queue, and then to declare items to be of type Queue; for example, the statements:
Chicago : Queue;
Memphis : Queue;
New_York : Queue;
define three queues. These data names then represent queues and can be passed as parameters in
procedure invocations. In other words, our four basic queue operations, Clear, Empty, Enqueue,
and Dequeue, can use these data names to specify which queue is to be processed. For example,
Clear( Chicago );
Enqueue( Memphis, Amount);
perform the desired operations on the desired queues.
To define arrays of queues, we can use a statement like the following:
type Queue_Array is
array ( Positive range <> )
of Queue;
16
Queues and Pipes
Clear( Waiting(3) );
Enqueue( Waiting(J), "A" );
where J is a positive valued integer, to perform the desired operations on the desired queues.
The mechanics of defining the Queue data type are left for the next section. The important
point at the moment is that we can define a new data type, Queue, and then use the new data
type the same way we would use any other data type.
Exercises
1. A bookstore orders books for its customers. When a shipment of books is received, the store
fills orders on a first-come-first-served basis. Design a system to do this.
2. An airline keeps a list of passengers waiting for an empty seat on a flight. When a reserva-
tion is canceled, it is filled from the waiting list on a first come first served basis. Develop a
program to handle this list for several different flights.
3. A supermarket has a meat counter queue where customers wait to be served and a checkout
queue where customers wait to be checked out. Assume customers arrive at the meat queue with
a probability of 0.5 during any one minute interval and the actual average time to serve out a
single customer is 2 minutes (normally distributed). Assume also the customer goes immedi-
ately from the meat counter to the checkout counter where it takes 3 minutes to be checked out.
How long does it take the average customer from the beginning of the meat counter queue to
leaving the checkout counter? (Assume you have a function, BELL, which generates normally
distributed numbers.)
4. A company values their inventory on a FIFO basis; i.e., each item in inventory contains the
price it costs the company and whenever the company sells an item it always sells the earliest
item it received of that type. The company's gain is the difference between the selling price of
the item and what the company paid for that particular copy of the item. Develop an algorithm
to keep track of the company's gains. Assume the company sells three products and that each
product is always bought and sold in lots of one.
6. For each physician on its staff, a medical clinic has a queue of patients waiting to see the
physician. Develop a computer system to keep track of these patients.
17
Queues and Pipes
In the last chapter we used a package to implement a single stack or an object. The same
technique is used to implement a single queue. In this chapter, however, we want to demonstrate
how to implement a queue object class or data type. (The same technique can, of course, be used
to implement a stack data type.)
To define a new data type, Queue_Type, we must declare the queue type in an Ada package
specification. We did this in Chapter 2 with both vector and list ADTs by defining the types to
be arrays or records with certain attributes and certain operations on the arrays or records. To be
more precise, in Program 2.3.1 we inserted the complete specification of the data type
Float_Vector in the package specification and in Specification 2.6.2.1 we inserted the complete
specification of the limited private data type List in the package specification. We could do the
same thing with queues.
The difficulty with this approach is that the data is stored in the user's program. If the data is
declared to be private, the data is protected since the user cannot alter the data. Most experts,
however, prefer to store the data in the queue package body. The Ada Quality and Style
Manual, Section 4.2.1, goes further and states: "Avoid unnecessary visibility; hide the imple-
mentation details of a program unit from its users."
Some of the reasons behind this approach are similar to the ones given for storing a single
stack in the package body:
1. A queue is a queue is a queue and the exact implementation data structure is not
a part of a queue definition.
2. If the user does not know and cannot find out implementation details, then the
user generally concentrates on using the package and ignores low level, imple-
mentation details.
3. If the data specification must be changed (and this does happen from time to
time), then we must recompile every program that uses the package. This can
be very time consuming and tedious if the package is widely used. Removing
the implementation details from the package specification reduces the possibil-
ity of changing the package specification.
4. It saves space later on when we start putting stacks, queues, etc. in containers.
The question, then, is how to store multiple queues in the package body and let the user or
client software specify which queue is meant. The answer is to let the user have a pointer (or
more precisely, an access type variable) whose value points at a particular queue in the package
body. Each queue name now points to where the actual data is stored in the package body. In
graphical form:
Q2
Q3
18
Queues and Pipes
where the queues in the package body are indicated by boxes in the package body. The queue
names in the user box indicate pointers to the place where the queue data is stored.
In simplified form, the corresponding Ada specification is:
package Queue_Package is
private --Declarations
type Queue_Node; --Queue data structure.
type Queue_Type is access Queue_Node; --Pointer to queue.
end Queue_Package;
This package specification has only a pointer to the data structure storing the queue itself. This
data structure is in the package body; in other words, the data type Queue_Type is actually a
pointer to a node in the package body and the exact way this node is defined depends upon the
implementation. The corresponding package body for an array implementation, for example,
starts out as follows:
package Queue_Package is
end Queue_Package;
This design method eliminates any representation details from the package specification and
gives all of the advantages listed for keeping representation details from the package specifica-
tion. This also means the system designer can ignore representation details -- an important
simplifying factor in the design of large programs with thousands of details that must be kept in
mind. The biggest practical advantage is probably that altering the queue data representation
used by a particular program means only recompiling the queue package body, not recompiling
all of the programs that use the package.
Declaring queues as data types has an additional advantage. Since queues are now ordinary
variables, they can be used in arrays and records and passed as parameters. If the package is
designed this way, the invoking program or user contains only pointers to stacks or queues, not
the stacks or queues themselves so that they can be passed easily and efficiently. This may not
19
Queues and Pipes
seem much of an advantage at the moment, but we will see later there are many advantages to
this capability.
Since the queue names are declared in the user/client program and the queue data is stored in
the package body, there must be some way to ensure that each queue name points to a particular
set of queue data. One way to do this is to add the operation:
Create ( The_Queue_Name )
which assigns a particular block of data storage space in the package body to the specified queue
name. Attempts to invoke other queue operations before the Create operation is executed can
cause difficulties. For now, we will raise an exception if a queue operation is invoked before the
Create operation is executed. We will also raise an exception if an attempt is made to create a
queue which has already been created. The question is treated in more depth in Section 4.4.4
(Initialization) and an alternative solution is presented there.
Actually implementing this approach requires a unified approach in the package specifica-
tion, the user program, and the package body. We next take up each of these three items in turn.
To simplify the presentation of the details and choices involved in the package specification,
we cover first the specification for an integer queue type and then generalize the specification to
a generic queue specification. The corresponding package bodies are covered later.
Specification 4.4.1.1 is an Ada package specification for an array representation of an integer
queue type; that is, a queue containing only integer values. Once this package specification is
compiled, user/client programs can declare items to be of type Integer_Queue. Note:
a. Type Integer_Queue is limited private; this insures that the package user cannot
alter the value of any queue name. In other words, the invoking program can
only pass queues as parameters in the specified operations.
b. There is no specification of the maximum size of the array used to store the
queue. This implies the size is fixed in the package body and cannot be altered
by the user/client program. This is a design decision and easily changed.
c. Each queue operation now contains a queue name as a parameter (this means the
invoking program can use multiple queues).
d. The package is designed to raise an exception any time an error occurs and has
four exceptions:
- no space left to insert a new item into the queue,
- attempt to dequeue from an empty queue,
- invoking an operation before the create operation is executed, and
- creating a queue already created.
e. The data type Integer_Queue is a pointer to the actual data structure,
Queue_Node, used to store the queue values. The detailed definition of
Queue_Node is left for the package body.
20
Queues and Pipes
package Integer_Queue_Package is
private --Declarations
type Queue_Node; --Queue data structure.
type Integer_Queue is access Queue_Node;--Pointer to queue.
end Integer_Queue_Package;
21
Queues and Pipes
Specification 4.4.1.2 generalizes the queue package specification to generic queues, where
the generic portion of the declaration includes Data_Type (the data type of the items in the
queue) and Maximum_Size (the maximum size of the queue). Note that:
a. The generic Data_Type is private; this insures the package user that the package
can only compare two values for equality or assign a value to a Data_Type
variable.
b. The user/client can now specify the value of Maximum_Size giving the user more
control over the data structure.
c. Maximum_Size is given a default value of 1000; that is, the value of
Maximum_Size is set to 1000 unless the invoking program instantiates it to a
different value.
d. The data type Queue_Type is declared as limited private to insure that the user
program cannot alter the value of a queue name in any way or attempt to compare
two queues in any way. Queue names can only be passed as parameters in the
specified operations.
e. The data type Queue_Type is a pointer to the actual data structure, Queue_Node,
used to store the queue values. The detailed definition of Queue_Node is left for
the package body.
f. The same four exceptions are used as before.
The advantages of defining the queue type as a pointer (see Style Guide, Section 8.3.5)
include the fact that it simplifies including queue in records, defining arrays of queues, and
passing queues as parameters.
Once the generic queue package specification has be defined it is possible to use multiple
queues in the same Ada program. The following sample program illustrates a typical application
program. Assume the following package instantiation:
Also assume the program uses several queues, say Q1, Q2, ..., and the information stored in each
queue consists of people's names of type Text. (See Appendix B for a complete description and
implementation of data type Text.) The declaration part of the program might start as follows:
with Name_Queue;
with Text_Package;
procedure Sample is
Q1 : Name_Queue.Queue; --First queue
Q2 : Name_Queue.Queue; --Second queue
...rest of code omitted...
end Sample;
22
Queues and Pipes
generic
package Queue_Package is
private --Declarations
type Queue_Node; --Queue data structure.
type Queue_Type is access Queue_Node; --Pointer to queue.
end Queue_Package;
23
Queues and Pipes
Note that Queue_Package is instantiated in the usual way and that Q1 and Q2 are declared to be
queues in the usual way.
To use the queue operations in the Sample program, we can use statements like the follow-
ing:
Name_Queue.Clear ( Q1 );
Name_Queue.Enqueue( Q2, Data );
Name_Queue.Empty ( Q2 );
Name_Queue.Dequeue( Q1, Data );
The astute reader might ask at this point why we didn't use the same implementation
technique in Chapter 2 for both Float_Vector and Patient_Record types. One reason is speed;
using the pointer to access the node adds an extra layer of memory accessing and can signifi-
cantly slow down the system under some conditions. A second reason is the test for equality.
Recall that when comparing two access values, the two access values are equal if and only if
they point to the same memory location. Since we never compare two queues for equality, this
presents no difficulties for queue implementations. Since we often compare, say Patient_Record
values for equality either we do not use access types for Patient_Record data types or we have to
add at least one more operation, a test for equality, to the package.
Since the data structure of Queue_Node is specified in the package body, no two package
bodies need use the same data structure. We will, in fact, give two data structures for this
package specification. The first will be a circular array implementation and the second will be a
linked implementation.
Program 4.4.3.1 contains the package body for the circular array implementation of a queue.
In this case, Queue_Node is a record containing the values of Front and Rear as well as the array
used to store the values in the queue. To be more precise:
type Queue_Node is
record
Size : Integer := 0; --Number of items in queue.
Front : Pointer := 1; --Pointer to first node in queue.
Rear : Pointer := 0; --Pointer to last node in queue.
Items : Data_Array_Type; --Array to hold circular queue.
end record;
where Data_Array is an array type. Note also that the values of Size, Front, and Rear are initial-
ized in the definition to an empty queue.
Some other minor alterations in the queue algorithms are necessary when the queue package
is implemented for multiple queues. As noted earlier, while the user may declare some variable
to be a queue in the user program, this variable is only a pointer to a Queue_Node and the corre-
sponding Queue_Node does not exist until the package body creates it. It is possible, for
example, for a user to attempt to enqueue something into a Queue_Node which does not yet
exist.
24
Queues and Pipes
The Create operation of course sets up a Queue_Node and links it to the specified
The_Queue_Name. Before the other operations can execute, they must make sure that the
Create operation has been executed. If the Create has not yet been executed, then an exception
must be raised. To do this, the first statement in the Clear, Empty, Enqueue, and Dequeue
routines should be the if statement:
Obviously this approach informs the user programmer that the queue needs to be created before
proceeding, but it probably does this at the price of stopping program execution.
To illustrate, the Clear procedure with this new statement is:
The other queue procedures, Enqueue and Dequeue, are treated the same way.
There is an additional problem with the Empty function. Recall that all function parameters
must be in parameters. We could still raise the Queue_Not_Created exception in the Empty
function, but this in some sense violates our concept of a function. In this particular case, if we
examine the Empty function, it seems rather obvious that a queue which has not yet been created
must contain nothing and so must be empty. A new algorithm including this extension for an
array implementation is:
Empty( The_Queue_Name )
If The_Queue_Name = Λ
then Return ( true )
else Return ( The_Queue_Name.Size = 0 )
end
The package body for the array implementation is given in Program 4.4.3.1.
25
Queues and Pipes
26
Queues and Pipes
Program 4.4.3.2 contains the package body for the linked implementation. This implementa-
tion has a more complicated data structure than the circular array implementation. Queue_Node
in this case is a record containing the Front and Rear pointers. To be more precise:
type Queue_Node is
record
Front : Link := null; --Pointer to first node in queue.
Rear : Link := null; --Pointer to last node in queue.
end record;
Note that even though Ada automatically initializes all access variables to null, the values of
Front and Rear are explicitly initialized to null. The extra clarity of explicit initialization is
often a helpful documentation feature.
The actual nodes in the queue are defined separately as Node records. Each Node record is
in the form:
type Node is
record
Item : Data_Type; --Information field.
Next : Link := null; --Pointer to next node in queue.
end record;
.
As in the circular array implementation, the user program only has a pointer to the queue
node in the package body so it is possible for a user to try to execute some operation without
having created and initialized the Queue_Node. Again the other routines raise an exception if
this occurs.
An empty function for a linked implementation is:
Empty( The_Queue_Name )
If The_Queue_Name = Λ
then Return ( true )
else Return ( The_Queue_Name.Front = Λ )
end
Since the linked implementation does not use the Maximum_Size and Queue_Overf-
low, they are simply ignored. They must be included in the package specification because the
array implementation needs them, but all other implementations are free to ignore them.
27
Queues and Pipes
28
Queues and Pipes
4.4.4 Initialization
One serious weakness of the last design is that the user/client program must Create every
queue before using the queue. It would greatly simplify the use of the package if there were
some way to automatically Create every queue at the beginning of the program. Ada 95 uses a
special procedure name, Initialize, to do this. To use this feature requires several changes
to the previous Ada queue package. First, we must add a procedure, called Initialize, to
the queue package. This procedure creates the queues in the usual way right after all the objects
in the program are given initial values. A typical Initialize procedure for the queue
package is:
Except for the use of the .Ptr (which will be explained shortly), this procedure is almost
identical to the Create routines in Programs 4.4.3.1 and 4.4.3.2. The major difference is that this
routine is automatically executed right after the user/client program objects are initialized, so the
package user does not have to worry about creating the queue objects before using them.
The long and the short of it is that, with this feature, there is no need for the Create routine or
the hassles associated with it.
To use this feature, requires doing two things:
1. Replace the Create routine with the Initialize routine in the package
specification and body. To insure that no user/client program actually tries to
invoke the Initialize routine, it is placed in the private part of the package
specification.
2. The data type Queue_Type must be a special case of a Controlled type.
Controlled items can be initialized, finalized, and adjusted. This section only
considers initialization. (Finalize procedures are left for the exercises and adjust
procedures are skipped entirely.)
To make Queue_Type a Controlled type requires several additions and changes.
1. The package specification must begin with:
29
Queues and Pipes
3. When Queue_Type is declared in the private section, this same clause must be
used; that is, given the two declarations:
This all sounds much worse than it is. The final result, the complete specification of the
queue package with initialization is in Specification 4.4.4.1. Note that since all queues are
automatically created, there is no need for the two exceptions for uncreated queues and queues
already created. This also simplifies the implementation because there is no need to check for
these cases. Carefully compare this specification to the one in Specification 4.4.1.2 and make
sure that you see which entries have been changed.
Since Queue_Type is now a record, every reference to a component of Queue_Node must
have also contain the Ptr; for example, the assignment:
Size <-- 0
becomes in Ada:
The_Queue_Name.Ptr.Size := 0;
Again, this makes the program look a bit messier than it actually is. The actual package body is
in Program 4.4.4.1. Again, carefully compare this program to the ones given earlier for a queue
package without initialization. The only changes are the addition of the Initialize procedure and
the inclusion of .Ptr. in all references to components of Queue_Node.
30
Queues and Pipes
package Queue_Package is
private --Declarations
type Queue_Node; --Queue storage structure.
type Queue_Pointer is access Queue_Node;
31
Queues and Pipes
type Data_Array_Type is
array (Pointer)
of Data_Type;
type Queue_Node is
record
Size : Natural := 0; --Number of items in Queue;
Front : Pointer := 1; --Pointer to first node in queue.
Rear : Pointer := 0; --Pointer to last node in queue.
Items : Data_Array_Type;--Array to hold circular queue.
end record;
--------------------------------------------------------------
--------------------------------------------------------------
procedure Initialize( The_Queue_Name : in out Queue_Type) is
begin
The_Queue_Name.Ptr := new Queue_Node;
end Initialize;
--------------------------------------------------------------
procedure Clear( The_Queue_Name : in Queue_Type) is
begin
The_Queue_Name.Ptr.Size := 0;
The_Queue_Name.Ptr.Front := 1;
The_Queue_Name.Ptr.Rear := 0;
end Clear;
--------------------------------------------------------------
function Empty( The_Queue_Name : in Queue_Type)
return Boolean is
begin
return ( The_Queue_Name.Ptr.Size = 0 );
end Empty;
32
Queues and Pipes
--------------------------------------------------------------
procedure Enqueue( The_Queue_Name : in Queue_Type;
New_Data : in Data_Type) is
begin
--Check for exceptions.
if The_Queue_Name.Ptr.Size = Maximum_Size then
raise Queue_Overflow;
end if;
--------------------------------------------------------------
procedure Dequeue( The_Queue_Name : in Queue_Type;
The_Data : out Data_Type) is
begin
--Check for exceptions.
if The_Queue_Name.Ptr.Size = 0 then
raise Queue_Underflow;
end if;
--------------------------------------------------------------
end Queue_Package;
33
Queues and Pipes
Exercises
1. Why is Queue declared to be limited private in the queue package specification of Specifica-
tion 4.4.1.2?
2. Redefine the Create operator so that it includes the maximum size of the queue; for example,
Create( Q1, 50 ) defines queue Q1 to have a maximum size of 50 and Create( Q2, 100 ) defines
queue Q2 to have a maximum size of 100. What are the advantages and disadvantages of this
version of the Create operator?
3. Expand the code in Program 4.4.3.1 and 4.4.3.2 to working Ada queue packages.
5. Discuss the pros and cons of designing a package so that the package continues to execute
correctly in spite of programming errors by the invoking program.
6. Assuming the existence of the multiple queue package, translate the algorithms developed in
Section 4.2 into working Ada programs.
7. Develop a multiple stack ADT and then implement a corresponding Ada package. Make
sure your specification is independent of the implementation data structure.
8. The linked implementation of a queue is rather slow if the enqueue routine has to create
many nodes. Develop a means of storing no longer needed nodes for reuse by the enqueue
routine. Would this affect the execution time of the package?
9. Develop a queue package which contains only one queue and that is in the package body.
10. Add the following operations to any of the queue packages above:
a. a peek,
b. a function iterator, and
c. a procedure iterator.
11. Develop queue specifications for array and linked representations where each specification is
independent of the other; that is, where each queue representation has its own specification.
Include as few parameters and exceptions as possible.
12. Ada contains a Finalize operation as well as an initialize one. Finalize is executed just
before the variable goes out of existence and is used to clean up any anomalies. Expand one of
your ADT packages with initialize to include a finalize routine which outputs: ‘I am out of here’.
34
Queues and Pipes
4.5. Pipes
A pipe, sometimes called a stream, is also a first-in-first-out (FIFO) list, but it differs from a
queue in two ways:
1. once an item is inserted in the pipe, it stays in the pipe, and
2. the operations allow us to process all of the items in the pipe over and over again.
The term pipe can be thought of as a "pipeline" with natural FIFO behavior. In another sense a
pipe corresponds to a list on a piece of paper. Once an item is written down, it stays written
down and we can go through the same list over and over again.
The basic pipe operations are:
The Clear, Empty, and Insert operations are obvious extensions of similar operations in a
queue.
The Get_Next operation is the basic operation for processing all of the items in the pipe, one
at a time. It is modeled on the Get operator for sequential files in that, each time Get_Next is
executed, it gets the next item in the pipe. Thus, the first time it is executed it returns the first
value in the pipe, the second time it is executed, it returns the second item in the pipe, and so
forth. The Open operation initializes the Get_Next operation to the first item in the pipe and the
End_of_Pipe operation determines if the Get_Next has input all of the values in the pipe.
Note the difference between the Empty and the End_of_Pipe operations. Empty is true only
until the first item is inserted in the pipe. End_of_Pipe becomes true only after Get_Next has
retrieved all of the items in the pipe and becomes false as soon as the pipe is reopened. Thus,
End_of_Pipe can be true several times for the same pipe during a program's execution.
To illustrate the use of these operations, assume we have filled the pipe with data and we
wish to process this data one item at a time. For simplicity, assume the pipe is a list of numbers
and we want to print only the positive numbers. Then we can use the algorithm:
Initialize
Open: Pipe
35
Queues and Pipes
Note how the Get_Next operator keeps getting the next item from the pipe until the
End_of_Pipe becomes true. The basic loop pattern is very similar to the one used to process
records in a sequential file.
As a second example, again assume the pipe is a list of numbers and we want to print the
sum of the numbers. Then we can use:
Initialize
Open: Pipe
Sum <-- 0
Terminate
Output Sum
end
These two examples are both special cases of the general pattern:
Initialize
Open: Pipe
Initialize any other necessary variables
Terminate
Wrap up processing and output any final results
end
Note how this pattern goes through the pipe one item at a time, starting at the beginning and
working though until the last item has been processed. The Get_Next and End_of_Pipe opera-
tions were designed to make the processing logic of this pattern very straightforward. Their
design was modeled upon the Open, Get and End_Of_File operators of sequential files.
We will see a number of uses of pipes in the next section. For now, we want to consider
their representation.
36
Queues and Pipes
4.5.1. Representations
There are two basic representation techniques for pipes, one stores items in arrays and the
other stores items in a linked list.
The array representation stores the pipe values in an array and uses a pointer, Rear, to point
to the last item stored in the pipe. Rear is initially zero and to insert an item in the pipe we
simply:
If Array is not full then
Increment Rear by one
Array(Rear) <-- New value
The three operators to process a pipe all use a pointer or an index, Next_Get_Position, which
points at the location in the array of the next value to be returned by Get_Next. The Open
operator sets Next_Get_Position to 1 and the Get_Next operator is essentially:
If not End_of_Pipe then
Data <-- Array( Next_Get_Position )
Increment Next_Get_Position by one
and the end of pipe operation is the Boolean:
Return( Next_Get_Position > Rear ).
A complete version of these algorithms and the end of pipe algorithm is in Module 4.5.2.1.
The linked representation of a pipe is very similar to the linked representation of a queue.
The pipe is stored in a linked list with each node containing an item and a pointer to the next
node. There is a pointer, Front, pointing to the first item in the pipe and a pointer, Rear, point-
ing to the last item in the pipe.
The pipe insertion algorithm is identical to the queue insertion algorithm. As in the array
representation, the three operators, Open, Get_Next, and End_of_Pipe all use the pointer
Next_Get_Position which points at the next item to be returned by Get_Next. The Open opera-
tion sets Next_Get_Position to point to the first item in the pipe. The heart of the Get_Next
algorithm is:
If not End_of_Pipe then
Data <-- Next_Get_Position.Item
Next_Get_Position <-- Next_Get_Position.Next
37
Queues and Pipes
Data Specification
Algorithms
Clear:
Rear <-- 0
Next_Get_Position <-- 1
end clear
Empty Full
Return( Rear = 0 ) Return( Rear = Maximum_Size )
end empty end full
Insert( New_Data )
If Full
then Overflow error
else Rear <-- Rear + 1
Array(Rear) <-- New_Data
end insert
Open End_of_Pipe
Next_Get_Position <-- 1 Return( Next_Get_Position > Rear )
end open end end_of_pipe
Get_Next( The_Data )
If End_of_Pipe
then Error -- Beyond end of pipe or Not opened
else The_Data <-- Array(Next_Get_Position)
Next_Get_Position <-- Next_Get_Position + 1
end get_next
38
Queues and Pipes
Data Specification
Front : Pointer to Node: --Points to first entry in pipe.
Rear : Pointer to Node; --Points to last entry in pipe.
Next_Get_Position : Pointer to Node; --Pointer to next item for Get_Next.
Node is record
Item : ??? --Information in pipe, any data type.
Next : Pointer to Node --Pointer to next entry in pipe.
end record Node;
Algorithms
Clear: Empty:
Front <-- null Return( Front = null )
Rear <-- null end empty
Next_Get_Position <-- null
end clear
Insert( New_Data )
--Set up new node
New_Head <-- new Node( Item => New_Data, Next => Λ )
Open End_of_Pipe
Next_Get_Position <-- Front Return( Next_Get_Position = null )
end open end end_of_pipe
Get_Next( The_Data )
If End_of_Pipe
then Error -- Beyond end of pipe or Not opened
else The_Data <-- Next_Get_Position.Item
Next_Get_Position <-- Next_Get_Position.Next
end get_next
39
Queues and Pipes
Exercises
1. Develop algorithms using the pipe operators for each of the following:
a. search a pipe for a specified item,
b. print a list of the items in a pipe,
c. print the maximum item in a pipe, and
d. copy one pipe into another pipe.
3. Compare Exercises 1.a and 2.b. Both exercises do approximately the same thing, so what
are the pros and cons of the two approaches?
7. The representations above always assume the pipe has been opened when executing a
Get_Next operation. Discuss methods for determining if a pipe has been opened before execut-
ing the Get_Next operation.
40
Queues and Pipes
One of the goals of computer science is to develop simple means to solve complicated
problems. One way to do this is to break the problem into small pieces -- small enough pieces
so that each individual piece is easily solvable. Even better is to break the problem into pieces
for which we already have subprograms. Then we only have to link the subprograms together.
The first question is: How can we link the subprograms together? There are several ways,
but a simple one is to let each subprogram input a pipe and output either a pipe or a single, scalar
value. This way the output of any subprogram can be used as input by any other such subpro-
gram. Subprograms with this property are called filters.
As a first example, consider the problem of inputting a set of prices and outputting only
those prices less than the average price. This requires inputting the prices, computing their
average and outputting those below the average. We could write a single program to do this, but
the following approach uses filters to produce the same result. (The filters are in boldface and
the pipes in italics. Each filter inputs the line above itself and outputs the line below itself.)
Average Price
Each individual filter is fairly easy to develop. Let's start with the first filter and work our
way through developing the filters one at a time.
Since we will be using several pipes, we also assume each pipe operation includes a pipe
name as a parameter.
The first filter needs to input numbers from a file and output a pipe containing the numbers.
A possible algorithm is:
Input_Filter( Out_Pipe )
Initialize
Clear: Out_Pipe
41
Queues and Pipes
Terminate: end
Average_of_Filter ( In_Pipe)
Initialize
Open: In_Pipe
Count <-- 0
Sum <-- 0
Repeat for each item in input pipe (while not end of pipe)
Get_Next( In_Pipe, Item )
Count <-- Count + 1
Sum <-- Sum + Item
end repeat
Terminate
If Count /= 0
then Return ( Sum / Count )
else Error -- Empty Pipe
end
and, finally, the filter to output items less than a specified value:
Repeat for each item in input pipe (while not end of pipe)
Get_Next( In_Pipe, Item )
If Item < Number then Output Item
end repeat
Terminate: end
Given these three filters, an algorithm to produce a list of prices less than the average price
is:
Input_Filter( Price_Pipe )
Average <-- Average_of_Pipe ( Price_Pipe)
Less_Than_Filter ( Price_Pipe, Average )
42
Queues and Pipes
Even though in this case we had to develop the three filters, they were straightforward and
required no real effort. Once they are available; they can, of course, be used with no extra effort
to solve other problems.
As a second example to illustrate the technique, assume we have a long text, say a book, and
we want to know how often the author used each word in the book. This essentially requires
inputting the book one word at a time and counting how often each word is used. We could
write a single program to do this, but consider the following approach using filters. (The filters
are in boldface and the pipes in italics. Each filter inputs the line above itself and outputs the
line below itself.)
Filter to input book one word at a time and output it into a pipe
A total of three filters suffice to do the job and each individual filter is fairly easy to develop.
We start with the first filter and work our way through, developing the filters one at a time.
The first filter needs to input words from text, so assume the Text package of Appendix B
has been expanded to input a single word from normal textual material. (This can be done by
expanding the Text_Package.Get procedure of the package in Appendix B so that it skips
punctuation.) An algorithm then is:
Input_Text_Filter (Out_Pipe )
Initialize
Clear: Out_Pipe
Terminate: end
43
Queues and Pipes
The sort filter is omitted at the moment because we have not yet covered sorting. But,
assuming an array implementation of a pipe, bubble sort can certainly be used to sort an array
and, hence, a pipe.
A filter to count the occurrences of each word in a sorted pipe compares each incoming word
to a saved word. If the two are equal, a count is incremented by one; if the two differ, then the
saved word and its count are output and the saved word is changed to the new word. An
algorithm is:
Count_Successive_Items_Filter ( In_Pipe )
Initialize
Open: In_Pipe
Get_Next( In_Pipe, Saved_Word)
Count <-- 1
Repeat for each remaining word in In_Pipe ( while not end of pipe )
Get_Next( In_Pipe, Word )
If Word = Saved_Word
then Count <-- Count + 1
else Output: Saved_Word, Count
Saved_Word <-- Word
Count <-- 1
end repeat
Terminate
Output: Saved_Word, Count
end
Input_Text_Filter (Words_Pipe )
Sort_Filter ( Words_Pipe, Sorted_List )
Count_Successive_Items_Filter ( Sorted_List )
Note that as promised each individual filter is straightforward, yet the combination of filters
can solve some rather large problems.
This technique seems to have originated in the early days of computing and is widely used in
several areas. Many COBOL accounting and inventory systems consist of a large collection of
"filters" where each filter is a COBOL program which inputs one or more files and outputs one
or more files. The whole collection of programs used one after the other, in the proper order,
generates the desired results. In fact, many COBOL systems for running a whole company are
based on this technique. These COBOL systems use sequential files rather than pipes, but the
analogies between pipes and sequential files are obvious.
The gains are also obvious. One large, say hundreds of thousands of lines, program can be
replaced by a collection of relatively small, say thousands of lines, programs. Each program
inputs a sequential file, processes it in some way, and outputs a new sequential file which can be
44
Queues and Pipes
used by later programs. This way each individual program is much simpler to write and hence
more likely to be correct. One large programming job can also be split into several small
programming jobs which can be assigned to different programmers. Since each programmer is
working on an independent piece, they can work independently. (This is an important gain.
Coordinating several programmers, all working on the same, large program is a nightmare.) It is
also much easier to alter a small program than a large program. Business programs need contin-
ual updating to correspond to new laws, changes in management procedures, and many other
items. If a change is restricted so that it only affects one, small program, the new version of the
program will be available much sooner and with much less worry about inadvertent changes to
other parts of the system.
Another big user of this technique is the UNIX operating system. The UNIX operating
system includes hundreds of filters already built into the system. All of the filters mentioned
above, for example, are included as part of the UNIX system. This way we often don't even
have to write a single line of code -- the filter is already there for our use. The end result is that
we can solve many complicated problems by using available filters.
Some programmers might object at this point that a system based on filters is slow. As one
example, the system to count how often each word occurs in a text uses three filters and, there-
fore, must make at least three passes through the data. It is possible to design a special purpose
program that would replace the three passes by two passes (one to input the data and one to
output the results). Thus filters can be significantly slower than other approaches.
The response to this objection is that filters trade computer effort for programming effort,
machine time for human time. It is much easier and faster to write three filters than one special-
ized program. The filter based system will be easier to alter and update and more likely to be
correct. These are significant advantages. The extra computer time need to execute the filter
based system is often (but not always) a small price to pay for these advantages
Exercises
1. Expand the filters in the text into working Ada subprograms and produce a list containing
each word in a book and the number of times the word is used.
2. Expand the sample problem in the text above so that it outputs only the ten most often used
words in the book.
3. Develop a set of filters and a program to produce an index for a book; that is, a list of words
in the book and the pages on which each word occurs.
45
Queues and Pipes
5. Use the filters of the previous exercise and those filters developed in the text above to
process a list of student grades and output:
a. only grades below 70, or
b. only those grades greater than the class average, or
c. the average of those grades greater than the class average, or
d. only those grades below the class average,
e. only grades between 70 and 80, or
f. the highest grade.
6. Develop filter(s) to sort on various fields of a record when the records are stored in a pipe.
Use these filter(s) to process a file of basketball player data. Assume each record in the file
contains a player's name, the name of a game, and the number of points the player scored in that
game. The final output should be:
a. the maximum number of points scored by a player in a game,
b. the total number of points scored by all players in all games,
c. the total number of points scored by each player in all games,
d. the total number of points scored in each game by all the players.
7. Develop filter(s) and a program to produce a list of customers who owe money and the
average amount owed per customer. Assume there is an input file containing each customer's
name and amount owed.
8. Given a collection of movie titles and for each title, the stars in that movie, produce for each
star a list of the movies the star has starred in.
9. For any of the problems above, compare the final product to a single program to accomplish
the same result. Which is faster? Easier to write? Easier to update? Easier to understand?
What changes in the original problem statement would change your conclusions?
10. Develop a special purpose program to count how often each word is used in a text and
compare your program to the filter based system in the text.
46
Queues and Pipes
Since pipes and queues are very similar, the resulting pipe packages are very similar to the
queue packages covered earlier in this chapter. Specification 4.7.1 is a specification of a generic
pipe with initialization.
Translating Algorithms 4.5.2.1 and 4.5.2.2 into the corresponding package bodies is straight-
forward and left for the exercises.
Exercises
2. Exercise 1 above gives a total of 32 different possible Ada packages corresponding to either
one of these modules. Which ones can be used with Specification 4.7.1?
47
Queues and Pipes
package Pipe_Package is
48
Queues and Pipes
private --Declarations
type Pipe_Node; --Pipe data structure.
type Pipe_Pointer is access Pipe_Node;
49
Sets and Bags
Everyday life seems to be a continuous list of lists. We have grocery lists, class lists, inventory
lists, most wanted criminal lists, best seller lists, stolen car lists, and so on forever it seems. The
set and bag ADTs and their extensions are used to model many of these lists.
1
Sets and Bags
A set is any collection of items as long as the collection contains no duplicates. A bag is any
collection of items (duplicates are allowed). In both cases, sets and bags, the order of the items is
immaterial.
Some basic unary set/bag operations are (binary set/bag operations are covered in Section 5.4):
5.1.1. Representation
So far all of our representations have been either array representations or linked representa-
tions. Sets and bags are the first ADTs for which there is another possible representation, in this
case, a bit mapped representation. This section presents all three representations: array, linked,
and bit mapped.
The only difference between the representation of sets and bags is that before inserting
something in a set, we must first check and make sure the item is not already in the set. Other-
wise, the set and bag representations are the same. For this reason, we give here only the repre-
sentation of bags and leave the representation of sets for the exercises.
1 2 3 4 5
Beth Abe Cathy ...
where the items are stored in the first three locations of the array. Since a bag is unordered, the
values can be stored in the array in any order.
A new item is simply inserted in the first available location. An algorithm to insert a new
item in a bag implemented using an array is:
2
Sets and Bags
If Array is full,
then Overflow error
else Size <-- Size + 1
Array( Size ) <-- Data
The last two algorithms, Delete and Is_In, depend upon another procedure that searches
an array for a specified item. There are several possible algorithms for this search. We shall
use:
Initialize
I <-- 1
Repeat for each entry in set/bag (while I < Size and Array( I ) /= Item)
I <-- I + 1
end repeat
Terminate
Found <-- ( Array( I ) = Item )
This algorithm can return two values: a Boolean value, Found, and, if found, the Location of
the item. It has a loop that is repeated at most once for each item in the set giving an execution
time of O( Size ). This implies, by the way, that all routines using the search routine (including
the set insert routine) must have a minimum execution time of O( Size ).
3
Sets and Bags
Data Specification
Algorithms
Clear
Size <-- 0
end clear
Empty
Return( Size = 0 )
end empty
Full
Return( Size = Maximum_Size )
end full
Insert( New_Data )
If Full
then Overflow error
else Size <-- Size + 1
Array( Size ) <-- New_Data
end insert
4
Sets and Bags
Repeat for each entry in set/bag (while I < Size and Array( I ) /= The_Data)
I <-- I + 1
end repeat
Terminate
Found <-- (Array( I ) = The_Data)
If Found
then Location <-- I
else Location <-- 0
end find
Is_In( The_Data )
If Empty
then Underflow error
else Find ( The_Data, Location, Found )
Return( Found )
end is_in
Iterate( Subprogram )
Repeat for each item in bag( for I = 1 to Size )
Execute Subprogram( Array(I) )
end iterate
5
Sets and Bags
Module 5.1.1.1.1 contains a complete module, including error checking, for a bag using an
array representation. The set representation differs from the bag representation in Module
5.1.1.1.1 only in the Insert algorithm; otherwise the two are identical.
The linked representation of a set/bag is very similar to the linked representation of a stack or
a queue. In this case, the items in the set or the bag are stored in a linked list with a pointer,
called Head, which points at the first item in the linked list. As usual, let each node in the linked
list contain two fields, the Item field (which holds one item) and the Next field which contains a
pointer to the next node in the linked list.
If the set/bag currently contains the three items Abe, Beth, and Cathy, the data structure
might look like:
Beth
Abe
Head
Cathy Λ
where, as usual, the last node in the linked list contains a Λ in the Next field. Since, as before,
both sets and bags are unordered, the values can be stored in the linked list in any order.
Since the order of the items does not matter in a set or in a bag, all insertions are done at the
beginning of the linked list. In more detail, to insert a new data item we first create a new node
and insert the new node at the beginning of the list; that is,
Head <-- new Node( Item => New_Data, Next => Head)
Of course, before inserting a new item in a set it is first necessary to make certain that the
item is not already in the set.
To clear the set or the bag to empty, it suffices to:
Head <-- null
To determine if the set or bag is empty, we can use:
Return ( Head = null ).
Data items are deleted from a set/bag by the usual linked list scheme of pointing around the
data item to be deleted. To delete Abe from the set above, for example, we can alter the pointer
in the previous node (the node containing Beth) to point to Cathy instead of Abe. This means
we need a find routine which will return a pointer to the node in front of the node to be deleted.
A sample find routine to return a pointer to the previous node is:
Initialize
P <-- Head
Last <-- null
6
Sets and Bags
Terminate
Found <-- (P /= null)
Previous <-- Last
where Previous now points to the item preceding the specified item in the linked list. Note
the use of the "and then" in the repeat while of this algorithm. This insures that the comparison:
Data_Item /= P.Item
is not performed if P is equal to null. In other words, if the first comparison is false, the second
comparison is skipped.
A delete algorithm which uses this algorithm to find the item to be deleted is:
Sets and bags are the first ADTs for which there is another possible representation; in this
case, a bit mapped representation. One of the important operations for sets and bags is the
search operation, Is_In. This is a slow, O(Size), operation in the standard array and linked
representations because it requires a sequential search.
Assume for the moment that the set items are the simple integers 1, 2, ..., Maximum_Size.
Then to implement the set, we can set up a Boolean array with the subscripts 1, 2, ...,
Maximum_Size and let the corresponding position of the array be either False (the item's not in
the set) or True (the item is in the set). To determine if item k is in the set, it suffices to test
Array(k). To insert item k in the set, it suffices to set Array(k) to true.
A bit-mapped representation of a bag uses an integer array where each array entry contains a
count of how many copies of this item are in the bag. To determine if item k is in the bag, it
7
Sets and Bags
Data Specification
Node is record:
Item : ??? --Holds item in bag, any data type.
Next : Pointer to Node; --Pointer to next node in bag.
end record
Algorithms
Clear
Head <-- null
end clear
Empty
Return( Head = null )
end empty
Insert( New_Data )
Head <-- new Node( Item => New_Data, Next => Head )
end insert
8
Sets and Bags
Terminate
Found <-- (P /= null)
Previous <-- Last
end find
Is_In( The_Data )
Find( The_Data, Previous, Found )
Return( Found )
end is_in
Delete( Old_Data )
If Empty
then Underflow error
else Find( Old_Data, Previous, Found )
--Remove data (if found) from bag
If Found
then If Previous = null
then Head <-- Head.Next
else Previous.Next <-- Previous.Next.Next
else Not Found error
end delete
Iterate( Subprogram )
Initialize
P <-- Head
9
Sets and Bags
suffices to test Array(k) for a value greater than 0. To insert item k in the bag, it suffices to
increment the value of Array(k) by one.
For both sets and bags, the search time is O(1), a significant improvement over the other two
representations.
The bit map representation of Is_In and Delete are much faster than the array or linked repre-
sentation of the same two operators, but the opposite is true of the Clear and Empty operators.
Both operations must process the whole array. The Clear operation must initialize every item in
the array:
Repeat for each item in the array (I = 1 to Maximum_Size)
Array( I ) <-- 0 (for bag) or false (for set)
end repeat
A simple version of the Empty operation must test every item in the array:
Initialize
Empty <-- true
A slightly faster version of the Empty operation exits the loop as soon as a non-empty
location is found; for example, a set version is:
Initialize
I <-- 0
There is a way to speed up the bit map representation of the Empty operation by using an
additional variable to store the current value of the size of the set or bag. The details are left for
an exercise at the end of this section, but, under the right circumstances, the speed gain can be
significant.
Obviously the bit-mapped representation can be used for any set or bag of consecutive
integers or enumerated variables. Just as obviously, the bit mapped representation is limited to
only consecutive integers or enumerated variables, but, since this case does occur from time to
time and the speed gain is so large. Module 5.1.1.3.1. describes this representation for enumer-
ated variables with First and Last used for the minimum and maximum enumerated variable.
10
Sets and Bags
Data Specification
Algorithms
Clear
Repeat for each item in array (I = First to Last)
Array( I ) <-- 0
end repeat
end clear
Empty
Initialize
I <-- First
Repeat for each item in array (while (Array( I ) = 0) and I < Last)
I <-- I + 1
end repeat
Terminate
Return( (Array( I ) = 0) )
end empty
Is_In( The_Data )
Return ( Array( The_Data ) > 0 )
end is_in
Insert( New_Data )
Array( New_Data ) <-- Array( New_Data ) + 1
end insert
Delete( Old_Data )
If Array( Old_Data ) = 0
then Not Found error
else Array( Old_Data ) <-- Array( Old_Data ) - 1
end delete
Iterate( Subprogram )
Repeat for each item in bag ( for I = First to Last )
For J = 1 to Array(I): Execute Subprogram( I )
end repeat
end iterate
11
Sets and Bags
5.1.2. Timing
The times required to execute the bag operations for the various representations are:
Operation Representation
Array Linked Bit-Mapped
Clear O(1) O(1) O(Maximum_Size)
Empty O(1) O(1) O(Maximum_Size)
Insert O(1) O(1) O(1)
Is_In O(Size) O(Size) O(1)
Delete O(Size) O(Size) O(1)
The times to execute the set operations for the various representations are the same as those
for the bag except for the insertion operation.
Operation Representation
Array Linked Bit-Mapped
Insert O(Size) O(Size) O(1)
Every conclusion reached about the comparison of array and linked representations of stacks
and queues remains true for sets and bags. In particular:
y The big O execution times of the two representations are equal. The only
restriction is that the time to get a new node in the linked representation of the
insert operation is time consuming. Again speedup techniques are possible,
but the array representation is considerably faster in general.
y The space utilization of the two representations depends upon several factors,
but, in general, the linked representation uses less space.
y The linked representation is the most flexible.
The bit-mapped representation is:
y In general, the fastest of the three representations.
y Must set aside space for every possible value in the set, so, in general, it uses
the most amount of space.
y Is very inflexible and limited to only certain kinds of sets and bags; to be
more precise, it is limited to sets and bags of enumerated items.
Section 5.3 considers how the situation changes when ordered implementations are consid-
ered. The analysis is left for that section, but, for the moment, it must be noted that sequential
array or linked implementations of sets and bags are not necessarily the best or fastest method.
12
Sets and Bags
Exercises
2. Assuming the items in a bag are integers, write a function algorithm whose value is:
a. the sum of the items,
b. the sum of the positive items,
c. the count of the items,
d. the count of the positive items,
e. the average of the items,
f. the maximum of the items,
g. the minimum of the items,
h. the product of the items,
i. the number of times a given item occurs,
j. true if and only if the bag contains duplicate items, or
k. a count of the number of duplicate items
in the bag where the bag is implemented using:
A. an array representation,
B. a linked representation, or
C. a bit-mapped representation.
Assume each operation is implemented inside the bag module. For each algorithm, give the
execution time, in terms of big O. How would these functions have to be altered to work on a
set?
5. Rewrite the bit map representation to include a variable to count the total number of items in
the bag.
a. What is the speed of this new representation?
b. What is the cost of this speed gain?
6. The functions in Exercise 2 all require processing every item in the bag. We assumed in
Exercise 2 that these functions were included in the bag module. Redesign the functions
13
Sets and Bags
assuming the bag module contains Open, Get_Next, and End_of_Bag operations corresponding
to the similar operations in a pipe. Compare the two approaches.
7. Another way to implement the delete operation is to include a delete field in each entry. The
field is true if and only if the item is in the set or bag. Develop algorithms for clear, empty,
insertion, deletion, and is_in operations based upon this approach assuming:
a. an array, or
b. a linked
representation. What are the pros and cons of this approach?
8. Should the Is_In operator raise an exception or return a false when the set is empty? Why?
9. The bag representations in the text assume that duplicate items are stored in separate
locations. It is possible to add a count field to each location so that instead of adding a new
entry for a duplicate item, only the count field is incremented.
a. Develop algorithms for this version.
b. Compare this new representation to the previous representation for speed,
space requirements, and flexibility.
14
Sets and Bags
Sets and bags are often used any place that a list is used in everyday life. The following
example illustrates such a case.
Example 5.2.1. A school wants a computer program to manage a class list. To be more
precise, the school wants to be able to add names to a class list, drop names from the class list,
and produce an up to date class list upon demand. Develop an algorithm for this problem.
The first step in the solution is to specify the input in more detail. The following user
commands seem adequate for the job:
Initialize
Clear the set
More-to-do <-- true
Terminate: end
15
Sets and Bags
It is a minor modification to the above algorithm to save the student's address and major as
well as his/her name. Adding more fields to the item in a set is mostly a matter of extending the
set items from a single scalar value to a record with multiple fields. If the set is implemented as
a generic package, this extension is straightforward.
Sets and bags have a limited set of operations and, hence, they are limited to problems that
require only this small set of operations. However many problems can be solved by adding one
or more additional operations. Say, for example, that we wish to add a command to the above
example to tell us how many students are currently in the class. This requires adding one opera-
tion (an operation to return the current size of the set or bag) to the set or bag package and
adding one line to the above algorithm.
Some problems require adding more than one operation to the set or bag, but the basic
principle is still the same regardless of the number of additional operations. If the operations are
carefully chosen, the additions to the set or bag package are straightforward. Such ADTs, for
obvious reasons, are called extended sets and extended bags.
The following problem illustrates an extended set.
Example 5.2.2. Assume a video store wants to keep track of how many times each tape is
rented.
This problem requires a set (why not a bag?) where each item in the set has two fields: the
name of the tape and a count of how many times the tape has been rented. Let the basic user
commands for the system be:
Note that the update operation is fairly dependent upon this particular application. While it is
possible to write a generic package including such as update capability, it is easier and more
straightforward to develop a non-generic package designed for this specific problem.
A possible main algorithm is then (the square brackets around [Name,0] indicate a record
containing these two values):
16
Sets and Bags
Initialize
Clear the set
More-to-do <-- true
Terminate: end
Note that the main algorithm is almost identical to the previous one; the major change is
including an update operation in the set package.
To understand the effects of changes in the problem on the choice of the ADT, let us recon-
sider the tape store problem with some changes.
To begin with, the store now wants to keep track of whether each tape is in the store or
rented at the moment. To do this, we add a new field to each record; this field, called In_Out,
has only two possible values "in" or "out." We also need to add a new operation, RETURNED,
so the resulting user operations are:
The set package now needs two update operations, one for a RENTED update and one for a
RETURNED update. The most reasonable way to do this, as before, is to forget the use of a
generic package and develop an extended set package for this particular problem. Design, for
example, a package to include the operations:
17
Sets and Bags
The main algorithm to use this extended set package is a minor alteration of the previous one
and left for the reader.
The main point of this last example is that when a program needs more than one search,
update, etc. from a set or bag, it is best to ignore any generic packages and develop an extended
set package for the problem at hand. This new package, of course, is a minor alteration of the
usual set package, but it is conceptually a new data type and its representation is a new package.
Exercises
1. Develop an algorithm to manage a customer list. Customers can be added and dropped and
upon command the system will produce a customer list.
2. Develop an algorithm to produce a list of all the words used in a given text.
4. Develop an algorithm to count the votes in an election. Assume that each vote consists of
one candidate's name.
5. A bank would like a program to keep track of the current balance for each of their customers.
The program must be able to alter the balance each time a customer makes a deposit or
withdrawal and upon request it must list all customers with less than $100 in their account.
6. The telephone company charges $0.10 per telephone call. Develop a program that will keep
track of the amount owed by each customer and which will upon command produce a bill for
each customer.
7. Given a data set where each item contains a student name and the name of a course that the
student is taking, develop a simple algorithm that will produce either all the students taking a
given course or all the courses taken by a given student.
8. A company would like a program that will upon request return the name of the owner of a
given piece of property or a list of all the property owned by a given person.
9. Develop a family tree program. One type of input entry consists of a person's name and the
names of the person's parents. The other type of input consists of requests for the names of the
parents, children, etc. of a given person.
18
Sets and Bags
For every ADT so far we have concentrated upon finding a representation that used the array
or linked structure and largely ignored the question of efficiency. Since the stack and the queue
primarily insert and retrieve items, there is limited room for improvement. (Although using a
circular array rather than a straight array for the queue makes a significant speed improvement.)
The set ADT, however, may have significant improvement possibilities in more than one
operation.
Three set operations, Insert, Delete, and Is_In, require a search of the set. Binary search is
much faster than sequential search, so using a data structure which allows binary search might
speed up all three of these operations. There are, however, some questions to consider first.
1. First, should the set be kept sorted all of the time or does it suffice to sort the
set before the search so that a binary search can be performed? The time to
do a sort, a minimum of O( Size * log2Size ), is clearly much longer than the
time to do a sequential search, so sorting only before a search is not
reasonable.
2. Second, keeping the items in the array sorted so that the binary search is
possible without first having to sort the items in the array raises a representa-
tion question. Some might object that the items in a set are unordered so the
representation should store the items in some unordered fashion. The fact that
the items in a set are unordered, however, implies the implementation can
store the items in the array any way that is convenient.
3. If the items are always kept sorted in the array, then it is necessary to deter-
mine if this has any effect on the other operations. Certainly the insertion
routine will have to insert a new item in the correct location in the array and
the deletion routine will have to preserve the order of the items.
To compare the ordered and unordered array representations, it is necessary to develop
algorithms and timing estimates for all of the operations and then to compare the results.
Let us start with the Find routine from Module 5.1.1.1.1 and change it to a binary search:
19
Sets and Bags
Terminate
If Found
then Location <-- Middle
else Location <-- 0
end find
Note that this routine works correctly even when Size = 0. Its execution time is O( log2Size )
which is much faster than the O( Size ) execution time of a sequential search.
Using this Find routine, the Is_In routine:
Is_In ( Item )
Find( Item, Location, Found)
Return Found
end
Insert ( New_Data )
Initialize
I <-- Size
Array( 0 ) <-- New_Data
Repeat for each item greater than New_Data (while New_Data < Array( I ) )
Array( I+1 ) <-- Array( I )
I <-- I - 1
end repeat
Terminate
If Array( I ) = New_Data
then Error -- Duplicate entry
else Array( I + 1 ) <-- New_Data
Size <-- Size + 1
end insert
20
Sets and Bags
This insertion algorithm leaves an empty slot in the array if the value of New_Data is a
duplicate of an item is already in the set. We could add a check for duplicate data to the Termi-
nate portion of the algorithm and, if duplicate data is entered, move all the items back down into
their original position before the insertion started, but this requires an additional time of O( Size
) to execute. A faster scheme is to first use a binary search of the array to determine if the value
of New_Data is a duplicate item. If it is not a duplicate then this insertion algorithm is
performed. The total execution time then is:
O( log2 Size ) if New_Data is a duplicate, and
O( log2 Size ) + O( Size ) if New_Data is not a duplicate.
This execution time is faster than the unordered array when a duplicate is found, but slower
when the value of data is not a duplicate. This suggests a tradeoff that needs some thought.
When the set is small, say that the value of Size is five or ten, then the extra search time is
significant in theory, but not in practice. When the set is larger, say that the Size is equal to
1000, then the extra search time is lost in the overall insertion time.
Recall that the big O function is actually only defined in the limit; thus, in the limit:
O( log2 Size ) + O( Size ) = O( Size ).
Thus, the overall insertion time, even using a binary search to check for duplicates, is still:
O( Size ).
In other words, keeping the list ordered does not alter the big O execution time of the inser-
tion operation.
Similarly, a deletion operation which always keeps the array ordered requires moving all of
the data to cover the location of the item being deleted. The basic idea (ignoring errors for the
moment) is:
Delete( Old_Data )
Find( Old_Data, Location, Found )
If not Found
then Error
else --move data to cover deleted item
For I = Location to (Size-1)
Array( I ) <-- Array( I+1 )
end for
Size <-- Size - 1
The execution time of the Find routine is O( log2Size ) and the execution time of the for loop
is O( Size ), so the total execution time is O( Size ) the same as for the unordered implementa-
tion.
The clear operation is:
Size <-- 0
and the empty function is:
Return ( Size = 0 ).
In other words, these two operations are the same as before.
A detailed module is in Module 5.3.1.1.
21
Sets and Bags
Data Specification
Algorithms
Clear
Size <-- 0
end clear
Empty Full
Return( Size = 0 ) Return( Size = Maximum_Size )
end empty end full
Insert( New_Data )
If Full then Overflow error
Find ( New_Data, Location, Found )
If Found, then raise Duplicate Error
Initialize
I <-- Size
Array( 0 ) <-- New_Data
Repeat for each item greater than New_Data (while New_Data < Array( I ) )
Array( I+1 ) <-- Array( I )
I <-- I - 1
end repeat
Terminate
Array( I + 1 ) <-- New_Data
Size <-- Size + 1
end insert
Iterate( Subprogram )
Repeat for each item in set ( for I = 1 to Size )
Execute Subprogram( Array(I) )
end repeat
end iterate
Ordered Array Representation of a Set
Module 5.3.1.1
22
Sets and Bags
Terminate
If Found
then Location <-- Middle
else Location <-- 0
end find
Is_In( The_Data )
Find ( The_Data, Location, Found )
Return( Found )
end is_in
23
Sets and Bags
Operation Implementation
Unordered Ordered
Clear O( 1 ) O( 1 )
Empty O( 1 ) O( 1 )
Insert O( Size ) O( Size )
Delete O( Size ) O( Size )
Is_In O( Size ) O( log2Size )
Clearly, the only significant speed difference is in the Is_In routine; the other routines have
the same big O times as before. In fact, the Insert and Delete are a bit slower than before
because both include binary search that was not included before. In other words, the ordered
array representation is probably only useful if the user makes a significant number of searches
for items; that is, uses the Is_In operation more often than the insert and delete operations.
Using a ordered implementation for a bag raises the same set of questions and requires the
same kind of analysis, algorithm development, and timing estimates for all of the operations.
Assuming the items in the array are kept ordered, we can use the same Find routine and Is_In
routine as the ones used above for the set implementation. (Why can the same Is_In routine be
used?)
The insertion routine is a bit different because it is no longer necessary to check for dupli-
cates. An algorithm is:
Initialize
I <-- Size
Array( 0 ) <-- New_Data
Repeat for each item greater than New_Data (while New_Data < Array( I ) )
Array( I+1 ) <-- Array( I )
I <-- I - 1
end repeat
Terminate
Array( I + 1 ) <-- New_Data
Size <-- Size + 1
end insert
The execution time of this algorithm is obviously O( Size ) which is much slower than the O( 1 )
required for the unordered implementation.
24
Sets and Bags
Since deletion requires finding the item to be deleted, the deletion routine is the same as for
the ordered representation and its execution time is O( Size ) which is the same as the execution
time for an unordered representation.
The Clear and Empty operations are the same as before.
The overall timings then are:
Operation Implementation
Unordered Ordered
Clear O( 1 ) O( 1 )
Empty O( 1 ) O( 1 )
Insert O( 1 ) O( Size )
Delete O( Size ) O( Size )
Is_In O( Size ) O( log2Size )
This table shows that tradeoffs made when a bag is implemented using a ordered array rather
than an unordered one. When the value of Size is less than, say, ten, the two are about the same,
but as the magnitude of Size increases, the Insert operation becomes significantly slower and the
Is_In significantly faster. Which version is faster overall depends upon the number of insertion
invocations versus the number of Is_In invocations.
Comparing the results for both the set and the bag implementations, it is clear that based
upon the operations above neither ordered nor unordered implementations is always significantly
faster than the other implementation. It often depends upon how the ADT is to be used, the size
of the set or bag, and how often each operation is invoked.
The same kind of comparison can be done for linked representation of sets and bags. Here
one assumes the items are stored in order in the linked nodes. Unfortunately a binary search of a
linked list is not possible so a sequential search is still necessary even though the items are
ordered. The only possible speed gain is if the item is not in the set or bag, the search can stop
as soon as it finds an item greater than the sought item.
Since insertion and deletion in a linked list require a pointer to the node before the one
containing the entry, a possible find algorithm is:
25
Sets and Bags
Terminate
If P = Λ
then Found <-- false
else Found <-- ( The_Data = P.Item )
Previous <-- Last
end find
The big O execution time of this algorithm is O( Size ), but, on the average, it only searches half
of the linked list before it either finds the sought item or determines that the item is not in the list
and quits.
The Is_In, Clear, and Empty algorithms are the same as those used in the unordered repre-
sentation of a set, so the execution times for these operations are the same as before.
Ignoring for the moment possible errors, one possible insertion routine is:
The execution time is the search time, O( Size ) plus the insertion time, O( 1 ), giving a total
execution time of O( Size ). In other words, the big O execution time is still O( Size ), but the
actual execution time is halved when the item is not in the set because the search can stop as
soon as it finds an item greater than the sought item.
Similarly ignoring possible errors for the moment, a possible deletion routine is:
Again, the execution time is essentially determined by the search time, so the execution time
is O( Size ).
These results are combined in Module 5.3.2.1.
26
Sets and Bags
The following table compares the two implementations for execution speeds.
Operation Implementation
Unordered Ordered
Clear O( 1 ) O( 1 )
Empty O( 1 ) O( 1 )
Insert O( Size ) O( Size )
Delete O( Size ) O( Size )
Is_In O( Size ) O( Size )
While the Big O values are the same, the ordered search is on the average twice as fast as the
unsorted search.
The same analysis can be done for a bag implementation comparing ordered and unordered
linked representations. The only difference is that the bag does not need to check for duplicates
before inserting a new item. The big O execution times in this case are the same for both repre-
sentations.
A detailed module is in Module 5.3.2.1.
5.3.3. Timing
The times required to execute the set operations for the various representations are summa-
rized in the following table.
27
Sets and Bags
Data Specification
Node is record:
Item : ??? --Holds item in set, any data type.
Next : Pointer to Node; --Pointer to next node in set.
end record
Algorithms
Clear
Head <-- Λ
end clear
Empty
Return( Head = Λ )
end empty
Insert( New_Data )
Find( New_Data, Previous, Found )
If Found
then Duplicate error
else If Previous = Λ
then Head <-- new Node ( Item => New_Data,
Next => Head )
else Previous.Next <-- new Node ( Item => New_Data,
Next => Previous.Next )
end insert
Iterate( Subprogram )
Initialize
P <-- Head
28
Sets and Bags
Terminate
If P = Λ
then Found <-- false
else Found <-- ( The_Data = P.Item )
Previous <-- Last
end find
Is_In( The_Data )
Find( The_Data, Previous, Found )
Return( Found )
end is_in
Delete( Old_Data )
If Empty
then Underflow error
else Find( Old_Data, Previous, Found )
--Remove data (if found) from set
If Found
then If Previous = Λ
then Head <-- Head.Next
else Previous.Next <-- Previous.Next.Next
else Error -- Not Found
end delete
29
Sets and Bags
For the first time, there is a significant difference in the big O times between the array and
linked representations. The array representation is significantly faster at searching than the
linked representation. Comparing these results to those for the unsorted representations given in
Section 5.1, we can conclude:
Z If it can be used, the bit-mapped representation is the fastest regardless of the
operations to be executed. If a bit-mapped representation cannot be used,
then the fastest method depends upon the operations to be executed. In
particular, the fastest method depends upon the ratio of the number of inser-
tions to the number of searches. A ordered array is significantly faster if the
number of searches exceeds the number of insertions. If the number of inser-
tions is greater than the number of searches for the general set or bag, then the
unsorted array representation is fastest. (The exact breakpoint between the
ordered and unsorted methods is examined in more detail in the chapter on
searching.)
Z The linked representation probably uses the least amount of space.
Z The linked representation is still the most flexible.
Exercises
1. Develop a table comparing the timing estimates for each possible representation of a bag.
2. The bag representations in the text assume that duplicate items are stored in separate
locations. It is possible to add a count field to each location so that instead of adding a new
entry for a duplicate item, only the count field is incremented Develop algorithms for this
version assuming the bag entries are stored in lexicographical order in:
a. an array or b. a linked structure.
Compare this new representation to the previous representation for speed, space requirements,
and flexibility.
3. Consider an extended set to solve the problem given in Example 5.2.2. Develop
algorithms and timing estimates comparing ordered and unordered array implementations of
such an extended set.
30
Sets and Bags
The advantages of generic packages should be obvious by now. There is, however, another
feature we need to cover --- a feature which greatly expands the capability of generic packages.
Their very generality limits generic package bodies to a few basic operations, such as arithmetic
or comparisons. We do not have a method of applying general, data type dependent, functions
or procedures inside the package body.
The ordered set package furnishes an excellent example of the problem. To keep items
ordered, there must be some way to compare two items for order; that is, a less than operator to
determine if a < b. Recall that private types can only use assignment and test for equality;
limited private types do not even allow these kinds of tests. What is needed is some way of
telling a package exactly what function to use to compare two items. One way is to pass the
comparison function (a Boolean valued function to determine if one item is less than another) as
a parameter in the operation invocation. If only one operation uses the comparison, this is
reasonable. When three set operations, insertion, deletion, and search, all need to do compari-
sons and they all need to use the same comparison function, this method is unreasonable. A
more reasonable solution in this case is to make the comparison function one of the generic
parameters so that all three operations can use the same comparison function. This also elimi-
nates the need to worry about the user/client invoking different operations with different
comparison operators.
This section presents the use of functions or procedures as generic parameters. Generic
function or procedure parameter declarations are included along with all of the other generic
parameters at the beginning of the package specification between the key words generic and
package.
Generic function and procedure declarations all start with the key word with; thus, assuming
Sample is a procedure,
specifies that sample will be a generic parameter procedure which the user can specify when the
package is instantiated. Similarly the declaration:
specifies that < is a function used as a generic parameter; one that will be specified at instantia-
tion time.
Specification 5.4.1 contains a sample ordered set package with a less than generic parameter.
The remainder of the package specification is standard. The body of the package is given in
Program 5.4.1; it is a routine coding of Module 5.3.1.1. Note how the less than function is used
in the Find just as if it were a normal less than function.
31
Sets and Bags
generic
package Ordered_Set_Package is
type Procedure_Access_Type is
access procedure ( Item : in Data_Type );
32
Sets and Bags
private
type Set_Node;
type Set_Pointer is access Set_Node;
33
Sets and Bags
type Set_Node is
record
Size : Subscript := 0; --Number of items in set.
Data_Array : Data_Array_Type; --Items in set.
end record;
-------------------------------------------------------------
-------------------------------------------------------------
procedure Initialize( The_Set_Name : in out Set_Type ) is
begin
The Set_Name.Ptr := new Set_Node;
end Initialize;
---------------------------------------------------------------
begin
The_Set_Name.Ptr.Size := 0;
end Clear;
---------------------------------------------------------------
begin
return The_Set_Name.Ptr.Size = 0;
end Empty;
-------------------------------------------------------------
begin
return The_Set_Name.Ptr.Size = Maximum_Size;
end Full;
Package Body of a Ordered Set Package with Initialization
Implemented using an Array
Program 5.4.1
34
Sets and Bags
-------------------------------------------------------------
begin
--Initialize
Bottom := 1;
Top := The_Set_Name.Ptr.Size;
Temp_Found := false;
--Terminate
if Temp_Found
then Location := Middle;
else Location := 0;
end if;
Found := Temp_Found;
end Find;
35
Sets and Bags
-------------------------------------------------------------
begin
--Test for exceptions
if Full (The_Set_Name) then
raise Set_Overflow;
end if;
--Initialize
I := The_Set_Name.Ptr.Size;
The_Set_Name.Ptr.Data_Array(0) := New_Data;
--Terminate
The_Set_Name.Ptr.Data_Array(I + 1) := New_Data;
The_Set_Name.Ptr.Size := The_Set_Name.Ptr.Size + 1;
end Insert;
36
Sets and Bags
-------------------------------------------------------------
begin
Find( The_Set_Name, The_Data, Location, Found);
return Found;
end Is_In;
-------------------------------------------------------------
begin
--Find location of item in array.
Find( The_Set_Name, Old_Data, Location, Found);
if not Found then
raise Item_Not_in_Set;
end if;
--Terminate
The_Set_Name.Ptr.Size := The_Set_Name.Ptr.Size - 1;
end Delete;
-------------------------------------------------------------
37
Sets and Bags
end Ordered_Set_Package;
To instantiate this package, the user or client will have to specify the < function to be used.
Some data types, such as integer and float, already include a less than function, but the exact
meaning of < for a user defined data type depends upon the data type. The data type designer
will have to include a less than function in data type package. It is not even necessary that the
data type actually have a "natural" less than function. Some data types, such as a list of chores,
may not have a "natural" ordering, but one might define one for search purposes, say alphabetic
order or an order based upon priority. All user defined less than functions will need to be
defined and programmed by the user.
To illustrate using Specification 5.4.1, we can instantiate a set with a user defined data type,
say Course_Data_Type, by a statement like:
with Course_Package;
with Set_Package;
package Schedules is
new Set_Package(
Data_Type => Course_Package.Course_Data_Type,
"<" => Course_Package.Course_Less_Than,
Maximum_Size => 10);
where the three actual generic parameters correspond to the formal generic parameters in the set
specification. To be precise:
- the formal parameter Data_Type is now Course_Data_Type,
- the formal parameter < is now the user furnished Boolean valued function,
Course_Less_Than, which compares two items of type Course_Data_Type, and
- the formal parameter Maximum_Size, the maximum size of the array, is now 10.
Note that the Course_Less_Than function is specially written to perform the desired
comparison. Assuming that Course has been defined as a record with four components, Depart-
ment, Number, Room, and Time. Further assume the values of Course_Data_Type are ordered
38
Sets and Bags
The record definition and function definition are combined into a single data type package,
as was done in Chapter 2. The result is in Program 5.4.2. It contains the definition of Course
and the details of the Course_Less_Than function. In fact, the implementation of the
Course_Less_Than function uses a slightly different algorithm from the one above.
Of course, the user is free to use any criteria in the less than function; the package simply
uses whatever the user specifies.
Exercises
3. Develop a generic set package for processing records where searches are made using only one
field of the record. Use a linked implementation.
4. Extend your package from Exercise 1, 2 or 3 to allow search and update on any criteria.
39
Sets and Bags
with Text_Package;
package Course_Package is
end Course_Package;
----------------------------------------------------------------
----------------------------------------------------------------
-- Package Body for Course_Package
begin
if Data_1.Department < Data_2.Department then
Result := true;
elsif Data_1.Department = Data_2.Department then
Result := ( Data_1.Number < Data_2.Number )
else--Data_1.Department > Data_2.Department
Result := false;
end if;
return Result;
end Course_Less_Than;
end Course_Package;
Course Data Type Package
Program 5.4.2
40
Sets and Bags
All the set operations defined so far manipulate only one set at a time. There are also some
set operations that manipulate two sets to generate a third set; consider, for example, the follow-
ing three set functions:
Intersect( Set1, Set2 ) whose value is the set intersection of Set1 and Set2,
Union( Set1, Set2 ) whose value is the set union of Set1 and Set 2, and
Differ( Set1, Set2 ) whose value is the set difference of Set1 and Set2.
There are several different algorithms for implementing each of these operations, each with
its set of pros and cons. Let's start with set intersection and consider some possible algorithms.
If the sets are implemented using a bit-mapped data structure, then an algorithm to compute
the intersection, Set3, of the two sets, Set1 and Set2, is:
If Set1 and Set2 have the same maximum size, the total execution time is obviously O(
Set1.Maximum_Size ). In theory, it is possible for Set1 and Set2 to have different values for
Maximum_Size in which case, the algorithm must be extended so that:
Set3.First = Largest_of ( Set1.First, Set2.First )
Set3.Last = Smallest_of ( Set1.Last, Set2.Last )
Set3.Maximum_Size = Set3.Last - Set3.First + 1
The total execution time of this algorithm is O( Set3.Maximum_Size ).
This value, O( Set3.Maximum_Size ), is also the minimum possible execution time because
all the values of Set3 must be set to either true or false.
When a set is represented by an array, there are more possibilities. We will consider three
ways to compute the intersection of two sets stored in an array representation. The first
algorithm is basically brute force:
Initialize
Clear Set3
41
Sets and Bags
where the algorithm is implemented using the set operations Is_In and Insert.
This algorithm is slow. The slowest part of the algorithm might be the part: Insert Item in
Set3. The time required to insert a single item in a set, using the standard set insertion operation,
is O(Size) of the set. Since Size3 items must be inserted in Set3, the total insertion time can be:
Size3 * O(Size3) = O(Size32).
The reason this insertion takes so long to execute is that it must first search the set each time
to insure that no duplicate item is being inserted. Since the items this algorithm is inserting
already come from a set that has no duplicates, this extra search time is not necessary. The
following loop eliminates the search for duplicate items:
With this new algorithm, the time to insert a single new item is O(1) and the total time for all
the insertions is now O(Size3), a significant times savings.
The execution time to determine if a single item from Set1 is in Set2, assuming a sequential
search of Set2, is O(Size2) and, since the search must be repeated once for each item in Set1, the
total time for the searches is O(Size1 * Size2). Combining the search and the insertion time, the
total execution time for the intersection algorithm is:
O(Size1 * Size2) + O(Size3) = O(Size1 * Size2) .
There are several possible ways to speed up this algorithm. One is to assume that Set2 is
ordered so that binary searches are possible. The search time for one item is then O(log2Size2)
and the total search time is O(Size1 * log2Size2) which can be significantly faster than O( Size1
* Size2 ) if Set2 is large. Unfortunately, this requires sorting Set2 which, even assuming some
fast sort method such as Quick Sort, takes time O(Size2 * log2Size2). The total time then to
compute the intersection, including the sort time is the sum of the sort time, the search time, and
the insertion time or:
O(Size2 * log2Size2) + O(Size1 * log2Size2) + O(Size3).
Combining the two terms containing the logarithm of Size2, gives:
O( [Size1+Size2] * log2Size2) + O(Size3) = O( [Size1+Size2] * log2Size2),
which is a significant speed gain over the previous version which used a sequential search of
Set2.
There is still another improvement possible. Assume both Set1 and Set2 are ordered, then a
merge kind of algorithm can be used as follows:
42
Sets and Bags
This algorithm repeats the inner loop at most once for each item in Set1 and each item in Set2
for a total of Size1 + Size2 times. Of course, both Set1 and Set2 must be ordered so the total
execution time is the sum of the two sort times, the merge operation, and the insertion time; i.e.,
O(Size1 * log2 Size1) + O(Size2 * log2 Size2) + O(Size1+Size2) + O(Size3).
This would be significantly faster than either of the two methods above if it were not for the
time required to sort the two sets.
Thus this method is faster only if the two sets are already ordered. So let us consider the
possibility of always keeping the sets ordered as was done earlier in Section 5.3. The execution
time now for an insertion is:
O(Size1+Size2) + O(Size3) = O(Size1+Size2) .
This is a significant improvement over any of the other methods. Recall from Section 5.3 that
the ordered array representation was slightly slower than the unsorted array representation for
every operation except Is_In. That statement must be modified to read that the ordered array
representation is significantly faster than the unsorted array representation for the Is_In and the
Intersect operations and slightly slower for all of the other operations.
Initialize
Clear Set3
where, as before, the algorithm is implemented using the set Is_In and Insert operations.
43
Sets and Bags
Also as before, the slowest part of the algorithm might be the part: Insert Item in Set3. The
time required to insert an item in a set, using the standard set insertion operation, is O(Size) of
the set. Since Size3 items must be inserted in Set3, the total insertion time is O(Size32).
Again, as in an array representation, the reason this insertion takes so long to execute is that it
must first search the set each time to insure that no duplicate item is being inserted. Since the
items this algorithm are inserting already come from a set that has no duplicates, this extra
search time is not necessary. This suggests the faster loop:
where the items in the linked list are unordered. The time to insert a new item is now O(1) and
the total time for all the insertions is O(Size3).
The execution time for the search to determine if the item is in Set2 is O(Size2) and, since the
search must be repeated once for each item in Set1, the total time for the searches is O(Size1 *
Size2). Combining the search and the insertion time, the total execution time for the algorithm
is:
O(Size1 * Size2) + O(Size3) = O(Size1 * Size2).
One is tempted to replace the unordered linked list by a lexicographically ordered linked list
and use a binary search as was done in the array representation. Binary searches, however, are
impossible in a linked list, so only sequential searches can be used. The linked lists however are
much easier to keep in lexicographical order, because a new item can be inserted between two
other nodes without having to move half of the list as must be done in an array representation.
This suggests that we assume the items are stored in the linked list in lexicographical order so
that the sets are always ordered. Again, this representation was covered in Section 5.3.2. Now
if both Set1 and Set2 are ordered, the intersection operation can be executed by a merger kind of
algorithm. To keep the items in Set3 in lexicographical order, all insertions are made at the rear
of Set3 and a new temporary variable Rear is used to keep track of the last node in the linked
list.
44
Sets and Bags
This algorithm repeats the inner loop at most once for each item in Set1 and each item in Set2
for a total of Size1 + Size2 times, that is:
O( Size1+Size2 ) + O( Size3 ) = O( Size1 + Size2 ).
Recall from Section 5.3.2 that all of the other operation execution times are about the same
for both the ordered and unsorted linked representation, so the ordered representation is best
when intersections must be performed.
The whole preceding discussion has been in terms of the intersection operation. The same
arguments remain valid, however, for the union and set difference operations and the develop-
ment of algorithms for these operations is left for the exercises.
5.5.4. Timing
The times required to execute the set operations for the various representations using ordered
array and linked representations are:
Operation Representation
Ordered Array Ordered Linked Bit-Mapped
All three representations use approximately the same amount of time and space to perform
these operations. Of course unordered arrays or linked representations are significantly slower.
45
Sets and Bags
Exercises
3. The bag representations in the text assume that duplicate items are stored in separate
locations. It is possible to add a count field to each location so that instead of adding a new
entry for a duplicate item, only the count field is incremented Develop intersection, union,
and difference algorithms for this version assuming the bag entries are stored in lexicographical
order in:
a. an array or
b. a linked structure.
Compare this new representation to the previous representation for speed, space requirements,
and flexibility.
46
Combining ADTs
COMBINING
ADTs
The first few chapters have introduced the basic concepts involved in an ADT. This chapter
presents ways of combining object classes and ADT's into larger data structures which can be
applied to problems requiring more sophisticated approaches to information representation and
manipulation.
1
Combining ADTs
6.1. Inheritance
There are several ways of combining ADT's and object classes into larger structures. The
first we will consider is hierarchies of object classes and inheritance.
One of the advantages of object oriented programming is that one can define hierarchies of
object classes where each item in the hierarchy inherits all of the attributes and operations of any
object class above it in the hierarchy. To illustrate, consider a bank with three kinds of accounts,
savings, checking, and CD accounts. While all accounts have some common features, such as
ID number, customer name, and balance, each account also has some distinct features:
a. a savings account has an interest rate,
b. a checking account has fees, and
c. a CD account has an interest rate and an ending date.
One could develop a separate package for each kind of account with the attendant duplication,
but as will be seen, there are many advantages to combining the different kinds of accounts into
a hierarchy containing four distinct object classes:
1. a parent or base object class, called Account, with the attributes common to
all three kinds of accounts; that is, the attributes: ID#, Name, and Balance,
2. two child object classes,
- Checking with one attribute, Fees, and
- Savings with one attribute, I_Rate.
3. Savings in turn has one child object class, CD account with one attribute, Ending.
In symbolic form:
Account
ID#
Name
Balance
Savings Checking
I_Rate Fees
CD
Ending
where the entry at the top of each box is the name of the object class and the items below the line
in each box are the attributes for that object class. Thus, Account has the three attributes listed
and the other object classes all have the one, additional attribute listed. As noted above, each
object class inherits the attributes of all the object classes above it in the hierarchy, so the CD
object class inherits the attributes I_Rate from the Savings object class and the ID#, Name, and
Balance attributes from the Account object class. This means that the CD object class can use
those variables as if they were included in the CD package. Compare this to defining and
2
Combining ADTs
implementing three independent object classes, one each for savings, checking, and CD
accounts, each containing its own declaration of all of the pertinent variables.
Each object class also inherits the operations of the object classes above it in the hierarchy.
In this case, there are no operations to inherit, so let us add some operations to the object classes,
an operation to print the object and an operation to enter data into an object. The Ada code for
even three complete packages is rather long, so to simplify the remainder of this example, only
the Account, Savings, and Checking object classes are included and the CD account is left for
the exercises. In symbolic form the three object class version is:
Account
ID#
Name
Balance
Savings Checking
I_Rate Fees
Print Print
Create_Savings Create_Checking
where the items in the lowest portion of each object class are the operations on that object class.
Note that all three object classes have a print operation; since each object class has a different
associated data type and set of attributes, these three print operations will produce different
outputs.
As another example to illustrate the concept, consider the hierarchy:
Person
Name
Birthdate
Student Faculty
Major Department
Year Rank
GPA Salary
Report_Card Pay_Check
Here the parent has two attributes, Name and Birthdate, and no operations. Each child inherits
two attributes from the parent and has three additional attributes of its own. Furthermore, each
child also has a different operation.
On the other hand, most objects in a hierarchy have certain operations which have at least the
same name even though they operate on different kinds of data and produce different kinds of
results. For example, the hierarchy:
3
Combining ADTs
Employee
Name
Birthdate
Department
Title
uses the same name, Paycheck, for two distinct operations. The Paycheck operation in Hourly
Employee calculates a paycheck one way and the Paycheck operation in Salaried Employee
calculates it differently.
One advantage of object oriented programming is that new kinds of objects can be built from
old ones by inheritance. Consider, for example, the employee example. Assume the company
wants to have a new kind of employee, a salaried employee who is hired for a limited time
period. Then a new child of Salaried Employee, say Temp_Employee, can inherit all of the
characteristics of Salaried Employee with one additional attribute, say a Termination_Date.
Inheritance makes adding Temp_Employee to the system a minor addition; without inheritance,
it would require much more work and time to implement.
The same kind of extension is possible with some of the ADT's developed earlier. There are
many possible ways to extend a set to include additional capabilities. In fact, the extension can
be designed to match the problem at hand.
To illustrate Ada code for a hierarchy, consider the banking hierarchy at the beginning of this
section. The algorithms for these operations, as will be seen, are trivial, so we start with devel-
oping the packages. The first package is the parent or base package, the Accounts Package. The
whole package is contained in Program 6.1.1.1. The important thing to notice is that this
package differs from all the Ada packages covered so far in only the two statements:
where the new term tagged appears. This specifies that this data type is to be the parent data
type in an object class hierarchy. Everything else in the Account specification and body is the
same as before.
It greatly simplifies the use of records in a hierarchy if we always refer to them by means of
a pointer. (The reasons will become clearer as we proceed.) For this reason, there is one
additional statement in the Account specification:
type Account_Pointer is access Account'Class;
The 'Class after Account specifies that these pointers can point to any item in the hierarchy;
that is to any type of data in the account hierarchy. Thus, instead of several pointer types, we
only need one type when the items are in the same hierarchy.
4
Combining ADTs
with Text_Package;
package Accounts is
type Account is tagged private; --Parent type.
private
type Account is tagged --Parent description.
record
ID_Num : Natural := 0; --ID# of account.
Name : Text_Package.Text := Text_Package.Text_Of(“ “);
--Customer's name.
Balance: Integer := 0; --Current balance.
end record;
end Accounts;
--------------------------------------------------------------
--------------------------------------------------------------
with Ada.Text_IO;
package body Accounts is
package Int_IO is new Ada.Text_IO.Integer_IO ( Integer );
end Accounts;
Accounts Package
The Parent Package in the Bank Hierarchy
Program 6.1.1.1
5
Combining ADTs
The child packages introduce several new features. First, the names of the packages consist
of the name of the parent followed by a period and the name of the child; e.g.,
Accounts.Savings
Accounts.Checking
Second, the child data type must contain the phrase "new Account with" in each part of the
declaration; e.g.,
The rest of the two child specifications are the same as usual. In this case, each child
declares a function to return an object class of the specified kind with the specified data; in other
words, the two functions Create_Savings and Create_Checking, each with its own set of parame-
ters. Both children also have a print procedure, with the savings account print operation output-
ting a savings account and the checking account print operation outputting a checking account.
The complete package specifications are in Specification 6.1.1.1.
The package bodies for these two children are in Programs 6.1.1.2 and 6.1.1.3. These
package bodies are rather routine except for one feature -- each can access all of the data in its
parent's package specification -- even the data declared in the private part of the specification. In
particular, each child can access the attributes ID_Num, Name, and Balance from the parent
package.
6
Combining ADTs
with Text_Package;
package Accounts.Savings is
private
type Savings is new Account with --Child description.
record
I_Rate : Integer; --Interest rate of account.
end record;
end Accounts.Savings;
--------------------------------------------------------------
--------------------------------------------------------------
with Text_Package;
package Accounts.Checking is
private
type Checking is new Account with --Child description.
record
Fee : Integer; --Fees charged to the account.
end record;
end Accounts.Checking;
7
Combining ADTs
with Ada.Text_IO;
package body Accounts.Savings is
package Int_IO is
new Ada.Text_IO.Integer_IO (Integer);
-------------------------------------------------------------
function Create_Savings( Cust_Name : in Text_Package.Text;
ID : in Natural;
Amount : in Integer;
Interest : in Integer)
return Savings is
begin
Savings_Name.Name := Cust_Name;
Savings_Name.ID_Num := ID;
Savings_Name.Balance := Amount;
Savings_Name.I_Rate := Interest;
return( Savings_Name );
end Create_Savings;
-------------------------------------------------------------
procedure Print( Savings_Name : in Savings) is
begin
Ada.Text_IO.Put ( "This is a savings account. " );
Text_Package.Put ( Savings_Name.Name );
Int_IO.Put ( Savings_Name.ID_Num );
Int_IO.Put ( Savings_Name.Balance );
Int_IO.Put ( Savings_Name.I_Rate );
Ada.Text_IO.New_Line;
end Print;
end Accounts.Savings;
8
Combining ADTs
with Ada.Text_IO;
package body Accounts.Checking is
package Int_IO is
new Ada.Text_IO.Integer_IO (Integer);
-------------------------------------------------------------
function Create_Checking( Cust_Name : in Text_Package.Text;
ID : in Natural;
Amount : in Integer;
Fee_Value : in Integer)
return Checking is
begin
Checking_Name.Name := Cust_Name;
Checking_Name.ID_Num := ID;
Checking_Name.Balance := Amount;
Checking_Name.Fee := Fee_Value;
return( Checking_Name );
end Create_Checking;
------------------------------------------------------------
procedure Print( Checking_Name : in Checking) is
begin
Ada.Text_IO.Put ( "This is a checking account. " );
Text_Package.Put ( Checking_Name.Name );
Int_IO.Put ( Checking_Name.ID_Num );
Int_IO.Put ( Checking_Name.Balance );
Int_IO.Put ( Checking_Name.Fee );
Ada.Text_IO.New_Line;
end Print;
end Accounts.Checking;
9
Combining ADTs
Once the package specifications and bodies are available, the actual use of the packages is
also rather routine. The start of a program to use this hierarchy needs a with statement for each
of the packages, followed by the usual declarations and executable statements. Thus, the
program segment:
with Text_Package;
with Accounts;
with Accounts.Savings;
with Accounts.Checking;
procedure Test is
begin
--Initialize
S := Accounts.Savings.Create_Savings(
Text_Package.Text_Of( "J. Doe" ),
11, 8, 8);
C := Accounts.Checking.Create_Checking(
Text_Package.Text_Of( "M. Ewe" ),
22, 68, 8 );
--Output
Accounts.Print( A );
Accounts.Savings.Print( S );
Accounts.Checking.Print( C );
end Test;
will output:
This is an account. 0 0
This is a savings account. J. Doe 11 8 8
This is a checking account. M. Ewe 22 68 8
Clearly, these definitions and operations can be extended to include almost any desired
operation on any one of the object class types and the user or client program can produce almost
any desired manipulation of any or all of the various types of data.
10
Combining ADTs
Exercises
1. For each of the following, sketch a hierarchy with some typical attributes and operations for
each object class.
a. students: undergraduate and graduate, c. loans: car, house, and student, and
b. shapes: circles, squares, and rectangles, d. library: book, tape, CD.
2. Extend the bank hierarchy in the text to include a CD object class with the operations
Create_CD and Print.
3. Alter the bank hierarchy in the text so that all three objects have a Create function.
4. Develop the following object classes into Ada packages and a test program.
1. A parent, Vehicle, with an operation to print "This is a vehicle."
2. A child, Car, with an operation to print "This is a car."
3. A child, Boat, with an operation to print "This is a boat."
Assume:
a. none of the object classes has an attribute or
b. each object class has one attribute, ID_Num for Vehicle, RPM for Car, and
Length for Boat.
Section 5.2 presented some of the advantages and applications of extended sets and bags.
Section 5.2 assumed that any extended set or bag was implemented as a distinct package, but sets
or bags with additional operations can also be implemented by a straightforward use of the
concept of object class hierarchies. First, let the parent in the hierarchy be a set of some kind
and then extend the set to include the desired operations. A simplified, generic set will illustrate
the idea.
Program 6.1.2.1 contains a simplified generic set package with only one operation, Insert.
(More operations, such as Clear or Empty, can be added but they would only make the example
harder to follow. The additional operations are left for the exercises.) This package implements
the set by an array which is declared in the package specification. The only variables "visible"
to descendent packages are those declared in the parent package specification, so this array must
be declared in the package specification. The package body also does not check for duplicates
before inserting a new item; this check needs to be added before using the package.
Our main interest at the moment, however, is the extended set. Assume we need to extend
the set package to include a print operation and an increment operation to keep a count of how
often each item is incremented. Instead of a print operation, an iteration operation is added; this
can be used for printing and any other later operation that must process all of the items in the set.
Specification 6.1.2.1 contains the desired extension.
11
Combining ADTs
generic
type Data_Type is private; --Data type stored in set.
package Simple_Sets is
--------------------------------------------------------------
with Text_Package;
with Ada.Text_IO;
package body Simple_Sets is
end Simple_Sets;
Generic, Parent, Simplified Set Package
Program 6.1.2.1
12
Combining ADTs
package Simple_Sets.Extended_Sets is
type Procedure_Access_Type is
access procedure ( Data_Item : in Data_Type;
Count_Item: in Integer );
end Simple_Sets.Extended_Sets;
13
Combining ADTs
with Ada.Text_IO;
package body Simple_Sets.Extended_Sets is
package Int_IO is
new Ada.Text_IO.Integer_IO ( Integer );
------------------------------------------------------------
procedure Increment( Set_Name : in out Extended_Set_Type;
The_Data : in Data_Type) is
begin
--Find location of The_Data in Items.
I := Set_Name.Size;
Set_Name.Items(0) := The_Data;
-------------------------------------------------------------
procedure Iterate(
Set_Name : in Extended_Set_Type
Operation_Pointer : Procedure_Access_Type) is
begin
--Repeat for each item in set
for I in 1..Set_Name.Size loop
Operation_Pointer( Set_Name.Items(I),
Set_Name.Count(I) );
end loop;
end Print;
end Simple_Sets.Extended_Sets;
Package Body of Extended Set
Implements Set Insertion and Print Operations
Program 6.1.2.2
14
Combining ADTs
The Extended_Set_Type is declared as the child of the Simple_Set_Type in the usual way.
The only thing that must be noted is that the Extended_Set_Type contains an integer valued
array to hold the values of the counts. To increment a given item in the set, the item must first
be found in the Simple_Set_Type.Items array and then the corresponding location incremented
in the Counts array. This is an example of a data structure called parallel arrays, two arrays
treated as if they were side by side so that each item in the first array corresponds to a single
item in the second array. In this case, the first array contains the items in the set and the second
array contains the count for each corresponding item. This technique can be extended so that
each item in the first array is a record and each item in the second array is a record.
A typical put operation, assuming the set items are of type Text, might appear as follows:
with Text_Package;
with Ada.Text_IO;
procedure Put( Item : in Text_Package.Text;
Count : in Integer) is
package Int_IO is new Ada.Text_IO.Integer_IO (Integer);
begin
Text_Package.Put( Item );
Int_IO.Put( Count );
end Put;
A sample main program with Text data type to use this Put is in Program 6.1.2.3.
15
Combining ADTs
with Simple_Sets;
with Simple_Sets.Extended_Sets;
with Text_Package;
with Put;
procedure Test is
package S_Set is
new Simple_Sets ( Data_Type => Text_Package.Text);
package Ext_Set is
new S_Set.Extended_Sets;
begin
Put_Pointer := Put'Access;
16
Combining ADTs
Exercises
2. Develop a main program for a video tape rental store where the store needs to know how
many times each tape has been rented.
3 Expand Exercise 2 so that the store knows whether each tape is in the store or rented at the
moment.
4. Develop a program to count votes in an election. Assume each input entry consists of the
name of one candidate.
17
Combining ADTs
6.2. Polymorphism
One of the advantages of inheritance is that it simplifies using collections of items in the
same hierarchy. Consider, for example, the bank hierarchy object classes given in the previous
section. Recall that this hierarchy consisted of a parent or base class, Account, and two children,
Savings and Checking. The details including programs are in Programs 6.1.1.1 through 6.1.1.3.
The bank may have many occasions to treat all of the accounts alike in one sense; for
example, if the bank issues a statement once a month for each of its accounts, it would be
convenient to use a simple loop like the following:
where the only difficulty is that each kind of account is printed in a different manner with its
own kind of information. The "standard" way to set up this loop is to have a separate loop for
each kind of account, but there is a better way based upon two features:
1. The first feature is an array which can store any kind of item provided the
item is a pointer to an object in a hierarchy; that is, the array contains pointers
to the parent and/or descendants of the parent.
2. The second feature is the ability of each item to "carry its own operations" so
that when told to print an item in the collection, the item "knows" which kind
of item it is and how that item is to be printed.
This is an example of something called polymorphism. The term polymorphism can be
loosely translated as "many forms" and is used in computing to refer to an operation which is
common to several object classes in the same hierarchy (printing or outputting is one of the most
common examples of polymorphism) where each object carries along an indicator of how this
operation should be executed for objects in this object class. This sounds more complicated that
it is; an example will help illustrate the concept.
Consider again the bank hierarchy where we want to print a list of all of the accounts, regard-
less of type. To do this with a single loop of the kind above requires both polymorphism and
some kind of common container for the objects. The bank hierarchy packages are polymorphic
so all that is necessary is to be able to define an array of the desired type. Recall that the
Account specification, Program 6.1.1.1 contained the declaration:
which specifies a pointer to Account'Class; that is, to any type of data in the account hierarchy.
In particular, any pointer of type Account_Pointer can point to either an account, or to a savings
account, or to a checking account.
Since a single pointer can point to any one of these three types of data, the two statements:
type Account_List is
array (Integer range <>)
of Accounts.Account_Pointer;
All_Accounts : Account_List(1..3);
18
Combining ADTs
specify an array of such pointers; that is, any item in the array can point at any item in the hierar-
chy. In other words, the array stores pointers to any type of object in the bank hierarchy, so the
single loop can process every account in the bank. This capability greatly simplifies program-
ming and reuse of code. If, for example, the bank later wishes to add some more types of
accounts to the hierarchy, much of the current code can be reused with little or no change
whereas if separate loops were necessary for each type of object, much more work would need
to be done in blending the new and the old code to achieve the desired results.
To implement polymorphism in the base and child packages requires a few changes. The
most important change is to have the Create operations return a pointer to a record rather than
the record itself.
The Accounts package has no Create operation so it remains the same. The new specifica-
tion of the Savings package, in Specification 6.2.1 illustrates the use of the type Account_Pointer
in a child package. Create_Savings routine now returns a pointer of type Account_Pointer rather
than a record. Everything else in the Savings package specification is the same as before. While
it is not illustrated, the Checking package specification has the same change.
The actual code necessary to return a pointer is illustrated in Program 6.2.1. The example in
this case is the package body for the Savings package. Note how the Create routine inserts the
values in a new record and then returns not the record, but a pointer to the record. Otherwise
everything is the same as before.
Assuming that the Checking account package has the same alterations, Program 6.2.2 is a
driver which illustrates the use of this approach to polymorphism.
______________________________________________________________________________
____________________________________________________________________________
with Text_Package;
package Accounts.Savings is --a sample child class.
type Savings is new Account with private;
private
type Savings is new Account with
record
I_Rate : Integer;
end record;
end Accounts.Savings;
19
Combining ADTs
with Ada.Text_IO;
package body Accounts.Savings is
package Int_IO is
new Ada.Text_IO.Integer_IO (Integer);
--------------------------------------------------------------
------------------------------------------------------------
function Create_Savings(
Cust_Name : in Text_Package.Text;
ID : in Natural;
Amount : in Integer;
Interest : in Integer)
return Account_Pointer is
begin
Pointer := new Savings’ ( Name => Cust_Name,
ID_Num => ID,
Balance => Amount,
I_Rate => Interest);
return Pointer;
end Create;
---------------------------------------------------------------
procedure Print( Savings_Name : in Savings) is
begin
Ada.Text_IO.Put ( "This is a savings account. " );
Text_Package.Put ( Savings_Name.Name );
Int_IO.Put ( Savings_Name.ID_Num );
Int_IO.Put ( Savings_Name.Balance );
Int_IO.Put ( Savings_Name.I_Rate );
Ada.Text_IO.New_Line;
end Print;
end Accounts.Savings;
20
Combining ADTs
with Text_Package;
with Accounts;
with Accounts.Savings;
with Accounts.Checking;
procedure Test is
type Account_List is
array (Integer range <>)
of Accounts.Account_Pointer;
All_Accounts : Account_List(1..3);
begin
--Initialize three accounts, one of each kind.
All_Accounts(1) := new Accounts.Account;
All_Accounts(2) := Accounts.Savings.Create_Savings (
Cust_Name => Text_Package.Text_Of("Anon"),
ID => 33,
Amount => 4,
Interest => 5);
All_Accounts(3) := Accounts.Checking.Create_Checking (
Cust_Name => Text_Package.Text_Of("J.Donne"),
ID => 44,
Amount => 15,
Fee_Value => 10);
This is an account. 0 0
This is a savings account. Anon 33 4 5
This is a checking account. J.Donne 44 15 10
21
Combining ADTs
Exercises
2. Develop the following objects into Ada subprograms and a test program which uses
polymorphism.
1. A parent, Vehicle, with an operation to print "This is a vehicle."
2. A child, Car, with an operation to print "This is a car."
3. A child, Boat, with an operation to print "This is a boat."
Assume:
a. none of the objects has an attribute or
b. each object has one attribute, ID_Num for Vehicle, RPM for Car, and
Length for Boat and each object (including Vehicle) has a new operation,
create; that is,
Vehicle has the operation: Create( ID_Num ),
Car has the operation: Create( ID_Num, RPM ), and
Boat has the operation: Create( ID_Num, Length).
3. Redo the bank example so that all the create operations return a pointer to a record rather
than the record itself. Compare the resulting package and client programs to the version used in
the text above.
6.3. Containers
Recall from Chapter 3 that a container (sometimes called a collection) is any data type
capable of holding items. One of the simplest containers is an array, but stacks, queues, pipes,
sets, and bags are all examples of containers. Most containers have some kind of operators to
insert, retrieve, search, test for membership, and so forth. Most containers also have some kind
of iterator to allow the user or client program to process all of the data in the container in a
simple, straightforward way. The various containers differ in the way that they insert, store, and
retrieve data, but the common capabilities are the items of interest here. We want to see how
containers can be combined with records and arrays to solve more complicated problems.
Records with containers and containers of records have been used before without comment.
In some sense the goal of the following discussion is to make explicit what is commonly done so
as to bring the underlying concepts to the level of consciousness.
The first use of containers we will consider are records with containers. A simple case is a
student record which contains a set of courses completed by the student. This can be represented
in algorithmic form by:
22
Combining ADTs
Student is record
Name : Text; --Student's name.
Major : Major_Type; --Student's major.
.....any other appropriate information plus.....
Courses : Set of Courses; --Set of courses taken.
end record.
Realistically, the set of courses would be an extended set of records; each record in the set could
contain a course ID number and a grade and the extended set would contain an operation to
compute the student's GPA. The end result is that there is one extended set for each student
record.
The information in the record does not have to be a set. Consider, for example, an airline
flight record with a list of passengers waiting for a seat on the flight. If the waiting customers
are given seats on a first come, first served basis, then the waiting list is actually a queue. The
flight record might appear as follows:
Flight is record
Flight Number : Integer; --Flight number.
Source : Airport; --Airport where flight starts.
.....any other appropriate information plus.....
Waiting_Queue : Queue of Passengers; --People waiting for seat on flight.
end record.
A record can have more than one container. The flight record above, for example, might
also contain a set of passengers and, if the flight has more than one "leg", a bag containing the
sources and destinations for each leg of the flight.
This can be generalized in another direction. Consider, for example, a government agency
which works with a number of universities. Each university is divided into colleges and the
colleges are then divided into departments. In this case, the program might contain a set of
universities where each university record contains a set of colleges and each college record
contains a set of departments.
Containers of Records
We have had a number of examples of containers of records. The section above, for
example, mentioned a set of course records, a queue of passenger records, and sets of college
and department records. These examples raise an interesting question about containers. What
exactly should be stored in the container? Should the records themselves be stored in the
container or should only pointers to the records be stored in the container? As a general rule,
containers of pointers take less storage space than containers of records, but each case should be
decided on its own merits. Consider some of the examples above:
1. In the student record example, each entry in the set of courses completed
contains a record with a course ID number and a grade. Since this is the only
place the record is used, it can be stored in the set, but we must also consider
the set representation. If the set is implemented using an array, storing the
23
Combining ADTs
whole record in the set can waste a large amount of storage. If the set is
implemented using a linked structure, then storing the whole record in the
container wastes little storage.
2. In the flight example, each flight record contained a queue of waiting passen-
gers. Since any given passenger might have reservations on more than one
flight and be in the waiting queue for some more flights, it makes sense to
have a separate record for each passenger (with the passenger's ID number,
name, address, and so forth) and then store only a pointer to the passenger in
the waiting queue. (This pointer might be an access variable or a passenger
ID number.) The space savings are significant and, if, for example, the
passenger changes his/her phone number, the change only has to be made in
one place rather than in a number of separate waiting queues and passenger
lists.
3. In the university example, it would again make sense to use college pointers
and department pointers in the given containers. (The pointers themselves
might be either access variables or ID numbers.)
Most cases require an analysis of the problem before deciding on the kind of container and
what exactly will be stored in the container. As an example, assume someone wants to construct
a program to do polynomial arithmetic; that is, given two polynomials such as:
f(x) = 9 + 7x + 19x2 and g(x) = 3 + 5x + 3x2
the computer should be able to generate the sum or the difference of the two polynomials.
To accomplish this requires at least three things, a way of representing polynomials in the
computer, some algorithms for performing the polynomial arithmetic, and some way for the
program user to specify what calculations to perform.
Assuming polynomials with integer valued coefficients, each polynomial can be represented
by a record of the form:
Polynomial is a record
Name : Character; --Name of polynomial.
Degree: Natural; --Degree of the polynomial.
Coef : Integer_array(0..100); --Coefficients of the polynomial.
end record
Record
Name = Distance
Degree = 4
Coef(0) = 9, Coef(1) = 7, Coef(2) = 19, Coef(3) = 0, and Coef(4) = 2
end record
where missing coefficients are zero. This scheme can handle polynomials of up to degree 100.
24
Combining ADTs
This representation also makes it simple to add two polynomials; for example, the function
algorithm:
--Initialize
Sum.Degree <-- maximum of P.Degree and Q.Degree
--Terminate
Return ( Sum )
end Add
does this and a similar algorithm can be used for subtraction. (Subtraction and multiplication are
left for the exercises. Division is ignored here because the division of polynomials might
produce both a quotient and a remainder which confuses the discussion without advancing the
computer science.)
The first design decision is therefore to design a polynomial data type consisting of polyno-
mial records and operations to add, subtract, and multiply two specified polynomials.
Of course, these polynomials will have to be stored in a container of some kind yet to be
determined.
Next we design a simple language for describing the desired calculations; for example,
where polynomial expressions can be any combinations of sums, differences, and products of
named polynomials. A typical "program" might be:
Read P
Read Q
Read R
S := P * ( Q + R )
Write S
End
25
Combining ADTs
To implement this system we will have to have some way to store and retrieve polynomials
by their name. There are several ways to do this, but, since the polynomial names are unique, a
simple way is to use a set with operations to store and retrieve polynomials based upon the name
of the polynomial; that is, the operations are:
While a set of pointers may use less space, the number of entries in the set is relatively small, so
it is simpler to store the whole records in the set and implement the set using a linked structure.
If one is more interested in speed one might store the records in an array sorted on polynomial
name and use a binary search to locate specified polynomials.
Then, a possible interpreter is of the form:
Initialize
More-to-do <-- true
Note that executing the Let command requires two stacks, one to convert an infix expression to a
postfix expression and one to evaluate a postfix expression. The first stack stores operators and
the second stack will have to store either data names or pointers to polynomial records or
complete polynomial records. It is faster to store the whole polynomial record on the second
stack, but it takes less storage space if the stack contains not the whole polynomial record, but
only the polynomial data names. Speed should be the determining factor in this case, so the
second stack should store polynomial records.
26
Combining ADTs
Note that this problem has three containers and a single change in one of them can affect the
others. If, for example, the set of polynomials contained only pointers, then the second stack
should probably only contain pointers and perhaps the polynomial data type arithmetic opera-
tions should be changed so that the parameters are pointers rather than records. In other words,
all of the decisions are interrelated.
Containers of Containers
which is almost identical to the linked structure used to implement a stack or a set. To define an
Item, we use a variant record like the following:
Item is record
Case
when Number
Itm : Integer;
when List
Itm : List;
end record;
27
Combining ADTs
Other implementation structures are also possible (one method uses a hierarchy), but the linked
structure of variant records is possible the simplest one.
Once the data structure is available, one can define operations on lists, such as concatenate
two lists or print a list. These are straightforward and left for the exercises.
Exercises
1. Develop a detailed design of the data structures for a simple airline reservation system where
each flight contains a list of passengers and a queue of waiting passengers. Develop a set of
commands along with the data descriptions, operations, and container(s).
2. Develop a detailed design of the data structures for a simple student registration system.
Develop a set of commands, data descriptions, operations, and containers.
3. Develop a system for counting votes during an election. The voting district consists of a set
of precincts, each precinct has a set of races, and each race has a set of candidates.
4. An automobile repair shop where classifies repairs are into areas such as tune-ups, brakes,
transmission and so forth. For each type of repair there is a set of mechanics who can do this
kind of work and a queue of repairs waiting to be done. Develop a system to keep track of
repairs, mechanics, and replacement parts.
5. Replace the array by a linked structure in the discussion above of polynomial arithmetic.
Compare the two implementations.
6. Expand the discussion above of polynomial arithmetic into a working Ada system.
7. Design and implement an arithmetic system capable of doing 100 digit addition, subtraction,
and multiplication.
8. Draw the list (2 (5 (7 9)) 3) using the representation given in the text.
9. Implement a LISP list in Ada. Include two operations, one to insert a new item (an integer
or a list) at the beginning of a list and one to print a list.
28
Combining ADTs
There are problems where we don't know in advance how many containers we will need or
what their names will be. A simple example to illustrate the difficulty is constructing an index
to a book. (An index is a list of the words in the book and the pages on which each word
appears.) One approach to solving this problem is to have one set for each word in the book and
as we input each word of the book, we insert the current page number in the set associated with
this word. When the processing is completed, for each word in the book there is a set containing
every page number where that word occurs. The interesting feature of this solution is that we
must have one set for each distinct word in the book and, in general, we do not have in advance
a list of all the words in the book. Even if we did have a complete list of words in advance, it
would be silly to write a program with thousands of variables, one variable for each word in the
book.
The solution is to define a new ADT which allows us to name and add new sets during
program execution; that is, the set names are determined (or more precisely, bound) at execution
time rather than at compile time as we have done so far with queues, pipes and sets.
The new ADT, called a set of sets (sometimes called a multi-set), contains an arbitrary
number of sets and we can perform the usual set operations on each and every one of these sets.
The important feature of set of sets is that the invoking program can add new sets, called by any
name, at any time during program execution.
A similar ADT for bags, called a set of bags, is a minor variation on the set of sets ADT.
Similarly, sets of stacks, queues, etc. are also minor variations on the same theme, so we present
here details only for the set of sets and leave the others for the exercises.
The basic set of sets operations are:
The first three operations initialize the ADT, allow the invoking program to add new sets, and
print the contents of every set. (Note that the set names are defined at execution time.) The next
operation, Is_Set, is used to determine if the ADT contains a particular set. The last six opera-
tions are standard set operations using the set name as a parameter, and, if necessary, any data
values as a second parameter.
29
Combining ADTs
Some examples will illustrate the advantages of being able to name and define sets at execu-
tion time.
Example 6.4.1.1. To illustrate the use of a set of sets, let us develop an algorithm for the
problem above, producing an index to a book. Assuming we have a set of sets package, the
main algorithm to generate one set for each word is:
Initialize
Clear_All_Sets
Page number <-- 1
Terminate
Print_All_Sets
end
This algorithm can be expanded to include error handling. The important feature at the
moment, however, is that we can add a new set with any name we want at any time we want.
This is a powerful capability.
Example 6.4.1.2. Given a set of records, each containing the name of a movie and the names of
two stars of the movie, output for each star a list of the movies the star was in.
The simplest solution here is to construct for each star a list of all the movies with this star.
In more detail, as the records are input, each movie title is inserted into two lists, one for each of
its two stars. When all of the records have been processed, there will a list for each star and this
list will contain all of the titles for this star. All that remains is to output all of the sets. A
detailed algorithm is:
Initialize
Clear_All_Sets
30
Combining ADTs
Terminate
Print_All_Sets
end
This algorithm is easily generalized for records containing a title and an arbitrary number of
stars. The inner loop becomes:
Example 6.4.1.3. Design a simple airline reservation system to add a new flight, add a passen-
ger to a specified flight, delete a passenger from a flight, and to produce a passenger list for a
specified flight.
The first step in the solution is to specify the input in more detail. The following user
commands seem adequate:
where <F#> stands for a particular flight number and <P_name> stands for a passenger name.
The basic solution to this problem is to assign a set to each flight. This set will store the
passenger list for the flight. Given this data structure, one possible algorithm is:
Initialize
Clear_All_Sets
More-to-do <-- true
31
Combining ADTs
Terminate
end
This algorithm can be expanded to include error handling and other refinements, but these
refinements are simple extensions of the basic algorithm above.
These three examples illustrate the advantage of being able to add new sets at any time
during program execution --- to allow the number of sets (and their names) to grow during
program execution to meet the demands of the problem.
There are several ways to extend the set of sets concept to handle more complex data
relationships. The simplest is to use a set of extended sets to solve the problem. In other words,
instead of using only standard set operations to solve the problem, assume the problem solution
needs sets with some more operations; that is, each set is an extended set. The next example
illustrates this concept by using sets extended to include an update operation.
Example 6.4.4. Each branch office of a company has an inventory of office furniture. The
Chicago branch, for example, might have so many desks, so many chairs, and so forth. The
problem is to keep track of this inventory as various offices add and remove various items.
One solution is to use a set for each office and let this set contain, for each kind of item, one
record containing the name of the item and the quantity on hand. Thus, there might be in each
office set a record specifying how many desks this office has. The system might then have
commands like the following:
NEWITEM <Office> <Item Name> <Quantity> inserts this new item with this quantity
into the specified office set.
UPDATE <Office> <Item Name> <Amount> updates the current value of quantity for
this item in this office by adding the
value of amount to the current value of
quantity.
32
Combining ADTs
This set of commands and the solution data structure are very similar to those used in the
other problems in this section. The major difference is the need for a set update command to
update the value the quantity for a particular item in a particular office. This means that the list
of set commands, such as insert and delete, will have to be expanded to include an update
command. The mechanics for doing this are discussed in the next section, but for the moment it
suffices to know that any extended set operation, such as update, can be easily added to the set of
sets operations. Given this update operation, a possible algorithm is:
Initialize
Clear_All_Sets
More-to-do <-- true
Terminate
end
A second common way to extend the set of sets data structure is to allow each entry to
contain both a set of values and a collection of fixed fields, the so-called extended set of sets.
The next two examples illustrate the basic concept.
Example 6.4.5. An instructor wants a "grade book" to keep track of student grades. For each
student there is to be a list of quiz grades and the total points scored on all of the quizzes.
The obvious approach is to have one set, or more precisely, one bag for each student; this
bag will contain a list of the student's quiz scores. (Why a bag rather than a set?) Not so
obviously, associated with each bag is one additional value, the total points scored by this
student. Now, whenever a new value is inserted into a bag, this value is also added to the total
associated with this student. The mechanics of how this is done is left for the next section, but
33
Combining ADTs
assume for the moment that it is done. Also assume the system has the following user
commands:
NEW <Student> adds the new student.
ADD <Student> <Grade> adds the specified grade to the list of the specified student.
PRINT prints a list of all the students and, for each student, a list of
all of the quiz grades, and the total quiz score.
STOP terminates the program.
The algorithm is almost identical to the one above for the airline reservation system and is
left for the exercises.
The important feature of this example is that each set can have one or more additional,
non-set values which are altered by the set of set packages. This greatly extends the capability
of set of set packages to many more kinds of problems. The next example illustrates an even
larger addition.
Example 6.4.6. A school wants to keep a set of student records. Each student record contains
the student's name, age, and major along with a set of courses the student has completed. Help
the school.
One solution to this problem is to have a set for each student, a set of completed courses.
Associated with each set is a record with the student's name, age and major. Now, the set of sets
has two insert operations, one for associating a record with a set and one for inserting an item
into the set. Similarly, the print operation will have to be extended to print both the associated
record and the set of values. With these extensions, the basic problem solution is straightfor-
ward and left for the exercises.
While this section presents the set of sets and the set of bags ADTs, the same concept is
applicable to a set of stacks, a set of queues, a set of arrays, and even a set of integers ---
provided one wants to bind the data names at execution time rather than at compile time,
Exercises
1. The algorithm for Example 6.4.1 is very slow to execute because it must search the set before
inserting a new page number. Design a new insertion operation which speeds up this algorithm.
2. Design a system to produce a line-by-line index to a text; that is, the system is to input a text
and, for each word in the text, output a list of the line numbers of all the lines that contain that
word.
4. A school has a record for each student. This record contains among other things a list of the
clubs and organizations the student belongs to. The school wants for each club and organization
a list of all the student members. Help them.
34
Combining ADTs
5. A company keeps a list for each customer showing what items the customer has purchased in
the past. The company would like for each product a list of all the customers who have
purchased this product. Help them.
7. Design a simple checking account system for a bank based upon a set of bags ADT. Upon
command the system is to:
a. add a new checking account,
b. deduct a check from a specified account,
c. add a deposit to a specified account, and
d. produce a statement for a specified account.
Design your own commands.
8. Design a simple department store charge account system based upon a set of bags ADT.
Upon command the system is to:
a. add a new customer,
b. charge an item to a customer,
c. prepare statements for all the customers, and
d. enter a payment by a customer.
Design your own input commands.
10. A mail order company wants to keep a set of customer records. Each customer record
contains the customer's name, address and credit rating along with a list of items the customer
has bought in the past. Design your own input commands and then develop the corresponding
system.
11. A police department would like to keep a set of criminal records. For each person in the
system there should be a record with the person's name, last known address, and a list of
outstanding warrants, a list of arrests and a separate list of convictions. Help the department.
12. Design a system to store and retrieve one hundred digit integers. The individual integers
can be if any size up to one hundred digits long. Can you extend your system to perform
addition, subtraction, and multiplication of the integers?
35
Combining ADTs
6.4.2. Representation
Conceptually, a set of sets is a table with two columns; in each row of the table, the first
column contains the name of a set and the second column contains the set. For example, if the
set names are to be words and the set items are pages in a book where this word occurs, the table
might look like the following:
List 1, 5, 6, 9, ...
Structure 1, 6, 12, ...
Queue 3, 5 , ...
Stack 2, 7, ...
... ...
where List, Structure, Queue, and Stack are set names and the integers are page numbers on
which the given word occurs.
We cannot physically insert a set in a column, but we can use a set data type as a column
entry. In other words, assume we have a set package with the usual set operations. This
package then allows us to define one or more variables to be of type Set. As usual, the Set data
type is essentially a pointer to the array or nodes containing the actual set. The data structure for
the set of sets would then appear as follows:
This suggests we use a set package to implement all of the set operations. That is, given a set
operation with a set name and perhaps a data item, the operation can be implemented by:
1. Finding the row in the table with this set name.
2. Using the set package operation to implement the operation.
In more detail, assume there is a Find( The_Set_Name ) procedure that will return the row of the
table containing this set name (or which will raise an exception if no such set name occurs in the
table). Then to implement, for example, the insert operation, we can use:
Row <-- Find( The_Set_Name )
Set_Package.Insert( Table(Row).Set_pointer, Data )
where the set package does the actual insertion of data into the set.
36
Combining ADTs
This technique is obviously applicable to the other set operations. It has several advantages.
First, it uses already developed software; this greatly simplifies the amount of effort involved in
the design. Second, the resulting program is easier to understand, develop, follow, and more
likely to be correct.
A detailed design of a set of sets, assuming the table is implemented using an array, is given
in Algorithm 6.4.2.1. The table can also be implemented using a linked list or even as a set, but
we leave these designs to the reader.
The data structure for implementing this table is similar to the one used before:
37
Combining ADTs
The only real difference between this version and the standard set of set representation is the
addition of an extra column to the table. This column must, of course, be updated whenever an
item is inserted into the set or deleted from the set. The value also needs to be printed whenever
the set is printed. Otherwise this version and the standard one are the same. The algorithms are
left for the exercises.
Timing
For a standard set of sets representation implemented using an non-sorted set package imple-
mentation, the execution times are:
Operation Timing
Clear_All_Sets O(1)
Add_New_Set O(Number of Sets)
Print_All_Sets O(Number of Sets) + O(Number of set values)
Is_Set O(Number of Sets)
Clear_Set O(Number of Sets)
Empty O(Number of Sets)
Insert O(Number of Sets) + O(Size of Set)
Delete O(Number of Sets) + O(Size of Set)
Print O(Number of Sets) + O(Size of Set)
Is_In O(Number of Sets) + O(Size of Set)
If an ordered array set package is used to implement the set of sets, then the execution times
remain the same for all but the Is_In operation. In this case, binary search is possible and:
38
Combining ADTs
Data Specification
Table_Maximum : Positive --Maximum number of sets.
Algorithms
Clear_All_Sets
Table_Size <-- 0
end clear
Add_New_Set( The_Set_Name )
If Table_Size = Table_Maximum then Overflow
If Is_Set( The_Set_Name )
then Error--set already exists
else Table_Size <-- Table_Size + 1
Table(Table_Size).Name <-- The_Set_Name
Set_Package.Create( Table(Table_Size).Set_Pointer )
end add_new_set
Print_All_Sets
Repeat for each row in table (for I = 1 to Table_Size)
Output: Table(I).Name
Set_Package.Print( Table(I).Set_Pointer )
end for
end print_all_sets
39
Combining ADTs
Is_Set( The_Set_Name )
Initialize
I <-- Table_Size
Table(0).Name <-- The_Set_Name
Terminate
Return ( I > 0 )
end Is_Set
The remaining algorithms all assume the existence of a set package with standard set ADT
operations. The first routine below, Find, is used for the other operations.
Find( The_Set_Name )
Initialize
I <-- Table_Size
Table(0).Name <-- The_Set_Name
Repeat for each item in table until found (while Table(I).Name /= The_Set_Name)
I <-- I - 1
end while
Terminate
If ( I > 0 )
then Return (I)
else Error--no set called The_Set_Name
end find
40
Combining ADTs
Clear( The_Set_Name )
Row <-- Find( The_Set_Name )
Set_Package.Clear( Table(Row).Set_pointer )
end clear
Empty( The_Set_Name )
Row <-- Find( The_Set_Name )
Return( Set_Package.Empty( Table(Row).Set_pointer ) )
end empty
41
Combining ADTs
Exercises
1. Extend the set of sets algorithms in Algorithm 6.4.2.1 to include algorithms to:
a. test for full table,
b. delete a set from a set of sets,
c. print all set names in a set of sets.
d. test for all sets empty,
e. count items in specified set,
f. sum items in specified set,
g. sum items in all sets, and
h. print all sets sorted on alphabetical order of the set names.
For each algorithm, give the execution time using big O notation.
NB: Some of these also require adding new operations to the Set ADT.
3. Develop a set of operations and a set of algorithms for an extended set of sets assuming:
a. the additional data associated with each set is treated as a single record for
insertion and printing, or
b. the additional data items associated with each set are treated independently for
insertion and printing.
4. Develop a set of operations and a set of algorithms for the extended set of sets used in
Example 6.4.5.
5. The set of sets representation above implemented the Table using an array. The Table can
also be implemented using a linked representation. In more general terms, the rows in the Table
form a set and can be implemented using any set representation method. Design a Table repre-
sentation using a:
a. sorted array representation, or
b. a linked representation.
In either case, also develop detailed algorithms for the corresponding set of sets.
6. Sometimes a problem requires several distinct sets for each set name; for example, an instruc-
tors "grade book" program might require three sets for each student -- one set for quiz grades,
one set for homework grades, and one set for exam grades. Extend the set of sets ADT so each
set name can have several sets.
42
Combining ADTs
This implementation is for a single set of sets with arbitrary data types allowed for both the
set names and for the items in the sets.
The sorted set package of Specification 5.4.1 is used for the sets. The package has been
extended to include an iteration operation; this allows the set contents to be printed, or, indeed,
any other iteration kind of operation to be added later. This iteration operation has two
parameters:
1. the set name and
2. a pointer to a procedure with one parameter of the same type as the items
stored in the set.
The whole operation and its implementation are very similar to the one covered in Section 3.6,
Iteration. Assuming the set is stored in an array, an algorithm is:
Put_Pointer can of course point to any procedure with the same kind of parameter.
Combining this operation with the appropriate operation in the set of sets package allows the
user/client to print the contents of one set. The corresponding operation in the set of sets
package is very similar to the insert, delete, and other operations in Algorithm 6.4.2.1. (N.B.
The_Set_Name here is not the same as the one in the set package itself. This set name is the
actual set name whereas the one above in the set package is a pointer to the record containing the
set and its size.)
This operation not only allows the user/client to print one set, it also the user/client to use any
other kind of procedure in place of the Put procedure to process one set .
The set of sets also contains a new operation, iteration over all of the sets which can be used
to implement the Print_All_Sets operation. This iteration operation also has two parameters:
1. a pointer to a procedure with one parameter of the same type as the set names
and
2. a pointer to a procedure with one parameter of the same type as the items
stored in the set.
An algorithm is:
43
Combining ADTs
This operation allows one to print all the sets because the first procedure can print a set name
and the second procedure can print an item in the set.
Since there is only one set of sets, the initialization of the set of sets package itself is done in
the package body when Table_Size is defaulted to zero at the point it is declared. The fastest set
implementation, the sorted array implementation, is the one used to store the sets.
The Ada specification of the set of sets package is given in Specification 6.4.3.1. There are
three major items of interest:
Z Generic Parameters. The generic parameters start with two private data types,
one for the set names and one for the items in the sets. The next generic
parameter is a less than comparison operator that is needed in the sorted set
implementation. The final two generic parameters are the maximum number
of sets (needed because the set names are stored in an array) and the
maximum set size (because the sets are stored in an array).
Z Instantiation of One Package Inside Another. The second item of interest is
the instantiation of the set package inside the package specification. This is
necessary because the set data type is not known until the set of sets package
is instantiated. This is the first time that we have used our own data structure
package to implement a second data structure package, but the basic concept
is no different from using IO packages inside a data structure package. The
only restriction is that the set package has a create operation instead of an
initialize operation.
Z Passing Exceptions Through. The last item of interest is the way the set
exceptions are handled in the package body by being "caught" and reraised as
set of set exceptions.
The remainder of the specification is mainly a list of the package operations.
Exercises
1. Develop an Ada set of sets package where the table is implemented as a linked list.
2. Develop an extended Ada set of sets package for a bank checking account system. Upon
command the system is to:
a. add a new checking account,
b. deduct a check from a specified account,
c. add a deposit to a specified account, and
d. produce a statement for all of the accounts.
The extended set of sets package will contain, for each account, the current value of the balance
on hand for this account.
44
Combining ADTs
3. A stock broker has, for each customer, a list of the stocks (and the number of shares of each
stock) owned by the customer. The broker wants a system to:
a. add a new customer,
b. alter the number of shares of a stock owned by the customer,
c. add a new stock to those currently owned by a customer, and
d. produce a statement for all of the customers.
Produce a set of sets package to solve this problem.
4. The table in the set of sets package can be implemented by a set package. Do so. Which set
implementation of the table gives the fastest set of sets implementation?
8. The set of set package in this section allows only one set of sets. Extend it to allow any
number of sets of sets. Develop:
a. a suitable set of operations,
b. a data representation and a set of algorithms for implementing the operations, and
c. translate the algorithms into an Ada program.
What choices have to be made in each section?
45
Combining ADTs
with Ordered_Set_Package;
generic
type Set_Name_Type is private; --Date type of set names.
type Data_Type is private; --Data type of items in sets.
package Set_Of_Sets_Package is
type Name_Procedure_Access_Type is
access procedure (I : Set_Name_Type);
procedure Clear_All_Sets;
--deletes all sets and their values from memory.
46
Combining ADTs
procedure Iterate_One (
The_Set_Name : in Set_Name_Type;
Name_Procedure_Pointer : Name_Procedure_Access_Type;
Item_Procedure_Pointer : Set_Pkg.Procedure_Access_Type);
--prints the contents of one set.
--Exceptions: No_Such_Set_Name, Set_Is_Empty
procedure Iterate(
Name_Procedure_Pointer : Name_Procedure_Access_Type;
Item_Procedure_Pointer : Set_Pkg.Procedure_Access_Type);
--prints the contents of every set.
--Exceptions: No_Sets_To_Print
end Set_Of_Sets_Package;
47
Combining ADTs
with Ada.Text_IO;
--Declarations
type Set_Entry is
record
Set_Name : Set_Name_Type; --Name of the set.
Set_Pointer : Set_Pkg.Set; --Pointer to the set.
end record;
------------------------------------------------------------
------------------------------------------------------------
procedure Clear_All_Sets is
begin
Table_Size := 0;
end Clear_All_Sets;
48
Combining ADTs
------------------------------------------------------------
procedure Add_New_Set( The_Set_Name: in Set_Name_Type ) is
begin
--Check for exceptions
if Table_Size = Table_Maximum then
raise Sets_Overflow;
end if;
------------------------------------------------------------
procedure Iterate(
Name_Procedure_Pointer: Name_Procedure_Access_Type;
Item_Procedure_Pointer: Set_Pkg.Procedure_Access_Type) is
begin
--Check for exceptions.
if Table_Size = 0 then
raise No_Sets_To_Print;
end if;
49
Combining ADTs
------------------------------------------------------------
function Is_Set ( The_Set_Name : in Set_Name_Type )
return Boolean is
begin
--Initialize
I := Table_Size;
Table(0).Set_Name := The_Set_Name;
--Terminate
return I > 0;
end Is_Set;
------------------------------------------------------------
function Find ( The_Set_Name : in Set_Name_Type )
return Natural is
--Locates the subscript for a given set name. Returns the
--position in the table or raises an error if it is not found.
begin
--Initialize
I := Table_Size;
Table(0).Set_Name := The_Set_Name;
--Terminate
if I > 0
then return I;
else raise No_Such_Set_Name;
end if;
end Find;
Program 6.4.3.1 (Continued)
50
Combining ADTs
------------------------------------------------------------
procedure Clear ( The_Set_Name : in out Set_Name_Type ) is
begin
Row := Find( The_Set_Name );
Set_Pkg.Clear( Table(Row).Set_Pointer );
end Clear;
------------------------------------------------------------
function Empty ( The_Set_Name : in Set_Name_Type )
return Boolean is
begin
Row := Find( The_Set_Name);
return Set_Pkg.Empty( Table(Row).Set_Pointer );
end Empty;
------------------------------------------------------------
procedure Insert ( The_Set_Name : in Set_Name_Type;
New_Data : in Data_Type) is
begin
Row := Find( The_Set_Name );
Set_Pkg.Insert( Table(Row).Set_Pointer, New_Data );
exception
when Set_Pkg.Set_Overflow => raise Set_Is_Full;
when Set_Pkg.Duplicate_Entry => raise Duplicate_Set_Data;
end Insert;
51
Combining ADTs
------------------------------------------------------------
procedure Delete ( The_Set_Name : in Set_Name_Type;
Old_Data : in Data_Type) is
begin
Row := Find (The_Set_Name);
Set_Pkg.Delete( Table(Row).Set_Pointer, Old_Data );
exception
when Set_Pkg.Item_Not_in_Set =>
raise Data_Not_Found_in_Set;
end Delete;
------------------------------------------------------------
function is_in ( The_Set_Name: in Set_Name_Type;
The_Data : in Data_Type)
return Boolean is
begin
Row := Find( The_Set_Name );
return Set_Pkg.Is_In( Table(Row).Set_Pointer, The_Data );
end is_in;
------------------------------------------------------------
procedure Iterate_One (
The_Set_Name : in Set_Name_Type;
Name_Procedure_Pointer: Name_Procedure_Access_Type;
Item_Procedure_Pointer: Set_Pkg.Procedure_Access_Type) is
begin
Row := Find( The_Set_Name );
Name_Procedure_Pointer( Table(Row).Set_Name );
Ada.Text_IO.New_Line;
Set_Pkg.Iterate( Table(Row).Set_Pointer,
Item_Procedure_Pointer );
exception
when Set_Pkg.Print_Set_Underflow => raise Set_Is_Empty;
end Iterate_One;
end Set_Of_Sets_Package;
52
Trees
TREES
Trees are used to describe almost anything with several levels and branches at each level.
Some common examples are hierarchies, decision sequences, management organizations, game
strategies, family trees, and arithmetic expressions. We must be able to store and manipulate any
structure which is so widely used, so this chapter will discuss these examples along with trees and
their representation and use.
1
Trees
7.1. Trees
The everyday world is full of examples of trees. The trees in Figure 7.1.1 are typical.
Z The tree in Figure 7.1.1a is a management structure for a typical corporation. This
tree shows four vice presidents under the president, another layer of submanagers
under one of the vice presidents, and another layer of people under one of these
submanagers, and so forth. Each person in the tree is a node of the tree and the
president is the root of the tree.
Z Figure 7.1.1b is an example of a decision tree for playing the game, Animal, Vegeta-
ble, or Mineral. In this case, each node is a question and the answer to the question
determines the next question.
Z Figure 7.1.1c is an expression tree, a tree which is used to store arithmetic expres-
sions. To evaluate an expression tree, we start at the bottom of the tree and work our
way up the tree evaluating each lower level before evaluating the next level up the
tree. The tree in Figure 7.1.1c, for example, has the value a + (b*c). Any arithmetic
expression can be written as a tree in this form. Even more interesting, every
computer program can be written as a "parse tree," a tree which reflects the
program's structure.
Z Figure 7.1.1d is an outline of a book with a subtree corresponding to each chapter of
the book and lower level subtrees for each section in each chapter.
Before progressing further, it is necessary to cover some technical vocabulary. Trees are
well known with many kinds and examples available, so a large vocabulary is also available.
The first definition is of a tree itself: A tree is either empty or consists of a node (called the
root) and zero or more subtrees. In Figure 7.1.1a, for example, the president is the root of the
tree and there are four subtrees. In Figure 7.1.1c the + sign is the root of the tree and there are
two subtrees.
An n-way tree is a tree where each node can have an arbitrary number of subtrees. Figures
7.1.1a and 7.1.1b are typical n-way trees where each node can have an arbitrary number of
subtrees. In general, the subtrees are unordered; that is, there is no particular precedence or
order among the subtrees. There are, however, ordered trees, ones with a particular order
imposed on the subtrees. The book outline in Figure 7.1.1d is an example of an ordered tree
where the order is imposed by the fact that the chapters are ordered.
A binary tree is a tree where each node has at most two subtrees and the subtrees are ordered
in the sense that one subtree is the left subtree and the other subtree is the right subtree. Figure
7.1.1c, e, and f are all examples of a binary tree.
The next definitions concern the relationships between the nodes in a tree. The root of a tree
is the parent of the roots of the tree's subtrees. In Figure 7.1.1c, for example, the root, "+" is the
parent of the "a" and the "*" nodes. In Figure 7.1.1d, the root, Book, is the parent of the nodes
Chap1, Chap2, and Chap3.
Conversely, the root of the subtree is the child of the root of the tree. Thus, "a" and "*" are
children of "+". The general terminology follows that of a family tree. We have not only parents
and children, we also have siblings, grandparents, grandchildren, ancestors, descendants, and so
forth.
The number of children of a node is called the degree of the node. A leaf is a node of degree
zero, that is, a node with no children.
2
Trees
Book
+
Chap I Chap 2 Chap 3
a *
1.A 1.B 1.C 2.A 2.B 3.A 3.B
b c
3
Trees
The level of a node is defined recursively: The root of the tree is at level 1 and every other
node has level one greater than its parent. In figure 7.1.1c, for example, the root, the + sign, is
level 1, the nodes a and the * sign are at level 2, and the nodes b and c are both at level 3.
Other definitions will be presented as they are needed. Table 7.1 summarizes some common
tree definitions.
- A tree is either empty or consists of an item (called the root) and zero or more trees
(called subtrees). In BNF terminology,
<tree> ::= null ¦ ( <item> [<tree>]... )
where the [<tree>]... indicates zero or more subtrees.
- The root is the top item in the tree. (In everyday usage, we think of the root of a tree
as being at the bottom of the tree. In computer science, the root of a tree is usually
drawn at the top of the tree.)
- The parent, child, sibling, descendant, and ancestor of a node are all defined using
analogies to family trees.
- The root of a tree is said to have level 1 and every other node in the tree has level one
greater than its parent node. (N.B. Some authors define the root as level 0 so their
value of the level is one less than ours.)
- The height (or depth) of a tree is the maximum level of the tree.
- A complete tree is a tree with all the nodes down to a given level and contains every
node at that level and no nodes below that level. A degenerate tree is a tree with
exactly one node per level.
- A binary tree is an ordered tree with at most two subtrees per node.
The left subtree is the first subtree in a binary tree.
The right subtree is the second subtree of a binary tree.
- A binary search tree (BST) is a binary tree such that every node in the left subtree
lexicographically precedes the root and every node in the right subtree
lexicographically succeeds the root.
4
Trees
Before developing algorithms for generating and manipulating trees, one must understand
something about how many items can be stored in a tree and how many levels are contained in a
typical tree. These questions are trivial in an array or a linked list, but they take some thought
when discussing trees.
The simplest tree to analyze, in some sense, is a complete binary tree. A complete tree is a
tree with all of the nodes down to a given level and contains all of the nodes at that level and has
no nodes below that level. Figure 7.1.1e is a complete tree with three levels. A moment's
thought allows us to conclude that since each node has two children, each level has twice as
many nodes as the previous level. Expanding this thought gives the following table:
Number of
Level Nodes at this Total Number of Nodes
Level in the Tree
1 1 1
2 2 1+2 = 3
3 4 1+2+ 4 = 7
4 8 1 + 2 + 4 + 8 = 15
... ... ...
L-1
L 2 1 + 2 + 4 + ... + 2L-1 = 2L - 1
Note how fast the number of nodes increase with the level. A tree with ten levels can hold
1023 items and a tree with twenty levels can hold over one million items.
Let the height of the tree, H, be the maximum level of the tree. Let Size be the total number
of nodes in the tree; then, for a complete tree, clearly,
Size = 2H - 1
or
Size + 1 = 2H.
Taking the log base two of both sides gives:
Log2 (Size + 1) = Log2 2H
= H
or, in other words, taking the limit as H and Size increase to the point where the 1 is negligible:
H ~ Log2 Size .
Thus, a complete tree with 1023 nodes (approximately equal to 1024 = 210) has a height of ten.
One of the interesting features of a complete tree is that over half of the nodes of the tree are
in the bottom level of the tree. Comparing the last two columns in the table above shows that
the exact ratio is
2L-1 / (2L - 1) = 1/(2 - 21-L).
This implies that the Average Node Level (ANL) is close to the height of the tree. To compute
the average level of a node in a tree we find the level of each and every node in the tree and then
compute the average of these levels. Starting with the table above, we note that there is:
1 node at level 1,
2 nodes at level 2,
5
Trees
6
Trees
ble right subtrees. So, the total number of trees with a k node left subtree is the product of B(k)
and B(n-k-1). Summing this product over all possible left subtree sizes gives:
B(n) = B(0)*B(n-1) + B(1)*B(n-2) + ... + B(n-1)*B(0).
The solution to this difference equation is beyond this text, but the final result is:
1 2n .
B(n) = 1 + n n
Strangely enough, a formula for the average level of a node in an arbitrary binary tree was
developed only in 1985. More details are given later in this chapter, but the final result is that
the average level in a general binary tree is O( log2 Size ). This implies most binary trees are
much closer to complete trees than to degenerate trees.
While most trees are n-way trees, we will start our study by examining only binary trees.
There are two reasons for this choice. First, binary trees are the simplest and most common type
of tree. Second, it turns out that once we understand binary trees, a minor variation on binary
tree techniques suffices for handling n-way trees.
Exercises
3. How many ancestor(s) does a node at level N have? What is the maximum number of nodes
at level N of a binary tree? In a three tree (each node has up to three children)?
4. What is the result of interchanging all left and right subtrees in a binary tree?
6. A strictly binary tree is a tree such that every node has either zero or two children. How
many nodes are there in a strictly binary tree with N leaves?
7. Develop the first four levels of a tree to play the game Animal, Vegetable, or Mineral.
7
Trees
The exact set of tree operations depends upon the kind of tree, but almost all trees use the
basic operations:
This section presents binary trees because these are the simplest and the most commonly
used trees, but the same operations are also applied to n-way trees later in the chapter.
The most common kind of binary tree is the binary search tree (BST), a tree which is in
lexicographical order; that is, for each node N in the tree, every node in the left subtree of N
precedes N and every node in the right subtree of N succeeds N. Binary search trees can be
searched very quickly and are often used wherever a quick search is necessary. This section
presents first binary search trees and then ordinary binary trees.
One of the advantages of BSTs is their search speed. Consider, for example, searching the
BST:
D
/ \
B H
/ \
A C
for the letter C. Since C comes before the root D, we know that C must be in the left subtree. In
other words, if we are searching the BST for C, one comparison eliminates approximately one
half of the tree. Similarly since C comes after B, we know that C must be in B's right subtree.
Each further comparison eliminates approximately half the remaining tree. The maximum
number of comparisons is equal to the height of the tree. A tree of height 10 can store over 1000
items, so the resulting BST search is much faster than a sequential search of a list with 1000
items.
The basic search mechanism is:
8
Trees
Do-one-of
The_Data < Root of subtree: Search Left subtree
The_Data = Root of subtree: Found <-- true
The_Data > Root of subtree: Search Right subtree
end do-one-of
This is repeated over and over again until we find the item or until we reach the bottom of the
tree.
The maximum number of times this mechanism is repeated is equal to one plus the height of
the tree. For a degenerate tree with Size nodes, the search time can be as large as O( Size ). For
a complete tree, the search time is limited to O( log2 Size ). Note that there is a large spread
between these two search times. The discussion of the average search time and various ways of
insuring that the search time is close to the average and not the maximum are left for the end of
this section.
A general recursive search algorithm for a BST is:
A non-recursive search routine is very similar to the recursive one. The major difference is
that it uses a while loop instead of recursive calls
Return( Found )
end
9
Trees
While the recursive search routine is perhaps easier to follow, the non-recursive version
executes faster because it does not have the overhead of the extra recursive procedure calls. In
practice either one can be used, but the non-recursive version is usually preferred for speed
reasons.
A new item is normally inserted into the bottom of a BST. Thus, to insert a B into the BST:
we would insert the B at the bottom. Since B precedes D, B must go to the left of D; that is,
D
/
B
To then insert an A into this tree we note that A precedes D and B, so the tree becomes:
D
/
B
/
A
Then to insert H in this tree, note that H comes after D, so the tree becomes:
D
/ \
B H
/
A
D
/ \
B H
/ \
A C
The disadvantage of this simple approach is that at times it can lead to a completely unbalanced
tree. If, for example, we inserted the items A,B,C,D into a BST in that order, the resulting tree
degenerates into a sequential list.
A general, recursive algorithm for inserting an item into a BST is:
10
Trees
Since some trees allow duplicate entries and some trees don't, the algorithm uses a ??? to
indicate that the handling of duplicate entries depends upon the circumstances.
A non-recursive insertion algorithm is much longer, but executes faster:
More sophisticated insertion techniques are available which, by one means or another, keep
the tree balanced as insertions are made. We will cover a few such methods in the Search
chapter later in this book.
Deletions are also delayed until the search chapter when we develop some tools to simplify
the process.
11
Trees
Timing
The average time to search a BST is equal to the Average Node Level in the tree, ANL.
ANL itself depends upon the configuration of the tree and the configuration of the tree in turn
depends upon the exact order in which the items were inserted into the tree. Consider, for
example, a tree with the three items a, b, and c. There are six different orders in which these
items can be inserted in the tree. (If a tree has N items, then there are N! different orders in
which the items can be inserted in the tree.) The six insertion orders and the corresponding trees
are:
a a b b c c
b c a c a c a b
c b b a
To determine the average node level for a tree with three items, we have to compute the average
node level of each individual tree and then average these averages. The average node level for
each of the six trees is:
ANL(abc) = (1 + 2 + 3) / 3 = 2
ANL(acb) = (1 + 2 + 3) / 3 = 2
ANL(bac) = (1 + 2 + 2) / 3 = 5/3
ANL(bca) = (1 + 2 + 2) / 3 = 5/3
ANL(cab) = (1 + 2 + 3) / 3 = 2
ANL(cba) = (1 + 2 + 3) / 3 = 2
and the final average for all six trees with three items is:
ANL = (2 + 2 + 5/3 + 5/3 + 2 + 2) / 6
= (34/3) / 6
= 17 / 9.
Generalizing this last result to a tree with Size items is a very difficult problem and was only
done successfully in the 1980s. The technique is too complicated and tedious to present here,
but the final result is very simple:
ANL = O( log2 Size ).
Hence:
The average search time in a BST is O( log2 Size ).
Note that the average search time is essentially equal to the time to search a complete tree and is
far from the time required to search a degenerate tree, so, we can conclude, that most trees are
much closer to a complete tree than they are to a degenerate tree. Note that this is not to say that
degenerate trees cannot occur; it is that they are relatively uncommon.
The above analysis is dependent upon two assumptions that can greatly affect the result.
They are:
y All insertion orders are equally likely.
y Nothing is ever deleted from the tree.
12
Trees
The first assumption may not be true in practice and, in fact, may vary from one set of inser-
tions to the next.
The second assumption is even more interesting. Deletions greatly complicate the analysis
of the average search time. They complicate the analysis so much that the problem has never
been solved for a tree with more than three nodes!
Strangely enough, assuming that the average search time is O( log2 Size ) seems to give
reasonable results in most cases, but there are no guarantees. Later in this book we will examine
ways to insure that the average search time is always O( log2 Size ) regardless of the insertion
order or deletions.
Determining the time required to insert an item into a BST is very similar to determining the
average search time. The reason is that since all of the items are inserted at the bottom of the
tree, the execution time of an insertion depends upon the levels of the bottom nodes in the tree.
For complete trees, the level of every bottom (or leaf) node is equal to the height of the tree, so
the average insertion time is O( log2 Size ). For a degenerate tree, the insertion time is O( Size ).
To determine an average insertion time, we must compute the insertion time over all possible
trees. The derivation of the result is too complicated to give here, but the final result is again
simple:
The average insertion time in a BST is O( log2 Size ).
As noted above, this is an average value and insertions can take times as long as O( Size ) in
certain cases.
We present the remaining tree operations for standard binary trees (BT), because the same
algorithms are applicable to both BSTs and BTs.
Printing
The print routines for a tree are interesting because there are many different ways to print the
same tree. One obvious way to print the BST above is in alphabetical order, ABCDH. To do
this we start by noting that the definition of a BST implies that to print the tree in alphabetical
order we must print everything in the left subtree before we print the root and we must print
everything in the right subtree after we print the root. This suggests a recursive approach:
which prints the left subtree before it prints the root and it prints the right subtree after it prints
the root.
Expanding this into a more detailed recursive routine gives:
13
Trees
In_Order_Print( Tree )
If Tree is not empty then
In_Order_Print ( Left subtree of Tree )
Output: Root of Tree
In_Order_Print ( Right subtree of Tree )
end in_order_print
This algorithm is called the in_order_print because the data comes out in alphabetical order.
There are other possible print orders. Recall the expression tree of Figure 7.1.c. If we use in
order print with this tree, we get a+b*c or the algebraic expression in standard infix notation. To
produce postfix notation, abc*+, from this tree we can use the recursive routine:
Post_Order_Print( Tree )
If Tree is not empty then
Post_Order_Print ( Left subtree of Tree )
Post_Order_Print ( Right subtree of Tree )
Output: Root of Tree
end post_order_print
which prints first the left subtree, then the right subtree, and last of all the root. The output is
called, for obvious reasons, postorder print.
To produce prefix notation, +a*bc, from the same tree we can use the recursive definition:
Pre_Order_Print( Tree )
If Tree is not empty then
Output: Root of Tree
Pre_Order_Print ( Left subtree of Tree )
Pre_Order_Print ( Right subtree of Tree )
end pre_order_print
D
/ \
B H
/ \
A C
produces the first level D, then the second level BH, and then the third level AC. The complete
output is then DBHAC. Similarly, applying breadth first print to the expression tree in Figure
7.1.1 produces +a*bc. Neither output seems very useful for these two trees, but there are cases
where breadth first print is necessary.
The basic algorithm uses a queue to store the next set of nodes to be processed:
14
Trees
This algorithm works by inserting the level 1 node (the root) in the queue and, when it dequeues
the level 1 node, it inserts the level 2 nodes in the queue. When it dequeues level 2 nodes from
the queue it inserts the level 3 nodes in the queue, and so forth. One level at a time, this
algorithm inserts all the nodes into the queue. Since each node is printed when it is dequeued,
all the nodes are printed by their level in the tree.
Traversals
Many problems require "visiting" each and every node in a tree. Some examples are routines
to count or sum the nodes in a tree. The formal term for this is tree traversal. The four print
routines all visit every node in the tree; they just visit the nodes in different orders. Since the
visitation order does not matter for counting or summing nodes, we can use any of the print
routines as a model. There is one important difference, however. The inorder, preorder, and
postorder routines are all recursive; the breadth first routine is non-recursive. We will examine
both approaches.
To illustrate recursive transversal routines, let's start with a recursive routine to count the
nodes in a BT. Examining the recursive print routines suggests it suffices to add 1 to the count
of the nodes in the left and right subtrees; that is,
Count( Tree )
If Tree is empty
then Answer <-- 0
else Answer <-- 1 + Count( Left subtree of Tree )
+ Count( Right subtree of Tree )
Return( Answer )
end count
15
Trees
Note that, if the routine correctly counts the nodes in the left and right subtrees, it will count the
nodes in the whole tree. Since it correctly counts empty trees, it will correctly count trees with
one node and hence trees with any number of nodes.
A minor variation on this technique can be used to compute the height of a tree. We first
note that the height of a tree is 1 larger that the height of its deepest subtree. In other words, the
height of the tree
D
/ \
B H
/ \
A C
is 1 plus the height of the deepest of the two subtrees. In this example, the left subtree is of
height 2 and the right subtree is of height 1; since 1 + max[2,1] equals three, the whole tree is of
height 3. This suggest the basic algorithm:
Height( Tree )
If Tree is empty
then Answer <-- 0
else Answer <-- 1 + Largest_of[ Height( Left subtree ), Height( Right subtree ) ]
Return( Answer )
end height
where Largest_of is a function which returns the largest of its two parameters.
To develop non-recursive traversal algorithms, we start with the breadth first algorithm. For
example, to count the nodes in a BT we can use:
Count
Initialize:
Clear Queue
If tree is not empty, then Enqueue Root of tree
Cnt <-- 0
Terminate
Return ( Cnt )
end count
16
Trees
As before, this algorithm inserts all of the nodes of the tree into a queue. Since the algorithm
increments the value of Cnt each time it dequeues a non-null node, the result is a count of the
number of nodes in the tree.
We can also use the breadth first algorithm to compute the height of a tree non-recursively.
The essential idea is to insert into the queue some distinctive item, say a @, at the end of each
level of the tree; then, to compute the height of the tree, it suffices to increment the count every
time an @ is dequeued. An algorithm using this approach is:
Height
Initialize:
Clear Queue
If tree is not empty, then
Enqueue Root of tree
Enqueue @
Cnt <-- 0
Terminate
Return ( Cnt )
end height
The breadth first algorithm is the basis for many non-recursive tree algorithms; that is, many
non-recursive algorithms are generated by essentially taking the breadth first algorithm and
appropriately processing each node as it is dequeued.
We are now ready to appreciate a general solution to the tree traversal problem. Any tree
problem requiring tree traversal can be based upon either the recursive or the non-recursive print
algorithms. The general recursive algorithm for a traversal function is:
Traverse( Tree )
If Tree is empty
then Answer <-- ?
else Update Answer using Root of tree,
and Traverse( Left subtree of Tree )
and Traverse( Right subtree of Tree )
Return( Answer )
end traverse
17
Trees
Traverse
Initialize:
Clear Queue
If tree is not empty, then Enqueue Tree
Answer <-- ?
Terminate
Return ( Answer )
end traverse
Timing
Since all of the traversal algorithms (and the corresponding print algorithms) must process
each and every node in the tree, they must use at least O(Size) amount of time. Even worse, the
algorithms given above use at least O(2*Size) amount of time. To see this, consider the breadth
first algorithm above. This algorithm inserts into the queue each and every left and right child
subtree. Since each node in the tree has two children (one or both may be null), this implies
2*Size children are inserted into the queue. Since the repeat loop is repeated once for each entry
in the queue, the total execution time is O(2*Size).
A few moments reflection shows that if a tree has Size nodes, then, including empty child
subtrees, it has 2*Size child subtrees. Furthermore, since each node has only one parent,
approximately one half of the child subtrees must be Λ. We can use this fact to halve the
number of loop repetitions by inserting only non-null child subtrees into the queue.
This new version of the breadth first algorithm is:
Traverse
Initialize:
Clear Queue
If Tree is not Λ, then Enqueue Tree
Answer <-- ?
18
Trees
Terminate
Return ( Answer )
end traverse
This algorithm is not as obvious as the earlier version, but, by inserting only non-null subtrees
into the queue, this algorithm is twice as fast as the earlier one.
A similar argument can be applied to the recursive versions. The in order print routine given
earlier calls itself for each child subtree in the tree giving a total execution time of O(2*Size).
The algorithm can be modified to only call itself for non-null child subtrees, but this eliminates a
test for the empty tree. To include a test for the empty tree, an additional "interface" routine is
used:
In_Order_Print( Tree )
If Root is not Λ, then In( Root )
In( Tree )
If Left subtree not Λ, then In( Left subtree of Tree )
Output: Root of Tree
If Right subtree not Λ, then In( Right subtree of Tree )
end print.
The resulting pair of algorithms halve the inorder print execution time to O(Size).
7.2.3. Representations
The two basic representation for trees are arrays and linked records. This is the first case
where the linked representation is easier to understand, so we will start with the linked represen-
tation and leave the array representation for later.
The linked representation simply stores each node in a record with three components: Item to
store the node value, and Left and Right to store the pointers to the left and right subtrees
respectively.
For example, given a node in the form:
19
Trees
Item
Left Right
where Item is the item stored in the node and Left and Right point to the left and right subtrees
respectively. The tree :
D
/ \
B F
/ / \
A E G
would be stored as follows:
B F
Λ
A E G
Λ Λ Λ Λ Λ Λ
One additional item is needed to represent the tree, a pointer to the root of the tree. Thus, to
represent a tree in the computer, we use one pointer, called the Root, to point at the node
containing the root of the tree. (Note that this terminology is a bit confusing. In one context, the
root is the value at the top of the tree; in another context, the root is a pointer to the node at the
top of the tree. Unfortunately, both uses are standard and the reader must determine the desired
meaning from the context.)
If Root is a pointer to the node containing the root of the tree, then
- Root.Item is the value of the root of the tree,
- Root.Left points to the left subtree, and
- Root.Right points to the right subtree.
In general, if P is a pointer to a node, then
- P.Item is the value of the node,
- P.Left is a pointer to the left subtree of the node, and
- P.Right is a pointer to the right subtree of the node.
With these substitutions, the non-recursive search routine:
20
Trees
Return( Found )
end
becomes:
Return( Found )
end
The advantages of the tree and subtree algorithms used earlier should be becoming clear by
now. They are independent of any particular representation, yet can be easily translated into
representation dependent algorithms. Furthermore, the same basic algorithms are used for any
representation rather than having to use independent algorithms for each representation.
Since Root is a pointer to the top node in the tree, to test for an empty tree it suffices to use:
Root = null
and the tree can be initialized to empty by the operation:
Root <-- null
The other operations are obvious alterations of the algorithms given earlier and Module
7.2.3.1.1 contains a complete set of detailed algorithms for a linked representation.
There is one minor difference between this set of algorithms and the ones used in the set
chapter. The insert, search, and delete algorithms in the set chapter all used a common find
routine. It is possible to use a similar approach for the tree algorithms, but the insert and search
algorithms in Module 7.2.3.1.1. each have their own unique find routine. One reason is to leave
free the option of inserting a duplicate entry into the tree, something that is not allowed in a set.
21
Trees
Linked Representation
Data Specification
Root : Pointer to node; --Pointer to root node of tree.
Node is record
Item : ?? --Data item to be stored in tree.
Left : Pointer to Node; --Pointer to left subtree of node.
Right : Pointer to Node: --Pointer to right subtree of node.
end record;
Algorithms
Clear
Root <-- Λ
end clear
Empty
Return( Root = Λ)
end empty
Is_In( The_Data )
Initialize
Pointer <-- Root
Found <-- False
22
Trees
Insert( New_Data )
Find Location of New Node in Tree
Initialize
Pointer <-- Root
Parent <-- Λ
23
Trees
Pre_Order_Print
If Tree not Empty, then Pre( Root )
end pre_order_print
Pre( Pointer )
Output: Pointer.Item
If Pointer.Left /= Λ, then Pre( Pointer.Left )
If Pointer.Right /= Λ, then Pre( Pointer.Right )
end pre
In_Order_Print
If Tree not Empty, then IOP( Root )
end in_order_print
IOP( Pointer)
If Pointer.Left /= Λ, then IOP( Pointer.Left )
Output: Pointer.Item
If Pointer.Right /= Λ, then IOP( Pointer.Right )
end in
Post_Order_Print
If Tree not Empty, then Post( Root )
end post_order_print
Post( Pointer )
If Pointer.Left /= Λ, then Post( Pointer.Left )
If Pointer.Right /= Λ, then Post( Pointer.Right )
Output: Pointer.Item
end post
Breadth_First_Print
Initialize
Clear Queue
If Tree not Empty, then Enqueue( Root )
24
Trees
To store a BST in an array, we use a table with three columns: the first column, Item,
contains the node value, the next column, Left, contains pointers to the row containing the Left
Child of the node, and the last column, Right, stores pointers to the row containing the Right
Child. Also, let the variable Root point at the row in the table containing the root of the tree.
Thus if the value of root is 3, then the root node will be stored in the third row of the table.
With these assumptions, the binary tree:
D
/ \
B F
/ / \
A E G
can be stored in the tree table as follows:
Root = 3
Item Left Right
1 B 4 0
2 E 0 0
3 D 1 6
4 A 0 0
5 G 0 0
6 F 2 5
7 ... ... ...
where the entry in the Left or Right column is the row number of the left or right child of the
given node. Thus, reading from row 3, the left child of D is in row 1 and the right child is in
row 6. Note that any node can be stored in any row in the table. It is completely arbitrary, for
example that the root is stored in the third row of the table or that the value G is stored in the
fifth row of the table.
This data structure uses the variable, Root, to store the row number of the row containing the
root of the tree. Note that
a. zero is used for the Λ pointer, and
b. a node can be stored in any row of the table as long as the pointers are consistent.
Part (b) implies that we have some way of keeping track of empty rows in the table. One
simple way is to use the rows in order and use a row only once. A more sophisticated scheme is
to use some means of tracking which rows are in NOT in use at any given time. One might for
example, keep a list of empty rows. The details are left for Exercise 17 at the end of this section.
In either case, the algorithms in Module 7.2.3.2.1 assume there is a routine to return the location
of an empty row.
25
Trees
There is one interesting new twist that makes the algorithms easier to read. So far we have
always stored a table as an array of records, one record for each row. There is another way to
store a table: as a series of parallel arrays, one array for each column in the table. This method
is called the parallel array representation of a table. Note that conceptually it makes no differ-
ence whether we store a table by rows or by columns. In either case, the table is stored and the
table entries can be manipulated in any way desired.
To store the tree table using parallel arrays implies that we need three arrays: one array for
the Item column, one array for the Left column, and one array for the Right column. The advan-
tage of this technique is that rather than having to use:
Array(P).Item, Array(P).Left, and Array(P).Right
to refer to the entries in row P of the table, we can use:
Item(P), Left(P), and Right(P)
The result is a less cluttered, easier to read set of algorithms. The corresponding Ada programs
are also easier to read.
Module 7.2.3.2.1 contains detailed algorithms for an array representation. Again, the use of
representation independent algorithms at the beginning of this chapter simplifies the develop-
ment of the representation dependent algorithms.
Again, the insert and search algorithms each use their own find routine rather than a common
one as was used in the set chapter. The reason is that, as before, it allows inserting duplicate
entries in the tree, something that is impossible with a set.
7.2.3.3. Timing
For the first time we give not only the average execution time, but also the maximum possi-
ble execution time for each operation. This is because, for the first time, there is a significant
difference between the two values.
Operation Representation
Array Linked
Average Maximum Average Maximum
26
Trees
Array Representation
Data Specification
Maximum_Size : Positive --Maximum number of items in tree.
Algorithms
Clear
Root <-- 0
end clear
Empty
Return( Root = 0 )
end empty
Is_In( The_Data )
Initialize
Pointer <-- Root
Found <-- False
Terminate :
Return (Found)
end is_in
27
Trees
Insert( New_Data )
If no empty row, then Overflow error.
28
Trees
Pre_Order_Print
If Tree not Empty, then Pre( Root )
end pre_order_print
Pre( Pointer )
Output: Item(Pointer)
If Left(Pointer) /= 0, then Pre( Left(Pointer) )
If Right(Pointer) /= 0, then Pre( Right(Pointer) )
end pre
In_Order_Print
If Tree not Empty, then IOP( Root )
end in_order_print
IOP( Pointer )
If Left(Pointer) /= 0, then IOP( Left(Pointer) )
Output: Item(Pointer)
If Right(Pointer) /= 0, then IOP( Right(Pointer) )
end in
Post_Order_Print
If Tree not Empty, then Post( Root )
end post_order_print
Post( Pointer )
If Left(Pointer) /= 0, then Post( Left(Pointer) )
If Right(Pointer) /= 0, then Post( Right(Pointer) )
Output: Item(Pointer)
end post
Breadth_First_Print
Initialize
Clear Queue
If Tree not Empty then Enqueue( Root )
end breadth_first_print
29
Trees
Exercises
1. For each of the following data sets, sketch the tree that is generated when the data set is
inserted into the BST one item at a time.
a. A, B, C, D, E d. A, C, E, D, B
b. A, D, E, C, B e. A, E, B, C, D
c. B, C, D, E, A
3. What would the output be if the queue in the breadth first algorithm were replaced by a
stack? Use this information to develop a non-recursive preorder print algorithm.
5. Can two different trees produce the same result under pre-order, post-order, or in-order
traversal?
6. Develop a recursive and a non-recursive routine to determine the parent of a specified item
in a BST.
8. Write procedures to
a. sum the nodes,
b. count the leaves,
c. sum the leaves,
d. delete the leaves,
e. print the leaves,
f. print the non-leaves,
g. interchange all left and right subtrees,
30
Trees
12. Develop an algorithm to determine if a tree is strictly binary; that is, every node has either
zero children or two children.
15. Develop an insertion algorithm for a tree where every node contains a pointer to its parent.
Are there any advantages to such a tree?
16. Develop an insertion and a search algorithm for a tree where every node has two parents.
Are there any advantages to such a tree?
17. The array representation of a tree needs some method of keeping track of unused rows. One
method is to keep a linked stack of empty rows and, each time a row is needed, an empty row is
popped off of this stack. Develop an algorithm to do this.
Hint: The pointers for the stack can be stored in the right column of the tree table.
31
Trees
We can speed up bag searches if we implement the bag using a tree as the storage mecha-
nism. Assuming the tree is balanced, the search time is then O(log2 Size) instead of O(Size)
which is considerably faster, especially as the value of Size increases. Basically, we simply
replace each bag operation by the corresponding tree operation; that is,
The only bag operation which causes difficulty is the Bag.Delete because we do not yet have
a Tree.Delete. For now we will include a Deleted field in each tree node. This field is false
when the node is created, but is marked true if the node is deleted. The search and print routines
then have to be altered slightly to skip deleted nodes. This solution is not very satisfying, but it
does work and it is simple to implement. We will consider more sophisticated methods later.
The tree representation of a bag is much faster at searching than the array or linked represen-
tations of a bag presented in the last chapter -- provided that the tree is balanced. The tradeoffs
or cost of using a tree representation are (1) the need to keep the tree balanced and (2) the tree
has a slower insertion operation. The time to insert an item into a bag now varies from O(Size)
to O(log2 Size) rather than O(1). This can be much slower than the bag representations given
earlier, so the tree representation of a bag is normally used for applications with few insertions
but many searches.
To develop this algorithm, let us start with some simple cases and work our way up to a
complete algorithm. The simplest case is a tree with a single value, say, for example, the tree:
5
That is, the root is the only value in the tree. In this case, evaluating the tree is only a matter of
returning the value of the root. The algorithm is then (assuming a linked representation):
Return( Root.Item )
32
Trees
Eval( Pointer )
Do-one-of
Pointer.Item = Number : Answer <-- Pointer.Item
Pointer.Item = + : Answer <-- Eval(Pointer.Left) + Eval(Pointer.Right)
Pointer.Item = - : Answer <-- Eval(Pointer.Left) - Eval(Pointer.Right)
Pointer.Item = * : Answer <-- Eval(Pointer.Left) * Eval(Pointer.Right)
Pointer.Item = / : Answer <-- Eval(Pointer.Left) / Eval(Pointer.Right)
end do-one-of
Return( Answer )
end eval
Expanding this algorithm to include a test for an empty tree, gives a complete algorithm for
evaluating an expression tree.
Example 7.3.3. Develop an algorithm to insert a postfix expression into an expression tree;
for example, given the postfix expression A B C * + ;, the algorithm should produce the tree:
+
/ \
A *
/ \
B C
The algorithm to do this is very similar to the algorithm for evaluating a postfix expression.
(Review Section 3.3.1 before continuing.) The major difference is that instead of stacking
values as we proceed, we insert the values into nodes and stack pointers to the nodes. Thus, if
each node is of the form:
Node is record
Item : value of some sort
Left : pointer to the left subtree
Right : pointer to the right subtree
end record
and, if we are processing A B C * + ;, then instead of inserting the A into the stack, we would
get a new node, insert A into the Item part of the node, set the values of Left and Right to Λ, and
then insert a pointer to this new node into the stack.
Continuing in this way with the expression, A B C * + ;, we
1. insert a pointer to A into the stack,
2. insert a pointer to B in the stack,
3. insert a pointer to C in the stack, and then
4. process the operators.
33
Trees
To process an operator, create a new node, insert the operator into the Item part of the node, pop
two pointers from the stack, make these pointers the values of the Left and Right parts of the
new node, and lastly push a pointer to the new node onto the stack.
A more detailed algorithm is:
Initialize
Clear stack
More-to-do <-- true
Terminate
Stack must contain exactly one Pointer or there is an Error
Return( Pop ) which is root of expression tree
end
A priority queue is a queue where every entry has a "priority" and the dequeue operation
returns the item in the queue with lowest (or highest) priority and, in case of a tie, the one
enqueued first. A simple example is a repair service that assigns a priority to each repair and
then processes the items on the basis of highest priority first.
We could implement a priority queue by a linked list. To enqueue an item in a linked list
representation of a priority queue, we must first search the linked list and then insert the new
entry in the correct position in the linked list. This requires time O(Size) to execute an enqueue
and time O(1) to dequeue. A faster execution time is possible if we store the items in a binary
search tree based upon priority. Thus, inserting the items::
[F, 50], [M, 30], [Q, 70], [J, 75]
(where the letter indicates an item and the integer is its priority) gives the queue:
[F, 50]
34
Trees
Dequeue( Data )
Initialize
Pointer <-- Root
If Pointer = Λ, then Underflow Error
which locates and returns the lower leftmost item in the tree. Since this is a dequeue operation,
it must also delete the item from the tree. (Make sure you understand how the item is deleted.)
The execution time is O(Size) to O( log2 Size ) depending upon the tree.
If the tree remains reasonably well balanced, this representation can be much faster than a
linked list representation. There is a data structure called a heap which guarantees the tree is
always balanced so the priority queue execution times are a minimum. We will present this
version of a priority queue in the section on heap sort.
These examples illustrate an interesting way that trees differ from the ADTs covered in
earlier chapters. Stacks, queues, pipe, and sets are all ADTs; that is, a set of values and the
operations on the values that are implemented using an array or a linked structure. All three
examples above, implement an ADT (a set, an expression, or a priority queue) using a tree as a
representation data structure. Since the tree is a data structure in these cases, it can be replaced
by another data structure without changing the desired ADT. (This is not true of an ADT;
because altering an ADT requires changing one of it operations which obviously alters the
35
Trees
ADT.) In other words, a tree is both an ADT and a technique for storing and representing an
ADT; a technique which is used because it is faster or easier than some other data structure.
Exercises
1. Show how a set could be represented by a binary search tree. Compare the time and space
requirements of using a binary tree to:
a. insert and search an item with and without the tree, and
b. find the maximum, the minimum, and the kth smallest item in a set with
doing the same operations by first sorting the set.
3. Design a tree data structure which uses a Deleted Field to denote a deleted item. Then
design a complete set of tree algorithms using this data structure.
4. Represent a set by using a binary tree such that each set element is a leaf of the tree.
Develop algorithms to generate and search such sets.
5. How could we implement a set of sets using a tree? What are the advantages of this represen-
tation?
6. Assume a tree is defined with an order such that every parent precedes its children and every
left child precedes the corresponding right child.
a. What properties would such trees have?
b. Name a use for trees with this property.
c. Design an insert algorithm to produce such trees.
7. Develop an algorithm to print an expression stored in a tree so that the output version of the
expression is completely parenthesized.
36
Trees
9. Develop a routine to evaluate an expression tree. Assume the tree is implemented using an
array.
11. Design a family tree program. The program is to input family data and then be able to give
the parents, grandparents, ancestors, children, grandchildren, and descendants of any specified
person.
12. A tree can also be represented in the form of a list where each entry in the list:
a. is empty, (), or
b. consists of a root, a left subtree, and a right subtree enclosed in parenthesis; e.g.,
(A (B () ()) () )
is the tree with root A and left subtree B and right subtree ().
Develop an algorithm to output a binary tree as a list.
13. Develop an algorithm to insert an item in a priority queue assuming the queue is imple-
mented using a tree.
Translating the tree representations above into Ada code is straightforward. We only give
one version here, the generic tree object package using the linked representation. While trees
can contain duplicate items, this implementation does not allow duplicate entries.
The implementation has only two exceptions, one for attempting to insert a duplicate item
into the tree and one for attempting to print an empty tree. This last exception is not absolutely
necessary; one could simply ignore the fact that the tree is empty or let the print procedure
output a message that the tree is empty, but issuing an exception allows the user/client program
to determine its own way of handling this case.
Exercises
1. Develop a tree package which uses an array for storing the tree.
2. Develop a generic, data type package for trees using a linked structure. The design should
allow duplicate entries in a tree. The package should include initialization and exceptions,
including an exception for running out of memory.
37
Trees
generic
package Tree_Package is
procedure Clear;
--sets the tree to empty.
38
Trees
type Node;
type Node_Pointer is access Node;
--------------------------------------------------------------
procedure Clear is
begin
Root := null;
end Clear;
--------------------------------------------------------------
function Empty return Boolean is
begin
return Root = null;
end Empty;
39
Trees
--------------------------------------------------------------
function Is_In( The_Data : in Data_Type) return Boolean is
begin
--Initialize
Pointer := Root;
Found := false;
--Repeat for each level of the tree until data item is found.
while (Pointer /= null and not Found) loop
if The_Data < Pointer.Item then
Pointer := Pointer.Left;
--Terminate
return Found;
end Is_In;
40
Trees
--------------------------------------------------------------
procedure Insert( New_Data : in Data_Type) is
begin
--Find location of new node in tree.
--Initialize
Pointer := Root;
Parent := null;
41
Trees
--------------------------------------------------------------
procedure I_O_P( Pointer : in Node_Pointer;
Put_Pointer : Procedure_Pointer) is
--recursive routine to print out each node of the
--tree in order if there is anything to print.
begin
if Pointer.Left /= null then
I_O_P( Pointer.Left, Put_Pointer );
end if;
Put_Pointer( Pointer.Item );
--------------------------------------------------------------
procedure In_Order_Print (Put_Pointer: Procedure_Pointer) is
begin
if not Empty
then I_O_P( Root, Put_Pointer );
else raise Tree_Empty;
end if;
end In_Order_Print;
end Tree_Package;
42
Trees
All of the print and traversal routines we have studied so far use recursion or a stack or a
queue. All of these techniques are slow and space consuming. The difficulty is that there seems
to be no easy way to find the next inorder, preorder, or postorder entry in a binary tree. What
we really need is some way to include in each node a pointer to, say, the next inorder node. Then
to do an inorder print or traversal, we can use this additional pointer to go directly to the next
inorder node. There is then no need for recursion or a stack or a queue.
The difficulty with this approach is that all those extra pointers can themselves use a lot of
space. There is a compromise, however, that is almost as fast and uses no extra space. If you
look carefully at the leaves of a binary search tree, you will note that both the left and the right
pointers of every leaf are Λ. More precisely, recall that every tree has (Size + 1) Λ pointers. Λ
pointers are not very useful and, with care, we can replace the Λ pointers with pointers to the
next inorder, preorder, or postorder node without paying a large space penalty.
A threaded tree replaces the Λ pointers in a standard binary tree by "threads" which point to
the next in-order node in the tree. The result is a fast, space efficient representation. Figure
7.5.1, for example, contains a threaded tree where the threads are denoted by dashed arrows. In
particular, there is a thread from A to its successor B, another thread from B to its successor D,
and a third thread from E to its successor F.
B F
A E G
Figure 7.5.1
While threads are denoted by dashed arrows in drawings, they are implemented by storing a
standard pointer in the Right portion of a tree node and there is always some way to determine
whether the Right portion contains a regular pointer or a thread pointer. In the array representa-
tion of a threaded tree, the pointers are positive integers and the threads are negative integers.
(More details are given below.) In the linked representation of a threaded tree, nodes are
expanded to include one more field, a Boolean variable used to specify whether Right is a thread
or a standard pointer. (Again, more details are presented below.) For now, it suffices to know
that threads are essentially standard pointers stored in the Right field of a node and that the
threaded tree procedures can determine whether the Right field contains a thread or a standard
pointer.
43
Trees
While the threads make it trivial to determine some successors (such as the successor of A or
B in Figure 7.5.1), it is slightly more difficult to determine the successor, for example, to nodes
D and F in the same threaded tree. The successor to F is clearly G; that is, the right child of F.
Determining the successor of D requires more work. Since the right child of D, F, itself has a
left child, finding the successor of D requires going first to F and then as far to the left from F as
possible. Assuming a linked representation, the general scheme to determine the successor of
node N is :
Successor( N )
If N.Right is a thread
then Successor <-- N.Right
else Pointer <-- N.Right
--Go as far left from P as possible
If Pointer /= null
while Pointer.Left /= Λ
Pointer <-- Pointer.Left
end while
Successor <-- Pointer
Return( Successor)
With this algorithm to find the successor of any node in a threaded tree, it is now possible to
traverse the tree non-recursively without using a queue or a stack. For example, assuming a
successor function equivalent to the algorithm above, an inorder print is now:
In_Order_Print
--Find first (smallest) item in tree
Pointer <-- Root
If Pointer /= Λ then
while Pointer.Left /= Λ,
Pointer <-- Pointer.Left
end while
Note how the successor function is used to eliminate recursion without the use of a stack or a
queue. This greatly simplifies, and speeds up, any algorithm requiring traversal of the tree.
The standard tree algorithms work with threaded trees with one minor modification. The
main difficulty is that a node at the bottom of the tree can now contain either a pointer or be a
thread. All of the algorithms that test for a Λ pointer must also test for a thread. Several
examples are given in the representation section.
44
Trees
7.5.1. Representations
As usual, we give two possible representations, an array representation and a linked repre-
sentation. Because both representations are so similar to the standard BST representations, many
of the algorithms are left for the reader.
The data structure remains the same as for a standard binary search tree except that the Right
pointer can be either a pointer or a thread. If the right pointer is a positive integer, it is a pointer;
for example, 3 is a pointer to row 3. If the right pointer is a negative integer, then it is a thread.
Note that a thread is simply the negative value of the row associated with the thread; for
example, -3 is a thread pointing to row 3. This makes it easy to determine whether the Right
pointer is storing a pointer or a thread. For example, the threaded BST:
B F
A E G
Root = 3
Note that we still have not used all of the zero-valued pointers. We still have space that might
be used to store pre-order or post-order threads.
To find the next inorder node using this data structure, we have to consider two possibilities.
a. Right is a thread, then the absolute value of Right is the location of the next
inorder node.
45
Trees
b. Right is not a thread, so it is a pointer and we must go down and as far to the
left as possible from this new node. Thus, the Right of D is a pointer to F, so
we go as far left as possible starting at F; in this case, E is the next inorder item.
Both cases lead us quickly to the next inorder node without recursion, stacks, or queues. An
algorithm to determine the inorder successor of a given node is:
In_Order_Successor( Last )
Next <-- Right( Last )
If Next > 0 then --its a non-null pointer, so go left
while Left( Next ) /= 0
Next <-- Left( Next )
end while
Return( absolute value of Next )
end in_order_successor
Once this procedure is available, an algorithm to produce an inorder print of the threaded
tree is:
In_Order_Print
--Find first (smallest) item in tree
Pointer <-- Root
If Pointer /= 0 then
while Left( Pointer ) /= 0
Pointer <-- Left( Pointer )
end while
--Visit Nodes
while Pointer /= 0
Output: Item( Pointer )
Pointer <-- In_Order_Successor( Pointer )
end while
end in_order_print
This inorder print routine does not use recursion or extra storage in the form of stacks or queues.
Note that for clarity the In_Order_Successor is treated as a separate routine. In practice, to save
the overhead of subprogram evocations, the actual code for the In_Order_Successor is inserted at
this point in the algorithm.
The In_Order_Print can also be used as the basis of all the traversal algorithms. Thus, the
general traversal algorithm is:
46
Trees
Traversal
--Find first (smallest) item in tree
Pointer <-- Root
If Pointer /= 0 then
while Left( Pointer ) /= 0
Pointer <-- Left( Pointer )
end while
--Visit Nodes
while Pointer /= 0
Visit Item( Pointer )
Pointer <-- In_Order_Successor( Pointer )
end while
end traversal
where Visit Item(Pointer) can be replaced by any desired operation, such as counting or
summing.
The search routine for a thread tree is almost identical to the one in Module 7.2.3.2.1. The
major difference is in the test for the bottom of the tree; that is, the bottom of the tree can now
be either an empty pointer or a thread:
Search( The_Data )
Initialize
Pointer <-- Root
Found <-- False
Repeat while [Pointer > 0] & [Not Found] --N.B. Pointer < 0 tests for threads!
Do-one-of
The_Data < Item(Pointer) : Pointer <-- Left(Pointer)
The_Data = Item(Pointer) : Found <-- true
The_Data > Item(Pointer) : Pointer <-- Right(Pointer)
end do-one-of
end repeat
An algorithm to insert data into a threaded tree is identical to the one in Module 7.2.3.2.1
except for method of linking the new node to the tree:
47
Trees
Insert( New_Data )
--Find location of new node in tree
Initialize
Pointer <-- Root
Parent <-- 0
Repeat while (Pointer > 0) --N.B. This also tests for threads!
Parent <-- Pointer
Do-one-of
New_Data < Item( Pointer ) : Pointer <-- Left( Pointer )
New_Data = Item( Pointer ) : ??
New_Data > Item( Pointer ) : Pointer <-- Right( Pointer )
end do-one-of
end repeat
This algorithm is a minor variation on the BST Insert algorithm in Module 7.2.3.1.1.
The remaining threaded tree algorithms for an array representation are also minor modifica-
tions of the usual binary search tree algorithms and left for the reader.
In a linked representation of a threaded tree, each node contains the four fields:
Item, Left, Right, and Thread.
The first three are the same as in a normal tree. The Thread is a Boolean variable which is true if
and only if Right contains a thread instead of a normal pointer.
The algorithms for a linked representation are minor modifications of those for an array
representation. For example, a routine to return the successor of a given node is:
48
Trees
In_Order_Successor( Last )
Next <-- Last.Right
If Last.Thread = false and Next /= Λ then --Next is a non-null pointer, so go left
repeat while Next.Left /= Λ
Next <-- Next.Left
end repeat
end if
Return( Next )
end in_order_successor
In_Order_Print
--Find first (smallest) item in tree
Pointer <-- Root
If Pointer /= Λ then
while Pointer.Left /= Λ
Pointer <-- Pointer.Left
end while
The insertion routine is very similar to the array insertion routine and is left for the exercises.
All of the threaded tree algorithms execute in the same big O amount of time as the corre-
sponding BST algorithms. The only difference is that the recursive calls are eliminated for the
inorder print and inorder traversals. This means that the threaded inorder print and traversal
algorithms execute much faster in practice than the corresponding BST algorithms even though
they have the same big O execution times.
All of the comparisons between array and linked representations of BSTs remain true of
threaded trees.
49
Trees
Exercises
2. Develop algorithms to
a. search, and b. breadth first print
a threaded tree stored using (A) an array representation and (B) a linked representation. Give the
execution time of each algorithm using big O notation.
3. Develop algorithms to insert an item in a threaded tree stored using a linked representation.
Give the execution time of the algorithm using big O notation.
5. Write procedures to
a. count the nodes, g. print the non-leaves,
b. sum the nodes, h. return the maximum value
c. count the leaves, i. return the height,
d. sum the leaves, j. make a copy, and
e. delete the leaves, k. find the leaf of least level,
f. print the leaves,
of a threaded BST. Use (A) an array and (B) a linked representation. For each algorithm, give
the execution time using big O notation.
6. Develop an algorithm to find the inorder predecessor of any given node in a threaded tree.
50
Trees
Most common trees such as management trees and decision trees require more than two
children per node. Trees with an arbitrary number of children per node are called N-way trees,
N-ary trees, multi-way trees, or general trees. The sample trees in Figures 7.1.1a, b, and illus-
trate the need for n-way trees. The questions are: How can we implement them and how can we
use them?
Let's start with the n-way tree of Figure 7.6.1.
We could represent this tree by using a pointer for each possible child. If, for example, a tree
can have as many as ten children per node, we could use ten pointers per node. This method
requires a lot of space and would be awkward to implement.
There is a better solution which uses only two pointers per node. Call these two pointers,
Child and Sibling. The Child node points to the node containing the first (eldest) child. The
Sibling points to the node containing the next sibling of the current node. If we denote the eldest
child by a downward line and the sibling by a leftward line, then the tree above can be rewritten
in the form shown in Figure 7.6.2.
51
Trees
The basic operations on the tree, Clear, and Empty, are the same as a binary search tree.
The representations of the other operations are slightly different, but are basically minor modifica-
tions on those used for a binary search tree.
Printing
The printing of n-way trees uses the standard binary search tree print algorithms with Left
and Right replaced by Child and Sibling. Some of the results are rather interesting. The basic
preorder print for n-way trees is:
Pre_Order_Print
If Tree is not empty then
Output: Root of Tree
Pre_Order_Print ( Child Subtree )
Pre_Order_Print ( Sibling Subtree )
end pre_order_print
1
1.1
1.1.1
1.1.2
1.1.3
1.2
1.2.1
1.2.2
1.3
...
where indenting has been used to emphasize the order of the output.
Similarly the inorder print for n-way trees:
In_Order_Print
If Tree is not empty then
In_Order_Print ( Child Subtree )
Output: Root of Tree
In_Order_Print ( Sibling Subtree )
end in_order_print
52
Trees
1.1.1
1.1.2
1.1.3
1.1
1.2.1
1.2.2
1.2
...
1
where again indenting has been used to clarify the order of the output.
Better algorithms would of course test for empty subtrees before calling themselves recur-
sively, but the basic algorithms above illustrate the advantages of combining our storage scheme
with minor modifications of standard binary tree print procedures.
Similarly, to print an n-way tree in breadth first form we can use a minor modification of the
standard binary breadth first tree:
Breadth_First_Print
Initialize
Clear Queue
If tree is not Empty, then Enqueue( Tree )
end breadth_first_print
Note that this algorithm differs from the binary tree version of the breadth first print by using an
inner loop to process all the siblings of a node. This insures that the algorithm processes all the
siblings of a node before it processes any of the children of that node, and, hence, the algorithm
prints first level 1, then level 2, and so forth, one level at a time.
Both the recursive, say In_Order_Print, and the non-recursive, Breadth_First_Print, routines
can be used as the basis for n-way traversal algorithms. Instead of printing the tree, we simply
process each node in any desired way as it is visited.
53
Trees
Searching
To find the location of any node in the tree, we must do a sequential search of some kind.
(Without an ordering among the nodes, there is no way to use a binary search tree kind of
search.) This search can use either a recursive or a non-recursive approach. A breadth first
(non-recursive) type search is:
Is_In( The_Data )
Initialize
Clear Queue
If tree is not Empty, then Enqueue( Tree )
Found <-- false
Return( Found )
Since this is a sequential search, its execution time is O(Size). The search can be speeded up
by testing a node for equality before inserting it in the queue. The execution time, however, is
still O( Size ). This routine also requires the extra space needed to store the queue
We can also generalize the recursive binary tree search to a recursive search of an n-way
tree. The only difference is that we must, in general, still search the whole tree. An algorithm
is:
On the whole, both the recursive and the non-recursive search of an n-way tree are unsatis-
factory. They both require an O(Size) execution time. But unless the nodes have some kind of
special order that allows quick searches, O(Size) is as fast a search as is possible.
54
Trees
Since both of these search routines essentially traverse the tree, one node at a time, they are
easily converted into standard traversal routines to process all of the nodes in the tree. The
details are left for the reader.
Insertion
The most different appearing operation is insertion. Because there is no particular order
among the nodes of an n-way tree, the insert operation must specify where to insert a node in the
tree. We present two kinds of insert. In the first we are given the parent of the item and we
insert the item as a child of the specified parent -- generally as the eldest child. In the second,
we are given a sibling of the item and we insert the item as a sibling of the specified sibling. In
both cases the general algorithm is:
There is one special case that must be considered, the root of the tree has no parent and no
siblings, so we include one extra insertion routine, one to insert the root of the tree. More
detailed versions are given in Module 7.6.2.1.
Since insertion requires a search, the execution time of an insertion is O(Size) unless some
special search is possible.
7.6.2. Representations
N-way trees can be implemented using either an array or a linked representation. Since the
two are so similar, Module 7.6.2.1 gives only a generalized representation and leaves the exact
representation to the reader.
7.6.3. Timing
Since almost all of the operations require a sequential tree search in the general case, most of
the operations require O(Size) time to execute. For this reason, it is important that the imple-
menter take advantage of any known tree facts which can speed the search.
The table below summarizes the execution times for the general n-way tree.
Operation Representation
Array Linked
Clear O(1) O(1)
Empty O(1) O(1)
Is_In O(Size) O(Size)
Insert O(Size) O(Size)
Print O(Size) O(Size)
55
Trees
Data Specification
Node is record
Item : ?? --Holds one data item.
Child : Pointer to node; --Pointer to child subtree of node.
Sibling : Pointer to node; --Pointer to sibling subtree of node.
end record;
Algorithms
Clear
Root <-- Λ
end clear
Empty
Return( Root = Λ )
end empty
Terminate
Subtree <-- Tree
end find
56
Trees
Is_In( The_Data )
Find( The_Data, Found, Subtree )
Return( Found )
end is_in
Insert_Root ( New_Data )
If Tree is empty
then Tree <-- new Node ( New_Data, Λ, Λ )
else Error -- Tree already has a root
end insert_root
57
Trees
Breadth_First_Print
Initialize
Clear Queue
If not Empty, then Enqueue( Tree )
end breadth_first_print
In_Order_Print
If not Empty, then IOP( Tree )
end in_order_print
IOP( Pointer )
If Child Subtree is not empty, then IOP( Child Subtree )
Output: Root of Tree
If Sibling Subtree is not empty, then IOP( Sibling Subtree )
end iop
Pre_Order_Print
If not Empty, then Pre( Tree )
end pre_order_print
Pre( Pointer )
Output: Root of Tree
If Child Subtree is not empty, then Pre( Child Subtree )
If Sibling Subtree is not empty, then Pre( Sibling Subtree )
end pre
58
Trees
Exercises
3. What would the output be if the queue in the breadth first algorithm were replaced by a
stack?
6. Write procedures to
a. count the nodes, e. delete the leaves, i. return the height,
b. sum the nodes, f. print the leaves, j. make a copy, and
c. count the leaves, g. print the non-leaves, k. find the leaf of least level,
d. sum the leaves, h. return the maximum value,
of an n-way tree. These should be done
A recursively, and
B non-recursively.
Give execution time in big "O" notation for each procedure.
59
Trees
8. Give some examples of n-way trees where fast searches are possible and
a. develop a fast search algorithm and
b. give the execution time of this algorithm.
9. Discuss the effect of replacing the stack by a queue in the breadth first search and print
routines.
10. Discuss the effects of eliminating the Insert_Sibling operation from the N-way tree opera-
tions. Are there other possible insertion operations that will allow the user to generate the
desired tree?
11. Discuss the properties of an N-way tree where the nodes are ordered and the tree is organized
such that every node comes after every node in its child subtree and before every node in its
sibling subtree. What if every node in the child subtree comes after the node and before every
node in the sibling subtree?
12. It is a nuisance to have to use a special insert routine for inserting the root. Compare the
following two methods of avoiding this.
a. Replace the root of the n-way tree with a dummy node which contains nothing.
b. To allow a null parent in the insert parent routine or a null sibling in the insert
sibling.
60
Graphs
GRAPHS
Many real world objects consist of parts with some kind of connection between the parts.
This type of object can be represented by a graph, so this chapter presents graphs, their uses and
representation.
Graph - 1
Graphs
8.1. Graphs
Graphs consist of nodes and connections between nodes, called edges. Unlike trees, how-
ever, the edges can form loops and there are often several paths from one node to another. In
fact any number of edges between any pair of nodes is possible. The abbreviated map in Figure
8.1.1 is a typical graph. Each node, in this case, is a city and each edge is a road between two of
the cities. The figure also contains a simple water distribution system. Each node in this case is
a component (well, pump, or tank) and each edge is a water pipe connecting two components.
Note that both graphs, unlike trees, contain loops or circuits.
Toledo Well
Pump1 Pump2
Dayton Columbus
Tank
Cincinnati
Sample Graphs
Figure 8.1.1
Graphs have a long history. The first technical reference to graphs is by Euler in 1736. He
seems to have invented the concept to simplify the solution of the Bridges of Koenigsberg prob-
lem. The city of Koenigsberg has a river with two islands and the islands are connected to the
mainland as shown in Figure 8.1.2. A summertime activity was strolling over the bridges and
admiring the view. This rather naturally led to the question: Is there a path which includes each
bridge once and only once? By simplifying the picture to the graph of Figure 8.1.2, Euler was
able to show the answer is no. (Why?)
Graphs are useful precisely because of their vagueness, anything with nodes and edges is a
graph. They can be used to represent chemical bonds, electrical networks (in fact, networks of
any kind), tournament scheduling (construction and other kinds of scheduling), and, in general,
almost any kind of relationship between entities.
The formal definition of a graph emphasizes the generality of the concept. A graph is an
ordered pair {N, E} where N is a set of nodes and E is a set of pairs of nodes, called edges. In
Graph - 2
Graphs
- A graph is an ordered pair [N,E] where N is a set of elements called nodes and
E is a set of pairs of nodes, called edges.
- Nodes are also referred to as points or vertices. Edges are also referred to as arcs.
- A circuit (loop, cycle) is a path with the same starting and ending node.
- A graph is complete if there is an edge from every node to every other node.
- A directed graph is an ordered pair [N,E] where each edge is also an ordered pair.
This is a precise way of saying that each edge in the graph has a direction and
can only be traveled in that direction.
other words, any set of items can be nodes and any sets of pairs of nodes can be edges.
A path is a sequence of edges {n1, n2}, {n2, n3}, ..., {nk-1, nk} such that the last node in one
edge is the first node in the next edge. A circuit is a path with the same starting and ending
nodes. A directed graph is a graph where each edge has a direction; that is, the edge {n, m} is
not the same as the edge {m, n}.
Graph - 3
Graphs
Graph operations can be divided into "housekeeping" operations and rather special opera-
tions dependent upon the field of application. Some basic "housekeeping" operations are:
The special operations include finding various kinds of paths, and minimizing or maximizing
various kinds of flows through the graph. We leave the special operations for later in this chap-
ter and concentrate for now on representation techniques and housekeeping operations.
Adjacency Representations
Sometimes we can't or don't want to draw a picture of a graph. In this case, there are two
basic adjacency representations of graphs; that is, ones based upon the concept of adjacent
nodes; nodes connected by an edge. In Figure 8.1.1, for example, Toledo is adjacent to Dayton
and Columbus is adjacent to Dayton and Cincinnati.
The adjacency matrix representation is simply a matrix with one row and one column for
each node in the graph. For each edge in the graph, we insert a one in the matrix at the intersec-
tion of the row and column corresponding to this edge. Figure 8.1.3 contains an adjacency
matrix for each of the graphs in Figure 8.1.1. The city names are abbreviated to improve read-
ability.
For the moment, dashes are used down the diagonal; more details are given in the next section
The adjacency matrix is useful for some purposes, but difficult to follow for large graphs
with many nodes. Some representations simply list the nodes and edges in some form or other.
Figure 8.1.4, for example, contains, for each of the graphs in Figure 8.1.1, a list of nodes and for
each node, the nodes adjacent to that node. This representation is called the adjacency list repre-
sentation.
We will use both the adjacency list and the adjacency matrix representations and flip back
and forth between representations depending upon the circumstances.
Graph - 4
Graphs
Exercises
1. Show how to represent a city's water distribution network as a graph. Is the graph directed?
Connected?
2. Given a list of required courses and the prerequisites for each of these courses, develop a
graph which will represent the courses and the order in which they must be taken. Is the graph
directed? Connected?
3. How could an Ada program be represented as a graph? Is the graph directed? Connected?
4. A binary relation is a set of ordered pairs from some set. A relation can be represented by a
graph where each edge of the graph corresponds to one of the ordered pairs. A relation is
a. reflexive if every item in the set is related to itself,
b. symmetric if a related to b implies b is related to a, and
c. transitive if x is related to y and y is related to z always implies x is related to z.
Develop
(1) a graphical representation of a relation and
(2) an algorithm to determine if the relation is reflexive, symmetric, or transitive.
5. Represent the following maze as a graph. How can you find a path through the maze?
__________________
¦ ____ ¦
¦ ¦ ¦ ¦ ¦
¦ _______¦ ¦ ¦
¦ ¦ ¦ ______¦ ¦
¦______¦____________
6. Represent a set of statements as nodes in a graph and logical implications as directed edges
between the nodes. How do you show that A implies B?
Graph - 5
Graphs
construct :
a. an adjacency matrix representation, and
b. an adjacency list representation.
10. Find the minimum number of edges necessary for a connected graph. (A connected graph
has a path between every pair of nodes.)
11. Find the minimum number of edges necessary for a complete graph. (A complete graph has
an edge between every pair of nodes.)
Graph - 6
Graphs
There are two distinct computer representation schemes for graphs. The first is based upon
the adjacency matrix representation and the second on the adjacency list representation. This
section presents both implementation schemes along with algorithms for the basic "housekeep-
ing" operations, Clear, Insert_Node, Insert_Edge, and Print.
The adjacency matrix based representation starts with the adjacency matrix. It is difficult to
use the node names as row and column subscripts in a generic Ada package, so our adjacency
matrix will have integer row and column subscripts. To convert the node names into integer
subscripts, we will use a Node_List containing the names of all the nodes in the graph. The
position of a node name in Node_List specifies the row and column of the adjacency matrix cor-
responding to that node.
Figure 8.2.1.1, for example, contains the adjacency matrix representation of the map graph
from Figure 8.1.1 along with the representation version including Node_List. (The city names,
as usual, have been abbreviated to improved readability.) Since Dayton is in the second row of
this Node_List, row 2 and column 2 of the representation version of the adjacency matrix corre-
spond to the city of Dayton.
Figure 8.2.1.1
The exact details of the adjacency matrix depend upon the type of graph. Some graphs allow
edges from a node to itself; such graphs can have non-zero values on the diagonal of the adja-
cency matrix. Some graphs do not allow an edge from a node to itself; such graphs must have
all zeros down the diagonal of the adjacency matrix.
Some graphs allow only one edge between two nodes; some allow more than one. Multiple
edges between the same two nodes imply the adjacency matrix must have some way of keeping
count of the number of edges between each pair of nodes. The simplest way to do this is to
make each entry in the matrix an integer equal to the number of edges between the given pair of
nodes. On the other hand, if only a single edge is allowed between two nodes, then we must
check to see if an edge already exists before inserting a new edge.
The most general type of graph seems to be one which allows:
a. an edge from a node to itself, and
b. multiple edges between the same two nodes.
Graph - 7
Graphs
Since the more restricted types of graphs require at most minor modifications to the general
type, we present the general type here and leave the special types for the exercises.
To simplify the graph algorithms, assume Node_List is a separate package with the opera-
tions:
Clear( Maximum_Size ) clears node list to empty and sets the maximum number of
nodes to Maximum_Size.
Insert( Name ) inserts the value of Name in the node list
(note that no duplicates are allowed).
Find_Loc( Name ) returns the location of Name in the node list (if the name is
not found, then an error exception is generated).
Print_Name( Loc ) prints the name of the node in position Loc of the node list.
This package is a minor variation of the set packages in Chapter 5. The need for the Clear,
Insert, and Find_Loc operations should be obvious. The Print_Name is used by the graph pack-
age to print node names.
A graph representation using this data structure and containing the housekeeping operations,
Clear, Insert_Node, and Insert_Edge, is in Module 8.2.1.2. Note that most of the work is actu-
ally done by the Node_List package.
The time to execute these operations depends heavily upon the speed of the Node_List pack-
age. Both the Insert_Node and the Insert_Edge operations require searching the node list, and,
in fact, this search time is essentially the execution time of the graph operations. The execution
time of the graph Clear operation is essentially determined by the time required to set the adja-
cency matrix, called Adj, to all zeros; this time is O(Maximum_Size2).
Exercises
1. Develop algorithms and an Ada package for the Node_List package used in Figure 8.2.1.1.
2. Discuss the pros and cons of implementing the Node_List package using either a sorted list
or a tree.
Graph - 8
Graphs
Data Specification
Algorithms
Clear
Set Adj to all zeros
Clear Node_List( Maximum_Size )
end clear
Insert_Node ( Node )
Insert Node in Node_List
If Node already in Node_List, then Error
end insert_node
Print
Repeat for each row (Row = 1 to Size)
Print_Name( Row )
Repeat for each column (Col = 1 to Size)
for I = 1 to Adj( Row, Col ): Print_Name( Col )
end repeat
end repeat
end print
Graph - 9
Graphs
The adjacency list representation of a graph starts with the adjacency list representation and
is very similar to the set of sets representation in Chapter 5. We start by storing the node names
in the first column of a table. The second column of the table contains a pointer to a pipe con-
taining the names of the adjacent nodes. To be more precise we store the row number of the
adjacent nodes in the pipe. Figure 8.2.2.1 contains a adjacency list version of the map of Figure
8.1.1 (we have abbreviated the city names for clarity) and the corresponding representation ver-
sion. The reason for storing the row number of the corresponding city in the pipe is speed. It is
much faster to find a city's entry using the row number of the city than it is to use the name of
the city.
Table
Adjacency List Version Node Adj_Nodes
Tol --> Day Tol ---- --> 2
Day --> Tol, Col, Cin Day ---- --> 1,4,3
Cin --> Day, Col Cin ---- --> 2,4
Col --> Day, Cin Col ---- --> 2,3
Figure 8.2.2.1
Create( Pipe_Name ) creates the pipe and sets the pipe to empty.
Clear( Pipe_Name ) sets the pipe to empty.
Insert( Pipe_Name, Data ) inserts the value of Data in the pipe.
Open( Pipe_Name ) sets pipe so Get_Next returns the first item in the pipe.
Get_Next( Pipe_Name, Data ) sets the value of Data to the next value in the pipe.
End_of_Pipe( Pipe_Name ) Boolean function which is true if and only if every
item in the pipe has been returned by Get_Next.
The graph Print routine in Module 8.2.2.2 uses the pipe operations to simplify the adjacency list
representation, but their real use is in the path routines of the next few sections.
8.2.3. Timing
Operation Representation
Adjacency Matrix Adjacency List
Clear O(Maximum_Size2) O(1)
Insert_Node O(N) O(N)
Insert_Edge O(2*N) O(2*N)
Print O(N2) O(E)
where N is the number of nodes in the graph and E is the number of edges in the graph.
Graph - 10
Graphs
Data Specification
Row_record is record
Node_Name : ?? --Name of node, any data type.
Adj_Nodes : Pipe --Pipe of adjacent nodes.
end record
Algorithms
Clear
Table_Size <-- 0
end clear
Insert_Node( New_Node )
If Table_Size = Maximum_Size then Table overflow error
If New_Node is in Node_Name column of Table, then Error.
Graph - 11
Graphs
Print
Repeat for each node in graph (I = 1 to Table_Size)
Output: Table(I) . Node_Name
Open Table(I) . Adj_Nodes
while not End_of_Pipe( Table(I) . Adj_Nodes )
Get_Next( Table(I) . Adj_Nodes, Node )
Output: Table(Node) . Node_Name
end repeat
end repeat
end print
N.B. The print routine can be greatly simplified if an iterate operation is added to the pipe
package.
Translating the graph algorithms above into Ada code is a straightforward process. We only
give one version here, the adjacency list version, because it offers another chance to implement
one package in terms of another package. In particular, it uses the pipe package developed
earlier.
The package specification has a number of features that must be noted:
- The node name is made a generic type so the user can use any type of node
names.
- The table is implemented by an array of records. Other possibilities are left
for the exercises.
- The maximum number of nodes allowable in the graph is left for the user to
determine at instantiation time.
- The package only allows one graph; in other words, it defines an object.
While multiple graphs might be used at times, most graph programs process
only one graph at a time and assuming only one graph at a time makes the
Ada code both execute faster and more readable.
- There are four exceptions, one each for:
- trying to use more nodes than the space allocated for nodes,
- trying to use more edges than the space allocated for edges,
- trying to insert a duplicate node, and
- trying to use a node which has not been inserted in the graph.
The package body is a straightforward implementation of the algorithms in Module 8.2.2.2.
The data structure is essentially the table developed above with one column for node names and
one column for the corresponding pipes. The only addition is the function Location_Of, used to
determine the row in the table containing a given node. This function is used to simplify the two
insertion routines.
Graph - 12
Graphs
-- A graph consists of
-- nodes and edges and
-- the operations: Clear, Insert_Node, Insert_Edge and Print
generic
type Node_Name_Type is private; --Type of node names.
package Graph_Package is
type Procedure_Pointer is
access procedure( I: in Node_Name_Type);
procedure Clear;
--sets the graph to empty with no nodes or edges.
end Graph_Package;
Graph - 13
Graphs
with Ada.Text_IO;
with Pipe_Package;
package body Graph_Package is
package Edge_Pipe is
new Pipe_Package( Data_Type => Subscript,
Maximum_Size => 100);
type Row_Type is
record
Node_Name : Node_Name_Type; --Name of node
Adj_Nodes : Edge_Pipe.Pipe; --Pipe of adjacent nodes.
end record;
-------------------------------------------------------------
-------------------------------------------------------------
procedure Clear is
begin
Table_Size := 0;
end Clear;
Graph - 14
Graphs
-------------------------------------------------------------
begin
--Initialize
I := Table_Size;
Table(0).Node_Name := Node;
--Terminate
return I;
end Location_Of;
-------------------------------------------------------------
Graph - 15
Graphs
-------------------------------------------------------------
begin
--Find location of nodes in Node_Name column.
Location1 := Location_Of( Node1 );
if Location1 = 0 then
raise Node_Name_Not_Found;
end if;
exception
when Edge_Pipe.Pipe_Overflow => raise Too_Many_Edges;
end Insert_Edge;
Graph - 16
Graphs
-----------------------------------------------------------------
procedure Print( Put_Procedure: Procedure_Pointer) is
begin
--Repeat for each row in table
for I in 1..Table_Size loop
--Initialize
Put_Procedure (Table(I).Node_Name);
Ada.Text_IO.Put (" -> ");
Edge_Pipe.Open (Table(I).Adj_Nodes);
Ada.Text_IO.New_Line;
end loop;
end Print;
end Graph_Package;
Exercises
1. Give the representation data structure for the water system of Figure 8.1.1 assuming an:
a. adjacency matrix, and b. adjacency list
representation.
Graph - 17
Graphs
Do the above exercises assuming an (1) adjacency list and (2) adjacency matrix representation.
Include a big O timing estimate for each algorithm.
3. Module 8.2.2.2 assumes the Table is implemented using an array. Discuss replacing the
array by either a linked list or by a tree. Discuss the speed and space requirements of each possi-
ble Table representation.
4. An undirected graph is one that allows an edge to be traveled in both directions. A directed
graph is one that allows an edge to be traveled in only one direction. An antisymmetric graph is
one such that if there is an edge from a to b then there is no edge from b to a. Develop algo-
rithms to determine if a graph is:
a. directed, b. undirected, and c. antisymmetric.
Assume the graph is stored using an (1) adjacency list and (2) adjacency matrix representation.
Include a big O timing estimate for each algorithm.
7. Rewrite the Ada graph package of Program 8.2.3.1 so that the adjacency lists are imple-
mented using:
a. linked lists or b. bags with iteration
rather than pipes. Which version is easiest to develop, write, and understand. Which version
executes fastest?
8. What changes must be made to the Ada graph package of Program 8.2.3.1 so that:
a. no edges are allowed from a node to itself,
b. only one edge is allowed between any two graphs, and
c. edges and nodes can be deleted from the graph.
9. Rewrite the Ada package given in Program 8.2.3.1 so that the table is implemented using:
a. linked or b. tree
data structures.
Graph - 18
Graphs
8.3. Paths
Many graph problems involve finding paths of various kinds. Some typical problems are:
- Does a path exist between two specific nodes?
- What is the shortest path between two specific nodes?
- What nodes are connected to a specified node?
- Does a path exist which goes through every node exactly once?
- Does a path exist which goes through every edge exactly once?
- Pert/CPM (Roughly speaking, compute the minimum time to visit every
node in a directed, weighted graph.)
- Determine the minimum/maximum flow through a graph.
There are two different kinds of algorithms for solving path problems. The first is based upon a
breadth first traversal of a graph and the second upon a depth first traversal of a graph, so we
start by covering each kind of traversal in turn.
The breadth first traversal of a graph is based upon the same approach as that used in the
breadth first traversal of a tree. We start at a node, next visit each adjacent node, then each node
adjacent to each of the adjacent nodes, and so forth. If you like, we visit first the node at dis-
tance 0 (the original node), then the nodes at distance 1 (adjacent nodes), then those at distance 2
(adjacent to adjacent nodes), then those at distance 3, and so forth. For example, a breadth first
traversal of the map graph in Figure 8.1:
Toledo
Dayton Columbus
Cincinnati
would produce, if we start with Toledo, first Toledo, then Dayton (distance 1), then Columbus
and Cincinnati (distance 2). If we start with Columbus, it would produce first Columbus, then
Dayton and Cincinnati (distance 1) and finally Toledo (distance 2).
A breadth first traversal does not have a unique order. Since Columbus and Cincinnati are
both at the same distance from Toledo, a breadth first traversal starting at Toledo may include
these two nodes in either order.
To illustrate the concept with a more complicated example, consider the graph:
A B D F
C E
Graph - 19
Graphs
D E C
(Before continuing, assure yourself that redrawing a graph does not change the graph and this
graph is the one above redrawn to emphasize the distance of each node from node D.) From the
graph it is clear that nodes B, E, and F are at distance 1 from node D and that nodes A and C are
at distance 2 from D. A breadth first traversal is then D, B, E, F, A, and C.
Example 8.3.1.1. Develop an algorithm to visit all the nodes connected to given node.
The breadth first traversal of a graph is very similar to the breadth first traversal of a tree.
Recall that a queue was used in the breadth first traversal of a tree and, as each node was
dequeued and printed, its children were inserted in the queue. The general scheme was:
This insured that as each level is dequeued, the next level is inserted in the queue. In other
words, the queue contains first the root, then those nodes distance 1 from the root, then those
nodes distance 2 from the root, and so forth.
We can do the same thing with a graph, only instead of children, we insert in the queue the
nodes adjacent to (distance 1 from) the node dequeued. The general scheme is:
Graph - 20
Graphs
Note that each time an item is dequeued, the nodes one edge further from the starting node are
enqueued. The result is a breadth first traversal.
There is one minor addition necessary for a graph. There is only one path in a tree from the
root to each node in the tree, but in a graph, there may be several paths from the starting node to
each node in the graph. The algorithm needs to consider each node only once, so, to keep track
of which nodes have already been visited, we will use a set called Visited. Let the given node be
called Start_Node. An algorithm then is:
Breadth_First_Traverse( Start_Node )
Initialize
Clear Queue
Clear Visited
Insert Start_Node in Visited
Enqueue( Start_Node )
Terminate
end breadth_first_traverse
Note that the first time we meet a node, we mark the node as already visited and insert the node
in the queue for later processing. This way the Visited set contains a list of nodes already
inserted in the Queue and the Queue contains nodes already visited and not yet processed.
Note that when we process the starting node in the repeat loop, we insert into the queue all
nodes adjacent to the starting node (nodes at distance 1). When we process the nodes at distance
1 in the repeat loop, we insert into the queue those nodes at distance two from the starting node,
and so forth; each time inserting into the queue nodes which are one edge further away from the
starting node. When the Queue is empty, we have processed every node connected to the start-
ing node.
The inner loop:
of the above algorithm must be carefully translated depending upon the representation.
The adjacency matrix representation executes this loop by processing the Vertex row of the
adjacency matrix; that is,
Graph - 21
Graphs
To compute the execution time of this loop, note that the time to execute an Insert is O(1)
and the time to execute an Enqueue is O(1). Let, for the moment, the time needed to search the
Visited set be simply, O(Time to search Visited Set). (We will return to this point later.) Then,
the execution time of the loop is O(N)*O(Time to search Visited Set) where N is the number of
nodes in the graph.
Examining the Breadth First Traverse algorithm, we see that this inner loop is executed once
for each node connected to the starting node. Let C be the number of edges connected to the
starting node, then the execution time of the adjacency matrix representation of the Breadth First
Traverse algorithm is:
The adjacency list representation executes this loop by processing the pipe associated with
Vertex; that is,
Note how the Open, Get_Next, and End_of_Pipe operations are used to process the adjacent
nodes, one at a time.
The execution time of this inner loop is a bit more interesting. Again, the time to execute an
Insert is O(1) and the time to execute an Enqueue is O(1). Again let the time needed to search
the visited list be simply, O(Time to search Visited Set). The loop is repeated once for each
node adjacent to the Vertex, so the total execution time of the loop is:
Examining the Breadth First Traverse algorithm, we see that this inner loop is executed once
for each node connected to the starting node. Thus, for each node connected to the starting
node, we process all of its adjacent nodes (by processing all of the edges through this node). In
Graph - 22
Graphs
the worst possible case (where every node is connected to the starting node), we must process
every edge exactly once. Therefore, the total execution time is roughly:
Representation
Algorithm Adjacency Matrix Adjacency List
Traverse O(C*N) O(E)
One final comment about this algorithm. While a graph may have more than one breadth
first traversal, this algorithm always produces exactly one traversal. The exact traversal the
algorithm produces depends upon the order in which the edges are inserted into the pipes. Exer-
cises 1 and 2 at the end of this section illustrate this.
Graph - 23
Graphs
procedure Breadth_First_Traversal(
Start_Node : in Node_Name_Type) is
package Subscript_Queue is
new Queue_Package( Data_Type => Subscript,
Maximum_Size => 100);
begin
--Initialize
Node_Place := Location_Of( Start_Node );
if Node_Place = 0 then
raise Node_Name_Not_Found;
end if;
end loop;
end Breadth_First_Traversal;
Graph - 24
Graphs
A careful examination of the traversal algorithm shows that we actually visit each node by
processing an edge to that node. For example, we visit the nodes adjacent to the starting node by
processing the edges from the starting node to the adjacent nodes. When we visit any node in
the graph, we always process the edge from the previous node to this new node. If we can keep
track of these edges as we proceed, we can use this information to construct paths.
To keep track of the edges as they are visited, we will save the name of each node as it is vis-
ited and the name of the node used to reach this new node. These two node names are then the
edge used to reach the new node. To save this information we will use a new table, called the
Path_Table. Path_Table contains two columns: the first column is the name of a node and the
second column is the name of the node used to reach this new node.
A sample Path_Table, using the data from the example above, and assuming the starting
node is Tol, is:
Path_Table
Node Previous Node
Tol Null
Day Tol
Col Day
Cin Day
The Null entry in the second column indicates that Tol is the starting node. The row with Day in
the first column has Tol in the second column indicating that Day was "reached" from Tol; that
is, Day is adjacent to Tol and that Tol was the Vertex when Day was enqueued. The two copies
of Day in the second column indicate Cin and Col were both "reached" from Day.
To use the Path_Table to find a path, we follow the table backwards. To go, for example,
from Col to Tol, we can start in the first column with Col. Here we find an edge to Day. Using
the first column entry for Day, we find an edge from Tol. Thus, the path is {Col, Day}, {Day,
Tol}, or in shorter form, {Col, Day, Tol}.
It simplifies matters if we assume the operations on this table are:
Now to insert the correct entries into the Path_Table we use the following modification of
the Breadth First Traverse algorithm above:
Graph - 25
Graphs
Breadth_First_Paths( Start_Node )
Initialize
Clear Queue
Clear Visited
Insert Start_Node in Visited
Insert Start_Node in Queue
Clear Path_Table
Insert [Start_Node, null] in Path_Table
The major difference between this algorithm and the Breadth First Traverse algorithm is that
every time this algorithm visits a new node, it saves the corresponding edge in the Path_Table.
Example 8.3.1.3. Develop an algorithm to find a path between two specified nodes.
Using the above Breadth First Paths algorithm, this is now a trivial problem. An algorithm
is:
where the Print_Path routine processes the Path_Table to produce the final path.
Example 8.3.1.4. How are the connected components and the transitive closure of a graph
computed?
Recall that a connected component of a graph is a set of nodes such that there is a path con-
necting any two nodes in the component. A moments analysis of the breadth first algorithm
shows that it produces the connected component containing the starting node. In particular,
when the algorithm is done executing, the Visited set is the connected component containing the
starting node. If the graph contains more than one connected component, then the Visited set
will not contain all of the nodes in the graph and it is necessary to repeat the breadth first algo-
rithm using a starting node which is not in the original Visited set. This will produce a second
Visited set which is equal to a second connected component of the graph. This process must be
repeated over and over again until every node in the graph is in one of the Visited sets.
Graph - 26
Graphs
To develop an algorithm to compute all the connected components of a graph, first assume
the breadth first algorithm has been modified so that:
1. the new value of Vertex is output every time Vertex is dequeued (this will
print out the connected component), and
2. the Visited set is global to the graph package and is not initialized at the
beginning of the breadth first algorithm.
If the Visited set is implemented using a bit-map, then the algorithm to generate all of the
connected components is:
Initialize
Clear Visited
Each time this algorithm executes the Breadth First algorithm, it will output another connected
component.
The transitive closure of a graph is closely related to the concept of a connected component.
The transitive closure of a graph is another graph which contains the same nodes, but the nodes
in the transitive closure are adjacent to one another if and only if they are in the same connected
component in the original graph; that is, two nodes are adjacent to one another in the transitive
closure if and only if there is a path between the two nodes in the original graph.
An algorithm to compute the transitive closure of a graph is similar to the one for computing
the connected components of a graph. The major difference is that the Breadth First algorithm is
modified so that it uses a set, the Connected Component, to keep a list of nodes currently in the
component being generated. Every time a Vertex is dequeued, edges between the new value of
the Vertex and the other nodes in Connected Component are inserted into the transitive closure.
Breadth_First_for_Transitive_Closure( Start_Node )
Initialize
Clear Queue
Clear Connected Component Set
Insert Start_Node in Visited
Insert Start_Node in Queue
Graph - 27
Graphs
Each time this routine is executed, it inserts into the transitive closure graph edges between
every pair of nodes in the same connected component. To generate the complete transitive clo-
sure graph this routine must be executed for every connected component in the original graph.
An algorithm to do this is:
Initialize
Clear Visited Set and Transitive Closure Graph
Insert all nodes of the original graph into Transitive Closure Graph
The actual computer program can be speeded up if the insertions into the Transitive closure
graph take advantage of the fact that the Node and Vertex are actually row numbers in the corre-
sponding adjacency matrix or table.
The Big Oh execution times for computing the connected component and the transitive clo-
sure are (assuming the Visited set is implemented using a bit-map):
Representation
Algorithm Adjacency Matrix Adjacency List
One Connected Component O( C * N ) O( E )
2
All Connected Components O( N ) O( E )
Transitive Closure O( N ) + O( N2 ) O( N ) + O( E ) + O( N2 )
where C is the number of nodes in the connected component and E is the number of Edges in the
graph
Exercises
Tank
produce if the starting node were the:
a. Well, c. Pump2, or
b. Pump1, d. Tank.
Graph - 28
Graphs
5. What Path_Table would be generated if the Breadth First Paths algorithm were applied to the
graph in:
a. Exercise 1 using Well as the starting node,
b. Exercise 2 using Well as the starting node,
c. Exercise 3 using A as the starting node,
d. Exercise 3 using D as the starting node, and
e. Exercise 4 using A as the starting node?
6. Develop a complete, detailed breadth first graph traversal algorithm assuming an adjacency
matrix representation.
7. Translate the algorithm of Exercise 6 into a working Ada subprogram and add it to your
graph package.
Graph - 29
Graphs
9. The adjacency matrix representation of the Breadth First Traverse algorithm can be made
faster if the Visited set is replaced by a Not Visited set and the inner loop is replaced by:
For Node in Not Visited
If Adj( Vertex, Node ) = 1 then
Process Node
Delete Node from Not_Visited
Enqueue( Node )
end repeat
Develop a complete breadth first traversal algorithm for this new version and discuss its speed
and space requirements.
10. Develop an algorithm to determine the node furthest from a specified node in a graph.
11. Develop an algorithm to print all of the nodes and their distance (number of edges) from a
given starting node.
12. Develop detailed algorithms for producing a connected component of a graph assuming the
graph is represented using:
a. an adjacency matrix, or b. an adjacency list.
13. Develop detailed algorithms for producing the transitive closure of an undirected graph
assuming the graph is represented using:
a. an adjacency matrix, or b. an adjacency list.
14. Develop detailed algorithms for producing the transitive closure of a directed graph assum-
ing the graph is represented using:
a. an adjacency matrix, or b. an adjacency list.
The depth first traversal follows one path as far as possible before attempting to follow
another path. For example, given the graph:
A B D F A --> B D --> B, F
B --> A, C, D E --> C
C E C --> B, E F --> D
a depth first traversal, starting at node A, would first follow the path A, B, D, F as far as possible
before coming back and picking up the remaining path B, C, E.
On the other hand if the starting node is node B, then first rearrange the graph as follows:
Graph - 30
Graphs
B C E
D F
to emphasize the paths leading from B. A depth first traversal is then B, D, F, C, E, and A.
As with breadth first traversals, the depth first traversal is not necessarily unique, but
depends upon the order in which the edges are inserted into the graph.
The major advantage of depth first traversal over breadth first traversal is that sometimes
there are a large number of possible paths and we wish to find a path (any path will do) without
trying all possible subpaths first. Depth first follows a single path until it knows it can go no
further.
There are two kinds of algorithms for depth first traversal. The first is based upon replacing
the queue of the breadth first traversal by a stack. The second kind of algorithm is based upon
recursion. We will develop both.
Example 8.3.2.1. Develop depth first graph traversal algorithms using a stack.
We essentially take the breadth first algorithm for breadth first traversal and replace the
queue by a stack. Thus, the basic depth first traversal algorithm is:
Depth_First_Traverse( Start_Node )
Initialize
Clear Stack
Clear Visited
Insert Start_Node in Visited
Push Start_Node onto Stack
Terminate
end depth_first_traverse
Following this algorithm by hand using the sample graph above with node A as the starting node
will show that it does indeed follow first the path A, B, D, F, then the remaining path B,C,E.
Since the algorithm is essentially the same as the breadth first traversal, the execution time is
also the same; that is, depending upon the representation method the execution time is:
Graph - 31
Graphs
Representation
Algorithm Adjacency Matrix Adjacency List
Traverse O(C*N) O(E)
To develop a depth first algorithm to determine all the paths to a given node, we simply take
the Breadth First Paths algorithm from section 8.3.1 and replace every occurrence of a queue by
a stack. The new algorithm is:
Depth_First_Paths (Start_Node)
Initialize
Clear Stack
Clear Visited
Insert Start_Node in Visited
Push Start_Node onto Stack
Clear Path_Table
Insert [Start_Node, null] in Path_Table
end depth_first_paths
where, as before, we assume the existence of a Path_Table package to save and print the
Path_Table.
The recursive traversals simply call themselves recursively instead of inserting a node in a
stack or queue. Thus, a recursive depth first graph traversal algorithm is:
Recursive_Depth_First_Traverse( Start_Node )
Clear Visited
Insert Start_Node in Visited
Call Trav ( Start_Node )
end recursive_depth_first_traverse
Graph - 32
Graphs
Trav( Vertex )
Process Vertex
For each Node adjacent to Vertex & not yet Visited
Insert Node in Visited
Trav ( Node )
end for
end trav
The Recursive_Depth_First_Traverse algorithm clears the Visited set, marks the start node as
visited, and calls Trav passing the starting node as a parameter. For each node adjacent to the
start node and not yet visited, the Trav algorithm visits the node, inserts the node in Visited and
then calls itself recursively. (Note that this version assumes Visited is global. What effect
would there be if we made Visited a parameter of the Trav algorithm?)
The reader should carefully follow this algorithm with a few sample graphs to make certain
the basic technique is clear. Note carefully that the recursive version of depth first visits the
nodes in a different order from the non-recursive version of depth first. In other words, the par-
ticular order for visiting nodes depends upon the exact algorithm; the important feature is that
the visiting is done in depth first order.
The routine to find all the paths to a given node is very similar. Assuming a Path_Table
package like the one used in the breadth first traversal, the algorithm is:
Recursive_Depth_First_Paths (Start_Node)
Initialize
Clear Visited
Clear Path_Table
Insert [Start_Node, null] in Path_Table
Insert Start_Node in Visited
Generate Path
R_Paths( Start_Node )
Print Path
end recursive_depth_first_paths
R_Paths( Vertex )
For each Node adjacent to Vertex & not yet Visited
Insert [Node, Vertex] in Path_Table
Insert Node in Visited
Call R_Paths( Node )
end for
end r_paths
The algorithm is a straightforward modification of the depth first paths algorithm above. Since
it traverses every node connected to the starting node, it generates a Path_Table from which the
final path(s) can be printed. Since it is essentially a copy of the depth first traversal, the execu-
tion time is also the same.
Graph - 33
Graphs
If all we want to know is if there is a path from one node to another, then a faster version of
this algorithm is available:
If the End node is in the Visited set, then there is a path between the two nodes; otherwise not.
The advantage of this algorithm over the previous algorithm is that it halts as soon as it
reaches the end node instead of continuing on to traverse every node. Hence it can be considera-
bly faster than the previous algorithm. It can also take just as much time as the previous algo-
rithm. It all depends upon whether or not the end node is in one of the first or one of the last
nodes visited.
Exercises
Graph - 34
Graphs
a. A, c. F, or
b. B, d. C.
6. What is the effect of moving the Process Node statement to inside the loop in the recursive
procedure Trav?
7. Develop:
a. a connected component algorithm and
b. a transitive closure algorithm
based upon depth first search of a graph.
8. Develop algorithms based upon the depth first traversal of a graph to print all nodes and their
distance (number of edges) from a given starting node.
Graph - 35
Graphs
A weighted graph is a graph where each edge has a numerical value called a weight. The
two graphs of Figure 8.1.1, for example, might appear as:
Toledo Well
30
100
0
500
00
60 Columbus Pump2
Dayton Pump1
00
40
40
00
50 70
Tank
Cincinnati
where the weights on the map are distances in miles and the weights on the water distribution
system are flow rates in gallons/minute. The weights could just as easily been in feet, travel
time, telephone costs, or any other units that could be of interest. The important fact is that each
edge has a numerical weight.
It simplifies the discussion if we talk as if the weights are distances. Then we can refer to
such things as the "closest" node or the "shortest path." You must keep in mind, however, that
the weights can be any kind of units that make sense in the given problem.
The basic "housekeeping" operations for a weighted graph are the same as for an ordinary
graph except that each edge now has a weight:
Graph - 36
Graphs
Representations
Each entry in the adjacency matrix for a weighted graph contains the weight of the corre-
sponding edge with * to indicate non-edges; thus the map graph above is stored as:
Node
Tol Day Cin Col List
Tol - 100 * * Tol - 100 * *
Day 100 - 50 60 Day 100 - 50 60
Cin * 50 - 70 Cin * 50 - 70
Col * 60 70 - Col * 60 70 -
Note that it is impossible to have multiple edges between the same two nodes using the adja-
cency matrix representation.
To make the minimum path algorithms work, the -'s and *'s will have to be replaced by some
very large positive number. (Ada uses the T.Large attribute to obtain the largest number of type
T.)
The actual algorithms to implement the housekeeping operations for a weighted graph are
almost identical to those for an ordinary graph and left for the reader.
The adjacency list representation stores the weight along with each node in the adjacency list
or pipe. Thus, the map graph above would be stored as:
Computer Representation
Toledo --> [2, 100]
Dayton --> [1, 100], [3, 50], [4, 60]
Cincinnati --> [2, 50], [4, 70]
Columbus --> [2, 60], [3, 70]
where the square brackets indicate a record with two fields. Again, the housekeeping algorithms
for the adjacency list representation are almost identical to the ordinary graph algorithms for an
adjacency list representation and are left for the reader.
Graph - 37
Graphs
A common problem is developing a minimum cost network that spans a number of nodes of
some kind. A city's water system, for example, must have a connection to every building in the
city. There are many possible ways to link the buildings together into a complete distribution
system and associated with each possible link is a cost. The problem is to determine the system
with the minimum cost. It doesn't matter whether the distribution system is a water system, a
gas system, an electric system, a telephone system, or even a set of computers, the basic problem
remains the same, find the set of edges with minimum weight that connect every node in the
graph. This set of edges is called the minimal spanning tree.
The minimum spanning tree is not in general unique. Sometimes more than one spanning
tree has the same length; for example, the graph:
A B
1
1 2
2 D
C
A B A B
1 1
1 2 1
2
C D C D
When more than one minimal spanning tree exists, the exact spanning tree produced by a com-
puter program depends upon the algorithm used and the order in which the nodes and edges are
entered into the package. Since all of the minimal spanning trees are of the same length, in some
sense, it does not matter which one is produced.
There are several schemes for producing the minimal spanning tree and all are variations
upon the so-called "greedy" method. All of the greedy methods work by starting with a one or
two node spanning tree and, at each step, expand this spanning tree by adding the next smallest
edge which preserves the minimal spanning tree properties. Since each step adds at least one
new node to the spanning tree, sooner or later the spanning tree must span the whole graph.
There are several possible variations of this basic algorithm. One possible algorithm, due to
Kruskal, starts with the smallest edge in the graph. This edge is now the spanning tree for its
two end nodes. Then, one edge at a time, the algorithm always tries to add the next smallest edge
which will not complete a circuit. Once it finds the next smallest edge which does not complete
a circuit, it adds the edge to the spanning tree. The basic idea then is to always keep adding the
smallest remaining edge that will not complete a circuit. A more detailed algorithm is (note that
Visited is a set of edges):
Graph - 38
Graphs
Kruskal's_Minimum_Spanning_Tree
Initialize
Clear Visited
Insert all the edges into a pipe S sorted on weights
Open Pipe S
end Kruskal's_minimum_spanning_tree
A 1 B
3 3
4
C D
2
Graph - 39
Graphs
Initialize
Sorted Pipe : [A, B, 1], [C, D, 2], [A, C, 3], [B, D, 3], [C, B, 4]
Visited = {}
Remaining Passes: All edges remaining in the pipe makes a cycle with edges already
chosen.
While conceptually simple and easy to do by hand, this algorithm can be a rather slow and
tedious algorithm to execute on a computer. The tedious part is checking each new edge to see
if it completes a cycle. Unless the optimal representation data structure is used for this checking,
the checking can be very time consuming. The optimal structure is beyond this text and the
interested reader should read R. E. Tarjan's book: Data Structures and Network Algorithms, pub-
lished by the Society for Industrial and Applied Mathematics, 1983.
Prim developed a slightly different algorithm which is harder to follow, but which executes
faster on a computer. Prim's algorithm works with nodes rather than edges. Starting with any
node, the algorithm asks which node is closest to the starting node. These two nodes are now a
spanning tree for two nodes. The algorithm continues by always adding to the spanning tree the
closest node not yet in the spanning tree. To determine the closest node not yet in the spanning
tree, the algorithm uses a priority queue. (Recall that a priority queue is a queue where every
item has a "priority" and the dequeue operation returns the item with smallest priority.) A
detailed algorithm is (Visited is a set of nodes this time):
Graph - 40
Graphs
Prim's_Minimum_Spanning_Tree ( Start_Node )
Initialize
Clear Visited
Clear Priority Queue
Insert [Start_Node, null, null] in the Priority Queue
To compute the execution time of this algorithm, assume the Visited set is implemented by a
bit map so both the search and insert times of the Visited set are O(1). The innermost loop is
repeated at most once for each edge. The execution time of the innermost loop is determined by
time necessary to enqueue an item in a priority queue. Assuming the priority queue is imple-
mented by a binary search tree ordered on the "priority," the enqueue time varies from O(E) to
O(log2E) where E is the number of edges. The total execution time then varies from O(E2) to
O(E*log2E) depending upon how well balanced we keep the tree used in the priority queue.
Using the heap representation of a priority queue (see the Sorting chapter), guarantees a bal-
anced tree and, hence, the faster execution time.
It is possible to speed up this algorithm. Since the minimal spanning tree of a graph with N
nodes requires only N-1 edges, the inner loop can be halted as soon as N-1 edges have been
found. This does not affect the general big O execution time because in the worst cases it must
still examine every possible edge, but the practical result for most cases is a much faster pro-
gram.
To illustrate the use of Prim's algorithm, we will use the same graph as
before
A 1
B
3 3
4
C D
2
Assume the Start_Node is D, then initialize the priority queue and enqueue [D, Λ, Λ]. Now the
results after initialization and after each pass through the loop are:
Graph - 41
Graphs
Initialize:
Visited = {}
Priority Queue: [D, Λ, Λ]
Remaining Passes until Priority Queue is empty: The Vertex in the dequeued
edge is already in Visited so the edge is ignored.
Example 8.4.2. Find the minimum weight path between any two nodes.
This problem has intrigued many researchers and whole books are devoted to the subject.
Many variations on the problem exist and algorithms have been developed for most of these
variations. We will present Ford's algorithm which finds the minimum path by repeatedly look-
ing for a shorter path.
Ford's algorithm depends upon an array called Dist where Dist(u) contains the distance from
the Start_Node to node u for each u. The algorithm starts by setting the entries in the Dist array
to infinity (or some suitable approximation of infinity) for every node except the start node
whose value of Dist is set to zero. It then examines all the edges in the graph one at a time. If
the Dist(u) plus the weight of the edge from u to w is less than Dist(w), then it must be true that
the path that goes to u and then uses the edge {u,w} to reach the w node is shorter than the cur-
rent path to node w, so, in this case, the algorithm makes u the predecessor of w and updates the
value of Dist(w). This process is executed over and over again until no further improvements
are possible (no shorter paths are found). The algorithm is guaranteed to work provided the
graph contains no negative cycles, cycles with a total path length which is a negative number.
Graph - 42
Graphs
Ford_Minimum_Path( Start_Node )
Initialize
Dist(Node) <-- Infinity for all Nodes except Start_Node
Dist(Start_Node) <-- 0
Pass <-- 0
Done <-- false
Clear Path_Table
end Ford's_minimum_path
As noted earlier, the algorithm produces minimum paths provided the graph contains no
negative cycles. To see the effect of negative cycles, consider applying the algorithm to the
graph:
The negative value of the edge {B, C} sends the algorithm into an infinite loop unless there is a
limit on the number of passes through the outer loop. Hence, if the algorithm stops and Done is
false, then the graph contains a negative cycle and no minimum path.
If Done is true when the algorithm stops, then on the last pass through the outer loop, no
shorter paths were found, so the algorithm has found the minimum path from the Start_Node to
every other node in the graph. The Dist array contains the minimum distance from Start_Node
to every other nodes and the minimum paths are in the Path_Table. The execution time is
O(N*E) where N is the number of nodes and E is the number of edges.
Exercises
Graph - 43
Graphs
3. The pipe in Kruskal's algorithm can be replaced by a priority queue. Discuss the pros and
cons of this change in the algorithm.
4. Kruskal's algorithm depends upon determining if an edge makes a cycle with edges in Vis-
ited. Develop an algorithm to do this.
6. Develop a complete Ada package for the weighted graph ADT using:
a. adjacency matrix or
b. adjacency list representations.
7. Translate the algorithms of Exercise 5 into working Ada subprograms and add them to your
graph package.
Graph - 44
Graphs
Many problems in everyday life require that things be done in a certain order. College courses,
for example, often have prerequisites and must be taken in a certain order. Buildings must be
erected in a certain order, the foundation must be installed before the walls can be erected and
the walls must be erected before the roof can be built. Given a collection of items with con-
straints on the ordering of the items, a topological sort is an ordering of the items that preserves
all of the necessary order between individual items.
In many cases there is a simple, straightforward way to determine the necessary ordering, the
topological sort, but some problems are more complicated. Consider, for example, the directed
graph:
b
a d
where the arrows indicate the order in which events must occur; that is, a must come before both
b and c, b must come before both c and d, and c must come before d. One possible order for this
problem is a, then b, then c, then d, or in abbreviated form, [a, b, c, d]. This case is easy to see
because the graph is simple. A similar graph for erecting a large building may contain 10,000
nodes and 100,000 edges. It is difficult to look at a graph of this size and determine a topologi-
cal ordering.
There is, however, a simple technique for determining a topological ordering of a graph
regardless of the size of the graph. It is based upon a predecessor count, how many nodes of the
graph must occur before this node can occur. Take the graph above, by counting the number of
edges coming into a node, it is obvious that node a has no predecessors, node b has one prede-
cessor (node a), node c has two predecessors (nodes a and b), and node d has two predecessors
(nodes b and c). This can be summarized in the table:
Node a b c d
Number of Predecessors 0 1 2 2
Since node a has no predecessors, it can obviously go first. If node a is eliminated from the
graph, then node b no longer has any predecessors and node c has only one predecessor; that is,
the predecessor counts are now:
Node b c d
Number of Predecessors 0 1 2
Since b now has no predecessors, it can go next so that, at this point, our topological sort is [a,
b]. With node b eliminated from the graph, node c no longer has any predecessors and node d
only has one predecessor; that is, the predecessor counts are now:
Graph - 45
Graphs
Node c d
Number of Predecessors 0 1
Clearly node c can go next so that, at this point, our topological sort is [a, b, c]. With node c
eliminated from the graph, node d has no predecessors left and can be output. Thus, the final
topological sort of this graph is [a, b, c, d].
The core of this technique is keeping track of the predecessor counts. Every time a node is
output, the predecessor counts are updated by subtracting one from the nodes pointed to by the
output node. Any time a predecessor count becomes a zero, the node is available for output.
More than one node can be available for output at any given time. There can even be more than
one node available for output at the beginning; for example, the graph:
a d
has nodes a and c available at the beginning. When more than one node is available for output,
it does not matter which node is output first. This last graph, for example, has three possible
topological sorts, [a, c, b, d], [c, a, b, d], and even [a, b, c, d], depending upon which of the
available nodes is output first. There are several ways to handle this in an algorithm, but one
simple method is to insert nodes with no predecessors into a queue and dequeue them as needed.
Before developing a formal algorithm for producing a topological sort, it helps to develop
some of the necessary pieces. The first piece produces the initial value of the Predecessor
Count, the number of predecessors, for each node. Let Predecessor_Count be an array, then,
Another necessary subalgorithm is one to update the values of the Predecessor Counts after a
node is output or "removed from the graph." In practice the node stays in the graph data struc-
ture and the values of the predecessor counts are updated, but, conceptually, the node is removed
from the graph. After the node Out_Node is "removed from the graph," the subalgorithm to
update the values of Predecessor Count is:
Graph - 46
Graphs
This subalgorithm also inserts any nodes which no longer have predecessors into a queue for
later output.
Combining these subalgorithms into the final algorithm gives:
Initialize
Clear Queue
Initialize the values of Predecessor_Count for every node in the graph
Enqueue every node with a Predecessor_Count of zero
Terminate
If all of the nodes have been output
then topological sort
else no topological sort
The Terminate piece of this algorithm is necessary because not every graph has a topological
sort. Consider, for example, a graph that contains a circuit such as the following:
a d
Clearly this graph has no topological sort and, in fact, the algorithm above would produce noth-
ing because every node has a Predecessor Count of one. No graph with a circuit has a topologi-
cal sort and, in every such case, the algorithm above will come to the point where the queue is
empty, but not all of the nodes have been output. Thus, it is necessary to count the number of
nodes that have been output and a topological sort exists and has been output only if all of the
nodes have been output.
This last result seems obvious, but it is best to double check. There is, in fact, a standard
theorem in mathematical graph theory:
Theorem: A directed graph has a topological sort if and only if it does not have a circuit.
Graph - 47
Graphs
Thus, our intuition is backed up by mathematical results. Many graph algorithms depend upon
the results of mathematical graph theory and anyone who works with graphs should become
familiar with mathematical graph theory.
This same algorithm can obviously also be used to determine if a directed graph has a circuit.
It suffices to change the terminate portion of the algorithm so that it outputs a comment about
the presence or absence of a circuit. It is also possible to eliminate the output of the nodes in the
repeat loop of the algorithm.
To determine the execution time of this algorithm, assuming an adjacency list representation,
note that every edge is processed twice, once in the initialization and once in the part that
updates the values of the predecessor counts. Similarly, every node is processed four times,
three times in the initialize section of the algorithm and once in the repeat loop. The final execu-
tion time is then
O(Number of Edges) + O(Number of Nodes).
For an adjacency matrix representation, the execution time of the initialization phase is
O(Number of Nodes2). Updating the value of the predecessor count takes O(Number of Nodes).
The final execution time is, therefore,
O(Number of Nodes2).
Exercises
1. Give two everyday examples of problems where things must occur in a certain sequence.
3. Expand the topological sort algorithm above to show all the details of the algorithm
assuming:
a. an array, or
b. a linked
representation.
4. What is the effect of replacing the queue in the topological sort algorithm above with a stack.
Graph - 48
Graphs
This book and most other textbooks give the impression that corresponding to every problem
there is a solution consisting of an algorithm which executes in time O(Size), or O(Size2), or
O(Size3), but always in some reasonable amount of time. This gives the impression that all
problems have solutions and that every algorithm executes in some reasonable amount of time.
In fact, problems can be classified into at least four categories:
1. impossible -- that is, no solution exits,
2. impractical -- any solution takes an unreasonable amount of time to compute,
3. practical -- a solution exists which executes in a reasonable amount of time, and
4. no one knows if a reasonable solution exists or not.
It is important that you realize that these categories exist and have some idea of what problems
fit into what categories.
Impossible Problems
The most basic question in problem solving is: Does an algorithm of any kind exist for solv-
ing a given problem? In other words, does the problem have a solution of any kind? The
answer is: Not always. The classic example of a problem without a solution is the Halting Prob-
lem. Here the problem is to write a program which will determine if any other program has an
infinite loop. This would be a handy program to have. It could be used to test a new program to
make sure that the program contains no infinite loops.
Unfortunately, no such program can exist. To see this, assume the opposite, assume that we
have such a program. Now make one minor modification: whenever the program finds a pro-
gram with an infinite loop, it prints out this fact and stops, but if it finds a program with no infi-
nite loop, it prints out this fact and goes into an infinite loop. With this simple modification, the
program acts in the opposite way from the program it is examining: if the program it is examin-
ing executes forever, then this program stops; if the program it is examining stops, then this pro-
gram executes forever. Now, what does this program do when applied to itself? If it has an
infinite loop, it halts; if it doesn't have an infinite loop then it executes forever. Either way is a
contradiction. Therefore, the original assumption must be false and no such program can exist.
The difficulty here is that the problem describes a logical impossibility.
For readers who prefer a more concrete example, assume the Halts function exits; that is,
defines a Boolean function which returns a true value if P halts and returns false otherwise.
Then consider the Test function:
Graph - 49
Graphs
procedure Test is
begin
if Halts( Test ) then -- if Test halts
loop
null; -- then Test does NOT halt
end loop;
end if; -- otherwise it does!
end Test;
If Test stops, then Halts returns a true and Test goes into an infinite loop. If Test does not stop,
then Halts returns a false and Test stops. A contradiction either way.
Halts is perhaps the simplest impossible problem, but there are others. Worse, it is not
always clear whether a given problem is impossible or not. Other problems can seem reason-
able, yet are still impossible. It certainly seems reasonable that there should be a computer pro-
gram for every possible mathematical function. There are certainly programs to compute the
square root of a number, the trigonometric functions, and so forth. Strangely enough, there are
mathematical functions which cannot be calculated by a computer. The proof is roundabout and
based upon two facts.
Fact 1: Mathematicians have shown that there are an uncountable number of
mathematical functions; that is, there are more mathematical functions than
there are positive integers.
Fact 2: There are only a countable number of computer programs; that is, there
is at most one computer program for each positive integer.
To see this last fact, note that every computer program eventually consists of a finite number of
zeros and ones in the computer memory. Thus, each computer program corresponds to some
base 2 integer (not all base 2 integers correspond to a valid computer program), so there are only
a countable number of possible computer programs. Since there are only a countable number of
possible computer programs and there are an uncountable number of mathematical functions,
there must be mathematical functions which cannot be computed by a computer.
Even more interesting there are many more mathematical functions which cannot be com-
puted than there are functions which can be computed. Fortunately, this does not seem to be a
practical difficulty, but it is interesting that computers are limited to some subset of all possible
problems and that most problems have no computer solution regardless of how big or how fast
the computer is.
Even if a problem has a computer solution, an algorithm which computes the solution to the
problem, this does not mean that the solution is practical; that is, it can be computed in a reason-
able amount of time.
As a simple example of an impractical problem, consider a program to print out every per-
mutation of a set of N items. If, for example, the set has three items, A, B, and C, then the pos-
sible permutations are:
ABC, ACB, BAC, BCA, CAB, CBA
or a total of six outputs. In general, a set with Size items has Size! permutations, so the program
would have to produce Size! outputs. Thus, any algorithm to produce this output has a
Graph - 50
Graphs
minimum execution time of O(Size!). O(Size!) is a reasonable execution time for small values
of Size, but quickly grows out of any reasonable bounds. To get some feeling for the necessary
execution time, consider the case where Size equals 25, a fairly small integer. To compute the
necessary execution time, start with the fact that:
Time O 1.55 10 25 outputs 16second = 1.55 10 19 seconds
10 outputs
This is too large to be meaningful, so let us convert it into years. First, the number of seconds in
one year is:
or 500 billion years, which greatly exceeds the age of the universe. Thus, even this trivial
sounding problem takes an unreasonable (unattainable?) amount of time.
Given that 25! is too large an execution time, is there some way of deciding what is a reason-
able and what is an unreasonable execution time without having to do a detailed calculation? Is
there some simple rule of thumb that suffices for most purposes? This, of course, depends upon
the definition of reasonable, but most people consider any algorithm with an execution time of
O(Size), or O(Size2), or, in general, O(SizeC) for some constant C, the so-called polynomial exe-
cution time algorithms, to be reasonable. Even for C = 3 and Size = 1000, this can be a very
long time, but it usually does not require centuries or even years of computer time. On the other
hand, exponential execution time algorithms, ones with execution times of O(2Size) or, equiva-
lently, O(Size!), can take longer to execute than the lifetime of the universe. These are defi-
nitely unreasonable. So one possible definition of a reasonable problem is one with polynomial
execution time and an unreasonable problem is one with exponential execution time.
The next question is: Is there some way in advance to know if a problem is going to take an
unreasonable amount of computer time? At the moment, the answer depends on the problem.
Any problem with a polynomial execution time algorithm is obviously a reasonable problem.
Any problem that requires printing Size! items is obviously unreasonable. But there are many
problems between these two extremes. Most problems in this in-between category are either:
a. ones where no algorithm is known or
b. ones where all known algorithms take exponential execution time,
but perhaps a faster algorithm exists which has not yet been found.
Graph - 51
Graphs
In other words, most in-between problems are ones where we don't know whether a significantly
faster algorithm exists or not.
A simple example to illustrate the difficulty is the Hamiltonian cycle problem, to find a cir-
cuit which goes through each node of an undirected graph once and only once. The Hamiltonian
circuit problem sounds simple, but all known solutions essentially require trying all possible
paths through the graph --- which can take O(Size!) execution time.
Strangely enough, a very similar sounding problem, the Euler circuit problem, has an reason-
able solution. The Euler circuit problem is to find a circuit which visits each edge of a graph
once and only once. There is an algorithm for the Euler circuit problem with execution time
O(Number of Edges) + O(Number of Nodes); that is, a polynomial time solution. It seems rea-
sonable that, since the Hamiltonian circuit problem is so similar to the Euler circuit problem, the
Hamiltonian circuit problem should also have a polynomial execution time solution. But no one
has ever found such a solution. Neither has anyone proven that a polynomial execution time
algorithm is impossible. Thus, we have a problem where the known solutions are unreasonable
and no one knows if there is a reasonable solution waiting to be discovered.
The Hamiltonian circuit problem is typical of a large class of problems for which no one
knows if faster solutions are available. Trying to find faster algorithms is a basic part of com-
puter science and many people are working on these kinds of problems. Even if someone finds a
faster algorithm for one particular problem, that still leaves the untold multitude of other prob-
lems. So, instead of attacking individual problems, some experts are trying to group problems
into classes of similar problems by some scheme or other so that one can discuss whole groups
of problems. One common scheme is to classify problems by their relative execution times.
Obviously the simplest are the O(1) problems. This is a special case of the class P of polynomial
execution time problems. Also obviously the class E of exponential execution time algorithms
contains the unreasonable algorithms. There are problems with even worse than exponential
execution times, but they are mostly of theoretical interest at the moment.
As an example of a problem with more than exponential execution time, consider a program
to input an integer n and output (n!)n! ones. Clearly this program has execution time O( (n!)n! ).
This problem is obviously made up, but the mere fact that we can easily make up a problem with
this kind of execution time indicates that there may be practical problems with the same kind of
execution times.
One might think that talking about classes of problems is so abstract that it tells us nothing
about how to solve individual problems, but this is not necessarily so. One of the more interest-
ing developments in computer science in the last 25 years is the discovery of classes of problems
where the solution of any one problem in the class tells us something about the solutions to all of
the other problems in the same class.
The best known class of this type is the class of NP-Complete problems. This is a very inter-
esting class which contains many practical problems, but to explain the underlying concepts, we
have to start in a rather roundabout manner. Instead of using a standard, deterministic computer,
assume we use a nondeterministic one, a computer which at each step of the algorithm is free to
decide what step to execute next and at every step it infallibly chooses the step which leads
directly to the solution of the problem. Thus, a nondeterministic computer can solve the Hamil-
tonian circuit problem in O(Size). (At each step it chooses the next edge in the Hamiltonian
Graph - 52
Graphs
circuit.) This sounds like a pretty powerful computer. It is impossible to build such a computer,
but it is an interesting theoretical concept because it concentrates on the essence of producing an
answer to a problem and ignores tedious details such as how we produce the answer. Strangely
enough, the extra power is not as powerful as one might guess at first.
Now let NP denote the class of problems solvable in polynomial execution time on a nonde-
terministic computer. In some sense these are problems whose minimum execution time is at
least polynomial. Many graph problems, such as the Hamiltonian circuit problem, and other
practical problems are in this category. But, like the Hamiltonian circuit problem, many of these
problems have no known algorithm with polynomial execution time on a deterministic machine.
The most interesting problems in the NP class of problems are the so called NP-Complete
problems. This class includes the Hamiltonian problem, the Traveling Salesman problem, and
the Satisfyability problem, to mention a few of the better known members. In some sense the
problems in this class all require finding one solution from either Size! or 2Size possible solu-
tions. Thus, the Traveling Salesman problem requires finding the cheapest route for visiting all
of the salesman's customers exactly once and the Satisfyability problem requires, given a Boo-
lean expression in Size variables, finding a set of values of the variables that makes the Boolean
expression true.
An interesting property of the NP-Complete problems is that any problem in NP can be
essentially replaced by an NP-Complete problem; that is, in some sense, any NP problem can be
translated into an equivalent NP-Complete problem so that solving the NP-Complete problems is
equivalent to solving all of the NP problems. In some very real sense, the NP-Complete prob-
lems are the core set of the class of NP problems; if we can solve the NP-Complete problems,
we can solve the NP problems.
Even more interesting, it turns out that the NP-Complete problems are all equivalent in the
sense that if we can solve one of them, we can solve all of the others. Thus, a solution to any
one of the NP-Complete problems leads to a solution of all of the NP problems. In other words,
if a single NP-Complete problem is shown to have a polynomial execution time algorithm on a
standard, deterministic computer, then every NP problem must have a polynomial execution
time algorithm on a standard, deterministic computer. In more general terms, if a single NP-
Complete problem is also in P, then every NP problem must also be in P. This means that find-
ing a polynomial solution to any single one of the thousands of NP-Complete problems will suf-
fice to show that all of the other problems have a polynomial solution.
For this reason much effort has been spent on the NP-Complete problems, but, while thou-
sands of problems are now known to be NP-Complete, no real progress has been made in prov-
ing or disproving that any NP-Complete problem is also a P problem. Most experts now assume
that if P = NP, then someone should have found a proof by now. But, by the same argument, if
P NP, then someone should have found a proof by now, so the only solid conclusion at the
moment is that no one knows.
This may seem like a rather long diversion from the original point, determining if a problem
has a reasonable solution or not, a solution that executes in polynomial time or one that executes
in exponential time. But the last conclusion above states that even if we cannot answer the ques-
tion directly, we can know that, currently, all known solutions to any problem in NP have expo-
nential execution time. Thus, before attempting to solve a new category of problem it is worth
checking if the problem is in P, E, NP, or NP-Complete.
Since exponential solutions are impractical some programmers have successfully changed
the original problem slightly to a new problem with a polynomial solution. Take the Traveling
Graph - 53
Graphs
Salesman for example; if finding the cheapest possible path is unreasonable, then is there a way
to find a path which is close to the cheapest possible path? Various polynomial execution time
algorithms are known which approximate the cheapest path; that is, find one close to but not
necessarily equal to the cheapest path. Thus, the slightly altered problems are in P, and, hence,
slightly altered solutions to all of the NP problems are in P. The tools and techniques used for
these altered problems are outside the scope of this text, but, obviously, any programmer faced
with an NP-Complete problem should consider altering the problem to some sort of approxima-
tion.
Another possible approach to an NP-Complete problem is to develop an algorithm which
works quickly most, but not all, of the time. The Simplex method for solving the Traveling
Salesman problem is in the category. While the method can take an exponential amount of time
to produce an answer, most of the time it produces an answer in much less time than this. One
always hopes that the problem at hand is one where the method works quickly and not one
where the method takes an exponential amount of computer time.
This seems like a long way from our original question: Is there some way to know in
advance if a problem is going to take an unreasonable amount of computer time? The reason is
that no one really knows the answer for all possible problems. If the problem has a polynomial
execution time solution, then there is no difficulty. If the problem requires outputting an unrea-
sonable amount of data, then the answer is that the problem has only unreasonable solutions.
Many problems are between these two extremes and, if the problem is NP-Complete, then no
one knows.
There is one final comment that must be kept in mind. There is a difference between an
unreasonable problem and an unreasonable algorithm. It is always possible to start with a rea-
sonable problem and generate an unreasonable algorithm to solve the problem. Many program-
mers have done this quite by accident; they start with a simple problem and generate an algo-
rithm that takes an unreasonable amount of time. In other words, if your algorithm takes an
unreasonable amount of computer time, maybe the algorithm needs to be redesigned.
Exercises
2. Define f(n) as n raised to the n power raised to the n power, and so forth n times; for exam-
3
ple, f(2) = 22 and f(3) = 3 3 Is there a problem with an execution time of O( f(n) )?
3. Let f(n) be an arbitrary function defined for each positive integer value of n. Develop a pro-
gram with execution time of O( f(n) ).
4. Is there a largest possible, finite execution time? If so, give a problem with this execution
time. If not, then give a problem with the largest possible, finite execution time that you can
find.
Graph - 54
Searching
Searching
Storing and retrieving data is one of the most common operations in many programs. The
exact way the data is stored can greatly affect the performance of the system. To illustrate some
of the techniques and their properties, this chapter compares different data structures for imple-
menting a simple set ADT with insert and search operations. Most other ADT's have as many
different representation data structures, but searching a set is a particularly simple case and allows
us to explore the options in some detail. Our goals are to study search techniques and to illustrate
the effect of the data structure on the final system.
1
Searching
9.1. Overview
The goal of this chapter is a thorough understanding of searching and the effects of the data
structure used to store the data on the search process. The data is assumed to be a collection of
items which can consist of, to name just a few cases, a list of simple items, a table, or a file of
records. The search itself can be on the whole item or some key part of the item, but the basic
goal in every case is the same, to search a collection for a given item. Most of the searches
assume that the search is always made using the same key part of the data, but the last section in
the chapter presents search methods to be used when the same data is accessed using different
parts of the data as the key.
The important thing is the number of different data structures that can be used to store a
collection and the effect of the data structure upon the insertion and search times. As will be
seen, each data structure has different insertion and searching characteristics; in other words,
changing the data structure used to store the collection can drastically alter the amount of storage
space used and the time necessary to access an item.
While any kind of collection can be stored and accessed, it simplifies the presentation and
discussion to assume the collection is a simple set ADT with only the two operations:
The items in the set are, of course, unique and the insert operation must insure that no duplicate
items are inserted into the set.
Occasionally a bag ADT with the same two operations will be discussed. A bag can contain
duplicate items so that the insert operation is simpler than the set insert operation. The bag search
on the other hand may be more complicated than the set search because it is possible that the bag
search must find all occurrences of the specified or sought item.
For programming purposes, all of the data in this chapter is assumed to be stored in the set
package whose program specification is in Program 9.1.1. There are several features of this
program specification which should be noted:
- The set data is assumed to be of type Data_Type.
- The data is always accessed using a key of type Key_Type; in other words, the key
is the part of the data that is used in searches.
- Since the data is accessed by a key, the client program must include in the instantia-
tion Boolean functions to compare two key values for equality and for less than.
Not all implementations use the less than function.
- The Maximum_Size is the maximum subscript allotted in the array used to store the
set items. Not all implementations use this feature.
- The specification includes two exceptions:
- one for running out of storage space, and
- one for trying to insert a duplicate item into a set.
As usual, the package specification says nothing about how the data is actually stored. The
package body will implement the set using various data structures, each with its own storage and
2
Searching
generic
type Data_Type is private; --Data type of set items.
type Key_Type is private; --Data type of key.
package Set_ADT is
search scheme. Regardless of the storage scheme used and the search method used, the goal is
the same to store and retrieve information using a fast, efficient scheme.
For simplicity and clarity, Specification 9.1.1 assumes the key is stored and accessed
separately from the rest of the data. This assumption is not necessary; it is possible to store the
key inside the data. If, for example, each piece of data is an employee record and the key is the
employee name, then the record can be stored as a single unit containing the employee name and
other pertinent information. In this case, the two comparison functions, equal and less than, have
to compare data within records; that is, they are of the form:
function "=" ( Left : in Data_Type;
Right: in Key_Type ) return Boolean;
function "<" ( Left : in Data_Type;
Right: in Key_Type ) return Boolean;
Similarly, if there is a function to extract the key value from the complete record, then the Insert
procedure does not need a separate key value
With similar modifications all of the other programs in this chapter can be made to work
without storing the key separately, but this version is left for the exercises.
3
Searching
The simplest type of search is a sequential search, a search that examines each item in turn
until it finds the item it is seeking or it reaches the end of the data set. We can speed up sequen-
tial search by either optimizing the algorithm or reorganizing the data in the data set. We consider
each option in turn.
This section examines the basic sequential search algorithm and ways to speed it up. We start
with an array representation of a data set, study ways to speed it up, and then do the same thing
for a linked representation of a data set.
Assume the data is stored as follows:
where Item(k) contains the kth item in the set and Key(k) contains the corresponding key value.
(As noted above, the key value might actually be stored as part of the item, but the explanations
are clearer if we consider the key as stored separately from the item.)
A basic search algorithm to find the data item with key value equal to The_Key is:
Search( The_Key )
Initialize
I <-- 1
Repeat for each item until found (while I < Size & Key(I) /= The_Key)
I <-- I + 1
end repeat
Since, on the average, this algorithm must examine one half of the items in the list, its execu-
tion time is O(Size).
This algorithm works well, but can it be speeded up? The initialize and terminate portions of
the algorithm are already O(1), so any speed gain must be made in the loop. The loop executes
two tests and one assignment each time through the loop, so we ask is there a way to eliminate
one of the tests?
There is a way to eliminate the test (I < Size) from the loop, but it is not obvious. The trick is
to store the sought key value in Key( 0 ) and use the following algorithm:
4
Searching
Search( The_Key )
Initialize
I <-- Size
Key( 0 ) <-- The_Key
Return( I > 0 )
end search
which uses the fact that the sought key value is in Key( 0 ) to stop the search in the case where
the item is not in the array.
This algorithm executes only one comparison and one assignment each time through the loop,
so it is the fastest possible sequential search algorithm. (Why can't we have a loop which executes
less than one comparison and one assignment each time through the loop?)
To use this last algorithm, we must of course always leave Key( 0 ) empty so we have room
to insert the sought item in Key( 0 ). This is, however a small price for the expected gains.
The standard algorithm to search a linked representation of a set:
Search( The_Key )
Initialize
P <-- Head
Found <-- false
Repeat for each item until found (while P /= null & not Found)
if P.Key = The_Key
then Found <-- true
else P <-- P.Next
end repeat
Return( Found )
end search
also executes three comparisons for each item in the list. In the array representation we need only
one (1) comparison per item if we insert the sought item at the end of the data set. To obtain the
same speed with a linked representation, we can add a "dummy node" at the end of the list; for
example, storing the data set A, B, C in this data structure gives:
5
Searching
Head A
B
C
Dummy
Λ
where the last item, the "dummy node," is left empty. The "dummy node" is inserted at the end of
the linked list in the Clear operation. A detailed data specification and the Clear, Empty, Insert,
and Traversal algorithms for this data structure are in Module 9.2.1.1.
This representation allows us to insert the sought item in the dummy node and use the search:
Search( The_Key )
Initialize
P <-- Head
Dummy.Key <-- The_Key
Return( P /= Dummy )
end search
which executes one comparison and one assignment each time through the loop.
Both of these last two algorithms have an execution time of O( Size ), but the second version
is faster (has a smaller big O coefficient) than the first version. In other words, while the big O
execution time is a good rough indicator of execution times, it is not precise enough for compar-
ing algorithms where the execution time of one algorithm is some constant times the execution
time of the other algorithm.
Since this is the fastest possible loop for a sequential search, any further speed gains will have
to be obtained by processing fewer items. The rest of the chapter considers methods of doing
this.
The tree algorithms were the first algorithms covered in this book whose execution time
depends upon the data; in other words, altering the order in which the data is inserted in the tree
can affect the shape of the tree and, thus, the execution time of the insert and search algorithms.
The execution time of many other algorithms however, is also dependent upon the data
processed and other factors. The execution time of sequential search depends, for example, upon
which item is being sought. Items at the beginning of the list are found much more quickly than
items at the rear of the list and the average execution time depends upon which items are sought
and how often they are sought.
6
Searching
Data Specification
Node is record
Key : ?? --Key value of data item.
Item : ?? --Data item to be stored in set.
Next : Pointer to Node; --Pointer to next node in set.
end record
Algorithms
Clear
Dummy <-- new Node ( Next => Λ )
Head <-- Dummy
Empty
Return( Head = Dummy)
end empty
Traversal
Initialize
P <-- Head
end traverse
7
Searching
To get a more precise estimate of the average execution time of sequential search, we assume
the execution time is proportional to the number of comparisons that must be made. In other
words, finding the first item in the list requires one comparison, finding the second item in the list
requires two comparisons, and so forth. To simplify the analysis further, assume that all items are
sought equally often. To compute the average search time (AST) requires adding the times spent
seeking each individual item in the set and dividing by the number of items in the set.
AST = [1 + 2 + 3 + ... + Size]/Size
The sum in the square brackets is given in Appendix A as:
K
n=1 n = K(K + 1)/2
Substituting this into the equation above gives:
AST = [(Size + 1) Size/2]/Size
= (Size + 1)/2
Thus, the average number of comparisons used in a sequential search, assuming all items are
sought equally often, is (Size +1)/2 and the average search time is O( (Size+1)/2 ) which is equal
to O( Size ). This average however is based upon all items being sought equally often. A minor
change in the access pattern (the items being sought and how often they are sought) can greatly
change the average search time. The next section presents this approach in more detail.
This search time also assumes that the item sought is in the set. If the item sought is not in the
set, then the whole set must be searched and the search time is always O( Size ). Similarly if the
item sought is in a Bag and we must find every occurrence of the item in the bag, then the search
time is O( Size ).
Not all data is created equal in terms of the usage pattern. Most data sets seem to have some
data which is accessed much more often than other data. The classic example of this is the 80-20
rule said to have been first discovered for beer drinkers: 20% of the beer drinkers drink 80% of
the beer sold. At any rate it was soon noted that a small percentage of the buyers of anything
account for most of the sales of that item. For example, 20% of the credit card holders account
for most credit card sales. The exact percentages 80-20 are not important; 70-30 or 90-10 work
just as well as 80-20. The important fact is that often a small percentage of the items in a data set
account for most of the accesses to the data.
If, by some magic, the frequently accessed items were at the front of the data set, if they were
the first items checked by the search algorithm, then the sequential search algorithm is much faster
than when the data set is unordered. The question is how do we insure that the heavily accessed
items are at the front of the data set? There are several methods for doing this and this section
presents some of the more commonly used methods.
- Frequency Counts: The most accurate method is to count how often each item is accessed
and then organize the data on the basis of frequency of use; that is, the most frequently used item
should be first, the second most frequently used item should be second, and so forth. This can be
done by adding a Count field to each item and incrementing the Count field every time the item is
accessed. From time to time then, the data set is sorted on the Count field. Provided the access
8
Searching
pattern remains the same, this is the ideal method. Any change in the access pattern, however,
requires waiting until the new values of the Counts are available and then sorting. For a set with
thousands of items, it can take a large number of accesses until the new values of the Counts are
reasonably accurate.
The frequency count method is most useful when the data access pattern is stable; that is,
changes very slowly or not at all.
- Transposition: This method moves an item forward one position every time the item is
accessed. Over a period of time the heavily used data items migrate to the front of the list. This
method is slow to respond to a change in usage, but it also tends to ignore abnormal access
patterns.
The transposition method is most useful when the data access pattern changes slowly over a
period of time.
- Move-To-Front: This method moves an item to the front of the data set every time the item
is accessed. This method responds very quickly to a change in usage, but it also responds quickly
to any abnormal access pattern. This method is easy to implement and executes quickly for data
stored in a linked representation. On the other hand, it executes very slowly for data stored in an
array because moving an item to the first location in an array implies that all the items in front the
item to be moved must be shifted down one position to make room.
The move to the front is very useful when the program tends to access the same data item
several times in a row. After the first access, each additional access then tends to require only
O(1) execution time.
- Mixed: Every time a piece of data is accessed, that piece of data is either moved forward
say 5% or 10% of the distance to the front of the list or swapped forward 5% or 10% of the
distance to the front of the list. This method has medium response speed to changes of access
pattern. Again, by using either moving forward or swapping this method can be made to work for
either arrays or linked representations.
- Cache: Here we have a small, high speed memory, called a cache, which is used to store
the most frequently accessed items. A common example occurs when a large data set is too large
to store in the main memory and must be stored on a disk file. In this case, we can store a small
(relatively speaking) subset of the data set in the main memory. This main memory storage is
called the cache. If the most frequently accessed data is stored in the cache, then the access speed
is essentially determined by the time to search the cache rather than the time to search the disk
file. Very large speed gains are possible by using a cache. The question is then: How to deter-
mine what data to store in the cache? The answer is that we can use any of the above methods;
that is, store the most frequently used data, the most recently used data, or any other scheme that
seems reasonable.
- Analysis of possible speed gains from data reorganization: The exact speed gains to be
expected from data reorganization depends upon the exact access pattern and is difficult to
compute exactly. Given a few assumptions, however, rough approximations are possible which
give an order of magnitude estimate of the improvements to be expected.
9
Searching
As an example, assume that 20% of the items account for 80% of the accesses and that this
20% of the items is at the front of the list. To be more precise, assume that:
a. the list has 100 items, and
b. the first twenty items in the list account for 80% of the accesses (and each item in
this twenty items is accessed equally often), and
c. the last eighty items in the list account for 20% of the accesses (and, while each of
these eighty items are accessed equally often, they are obviously accessed much less
frequently than the first twenty items in the list).
Then:
80% of the accesses succeed in 10.5 comparisons (1+20)/2
20% of the accesses require 60.5 comparisons [20 + (1+80)/2]
versus an expected number of 50.5 comparisons (1+100)/2 for an unordered list; that is, searching
the unordered list requires approximately two and one half times as much effort as searching the
ordered list.
Exercises
1. Is it possible to eliminate the loop in a sequential search algorithm; that is, write a sequential
search algorithm without a loop?
2. Assume the data specification given in the text for a linked representation and develop
algorithms for each of the following operations:
a. print, c. count,
b. delete, and d. maximum.
Assuming the entries are integers also develop algorithms for:
e. sum, and f. average
3. Assume a data set has 500 items. If the data set satisfies the 80-20 rule, how much faster is
sequential search if this set is stored in a set ordered by frequency of use rather than an unordered
one?
4. Assume a data set has 500 items and calculate the expected number of comparisons required
to sequentially search the data set assuming the accesses satisfy:
a. the 70-30 rule, or b. the 90-10 rule.
and we use (1) an unordered data set and (2) a data set ordered by frequency of use.
5. Develop a program to simulate inserts and searches in a data set. Compare the search times if
the accesses are:
a. uniformly distributed, c. follow the 90-10 rule, or
b. follow the 80-20 rule, d. follow the 70-30 rule.
10
Searching
6. Develop a general formula for estimating the search time assuming an 80-20 rule and a data
set ordered by frequency of use.
8. Develop search algorithms that use move to front and transposition for sets for an
a. array or b. linked
representation of a set. Give big O timing estimates of the execution time of your algorithms.
10. All of the above timing estimates are based upon the assumption that the given item is in the
set. What is the search time if the given item is not in the set? What is the search time if one half
of the given search items are not in the set?
Assuming the package specification of Section 9.1, a corresponding package body for an array
implementation with sequential search is given in Program 9.2.4.1. The package body contains
the standard data definitions for a single set. Note that the key values are stored in one array and
the data values in a second array. The translation of the algorithms into Ada routines is straight-
forward.
The Ada code for the linked implementation is very similar and left for the reader as an
exercise.
Exercises
1. Develop the Ada package body for the Set ADT using the linked implementation presented in
Module 9.2.1.1.
3. Develop an Ada package which allows the user/client to define as many sets as desired at
compile time; i.e., make the set a data type instead of an object.
4. Alter Program 9.2.4.1 so that the key value is stored as part of the data instead of being stored
separately. What change does this require in the package specification?
11
Searching
-- Set Package --
-- Implemented Using Sequential Search --
type Set_Node is
record
Size : Subscript := 0; --Number of items in set.
Key : Key_Array; --Array to hold Key values.
Items : Items_Array; --Array to hold Items.
end record;
begin
--Initialize
I := Set.Size;
Set.Key( 0 ) := The_Key;
--Terminate
return( I > 0 );
end Is_In;
12
Searching
---------------------------------------------------------------
--Insert item
Set.Size := Set.Size + 1;
Set.Key( Set.Size ) := The_Key;
Set.Items( Set.Size ) := New_Data;
end Insert;
end Set_ADT;
13
Searching
If we assume the data set is sorted in lexicographical order, we can use binary search, which is
much faster than sequential search. The standard binary search algorithm for data stored in
lexicographical order in an array is:
Binary_Search ( The_Key )
Initialize
Bottom <-- 1
Top <-- Size
Found <-- false
Return (Found)
end binary_search
This algorithm has an average search time of O(log2Size) if the sought item is in the set and a
search time of O(log2Size) otherwise. Note that for practical purposes the search time is essen-
tially the same whether the sought item is in the set or not.
The price we pay for this increase in speed is that we must keep the items in the array sorted.
This implies that to insert a new value into the array, space for the new item must be created by
moving down one location all the items in the array greater than the new value. A general routine
to insert a new item into a sorted list is:
14
Searching
The insert (and delete) time is now O(Size) because, on the average, one half of the list must
be moved to make room for the new item or to delete an old item. This is the price of keeping the
list sorted. If we average all the accesses (searches as well as inserts and deletes), then binary
search of a sorted data set is faster than sequential search of an unordered data set if and only if
the total number of searches is much larger than the number of inserts and deletes. If most
accesses are inserts or deletes, then sequential search of an unordered data set may be faster.
The real problem with binary search is of course the insert and delete times. Any method
which speeds up these operations can be useful. There is one way, that sometimes works, to
speed up sorted data set inserts and deletes. When the list is first sorted, leave every other entry
in the set unused. This doubles the size of the set for search purposes (which adds, in case of
failure, one additional comparison to each binary search), but, at least for the first few inserts, the
insert time is the sum of the search time to find the position, O(log22*Size) = O(1+log2Size), and
the insertion time O(1). Thus the insert time is O(1+log2Size). The insert time degrades as more
inserts are made and the unused entries are filled in. Therefore, the data set must be reorganized
(more blank entries inserted) whenever the insert time becomes too large.
Another possible solution is to combine both the binary and the sequential search methods.
One way to do this is to store the data items in a sorted list and insert the new entries in a
separate, unsorted list. The sorted list is searched using a binary search and, if the item is not
found there, the separate list is searched using a sequential search. From time to time the
unsorted list is sorted and merged into the sorted list. This method works well provided there are
a large number of searches for each insert.
Another way to combine the binary and sequential search methods is to store the data items in
a sorted list. Associated with each data item in the sorted list is a set. New items are inserted in
the set associated with the next larger item in the original data set. Insertion time is now
O(log2Size) and the search time is O(log2Size) plus the average length of the sets. Provided the
sets are relatively small, the search times are almost as good as binary search. From time to time
the sorted list is updated by sorting each set and merging it into the sorted list.
The final way we will consider is to replace the sets in the last version with binary search
trees. That is, the data items are stored in a sorted list and a BST is associated with each data
item in the sorted list. Insertion time is O(log2Size) plus the average depth of the associated BST.
Search time is the sum of the time required to search the sorted list and the time required to
search the BST.
These last few methods require updating the sorted list by inserting a set of new items into the
correct position in the sorted list. This can take some computer time, so it is often done when
nothing else is going on, say at two or three o'clock in the morning.
The binary search time of a bag is O(log2Size) + O( number of duplicates). The correspond-
ing insert time for a bag is the same as for a set. The proof is left for the reader.
15
Searching
A precise comparison between sequential search and binary search depends upon the exact
algorithm used, the Ada compiler used, and the computer used. To perform such a comparison
we would have to choose a computer and an Ada compiler, then develop, compile, and use both
methods on this computer with a data sample which accurately represents the data and a search
sample which accurately reflects the searches to be performed. This is a very time consuming
process and a minor change in the assumptions can invalidate the results
Fortunately, in most cases, one method is so superior to the other method that a rough calcu-
lation suffices to determine the best method for a particular case. For practical purposes, we can
estimate the cost tradeoffs between the two search methods by using the number of comparisons
required by each method. The execution time of a binary search is thus log2Size and the execu-
tion time of an associated insertion or deletion is Size/2. The execution time of a sequential
search, by a similar argument, is Size/2. Since a set insert must check for duplicates before insert-
ing the new item, the insertion execution for sequential access is Size/2 and the associated
deletion time is Size/2. We know these execution times are not exact, but they do give us order
of magnitude estimates for comparing various methods.
Since the two insertion times are the same and the binary search time is much faster than the
sequential search time, the first conclusion is that the sorted representation with binary search is
better for storing sets.
For storing bags, however, the result is different because the bag insertion does not have to
search first for duplicates before inserting a new item in the bag. The insertion time for a bag with
sequential search is therefore only O(1). This suggests that more detailed comparisons are neces-
sary between sequential and binary search methods when storing a bag. For the sake of the
comparison, assume that the bag searches are seeking only one occurrence of the sought item.
Example 9.3.1.1. Compare sequential search and binary search assuming a bag with 1000
items and
a. 99 searches for each insert,
b. 50 searches for each 50 inserts, and
c. 1 search for each 99 inserts.
Assume the execution times for the operations are approximately proportional to the number
of comparisons. For binary search of a list with 1000 items the number of comparisons is
log21000 or approximately 10. Similarly, the execution time for an insertion into an ordered list
with 1000 items is 1001/2 or approximately 500. The approximate relative costs for 100 opera-
tions are then:
a. 99*10 + 1*500 = 1490 (approx. 15/operation),
b. 50*10 + 50*500 = 25500 (approx. 250/operation), and
c. 1*10 + 99*500 = 49510 (approx. 500/operation).
Note how the costs increase as the fraction of inserts increases.
For comparison purposes, assume we replace the binary search by a sequential search on an
unordered list. The approximate search time for a list with 1000 items is 500 and the approximate
insert time is 1, so the approximate relative costs for 100 operations are:
16
Searching
In cases (a) and (c), there is no question. Binary search is superior in case (a) and sequential
search is superior in case (c). Case (b) needs a more precise calculation because our calculation is
not accurate enough when the two cases are approximately equal. Neither method is very good in
case (b). Even this crude comparison, however, suffices to determine the best methods in cases
(a) and (c).
Thus, in the case of a bag, the best method depends upon the ratio of inserts to searches. This
result of course, assumes that all items are accessed equally often; if this assumption is not true,
then new execution times for search are necessary and the results will change.
Exercises
3. Develop a binary search algorithm for a bag. Show that the binary search time for a bag is
O(log2Size) + O(Number of duplicates) and the insert time is O(Size).
4. Assume (1) 50% of the accesses to the data set are searches and 50% are inserts of new data,
and (2) 10% of the data accounts for 90% of the accesses. Which is faster, the sequential or
binary method? Assume the data set contains:
a. 10 items, c. 10000 items, or
b. 50 items, d. 1,000,000 items.
6. Give practical examples where binary search is (a) a good and (b) a bad method.
8. Develop a binary search algorithm assuming the data is sorted and stored in an array with
every other entry empty. Develop an insert algorithm for this data structure.
9. What is the expected number of comparisons needed to search a data set where the data set
contains 1000 items and is stored in a sorted list with new items inserted in an unsorted list.
Assume:
a. 99 searches for each insert,
b. 50 searches for each 50 inserts, or
c. 1 search for each 99 inserts.
Compute the expected number of comparisons over 1000 operations.
17
Searching
10. Develop an algorithm to insert a small, unsorted list into a large sorted list. Your algorithm
should be as fast as possible and require as little movement of the large list as possible. What is
the execution time of your algorithm?
11. Repeat Exercise 8 assuming that new items are inserted in a BST. How do the two methods
compare?
12. The Insert routine in the text uses a preliminary search to determine if the new item is already
in the list. Redesign the Insert routine so that this preliminary search is unnecessary and compare
the two insertion routines.
13. Show how to eliminate one comparison each time through the loop in the Insert routine by
inserting the new key value in Key( 0 ).
The Ada implementation of the binary search algorithms is straightforward. The only real
change from the algorithms is the addition of the exceptions. (In fact it is pretty much a simplified
version of the one in the Set/Bag Chapter using a key rather than an item to determine the
location in the array.) The result is in Program 9.3.2.1.
Exercises
1. Alter the insertion routine in Program 9.3.2.1 to eliminate one comparison each time through
the loop.
2. Alter Program 9.3.2.1 so that it works for a set of items which are of an enumerated data
type. What are the difficulties?
3. Generalize Program 9.3.2.1 so that it will work for any number of sets.
4. Alter Program 9.3.2.1 so that the key value is stored as part of the data instead of being stored
separately. What change does this require in the package specification given in Program 9.1.1?
18
Searching
-- Set Package --
-- Implemented using a Sorted List and Binary Search --
package body Set_ADT is
subtype Subscript is Natural range 0..Maximum_Size;
type Set_Node is
record
Size : Subscript := 1000; --Number of items in set.
Key : array( Subscript ) of Key_Type;
Items : array( Subscript ) of Data_Type;
end record;
19
Searching
--------------------------------------------------------------
begin
--Initialize
Bottom := 1;
Top := Set.Size;
Found := false;
--Repeat for each item while not found and not done
while not Found and Bottom <= Top loop
Middle := (Bottom + Top) / 2;
if The_Key = Set.Key( Middle ) then
Found := true;
end if;
end loop;
return( Found );
end Is_In;
end Set_ADT;
20
Searching
9.4. Trees
Tree searches have an average search speed equal to binary search, O(log2Size), with insert and
delete times approaching O(log2Size). Trees thus combine the fast search times of a sorted list
with a much faster insertion operation. There is of course a tradeoff involved. The cost of these
improved execution times includes:
- extra memory to store the pointers,
- efforts to keep the tree more or less balanced, and
- more complicated algorithms.
The gains, however, often more than offset the costs, so trees are commonly used for fast
searches.
The O(log2Size) search time depends upon the tree being more or less balanced. In the worst
case, if the tree is one long branch, the search time can degenerate to O(Size). If the tree is
practically never altered once the data is entered, it is easy to balance the tree. We first construct
the tree and then, using inorder traversal of the tree, instead of printing the items, we insert the
items into an array so that the items in the array are sorted. Now, let the middle item in the array
be the root of the new tree; let the items in positions Size/4 and 3*Size/4 be the roots of the two
subtrees, and so forth. A general algorithm to generate a balanced tree from a sorted list is:
Balance( List )
If Size > 0, then Bal ( List, 1, Size )
end balance
21
Searching
This new version inserts the item into the tree in time O(1), hence the total execution time of this
version of the Balance routine is O(Size).
In practice, a non-recursive version is needed for efficiency. The non-recursive version is left
for the exercises.
If new items are continually being inserted in the tree as searches are being made, then more
sophisticated methods are necessary to keep the tree balanced as it is altered. The earliest method
for doing this is probably the AVL tree. The tree is named after its inventors Adel'son-Vel'skii
and Landis, but can be thought of as the AVerage Length tree because it keeps the tree balanced.
To be more precise, it assures that no two sibling subtrees differ in height by more than 1. Every
time an item is inserted in the tree, the heights of all affected subtrees are checked. If any two
sibling subtrees differ in height by more than one, the tree is rebalanced. The algorithms for rebal-
ancing are straightforward but tedious and the interested reader is referred to H. Smith's Data
Structures, Form and Function, Harcourt Brace Jovanovich Publishers, 1987. The advantage of
AVL trees is that in the worst possible case they can be searched in 44% more time than is neces-
sary to search a perfectly balanced tree. If the tree is searched often and seldom altered, AVL
trees do have advantages.
To compare the relative speeds of using a sorted array and a balanced binary tree for storing a
set, it suffices to compare the Big Oh times.
The balanced binary tree has search and insertion times of O( log2Size ). The sorted list has
the same search time, but the insertion time is O( Size ). The balanced binary tree is obviously
better in every case. This conclusion is true whether the list is a set or a bag.
The trouble is, of course, that a binary search tree does not stay balanced if too many inser-
tions are made. In other words, as random insertions are made into a binary search tree, even if
the tree is balanced to begin with, there is a distinct probability that the insertions will unbalance
it. One solution to this problem is to keep track of the number of comparisons made per search
and to rebalance the tree whenever the number of comparisons becomes larger than log2Size.
Another solution is to use bottom up trees which remain balanced as insertions are made. (They
are covered later in this section.) Still another method is to use the AVL tree. Each of these
methods has it own drawbacks and there is a need for some better way of keeping a tree balanced.
The conclusion is slightly different if a bag is stored in an unordered list.
22
Searching
Example 9.4.1.1. Compare sequential search and a balanced binary search tree assuming a bag
with 1000 items and
a. 99 searches for each insert,
b. 50 searches for each 50 inserts, and
c. 1 search for each 99 inserts.
Assume the execution times for the operations are approximately proportional to the number of
comparisons. For a balanced binary search tree with 1000 items the number of comparisons to do
a search is log21000 or approximately 10 and the number of comparisons needed to insert an item
is the same. The approximate relative costs for 100 operations are then:
For comparison purposes, assume we replace the balanced binary search tree by a sequential
search on an unordered list. The approximate search time for a list with 1000 items is 500 and the
approximate insert time is 1, so the approximate relative costs for 100 operations are:
In cases (a) and (b), there is no question. A balanced binary search tree is superior in both cases.
Case (c) needs a more precise calculation because our calculation is not accurate enough when the
two cases are approximately equal. Even this crude comparison, however, suffices to determine
the best methods in cases (a) and (b).
Note how this conclusion differs from a similar comparison between a sorted list and an
unsorted list.
Exercises
1. Given a set with 10,000 items, compare the execution times of a sorted list vs. a binary tree
assuming 1000 inserts and:
a. 100 b. 1000 c. 65,000 d. 1,000,000
searches.
2. Compare the execution times of a sorted list vs. a binary tree data structure for implementing
a bag assuming:
a. 1% inserts and 99% searches, c. 50% inserts and 50% searches, and
b. 10% inserts and 90% searches, d. 90% inserts and 10% searches.
3. What BST is generated when the Balance algorithm of the text is applied to the list:
a. 1,2,3,4,5,6,7 or b. 1,2,3,4,5.
23
Searching
4. What is the execution time of the Balance algorithm? What is the total time required to
balance a tree using the balance algorithm approach?
7. How unbalanced can an AVL tree be? Give examples to illustrate your answer.
8. Develop an algorithm to perfectly balance a tree after each insert. How much time does your
algorithm take? How does your method compare to using a sorted list and binary search?
9. The Balance algorithm can be altered so that it uses a stack or a queue rather than recursion.
Which method is (a) faster or (b) uses less storage?
10. Assuming the list is stored in a binary search tree, what is the search time if the sought item is
not in the tree?
If the data access pattern is skewed in some way, say for example, 80-20, then we can speed
up tree searches by organizing the BST based upon usage patterns. If the data is known in
advance and we know how often each piece of data will be accessed, then there is an optimal
BST, one that maximizes the search speed. The difficulty, of course, is that we must know in
advance not only the data but how often each piece of data will be accessed.
There are cases when this is known. Consider, for example, a spelling checker. We know in
advance that a small number of words account for most of the text in almost any English text. We
even know how often each word occurs. The five most common words in English, for example,
and their approximate relative frequencies are:
Word Frequency
the 0.4
of 0.2
and 0.15
to 0.15
a 0.1
A spelling checker can speed up its searches if it uses this information. Storing these five
words in the balanced tree:
24
Searching
of
/ \
and the
/ \
a to
the
/ \
of to
/
and
/
a
uses:
The unbalanced tree, therefore, is approximately 5% faster on the average than the balanced tree.
The speed gain, even in this trivial case, is significant, but depends upon knowing in advance the
access frequency of each node in the tree.
Detailed algorithms for computing the optimal BST tree for a given data set and access
pattern are available. (See, for example, Smith, op. cit.) The difficulty is that the computation time
is O(Size2). As the size of the tree grows, this computation time becomes excessive.
- Frequency Counts: For more general trees where we do not know the frequency of use in
advance, we can, of course, use frequency counts to obtain a reasonable approximation to the
access pattern.
- Balanced Heuristic: While the time required to compute the optimal tree is O( Size2 ), there
are faster, heuristic methods which produce BST trees which are close to optimal, but can be
computed in time O(Size). The balanced heuristic, for example, tries to keep the subtrees more or
less balanced. That is, we attempt to construct a tree such that the relative access frequency of
each subtree is balanced with its sibling. To illustrate, assume we are given the data:
25
Searching
We start at the top of the list and consider each item in turn as a possible root.
1. If Atlanta is the root, then it has no left subtree.
2. If Boston is the root,
then its left subtree has a frequency of 0.1 and its right subtree a frequency of 0.8
3. If Chicago is the root,
then its left subtree has a frequency of 0.2 and its right subtree a frequency of 0.6
4. If Detroit is the root,
then its left subtree has a frequency of 0.4 and its right subtree a frequency of 0.3
5. If Miami is the root,
then its left subtree has a frequency of 0.7 and its right subtree a frequency of 0.25
Choosing Detroit as the root gives a left subtree with relative frequency 0.4 and a right subtree
with relative frequency 0.3 --- which is as close to balanced as is possible with this data. Using
Detroit as the root and applying the same technique to the left subtree, we see that the left subtree
is best balanced if we use Boston as the root. Similarly, the right subtree is best balanced if
Omaha is the root. The tree is then:
Detroit
/ \
Boston Omaha
/ \ / \
Atlanta Chicago Miami Ottawa
\
Raleigh
This may not be the optimal BST, but it is close to it. Its average access time is:
Many other heuristics have been proposed. One of the better ones is the min/max heuristic
which at each node minimizes the access frequency of the subtree with the maximum access
frequency. It is slightly better than the balanced heuristic, but requires a more complicated
algorithm. Detailed algorithms are available in Smith op. cit.
26
Searching
When either we don't know the access pattern or the access pattern changes with time, we can
use the same types of data reorganization techniques as those developed for sequential searches;
we can, for example, use transposition and move to the front.
- Transposition: Transposition depends upon moving the search item up one level in the tree.
To move the node A up one level in a tree, we can use transformations like the following (where
C is the original parent of B and can be any node in the tree):
C C
/ /
B A
/ \ becomes / \
A /3\ ===> /1\ B
/ \ / \
/1\ /2\ /2\ /3\
where the numbered triangles represent subtrees. Note that after the transformation, the node A
has moved up one level in the tree and B has moved down one level, yet the binary search tree
characteristics are preserved. Note also that only three pointers have to be changed to achieve
this result: the right pointer of A, the left pointer of B, and the pointer that originally pointed to B
must now point to A. (This transformation assumes that A is to the left of B. A transformation
to handle the case when A is to the right of B is left for the exercises.) These transformations are
easy to do and over a period of time should bring the frequently accessed items to the top of the
tree.
- Move-To-The-Front: Move to the front requires moving the search item up to the root of
the tree. This can be done by repeating the transposition transformation until the desired item has
moved up to the root position. One kind of tree which uses repeated transpositions to achieve
move to the front is the splay tree. By careful choice of the sequence of transpositions, splay
trees attempt to balance the tree during each insertion. Splay tree insertion can be very slow
because the transposition transformation may have to be performed once for each level starting
with the level of the sought item and working upwards to the root of the tree.
There are ways, however, to transform the desired item into the root in one move. One such
transformation is:
/1\ A
\ / \
A /1\ /3\
/ \ becomes \
/2\ /3\ /2\
and
27
Searching
/3\ A
/ / \
A becomes /1\ /3\
/ \ /
/1\ /2\ /2\
These transformation requires changing the root and two other pointers.
When the sought item is not in an outermost branch of the tree, then more complicated trans-
formations are necessary. For example,
/4\ A
/ / \
/1\ /1\ /4\
\ becomes \ /
A /2\ /3\
/ \
/2\ /3\
and
/1\ A
\ / \
/4\ becomes /1\ /4\
/ \ /
A /2\ /3\
/ \
/2\ /3\
28
Searching
Exercises
1. Use the balanced heuristic method to develop a near optimal BST for the distribution:
Alabama 2 Kansas 4 Michigan 8 Oklahoma 3
California 10 Louisiana 5 Nebraska 2 Pennsylvania 9
.
2. The local phone book has the following distribution of the first letters of customer's names:
That is, there are 18 pages of A's, 61 pages of B's and so forth. (There were too few X's to
include.) Use the balanced heuristic to find a near optimal BST for the distribution above for the
letters:
a. A-G d. V-Z
b. I-N e. A-O
c. O-U f. A-Z
In each case, calculate the expected number of comparisons required for a search.
3. Develop a search algorithm for a binary search tree that applies the transposition method to
every item found.
4. In a sequential list there is only one way to move an item to the front. Show that for a binary
search tree there are at least three ways to move an item to the front. What is the execution time
for each method?
6. Develop a BST search algorithm which improves the search times by using:
a. transposition or b. move-to-the-front.
29
Searching
Another way to insure that the tree stays balanced is to keep all the leaves at the same level. The
secret is to let insertions bubble up the tree. Many such types of trees have been developed, each
with its own advantages and disadvantages. We will present three such trees, the 2-3 tree, the B
tree, and the B+ tree.
. 20 . 30 .
10 22, 25 40, 50
is a 2-3 tree with one node containing one item and three nodes containing two items. The root in
this case has two items along with pointers to a node with values less than 20, a node with values
between 20 and 30, and a node with values greater than 30.
The critical part of this definition is that all leaves are at the same level. This insures the tree
is always balanced. To achieve this however requires two new features. The first new feature is
that all insertions are made on a bottom up basis (which will be covered in a moment). The
second new feature is that to execute a bottom up insertion we use nodes with more than one data
item.
Every node in a 2-3 tree contains either one or two data items. If a non-leaf node contains
one item, then it has two pointers, one pointer to nodes with values less than the value and one
pointer to nodes with values greater than the value. This kind of node is almost identical to a
node in an ordinary BST.
If a node contains two items, then it has three pointers:
- one pointer points to nodes with items less than the first item,
- one pointer points to nodes with items between the two items, and
- the last pointer points to nodes greater than the second item.
In other words, if a non-leaf node contains two items, then the node points to three other nodes.
The graphical representation of a node is:
Item1 Item2
Next1 Next2 Next3
where Next1, Next2, and Next3 are pointers. In record form a node is:
30
Searching
Node is a record
Next1 : Pointer to node with data < value of Item1
Item1 : First piece of data
Next2 : Pointer to node with data between value of Item1 and Item2
Item2 : Second piece of data
Next3 : Pointer to node with data > value of Item2
end record
To insert a new entry into a 2-3 tree, we first locate the leaf node that should contain the new
entry and then repeat the following algorithm until we run out of items to insert:
Thus, we start inserting at a leaf node. If the leaf node overflows, we break the node into two
nodes and insert the middle of the three values into the parent node. If the parent node overflows,
we repeat the process again inserting the middle value into the parent's parent. We keep this
process up until we find a node which does not overflow, or we have generated a new root for the
tree. Figure 9.4.3.1 illustrates insertion into a 2-3 tree by starting with an empty tree and inserting
several items, one at a time. The leaves in Figure 9.4.3.1 all contain pointers. Since once a leaf,
always a leaf in 2-3 tree, some designers save space by omitting the pointers from the leaves.
Insertion is more complicated if duplicate values are allowed in the tree; that is, the 2-3 tree is
used to store a bag.
The basic search algorithm, assuming all nodes contain exactly two items, is:
Search (Data )
Initialize
P <-- Root
Found <-- false
31
Searching
. 56 .
. 32 . . 32 . 56 . . 32 . . 72 .
. 56 . . 32 . 56 .
. 21 . 32 . . 72 . . 21 . . 35 . . 72 .
d. Insert 21 e. Insert 35
. 32 . 56 . . 32 . 56 .
f. Insert 60 g. Insert 27
. 32 .
. 27 . . 56 .
. 21 . . 29 . . 35 . . 60 . 72 .
h. Insert 29
32
Searching
Return ( Found )
end search
A slightly more complex algorithm is necessary to allow for nodes containing only one item.
Note that 2-3 tree search uses five cases in the loop vs. the three cases used in a BST.
The primary advantage of 2-3 trees is that they always stay balanced. This insures that the
depth of the tree never exceeds log2Size, and, in fact, it may be less. The price we pay for this
guarantee is additional storage space, more complex algorithms, and slower executing algorithms.
9.4.3.2. B Trees
A B tree is a generalization of a 2-3 tree to a tree where each node may contain up to m items
where the value of m depends upon the particular B tree but may be as large as 100 or 200. In a
B tree:
1. All leaves are at the same level.
2. Every non-leaf node with k values contains (k+1)
pointers to nodes at the next level down.
Figure 9.4.3.2 illustrates a typical B tree with m equal to four. Note that the items in each node
are sorted; this allows fast search of a node.
Since the leaf nodes have no children, and never will, they do not need pointers. Hence, the
leaf nodes normally have no space for pointers.
Insertion into a B tree is very similar to insertion into a 2-3 tree. Given a new item, we first
locate the leaf node that should contain the new entry and then repeat the following algorithm
until we run out of items to insert:
Thus, we start inserting at a leaf node. If the leaf node overflows, we break the node into two
nodes and insert the middle item into the parent node. If the parent node overflows, we repeat the
process again inserting the middle item into the parent's parent. We keep this process up until we
find a node which does not overflow, or we have generated a new root for the tree. The need to
insert an item in the parent node is a nuisance, but if a node contains, say 100 items, this occurs
on the average at most every fifty insertions and may be much rarer.
As with 2-3 trees, duplicate entries complicate insertion.
Over the years, designers have developed many variations on B trees. One variation is based
upon the fact that "real" B trees always use at least half of every node; that is, if after deletion a
node contains less than m/2 values, then the node is merged with its siblings so that every node
33
Searching
12
14
21
23
27
28
22 29
30
49
32
43
45
51
55
53
160 54
...
58
60
63
66
71
69 73
90 77
113 82
132
95
103
112
140
150
34
Searching
contains at least m/2 values. This method insures that the tree does not degenerate into a large
collection of almost empty nodes. A minor variation on this idea is to insure that every node
contains at least 2m/3 values by merging any node into its siblings if it fails to meet this criteria.
These trees are called B* trees.
Regardless of the kind of B tree, it is always balanced and this makes insertion and searching
both O(logmSize) operations.
9.4.3.3. B+ Trees
B+ trees are useful when we are storing records that we want to access both by key and
sequentially, say sorted on the key value. They differ from B trees in two ways:
1. They store all the data in leaves; non-leaf nodes contain only key values. This
implies that non-leaf nodes either can be much smaller (since they contain only the
key part of the item and not the entire item) or they can contain many more items in
the same space. The only purpose of a non-leaf node is a quick tree search using the
key to locate a record.
2. Each leaf node contains a pointer to the next leaf node so that, once we have found
the first leaf node, we can process the leaf nodes one at a time as a linked list. This
gives a quick, non-recursive, sorted output of all the items in the tree.
There are many variations on the basic concept of a B+ tree. Figure 9.4.3.3 contains a typical
B+ tree with four values per node. Note that each non-leaf node contains the largest key in each
node below it; for example, if the largest key in a given node is 1234, then the parent node will
contain the key value 1234 and a pointer to the given node.
B+ trees in various forms are widely used. Probably the most common use is in direct access
disk files; that is, disk files whose records can be accessed by the record's key value. The most
commonly used example is a VSAM file in an IBM mainframe system. Most large database
systems use some form of B+ trees for their ISAM file storage and retrieval system.
Search of B+ trees and insertion into B+ trees is a minor variation on B tree search and inser-
tion and left for the exercises.
Exercises
1. Is it possible to design a binary tree such that the leaves are all on the same level?
3. Insert the following items into a B tree with a maximum of four items per node.
a. 31, 10, 30, 93, 66, 49, 126, 33, 11, 41, 32, 34, 39, 29, 10, 47, 54, 6, 8, 127, 63, 43
b. 30, 29, 125, 31, 96, 97, 123, 39, 47, 21, 69, 49, 79, 74, 51, 48, 114, 25, 43, 64, 126,
50, 62, 6, 78
35
Searching
12
14
21
23
27
28
21
29
29
45
32
54 43
45
51
53
54
54
131
200
58
60
63
66
71
66
73
82 77
112 82
131
95
103
112
114
150 123
128
A Typical B + Tree 200 131
Figure 9.4.3.3
140
150
36
Searching
4. Insert the following items into a B+ tree with a maximum of four items per node.
a. 31, 10, 30, 93, 66, 49, 126, 33, 11, 41, 32, 34,
39, 29, 10, 47, 54, 6, 8, 127, 63, 48, 43, 125
b. 30, 29, 31, 96, 97, 123, 39, 47, 21, 69, 49,
79, 74, 51, 48, 114, 25, 49, 64, 126, 49, 62, 6, 78
c. 42, 31, 49, 51, 124, 125, 47, 100, 96, 123, 48, 20,
78, 32, 75, 34, 17, 5, 68, 45, 50, 74, 14, 61, 77
5. Develop search algorithms, insertion algorithms, and sequential processing algorithms for:
a. 2-3 trees, c. B+ trees.
b. B trees, and
6. What is the maximum number of items that can be stored in a B tree of height h?
7. Given a set with Size items stored in a B tree, how large does m have to be for the B tree to
have height less than or equal to three?
9.5. Tries
A trie (the name is taken from the middle syllable of the word retrieval and is pronounced
"try") stores and searches the data set one digit/character at a time rather than one item at a time.
To illustrate, we can store the data set
101, 107, 111, 117
in the form of the n-way tree:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
where each digit in the item takes us one further step down the tree. This way we need a
maximum of three probes to find a three digit item. (A probe is the work needed to determine the
next branch to take in the tree.) In fact, three probes suffices for any collection of three digit
numbers. In other words, we can search for any one of up to 1000 three digit numbers by using
only three probes.
Tries can be used with any alphanumeric data. Alphabetic data sets might have 26 or more
branches per node, but the basic principle is the same.
37
Searching
We could store the trie using a standard n-way tree data structure, but then each probe would
have to sequentially search a list of siblings. Each probe, on the other hand, is an O(1) operation
if we store the trie in a table with eleven columns, one column for each digit and one special
column. The above trie is then:
0 1 2 3 4 5 6 7 8 9 *
1 2
2 3 4
3 101 107
4 111 117
where each entry in the table is either a pointer to another row or the item sought. In this text,
the pointers are denoted by regular integers and the data set items are underlined. In an Ada
package each table entry can be a variant record, although some implementations use positive
integers for pointers and represent items by negative integers denoting a subscript in an array
which contains the items. The * column in the table will be covered shortly.
To search this trie for a particular number we always start with the first row and use the first
digit of the number to determine the column. Thus, to search for 107, we examine Table(1,1)
which is a 2, so our next probe is in row 2. The next digit in 107 is a zero, so we examine
Table(2,0) which sends us to row 3 for our next probe. The next digit is a 7, so we examine
Table(3,7) and find the number 107 so our search is successful. Any time a probe lands on an
empty table entry, the search fails.
Since each probe picks a single entity from the table, the time to execute a probe is O(1).
Tries grow, as long as more rows are available, to contain as many items as necessary. For
example, adding the numbers 103, 1005 and 14 to the above trie gives:
0 1 2 3 4 5 6 7 8 9 *
1 2
2 3 4 14
3 5 101 103 107
4 111 117
5 1005
where we have used the first empty row, row 5, to store the next set of values.
Now assume we need to add the item 141 to the trie. The first two digits, 14, also happen to
be the value of an item in the table. To handle this, we use the * column. The trie with this
addition is:
38
Searching
0 1 2 3 4 5 6 7 8 9 *
1 2
2 3 4 6
3 5 101 103 107
4 111 117
5 1005
6 141 14
The 14 has been moved from row 2, where there is no longer room for it, to the * column of
row 6. The * column is used for items when there are no more digits in the item, or, in other
words, the last character of every number is assumed to be a *.
Exercises
3. Give space and time estimates for trie inserts and searches.
9.6. Hashing
An ideal search scheme would take one look at the given key and know immediately, or at
least in time O(1), if the given item is in the data set. There is a way, called hashing, which comes
close to this ideal. It stores the items in a table location determined by the value of the key.
As a simple example, an airline might use a three digit number for a flight number and then
store the flight information in the table location corresponding to the flight number; for example,
flight 259 would be stored in row 259 of the table. This scheme allows accessing the item directly
from the key value in time O(1).
This case is particularly simple because the key values are three digit integers. In the more
general case, there must be some way to calculate the table location from the key value. Any
mathematical function which transforms a key into a table address is called hash function; that is,
39
Searching
any calculation that turns a key into an address in a table is a hash function, or, as it is sometimes
known, a key-to-address function.
To illustrate hash functions consider a company with twenty or thirty employees which needs
to store information on each employee. It may set up a table with 100 rows (numbered 00
through 99) and store each employee's record in the row determined by the last two digits of the
employee's Social Security number; for example, if the Social Security number is 425-86-1234,
the employee record is stored in row 34, the last two digits of the number. Since the last two
digits of Social Security numbers are probably fairly randomly distributed between 00 and 99, the
employee records are probably fairly randomly distributed throughout the table. Thus, to insert or
to search the table for an employee record, we can use the last two digits of the Social Security
number to quickly locate the desired record.
Hash functions have Insert and Search times of O(1); they also have Delete, and Update times
of O(1). These are the fastest possible speeds for these operations. There are some practical
problems however that often keep us from attaining these speeds.
Collisions
The first practical problem is collision. The above example illustrates the problem. If two
different employee Social Security numbers end with the same two digits, the records are said to
"collide" since they would be stored in the same row of the table. There are at least two ways to
handle collisions.
- Separate Chaining: The first way to handle collisions is to store a pointer to a linked list in
the table rather than the record itself. That is, each entry in the table is a pointer to a set of
records and every record in the same set has a Social Security number ending with the same two
digits. This method is called separate chaining since a set is "chained" to each array entry.
Figure 9.6.1 shows the resulting data structure as 11, 23, 83, and 51 are inserted using
separate chaining where each item is inserted into the location determined by the units digit of the
number. Thus, the number 11 is inserted in location 1 and the number 23 is inserted in location 3.
Collision occurs when there is an attempt to insert 83 in location 3, so the new value, 83 is
inserted in the "chain" associated with location 3. Similarly, attempting to insert 51 into location
1 also generates a collision, so 51 is inserted in the chain associated with location 1.
The Insert time is still O(1) with separate chaining (why?). The Delete, Update and Search
operations all require searching a set; this normally means an execution time of O(Size of the set).
If each set contains only one or two items, these are still fast times. If one or more of the sets
become very large, the execution times can also be large. In other words, the execution time
increases with the set size or the number of collisions.
40
Searching
0 0 0 0
1 --> 11 1 --> 11 1 --> 11 1 --> 51,11
2 2 2 2
3 3 --> 23 3 --> 83,23 3 --> 83,23
4 4 4 4
.... .... .... ....
- Open Addressing: A second way to handle a collision is to store the record in the first
empty row in the table following the specified one; thus, the record with Social Security number
425-86-1234 is stored in row 34 if that row is empty; if the row is full, the record is stored in the
first empty row following row 34 (if we ever reach row 100, we start over at row 00). This
method is called open addressing. To search for a record, we start at the specified row and
examine one row at a time until we find the record or we find an empty row. Again, if the table is
mostly empty rows, the execution times can be very fast. If the table contains long sequences of
filled rows, the execution times can be very slow.
Figure 9.6.2 shows the steps when each item is inserted in the location determined by the units
digit of the item:
first, 11 is inserted in location 1,
second, 23 is inserted in location 3,
third, attempting to insert 83 in location 3 generates a collision so 83 is inserted in
the first empty location, location 4, and
fourth, attempting to insert 51 in location 1 also generates a collision so 51 is inserted
in the first empty location, location 2.
0 0 0 0
1 11 1 11 1 11 1 11
2 2 2 2 51
3 3 23 3 23 3 23
4 4 4 83 4 83
.... .... .... ....
-A Comparison: The reason for restricting the example above to a company with twenty or
thirty employees is to make sure that there are very few collisions; that is, if most of the rows are
41
Searching
empty and the non-empty rows contain only one or two items, then there are few collisions and
the execution times are fast. If, on the other hand, the company had 100 employees then the
probability of collisions becomes fairly large and the execution times would be slow.
If we have an ideal hashing function, one where every array entry is equally likely to occur,
collisions can still occur, but may have negligible effect on the resulting execution times. Assume
for example, an ideal hashing function that hashes a set of N items into an array with M entries, so
that each of the M entries is equally likely to occur. Then the expected number of probes for a
successful search is:
where R = N/M. The differences between these two formulas can be seen in the table:
Comparing the two columns shows that separate chaining is always faster than open addressing,
but only at the price of storing all of the additional pointers.
These results assume of course an ideal hash function, but, even in the ideal case, note how
the number of probes used by the open addressing method starts growing when R becomes
greater than 0.5. In real life, hash functions are sometimes far from ideal and the number of
probes is larger than indicated by this table.
Also note that while open addressing limits the number of entries that can be stored to the size
of the array, separate chaining has no limits on the number of items which can be stored, but does
require extra storage for the pointers.
Practical hashing functions must distribute the records fairly uniformly throughout the table --
otherwise there will be too many collisions. If our company, for example, used the first two digits
of the employee's Social Security number as a function, there would be many collisions because
the first three digits of a Social Security number denote the region or office issuing the Social
Security numbers. If most of the employees are from the same area, they will tend to have the
same first three digits in their Social Security numbers, and there will be many collisions.
Many types of hashing functions have been proposed and used. There are whole books
devoted to the subject. We will cover only three of the more popular functions here.
- Truncation: Truncation normally uses either the first few or the last few digits of the key as
a function. The Social Security example above used truncation with the last two digits. If the key
42
Searching
is alphabetic data, we might set up a table with 26 rows (keeping one letter from the key), or 262
rows (keeping two letters from the key), or so forth. While normally the first or last digits are
used in truncation, actually any combination of digits or characters in the key can be used. It is a
matter of determining the best combination. Truncation is simple, fast, and, when it works, a
good method.
- Division: The division or the remainder method divides the key by some number and uses
the remainder of the division to determine the row in the table. For example, if the table has 29
rows (numbered 00 through 28), we divide the key by 29 and use the remainder of the division to
determine the row. If the key is 100, then we do the division
___ 3
29 ¦ 100
87
13
so the record with key 100 is stored in row 13. The method is fairly fast, but works best when the
divisor is a prime number and the table is at most about 70-80% full.
- Folding: The folding method breaks the key into pieces and adds the pieces. The key
123456 for example might be broken into the three pieces 12, 34, and 56. These pieces are then
added and the result again folded until folding is no longer possible. Thus, starting with the key
123456, we compute:
12
34
+ 56
102
which must be folded again to produce the new sum 3 so the number is stored in row 3.
Folding is often used with alphabetic data because it tends to scatter keys that otherwise
would end up together. People's names, for example, are often spelled the same or nearly the
same so that simple truncation or remaindering would tend to generate many collisions. No
matter which method we use, all the Smiths or all the Johns would end up colliding. With folding
we replace each letter by a number between 1 (for A) and 26 (for Z). We then take the whole
name (first, middle and last) and add all the numbers together. Since every pair of Smiths would
presumably have different first or middle names, this would generate different rows for each pair.
Similarly, all the Johns would presumably have different middle or last names so they also would
end up in different rows. Collisions can of course still occur; John Jacob Smith and Jacob John
Smith could collide, but hopefully such collisions are rare.
Using Hashing
Many beginners think that hashing solves all their searching problems. Hashing after all gives
the best possible insertion and search times, O(1). What they don't realize is how hard it is to find
a good hashing function. As we have noted before, data is seldom uniform.Just as some items are
accessed more often than others, there can be data "clusters" that hashing functions find it difficult
to separate into distinct locations in the array.
43
Searching
A few minutes thought, for example, will show that certain surnames, such as Smith or
Brown, account for a relatively large fraction of the population of the United States. Given that
certain first names are also very common, we can expect many duplicate or almost duplicate
names that will give any hashing function trouble. We can of course extend the hashing function
to include the person's Social Security number or birth date to differentiate between similar
names, but our key and hash function are getting long and complicated.
The only way to be sure that a hashing function works with a given set of keys is to actually
try the function on all the keys and study the collisions that occur. This is a lot of work, especially
for large data sets and any change in the distribution of the data values requires a new study.
Probably the best way to describe hashing is to say that when it works, it works very well,
and, when it doesn't work, it tends to fail miserably.
Exercises
1. What location would the item value 37 be stored in if the data is accessed by a hash function
using:
a. division with the divisor 11, c. truncation using the tens digit, or
b. truncation using the units digit, d. folding using each digit individually.
Also for each case, how much memory is required with this hash function?
2. Insert the numbers 1, 5, 13, 15, 19, 22 into an array assuming hashing using division by 7 and
separate chaining.
3. Insert the numbers 1, 5, 13, 15, 19, 22 into an array assuming hashing using division by 7 and
open addressing.
4. Insert the numbers 1, 5, 13, 15, 19, 22 into an array assuming hashing using division by 5 and
open addressing.
5. Develop insertion, deletion and search algorithms based upon division and open addressing.
6. Develop insertion, deletion and search algorithms based upon division and chaining.
7. Insert the numbers 351, 5218, 4193, and 298 in an array using folding upon one digit groups.
8. Insert the values Able, Baker, Abet, Beth, and Zeta in an array using::
a. the first two letters as subscript or b. folding using all the letters.
9. Develop an Ada truncation procedure to take the first two letters of a name and turn these
letters into:
a. a pointer or b. the next empty location in an array.
10. The text above treats collision in open addressing by storing the new item in the first empty
location following the desired location. Double hashing uses two hashing functions. The first
hashing function determines the primary location of the item. If the location is available, the item
44
Searching
is inserted there. If the location is already occupied, a second hashing function is used to deter-
mine the next location to try. In particular, the second hashing function determines the number of
locations to skip before trying again. Name an advantage of this method?
Program 9.6.2.1 contains a set package body based upon hashing. This package is the result
of several design decisions and compromises.
First, since the hash function is usually unique to the particular data in the set, the hash set
package should be designed from the beginning to the end for this particular set of data. For one
thing, the hash function should be included as an internal function of the package body. Also
making the package specific to the data set eliminates the overhead introduced by using a generic
package. Program 9.6.2.1 is generic only so it will match the package specification in Specifica-
tion 9.1.1 and can be compared to the other package bodies developed earlier for this same
specification.
Second, the package uses a separate chaining implementation, but, rather than implement the
details of the chaining, the package implements the chaining by using a set package based upon a
linked implementation (say one similar to the linked representation of a bag in Algorithm 5.1.2.2.1
in the Set chapter of this book).
Note that the hash function included in the package body is only a skeleton and the user must
determine the details of the function.
Exercises
2. Redesign Program 9.6.2.1 so that the hash function is one of the generic parameters. What
changes does this require in the package specification?
4. Redesign Program 9.6.2.1 so that it hashes by division and the divisor is a generic parameter.
5. Redo Program 9.6.2.1 to make the hashing function a generic parameter. What are the pros
and cons of this approach?
45
Searching
with Ada.Text_IO;
with Linked_List_Set_Package;
package body Set_Package is
Items : Array_Type(Subscript);
------------------------------------------------------------
------------------------------------------------------------
function Hash_Function ( The_Key : in Key_Type) is
--A function to compute a subscript from a key value.
--Depends upon the particular application and
--the details are left to the designer.
end Hash_Function;
------------------------------------------------------------
procedure Insert (The_Key : in Key_Type;
New_Data : in Data_Type) is
begin
Row := Hash_Function( The_Key );
Set.Insert ( Items(Row), The_Key, New_Data);
exception
when Duplicate_Entry => raise Duplicate_Entry_Error;
end Insert;
46
Searching
------------------------------------------------------------
function Is_In (The_Key : in Key_Type) return Boolean is
begin
Row := Hash_Function( The_Key );
return Set.Is_In ( Items(Row), The_Key);
end is_in;
end Set_Package;
Sometimes the data has some property that makes storing and retrieving it almost trivial. One
obvious case is when the data is a straight sequence of numbers, say the integers from 1 to 100.
The obvious solution is to store the data in an array and use each number as its own location in
the array. This gives O(1) insertion, search, and deletion times. This method can be used anytime
the data keys are a set of integers (or even a set of alphabetic strings) in some limited range. A
set of airline flight numbers, for example, might be three digit numbers. We can then use an array
with subscripts from 1 to 999 to store the flight numbers. Many of the entries in the array might
be empty (no such flight), but the insertion, search, and deletion speeds for more than compensate
for the extra storage space.
Since many of the array entries would be empty and it takes a rather large record to describe
each flight, rather than store the complete records in the array, it is better to store only pointers to
the records in the array. This way each empty array entry only wastes the space needed for a
single pointer.
One might object that data very seldom has this nice property and, hence, the method while
occasionally valuable is of limited use. That is both true and false. Many kinds of data do not
have this property and the method cannot be used in these cases. Many kinds of data, however,
are under our control to some extent and we can make the data satisfy the conditions above. The
data often consists of records of some kind and we can choose the record key to meet our own
conditions.
Many companies, for example, use Social Security numbers as employee identifiers. Ignoring
for the moment the fact that duplicate Social Security numbers exist and can wreak havoc with
this choice, consider the problems of using a nine digit, rather random number as an employee
47
Searching
identifier. At best, such keys imply the use of some kind of hashing method with its collisions and
other problems. A much simpler scheme is to assign the employees unique identification numbers,
say 1, 2, ... . These identification numbers can be used to store and retrieve the employee records.
If one insists on double checking, once the record is retrieved the employee's name and Social
Security number can be checked.
If one wishes to disguise the fact the identifiers are sequential integers such as 1, 2, 3, ..., then
it is possible to multiply each identifier by some large integer, say N, then the users sees the identi-
fiers, N, 2N, 3N, ..., but, of course it is easy to divide these user identifiers by N to get the origi-
nal identifiers back.
Assigning sequential numbers as keys to records gives a fast and reliable storage and retrieval
method. It is widely used by banks, credit card companies, and anyone else who needs a very fast
storage and retrieval system.
Sometimes it is possible to design a perfect hash function, a hash function with no collisions.
Assigning the key values sequentially, as discussed above, gives simple perfect hash functions.
When the keys cannot be assigned sequentially then it can be difficult to find a perfect hash
function. There are cases, however, where it has been done. The best examples are connected
with the reserved words in some computer language or other. (The words "if", "then", and "else",
for example, are reserved words in Ada and can only be used as part of an if statement.) The
problem is to determine a hash function which maps each reserved word into a position in an array
with no collisions. Several perfect hash functions have been developed for the reserved words in
the Pascal language.
To determine a perfect hash function, we must be given a complete list of items to be hashed,
because it is impossible to design a perfect hash function unless all the items to be hashed are
known in advance. There are complicated algorithms which can produce perfect hash functions
when the list of items is small. As might be imagined, the computation time for these algorithms
becomes excessive for a list of any real size.
Exercises
1. Which of the following are candidates for assigning unique, sequential identifiers? Justify
your answer.
a. Checks in a checking account.
b. Students in a university.
c. Automobile license numbers.
d. Books in a library.
e. Charge accounts.
48
Searching
We have so far acted as if all the data is stored and retrieved using a single component of the
data, either the data itself or some key item in the data. Many data sets and many problems
require searching for the data on not one component, but on several components. A student
record, for example, might have a numerical identifier, a name, a major, a grade point average,
and other components which don't interest us at the moment. Figure 9.8.1 contains a typical table
of such records with the data arranged alphabetically by name.
While the data set might normally be searched using the student's name, from time to time it
might be necessary to search on the student's major or grade point average. In this case, the name
is called the primary key and the major and grade point average are called secondary keys.
The question is: How can we quickly access data using not just the primary key, but also
accessing data using one of the secondary keys? If we arrange the data so that it produces fast
searches using the primary key, then searches on the secondary keys can be very slow.
Examining Figure 9.8.1 illustrates the difficulty. The table in Figure 9.8.1 can be searched on
name with a binary search. Searching the records to find all the Computer Science majors,
however, we see that the Computer Science majors are fairly randomly scattered all through the
table. To find all the computer science majors, we may have to sequentially search the whole
array. The same thing is of course true of any other column that is not strictly related to the way
the data is stored in the table.
The difficulty is summarized in the theorem:
Theorem: An arbitrary data set can be organized for fast retrieval on only one
of its components unless one uses extra storage space.
That is, if we have only one copy of a table, then it is possible to do a fast search on only one
column of the table.
The wording of this theorem suggests two ways around the difficulty. The first way is to have
some very special data. If all the majors were determined by a student's ID number, then we can
store the data by ID number and execute fast searches on both ID number and on major.
The second way around the difficulty is to use extra storage space to speed up the search on
secondary keys. We might store every record twice, with, for example, one copy of the data set
49
Searching
organized by name and the second copy of the data set organized by major. This undoubtedly will
give fast searches, but it is very wasteful of storage space. (It also introduces some consistency
problems. How can we insure that both copies always contain exactly the same data -- in spite of
any hardware or software failures?)
We can improve the situation considerably by storing not two copies of the whole record, but
one copy of each record and then storing two sets of pointers (one organized by name and one
organized by major). This way we only have to store one copy of each record. A table of point-
ers, organized by a key component, for accessing records is called an index or an index table.
Figure 9.8.2, for example, shows an index or index table for the Major column of the Student
Table in Figure 9.8.1; that is, for each value of major in the original table, there is a corresponding
value in the index table with a pointer to the entry in original table.
Major Name
Art Amy
CS Beth
CS Chuck
Hist Bill
Hist Cathy
Math Abe
Index to The Student Table of Figure 9.8.1 using the Major Column
Figure 9.8.2
Since the Student Table in Figure 9.8.1 is organized by student name, each pointer in this table is
a student name. To locate, for example, all of the CS majors, one searches the first column of this
index for CS and finds that the names of the CS majors are Beth and Chuck. These names can
then be used to locate the complete record for each of these students.
These index tables are simply two column tables and the rows can be organized any way
desired to speed up the search. Thus, the Major Index Table above is sorted by major to allow
fast access by major, but the rows in the table could have been organized for a sequential search,
or a tree search (make each row in the table an entry in a balanced binary search tree), or even
hashed by some means or other. Designing an index table for fast searching is the same as design-
ing any other data set for fast searching and all of the searching options presented earlier in this
chapter should be considered.
Before designing the index tables for fast search, it helps to organize the original table for fast
access from an index table. Assume, for example, that the ID numbers in Figure 9.8.1 are
assigned sequentially so that we can store the records in an array using the ID numbers as the
location of the record. Figure 9.8.3 contains the table of Figure 9.8.1 stored using the ID
numbers as the record location in the table. The pointers used in the index tables can now be
these ID numbers.
50
Searching
Name ID Major ID
Abe 7 Art 2
Amy 2 CS 3
Beth 3 CS 5
Bill 1 Hist 1
Cathy 4 Hist 4
Chuck 5 Math 7
Figure 9.8.3 also has two indices, one for the name and one for the major. After each name or
major in an index is the ID number of the corresponding row in the table of student records.
To summarize, Figure 9.8.3 assumes:
- the student ID numbers are assigned sequentially,
- the records are stored in an array using the ID as the record location, and
- the major and grade point average indices use the student ID as a pointer into the array.
Even though the ID numbers are assigned sequentially, it is possible that over a period of
time, some of them will go out of use. Row 6 in the table is an example of this. While empty
rows are in some sense a waste of space, this wasted space can be a small price to pay for the
additional speed.
In the original version of the Student Table, the student name was in some sense the primary
key because the table was organized on this column whereas, in some sense, the new version of
the table makes the ID the primary key for the table. Given the extra index tables, however, this
is more of a language difference than a practical one.
To search the new version of the Student Table on an ID number, we can go directly to the
array. To search on a secondary key, we use the index to determine the entries in the array; for
example, to locate the record for a given student name, we first search the index for the name and
then use the corresponding ID number in the index to determine the row containing the record of
51
Searching
the given student. In this case, the total search time is determined by the time to search the index
and then the time to find the corresponding record in the big table.
The next question then is to determine the time needed to search the index. This obviously
depends upon the data structure used to store the index. If the rows are stored in random order
and sequential search are used, then the search time is O( Size ). If the index is stored sorted on
name (as shown in Figure 9.8.3) then binary search can be used to find a given student's name and
the index search time is O( log2Size ). If the index is stored in a balanced binary search tree using
the student names as key values, then the search time is again O( log2Size ). If it is possible to
hash on the student name, then the index can be stored in a hashed array and, provided a good
hash function is used, the access time is O( 1 ). In any case, the index is simply another table and
the index table storage scheme is designed and analyzed the same way as all the other data sets so
far.
Assume the index is arranged alphabetically on name and that we use a binary search to locate
a specified name in the name index; the index search time is then O( log2Size ). Once the correct
ID number is known, the time to locate the corresponding student record in the big table is O(1).
The total search time for a student name is then O( log2Size ). Of course when a single value can
be repeated many times in an index, such as occurs in the major index where a single value like CS
can occur several times, then the search time depends upon how often the entry occurs in the
index. If, for example, there are 200 CS majors, then it takes time O( log2Size ) to locate the first
one and an additional O( 200 ) to find the rest of the CS majors. A more detailed analysis is given
shortly.
While the indices in Figure 9.8.3 use the ID number as a pointer, there is some question
whether the pointers in the index should be access type variables rather than the student ID
number. Both versions execute at about the same speed, but the student ID numbers are slightly
more general in that the scheme for storing records can be changed without having to alter the
data in the indices, hence the choice made in Figure 9.8.3.
The next question is: How do we organize the index tables? An index is searched on only one
component and must contain one or more pointers for each value of the component. This
presents no difficulty in the name index in Figure 9.8.3 because there is a unique name for each
student. On the other hand a single value of student major may contain anywhere from none to
several thousand pointers, depending upon the major and the school. If there were only one
student for each major then we might consider storing the index as a table with two columns: the
first column contains the major and the second column a pointer to a record with that major.
When there is more than one student record for each major, the situation is more interesting.
One solution is to repeat each major once for each student with this major, as in done in Figure
9.8.3. When a single major may have hundreds of students, a better data structure is to store the
index as a set of pipes; that is, use one pipe for each possible major. This reduces the number of
rows in the index table to the number of distinct majors; that is, one row and one pipe for each
major. If, for example, a school has 5000 students and 50 majors, then this reduces the index
table from 5000 rows to 50 rows which greatly reduces the search time. Each row of the index
table then has a pipe of pointers with each pointer pointing to the record of a student with that
major. Figure 9.8.4 contains this version of the index for Figure 9.8.3.
52
Searching
Major ID
Art 2
CS 3, 5
Hist 1, 4
Math 7
Major Index for the Student Table in Figure 9.8.3 based upon using Pipes
Figure 9.8.4
To process, for example, all the Computer Science majors, assuming a pipe based index on
majors, we can use an algorithm like the following:
Initialize
Find Computer Science pipe in Major Index
Open Computer Science pipe
Repeat for each entry in pipe (while not end of Computer Science pipe)
Get_Next ( ID_Number )
Process Record in Array( ID_Number )
end repeat
This representation of this version of the index is simple and straightforward. The execution
time is essentially the time to locate "Computer Science" in the set of steams table plus the time to
process each of the individual computer science major records. The total execution time of this
method is then:
O( log2 Number of Distinct Majors ) + O( Number of Computer Science Majors ).
Hint. For efficiency, the pipe package used to implement this version of the index table should
be compiled with the Ada inline pragma.
Another way to implement an index is to link the records which have the same major by using
spaces in the records themselves. In this version each student record contains space for a pointer
to the next student with the same major. See Figure 9.8.5 where the entry in the Next_Major
column is the row number of the next entry with this same major. As usual, a zero in this entry
indicates the last student with this major.
The associated index then contains, for each major, a pointer to the first student with that
major. This student record then points to the next student with the same major, and so forth. In
other words, the pipe of pointers is stored in the records themselves. The index is then a table
with two columns: the first column contains the value of a major and the second column contains
a pointer to the first student with that major.
53
Searching
An algorithm to process all the Computer Science majors, given this representation of the
index, is:
Initialize
Find Computer Science in Index
ID_Number <-- Pointer associated with Computer Science in Index
Repeat for each student with this major (while ID_Number /= null)
Process Record in Array( ID_Number )
ID_Number <-- Array( ID_Number) . Next_Major
end repeat
where Array( ID_Number ) . Next_Major contains a pointer to the next student record with the
same major.
The execution time of this algorithm is the same as the last representation:
O( log2 Number of Distinct Majors ) + O( Number of Computer Science Majors ).
The advantage of this index representation over the previous representation is that this one
uses the minimum amount of additional space and executes slightly faster (has a slightly smaller
value of the constant in the big O function).
There is one advantage to the index representation based upon pipes. This advantage occurs
when we want to search simultaneously on two or more criteria. Assume, for example, we want
to find all the Computer Science majors who live in Founder's Hall and we also have an index
specifying where each student lives.
We could search the set of Computer Science majors and check each student's record to
determine if their local residence is Founder's Hall. Presumably, however, only a small fraction of
the Computer Science majors live in Founder's Hall and, conversely, only a small fraction of
Founder's Hall residents are Computer Science majors, so this method is slow and inefficient. If
the two indices are stored using the set of pipes method, it suffices to compare the two pipes of
pointers and keep only the pointers common to both pipes. This is usually faster than storing
pointers in the student records and having to examine the local residence of every Computer
Science major.
54
Searching
Using indices means that inserting a new student record also requires updating each and every
index. Regardless of the representation, this means, for example, first finding the major in the
index and then updating the pointers to include the new record. The index thus increases the
insertion time for a new record by the search time needed to find the particular major in the index
and the time to update the pointers. For practical purposes, the search time in the index deter-
mines the insertion time. The index search time may not matter when we are inserting only a few
records, but overall the insertion time is increased by the number of records times the search time
of the index. This can be a very large amount of time if the index search is slow.
Building an index for the grade point average (GPA) presents a slightly different problem.
Normally we want either all the students with a GPA greater than some value or all the students
with a GPA below some amount. This implies that the index must make it easy to find all the
records with indices above or below any given value. This implies the index should somehow or
other be sorted on the value of the GPA.
Depending upon how the GPA is calculated, there might be thousands of values in the index.
(If the GPA is calculated to five significant digits, there can be tens of thousands of values.)
Realistically this is too many values for most indices. We can create an index with values
0.00, 0.01, 0.02, ... , 3.98, 3.99, 4.00
and use the first three digits of the GPA to determine the entry in the index table. (Note that there
is nothing magic about using the first three significant digits as the index value. The number of
digits must depend upon the data set and how the data set is to be searched.) This method gives
an index with 400 values. Furthermore, for fast insertion and retrieval, the index can be stored in
an array with indices from 0 to 400.
We can implement the index using any of the methods described above. The space and time
required remain essentially the same. The only difference is the way we determine the index
value.
The whole analysis and presentation of this section is based upon using sequential integers as
student ID numbers. If sequential ID numbers are impossible for some reason, then the whole
analysis becomes more complicated. The only way to keep the O(1) access to the table is to hash
the table, presumably on the student ID number. This should be done if at all possible. If no
reasonable hash function is available, then the best one can do is to access the table in O(log2Size)
using either a bottom-up binary search tree or a sorted array with binary search. Either one,
however, implies a much slower execution time. Regardless of the method of storing the table,
the indices are essentially two column tables and can be stored any way one chooses.
Exercises
1. Develop algorithms to insert a new record assuming the secondary index is implemented
using:
a. a set of sets,
b. a set of sets extended to include a process records operation, and
c. a student record which stores the pointer to the next student with the same major.
55
Searching
3. Compare the pros and cons of implementing a secondary index using a set of sets versus
storing the pointer in the record.
4. Design a data structure for a set of customer records where searches are made on the
customer's name, location, and phone number. What questions would you ask before beginning
this design?
5. If the table is stored as a linked structure, how can the items be accessed on (a) a primary and
(b) a secondary index? What difference does it make if the table must be stored in a file for later
use?
Search Exercises
5. If one student in twenty belongs to the glee club, and each block contains four student names,
what is the probability that:
a. a block contains no members of the glee club, or
b. a block contains at least one member of the glee club.
6. What data structure should be used to store and search (assuming it fits in memory):
a. spelling checker (20,000 words),
b. telephone directory (100,000 entries),
c. employee list (1000 social security numbers),
d. chore list (4 chores),
e. for smallest (3rd largest, nth smallest) item in a list,
f. for item which has been in the list the longest.
7. Design a data type to keep track of grades for a course. The data type operations should
include:
56
Searching
8. Design a data type to keep track of customer's charge account for a department store. The
operations should include:
a. adding and dropping customers (2% of operations),
b. adding items bought (95% of operations),
c. deducting payments from amount due (1% of operations),
d. printing bills (2% of operations).
Describe your ADT's, your data structures, and the necessary algorithms. You can use any data
types developed in class as building blocks, but be sure to specify the type and the operations you
are using. You will be graded on both the correctness and the quality of your solution.
9. A company pays its employees on a piecework basis. Each time an employee completes an
item, the employees account is credited with the item. The company makes up to 100 different
kinds of items and the kinds of items change regularly. If an item is returned by a customer, the
credit is deducted from the employee's account. At the end of the month, each employee is paid
on the basis of the number and kinds of items completed. Design a system to keep track of the
employees' earnings. The operations should include:
a. adding and dropping employees (2% of operations),
b. adding items made (95% of operations),
c. deducting items returned (1% of operations),
d. printing paychecks (2% of operations).
57
Sorting
SORTING
Sorting is one of the most common computer operations. Knuth once estimated that sorting
has used up to 25% of all the cpu cycles that have ever existed. Because of its importance a large
amount of time and effort has gone into developing sorting methods.
This chapter presents some of the more common sorting methods with the intent of showing
some of the different approaches to sorting and their associated data structures. Sorting also
illustrates the effect of different implementation data structures on the execution time and the
amount of memory required to achieve a given goal.
The sorting methods presented here illustrate the richness of the possibilities in developing
sorting methods, but it is important to remember that almost any problem has the same richness of
methods; it is just that sorting is so important, that some of the possibilities have actually been
developed.
1
Sorting
Many of the simpler sort methods work by swapping items in the list. If each swap moves the
list one step closer to being sorted, then sooner or later, after enough swaps, the list will be
sorted. The formal term for swapping two items in a list is an exchange, and methods based upon
exchanges are called, for obvious reasons, exchange methods. While exchange is the correct,
formal term, we will normally use the simpler term swap.
This section present some of the more common exchange sort methods.
The simplest and most straightforward method to solve a problem is often brute force. For
example, to sort a list of items, we search the list for the smallest item and put it into the first
position in the list; then search the remaining items to find the second smallest item and put it into
the second position in the list, and so forth.
Starting, for example, with the unsorted list:
7, 3, 5, 1, 4, 2.
it is clear that the smallest element is the 1, so swapping this with the first item in the list gives:
1, 3, 5, 7, 4, 2.
The next smallest element in the list is the 2, so swapping this with the second item gives:
1, 2, 5, 7 4, 3.
Swapping the next smallest item, the 3, with the third item in the list gives:
1, 2, 3, 7 4, 5.
Swapping the next smallest item, the 4, with the fourth item in the list gives:
1, 2, 3, 4, 7, 5.
And finally swapping the next smallest item, the 5, with the fifth item in the list gives:
1, 2, 3, 4, 5, 7.
Note that after each swap the sorted part of the list is one item longer, and, in general, after N
swaps the first N items in the list are in order. Hence, if the list has Size items, after Size swaps,
the list is in order. (Actually only Size-1 swaps are necessary to produce a sorted list. Why?)
The process of determining the next largest item and performing the swap is called a pass.
Selection sort requires exactly Size-1 passes.
An algorithm to sort a list stored in an array is:
Selection Sort
For I = 1 to (Size - 1)
Find Location of minimum of Array(I), Array(I+1), ..., Array(Size)
Swap Array(I) and Array(Location)
end for
end selection sort
An algorithm to sort a list stored in a linked structure is more complicated. Assuming that
each item in the linked structure contains the item and a pointer to the next item, we can set up a
routine to find the smallest item in the remainder of the list. Swapping this minimum item with
2
Sorting
the current item, however, introduces a choice. Do we leave the links alone and swap the items in
the nodes or do we leave the items alone and perform the swap by updating the pointers? The
following algorithm swaps the items; an algorithm which updates the pointers is left for an
exercise.
Selection Sort
Initialize
Pointer <-- Head
The method works regardless of the storage method used and, since we already know how to
find the smallest item in a sublist, is easy to develop. Its disadvantage is that like all brute force
methods, it spends a lot of time and effort huffing and puffing when a little finesse would eliminate
much of the work.
To determine the execution time of the selection sort, we start by determining the number of
comparisons.
- When I is 1, we must compare Size-1 items to find the minimum.
- When I is 2, we must compare Size-2 item to find the minimum.
- When I is 3, we must compare Size-3 items to find the minimum.
and so forth.
The total number of comparisons is then:
(Size-1) + (Size-2) + (Size-3) + ... + 1.
From Appendix A, this sum is
(Size-1) * Size / 2 = (Size2 - Size) / 2.
In Big Oh terms, the total number of comparisons is then O( Size2 ).
3
Sorting
Since the total execution time is equal to the number of comparisons plus the number of
swaps (there are Size-1 swaps), the total execution time is:
O( Size2 ) + O( Size ) = O( Size2 ).
Since Selection sort treats every list exactly the same, its execution time is independent of the
original list and is always O( Size2 ). For later reference, however, it is convenient to list three
execution times:
Minimum: O( Size2 )
Average: O( Size2 )
Maximum: O( Size2 )
In this case, all three execution times are of course all the same, but this is not true for some other
methods covered later in this chapter.
One advantage of Selection Sort is that it sorts the list in place, so no extra space is needed.
This is a very poor method, but it does show some of the advantages and disadvantages of
brute force methods.
Sometimes the simplest solution to a problem is to sidestep it. For example, instead of sorting
a list, we can pretend our list is already sorted and insert each new item into the correct position
in the list. Since an empty list is by definition already sorted, if we insert each new item into the
correct position, the list is always sorted.
The question is exactly how to perform the insertion. The simplest method is to move items
down one position in the list. To insert, the value 6 into the sorted list:
1, 2, 5, 7, 9, 10
we proceed to make room in the list by shifting larger items down one position as follows:
where the * indicates an empty slot in the list and each step moves the empty slot one position
further ahead in the list. Once the proper slot is freed up, the new item can be inserted: In this
case, the 6 can be inserted between the 5 and the 7.
An algorithm to insert an item into the correct location in a sorted list stored in an array is:
4
Sorting
(The "and then" clause in the repeat while statement indicates that the second comparison,
Array(I) > Item, is only made if the first comparison, I>=1, is true.) This is necessary because the
normal "and" operator evaluates both comparisons before deciding whether or not to continue. In
this case, the value of I can decrease to zero on the last time through the loop and Array( 0 ) is
undefined, so the program would abort.)
This algorithm must of course be repeated for each new item inserted in the list. The execu-
tion times for inserting a single item into the list are:
Minimum: O( 1 )
Average: O( Size/2 ) = O( Size )
Maximum: O( Size )
(What data can produce each of these insertion times?)
One advantage of this version of In_Order Insert is that it sorts in place, it needs no extra
space.
A faster version of the In_Order Insert is possible if we assume Array(0) is available to tempo-
rally store the value of new data. The algorithm is:
5
Sorting
Placing the value of New_Data in Array(0) insures that, in the worse case, the routine will exit
the loop when I = 0. Thus there is no need to always make the check I >= 1.
This version has the same Big Oh execution time as the first version, but the actual execution
time is less because it has to make one less comparison each time through the loop, all at the cost
of one extra storage location.
The technique can also be used to sort an unsorted list. The items in the unsorted list are
inserted, one at a time, into a new, sorted list. With care, this new list can occupy the same space
as the old list. The basic algorithm is:
Insertion Sort
Repeat for each item in list (I = 2 to Size)
Insert Array(I) in the correct position in
the sublist consisting of Array(1), Array(2), ..., Array(I-1)
end repeat
end insertion sort
where one pass now consists of inserting one more item into the part of the list already sorted.
Note that after the Ith time through the repeat loop the first I items in the list are sorted and the
remaining Size-I items are still unsorted. Hence, after (Size - 1) passes, the list is sorted.
Since the insertion must be executed once for each item in the list, the overall execution time
of a complete insertion sort is:
Minimum: O( Size ) (for a list which is already sorted)
Average: O( Size2 )
Maximum: O( Size2 )
This is one sort method which is simpler to follow when the list is stored in a linked structure.
Assuming the data is stored in a node along with a pointer to the next node, an algorithm to do an
inorder insertion is:
6
Sorting
Although this algorithm has the same Big Oh execution time as the corresponding algorithm
for a list stored in an array, it does not move any items, so its actual execution time is faster.
Just as the extra test in the loop was omitted in the array version of this algorithm by storing
the value of New_Data in Array(0), the extra test in the loop can be eliminated by using the linked
data structure of Algorithm 9.2.1.1 to store the value of New_Data in a dummy node at the end
of the linked list.
This is a poor method for sorting a list unless the list is almost sorted already. Its major
advantage is not so much sorting a list as inserting a new item into an already sorted list.
A more common approach to developing a method is to find some basic "trick" which solves
the problem. Generally this trick is some minor variation on the brute force method. Consider
the brute force sorting method, Selection Sort, which sorts the list by essentially finding the small-
est item in the list and moving this item to the front of the list. We can accomplish the same result
by starting at the rear end of the list and comparing each item with its predecessor; if the items are
in the wrong order, then we swap the two items. At the end of one pass through the list, the
smallest item is at the front of the list. If we continue this process over and over again, the list is
eventually sorted.
An example will illustrate the concept:
Where the * before an item denotes one of the two items to be swapped at this stage.
Note that after the first pass, the first item is in the correct position; after the second pass, the
first two items are in order; after each succeeding pass, the first Pass items are in their correct
positions. Also note that if the first item is the largest item in the list, it only moves down one
position in the list during any one pass.
More interesting, however, is that the new method can stop any time the list is sorted. The
Selection Sort must go through the whole sort process every time -- even if the list is sorted to
7
Sorting
begin with. Our new method can keep track of the swaps and, any time we make a complete pass
through the list without swapping any items, the list must be sorted, so we can stop at this point.
If the list is sorted to begin with, our method will only make one pass and use O( Size ) time
whereas Selection Sort requires O( Size2 ) in every case. Note that in some cases, our new
method is as slow as Selection Sort, but, for lists which are more or less sorted to begin with, our
new method can be much faster.
This new method is, of course, the Bubble Sort, so called because the smaller, "lighter" items
bubble to the top of the list. One Bubble Sort algorithm is:
Bubble Sort
Initialize:
Done <-- false
Pass <-- 0
(Why does the inner repeat loop stop executing when I becomes equal to the value of Pass?)
There are a many variations on this basic concept and many different bubble sort algorithms, but
all do basically the same thing and require the same amount of time.
As we noted above, the Bubble Sort can use as little as O( Size ) time, but in the general case,
the inner loop must be repeated for each possible value of Pass. The total effort in this case is:
- When Pass is 1, we must compare Size-1 pairs of items.
- When Pass is 2, we must compare Size-2 pairs of items.
- When Pass is 3, we must compare Size-3 pairs of items.
and so forth.
The total number of comparisons is then:
(Size-1) + (Size-2) + (Size-3) + ... + 1 = Size * (Size-1) / 2
The total number of swaps required depends upon the original list and can vary from zero for a
list already in order to Size * (Size-1) / 2 swaps for a list in reverse order.
Since the total execution time is the sum of the time to do the comparisons plus the time to
execute the swaps, the execution times, in big O notation, are:
Minimum: O( Size )
Average: O( Size2 )
Maximum: O( Size2 )
8
Sorting
The bubble sort above must be completely reversed if it is to be used to sort a list stored in a
linked structure. The difficulty is that the above algorithm sorts the list by starting at the end of
the list and working forward. This cannot be done on a linked structure unless there are
backward links; that is, pointers from each node to its predecessor. Assuming the linked structure
consists of nodes containing an item and a pointer to the next node in the structure requires start-
ing at the beginning and working forward through the list. This means that each swap moves the
bigger item one position towards the rear of the list. After one pass, the largest item will be at the
end of the list. Each succeeding pass must stop one step further from the end of the list. An
algorithm is:
Repeat for each pass (while not Done and Head /= Last)
Done <-- true
Pointer <-- Head
The Big Oh execution time of the Bubble Sort for a list stored in a linked structure is exactly
the same as for one stored in an array.
One advantage of Bubble Sort is that it sorts in place so no additional space is needed.
This is a poor method for sorting an arbitrary list. The difficulty is the large number of swaps
that are necessary. Some versions of Bubble Sort eliminate some of the swaps by sifting items
down in the list and this can be helpful when the list is in reverse order or close to reverse order.
The result, however, is an algorithm almost identical to the Insertion Sort, so one might as well
use Insertion Sort to begin with. One major advantage of Bubble Sort is that it is easy to under-
stand and follow, and, for this reason, it is used in many elementary texts.
10.1.4. Quicksort
Once we start improving a basic idea, there are often variations which offer even more
improvement. The weakest part of Bubble Sort of items in an array, for example, is that each
9
Sorting
pass moves a large item back only one position in the list. If the largest item starts in the first
position in the list then it takes Size-1 passes to move the item to the rear of the list. A faster
method must have some way of "jumping" an item to some place close to its final position.
The basic need is obvious, but it took some effort to develop methods which do this jumping.
One method of doing this is to break the list into two parts such that every item in the first part is
less than every item in the second part. To break the list into two such parts, we can move any
item which is larger than or equal to the median value of the list to the second half of the list and
move any item smaller than the median to the first half of the list.
To illustrate the process, consider the list:
7, 3, 5, 1, 8, 6, 10, 4, 9, 2.
The median of this list is 5.5 which is not in the list, but this does not matter. Now starting at the
beginning of the list we underline the first item greater than or equal to the median:
7, 3, 5, 1, 8, 6, 10, 4, 9, 2.
and working backwards from the end of the list we underline the first item less than the median:
7, 3, 5, 1, 8, 6, 10, 4, 9, 2
These two underlined items are in the wrong half of the list, so swap them to produce:
2, 3, 5, 1, 8, 6, 10, 4, 9, 7
This single swap has moved two items into the correct half of the list.
We repeat this process again. Underlining the two out of place items gives:
2, 3, 5, 1, 8, 6, 10, 4, 9, 7
and swapping these two items gives:
2, 3, 5, 1, 4, 6, 10, 8, 9, 7
Note that this process puts all of the items less than the median at the beginning of the list and
all of the items greater than or equal to the median at the end of the list. When the process is
finished, the list can be subdivided into two sublists, one with all of the items less than the median
and one with all of the items greater than or equal to the median. We then apply the routine
recursively to each half; that is,
Qsort_Ideal( Array )
If the Array has more than one item then
Partition list into 2 parts
(every item in first part is < median
every item in second part is >= median)
Qsort_Ideal( First part of Array )
Qsort_Ideal( Second part of Array )
end Qsort_Ideal
Since large "jumps" are possible when breaking the list into two subparts, the final algorithm can
be fast.
The only difficulty is that finding the median of the list is almost as much work as sorting the
list, so we must find some way to replace the median by something which accomplishes the same
goal, splitting the list into two parts such that every item in the first part is less than every item in
the second part. It turns out, in theory, that it really doesn't matter much what we use to replace
the median. Almost any item in the list can be used in place of the median and it has very little
effect on the average sort time (it can affect the maximum sort time). This is a surprising result
10
Sorting
and its proof, which is based upon a statistical analysis of all possible lists, is too complex for this
text, but the result is important. The replacement for the median is called the pivot and the
programmer is free to use almost any item in the list as the pivot. Two common choices are the
first item in the list and the middle item in the list. In theory, it doesn't matter which one we use;
in practice, the middle item seems to make a better pivot for reasons discussed later.
A more complete algorithm including the steps for breaking the list into the two sublists is:
11
Sorting
While in theory any item in the list can be used for the pivot, it helps to choose a pivot near
the median of the items in the list --- particularly if the list is already sorted or close to it.
Consider using the first or last element in the list as the pivot when the list is almost sorted. In
most cases, this will partition the original list into one sublist of length 1 and one sublist of length
Length-1. This can end up requiring processing Size sublists, each of size one smaller than the
last, and the total execution time becomes O( Size2 ).
On the other hand, if the middle item of the list is used as the pivot when the list is almost
sorted, then the list is partitioned into two equally sized sublists and the execution time becomes
O( Size * log2 Size ). The time savings can be significant.
In general, the closer the pivot is to the median of the list, the faster the sort. For this reason,
a common approximation to the median of the list is to take the median of three items, the first,
middle, and last items in the list. This choice of pivot insures that the maximum execution time is
O( Size * log2 Size ); to be more precise, this choice insures that the worst case takes only about
40% longer than the average case.
The above version of Quicksort is not very suitable for lists stored in a linked structure where
each node contains one pointer pointing to the next node in the linked structure. The reason is
that the algorithm above must move through the list in both directions. There is, however, a
slower variation which partitions the list by moving through the list in only one direction. The
idea is to go through the list from beginning to end and at all times keep the part processed so far
partitioned into two sublists. The first sublist consists of items less than the pivot and the second
sublist contains items equal to or greater than the pivot. The diagram:
illustrates the concept. The first block in the rectangle contains the list items less than the pivot,
the second block in the rectangle contains the list items greater than or equal to the pivot, and the
third block in the rectangle contains the list items yet to be processed. Now as each additional
item in the list is compared to the pivot, one of two things happens. If the next item in the list is
greater than or equal to the pivot, the upper list is extended by one item:
On the other hand, if the next item in the list is less than the pivot, then it is swapped with the first
item in the equal to or greater than list:
swap
After either operation, the properties of the partition are preserved and, when the whole list has
been process, the list is partitioned.
To keep track of the location of the last element in the first sublist (items less than the pivot),
we will use the variable LastSmall, and to keep track of the location of the first element in the
12
Sorting
second sublist (items equal to or greater than the pivot), we will use the variable FirstBig. To
ensure that the list starts in the correct mode, we start by skipping over the items less than the
pivot. A partitioning algorithm is (First and Last are pointers to the first item and to the last item
respectively in the list to be partitioned):
end partition
Developing the remainder of the Quicksort algorithm using this partitioning algorithm is
straightforward and left for the reader. Also, a non-recursive version is left for an exercise.
The Big Oh execution times (minimum, average, and maximum) of the linked version of
Quicksort are the same as the array version, but since the linked version swaps approximately
three times as many items as the array version (Size/2 versus Size/6 swaps on the average), the
linked version can be significantly slower.
Quicksort is generally considered one of the better methods -- provided the pivot is carefully
chosen. A poorly chosen pivot can produce a series of sublists such that one of the sublists is
always one item long; this can produce sort times of O( Size2 ) and turn Quicksort into one of the
poorer methods. A better choice is to make the pivot the median of, say, three items in the list. A
common choice is the median of the first, middle, and last items in the list; this choice insures a
maximum sort time which is about 40% worse than the average.
Exercises
1. Give an example of a list which produces the worst (best) possible execution times for:
a. Bubble Sort, b. Insertion Sort, and c. Quicksort.
2. Given the list 2, 5, 4, 3, 1, 9, 8, what is the list after one pass of:
a. Insertion Sort, b. Bubble Sort, and c. Quicksort.
13
Sorting
4. Which sort methods require no extra space? Which sort methods require extra space. List for
each method the amount of extra space required.
5. Prove or disprove: The repeat loop in In_Order Insert can be reduced to one test and one
assignment statement each time through the loop. Will additional assumptions change the conclu-
sion?
6. Develop an algorithm for In_Order Insert, assuming the list is stored in a linked structure, like
the one in Module 9.2.1.1.
7. Develop detailed algorithms for Selection Sort assuming the list is stored in:
a. an array or b. a linked structure.
9. Bubble Sort makes many unnecessary swaps. Modify Bubble Sort so that the number of
swaps is greatly reduced.
10. The number of times Qsort is invoked in Quicksort is greatly reduced if Qsort is only invoked
for sublists of length greater than one. Do this. How does this affect the space requirements of
Quicksort?
11. Develop a non-recursive version of Quicksort. How does this change the space requirements
of Quicksort?
14
Sorting
to determine the five (5) smallest items in a list. Give the execution time, in Big Oh, for each of
the algorithms.
15. Determine the minimum, average, and maximum number of comparisons and moves required
by:
a. Selection Sort, c. Bubble Sort, and
b. Insertion Sort, d. Quicksort
assuming the list is stored in (A) an array or (B) a linked structure.
One tree data type which has these nice properties is a heap. A heap is a binary tree such that
every node is less than or equal to its parent. In particular, the root node is always the largest
item in the heap. Heap Sort starts by turning the list into a heap. It then swaps the root (the
largest item) into the last position in the list and makes the remaining items back into a heap. The
root of this new heap is then swapped into the second last position of the list and again the
remaining items are made back into a heap. This process of swapping and making the remaining
items into a heap continues until the tree is completely sorted.
This sounds a lot like the Selection Sort method where we:
1. find the largest item and move it to the end of the list,
2. find the second largest item and move it to the second last position in the list,
3. and so forth.
15
Sorting
The difference between the two methods is the way the next largest item is found. In Selection
Sort, it takes O( Size ) to find the next largest item. By using a heap, finding the next largest item
takes only O( log2Size ). The difference is significant.
Besides being a way of sorting a tree, a heap is also a way of efficiently storing a tree so as to
waste no space. This can be confusing at times, a heap is both an ADT and a storage scheme, but
we will present the heap data structure after showing how to use a heap ADT to sort a list.
To illustrate the process, let us start with the list:
7, 3, 5, 1, 4.
and start building the heap by inserting the items one at a time into a heap. Start by inserting the
first item as the root of the heap:
7
3
which is a heap. Inserting the remaining items in breath first order, one at a time, gives:
7
3 5
1 4
which is not a heap because 4 is larger than its parent 3. Swapping the 4 and the 3, however,
gives the heap:
4 5
1 3
This last insertion illustrates that any time a new item is added at the end of the heap, the new
item is swapped with its parent (if necessary, over and over again) until the tree is a heap.
By definition of a heap the largest element in the tree must be the root of the heap. Swapping
the root and the last element gives:
3
4 5
1 7
16
Sorting
The 7 is now in its final, correct position at the end of the list and we confine our attention to the
remaining items. The remaining items no longer form a heap. But, by "sifting down" the 3 (i.e.,
swapping it with its largest child), they can be restored to a heap:
5
4 3
1 7
At this point the heap has 5 as a root and swapping the root with the last element in the heap
gives:
1
4 3
5 7
The last two items in the list, the 5 and the 7, are now in their final, correct position, but the
remaining items no longer form a heap. Restoring the heap property by a sifting down the 1,
while ignoring the last two items, gives:
4
1 3
5 7
Again swapping the root of the heap with the last item gives:
3
1 4
5 7
with the 4, 5, and 7 in their final, correct position. The remaining items form a heap so swapping
the root of the heap with the last item gives:
1
3 4
5 7
which, when output in breadth first order, produces the list in sorted order.
Note that as an ADT, a heap has two important operations: insert a new item into the heap
and sift the root down to recreate a heap. The sort is basically a combination of these two
operations.
17
Sorting
The heap data structure stores the tree in an array as follows: the root is stored in List( 1 ) and
the left child of List( I ) in List( 2*I ) and the right child of List( I ) in List( 2*I+1 ). This storage
scheme eliminates the need to explicitly store the pointers. In other words, it saves space, but
only if the tree is almost complete. If a tree is relatively sparse, say close to degenerate, then this
storage scheme wastes a lot of space for items that are not in the tree. The heap data structure is
a natural one to use for any application based upon the heap ADT because the heap ADT is a
complete tree
As one possible use of the heap data structure, consider a priority queue, a queue where each
element has a priority. The tree chapter contains an example of how a priority queue can be
stored in a tree. The heap data structure is an ideal data structure for this tree. A heap has the
largest item as the root, so dequeueing an item requires removing the root from the heap, moving
the last item in the heap into the root and then sifting the root down to produce a new heap. The
execution time is O( log2 Size ). Inserting an item into the priority queue is the same as inserting
an item into a heap, so is also O( log2 Size ). The result is a priority queue implementation which
is fast and requires no extra space for pointers. The detailed algorithms are left for an exercise.
A general algorithm for the heap sort using the heap data structure is:
Assuming the routine starts with the original list stored in the array, the subalgorithms are:
Make Heap
For I = 2 to Size
J <-- I
while (J/2) >= 1 and then Array(J) > Array(J/2)
Swap Array(J) and Array(J/2)
J <-- J/2
end while
end for
end make heap
Sift_Down (1..Length)
Initialize
Parent <-- 1
Child <-- 2
If (Length >= 3) and (Array(Child+1) > Array(Child)) then
Child <-- Child + 1
18
Sorting
Another kind of tree sort is based upon a tournament tree. Most readers have seen a tourna-
ment tree of the kind
Team1
Team2
Team3
Team4
for a tennis or basketball tournament. After Team1 and Team2 play one another, the winner is
listed in the next level of the tree. Assuming Team1 and Team4 win their respective games, the
tournament tree would then appear as follows:
Team1
Team1
Team2
Team3
Team4
Team4
Again at this level, each pair of teams play a game and the winner advances to the next level of the
tree. Assuming Team4 wins this game, the final version of the tournament tree is:
19
Sorting
Team1
Team1
Team2
Team4
Team3 Team4
Team4
The general scheme is that the winner of each game advances to the next level of the tree.
This continues until one team, the tournament winner, reaches the final level of the tree.
The same scheme can be used to sort a set of items. Instead of two teams playing a game,
each pair of numbers are compared and the smallest one is declared the winner and moves to the
next level. At each level, two items are compared and the smallest one moves to the next level.
Eventually, the smallest item in the list ends up at the winning position. Figure 10.2.2.1 illustrates
the process.
In this particular case, the smallest item in the original list is a 3 and it eventually arrives at the
last level in tree. Now, the 3 is eliminated from the tree (normally it is replaced by an infinity or
some other suitably large item) and the process repeated to produce the next smallest item in the
tree. Note that only one branch of the tree, using log2Size comparisons, needs to be recalculated
to produce the second smallest item in the tree.
7
7
7
12
7
17
9
9
3
22
3
3
3
30
25
25
Tournament Sort--Step 1
Figure 10.2.2.1
The second smallest item is replaced by infinity or some large value and the process repeated
again to produce the third smallest item in the tree. Repeating this procedure over and over again
will eventually produce the whole list in sorted order.
The time to execute this process can be estimated in the usual way. Note that it takes (Size -
1) comparisons to produce the smallest item in the tree and O( log2Size ) comparisons to produce
the second and all later items. The total execution time is then:
Minimum: O( Size * log2Size )
Average : O( Size * log2Size )
Maximum: O( Size * log2Size )
20
Sorting
Considering all of the overhead associated with a tree, this does not seem a very useful
method, but there is one case where it is useful. Assume we have multiple computers and each
comparison is made on a different computer. Then all of the first level comparisons are made at
the same time, all of the second level comparisons are made at the same time, and so forth. The
smallest item is then produced in time O( log2Size ). If each level of the tree begins processing
the next set of comparisons as soon as it finishes the previous set, then it takes only O( Size-1 )
time units to process all of the items once the first item is produced. This gives a total execution
time of:
O( log2Size ) + O( Size-1 ) = O( Size ).
Even more interesting, once the smallest item is generated, each additional clock pulse suffices
to generate the next item in the sorted list. This is convenient when the parallel processors can
take advantage of this fact to process the items as soon as they are generated by the tournament
sort.
Exercises
2. What kind of list gives the fastest sort time for Heap Sort? What kind of list gives the slowest
sort time?
5. Using a heap data structure to implement a priority queue, show what the heap would look
like after each of the following operations: (the second item in the enqueue is the priority)
Enqueue( A, 10 ), Enqueue( B,20 ), Enqueue( A, 5 ), Enqueue( B,20 ), Enqueue( A, 10 ),
Dequeue, Enqueue( B, 15 ), Enqueue( A, 10 ), Enqueue( B,20 ), Dequeue, Dequeue,
Dequeue, Dequeue
6. Develop a set of algorithms for implementing a priority queue using a heap data structure.
7. Develop an algorithm based upon heap ADT for finding the maximum item in a list. Is this a
good method? Develop an algorithm based upon heap ADT for finding the five largest items in a
list. Is this a good method?
8. Modify the Heap Sort method so it produces the minimum item in a list. Is this a good
method? Modify the Heap Sort method so it produces the five smallest items in a list. Is this a
good method?
21
Sorting
9. Modify the Heap Sort method so it produces the items sorted in reverse order; that is, from
largest to smallest.
Another approach is to consider the list as a collection of smaller, sorted lists and merge these
small, sorted lists into larger and larger sorted lists. Thus to sort the list:
{ 7, 12, 17, 9, 22, 3, 30, 25}
we first assume it is the set of sublists:
{7}, {12}, {17}, {9}, {22}, {3}, {30}, {25}
each of length one. Since any list of length 1 is sorted, these sublists are all sorted.
Next we merge these sublists pair wise into sorted sublists of length two:
{7,12}, {9,17}, {3,22}, {25,30}
These sublists are then merged in pairs into sorted sublists of length four:
{7, 9, 12, 17}, {3, 22, 25, 30}
These sublists are then merged in pairs into sorted sublists of length eight:
{3, 7, 9, 12, 17, 22, 25, 30}
and the list is sorted.
The method will work for a list of any length, for, since the sorted lists double in length after
every merger, sooner or later the list becomes sorted.
To turn the method into an algorithm, we use a top-down, recursive algorithm which divides
the list in two, sorts the two halves, and then merges the two sorted halves into a sorted whole;
that is,
This algorithm keeps calling itself over and over again it until it reaches sublists of length 1.
These are considered sorted, so they are merged into list of length 2, and so forth back up the line
until the whole list is sorted. The exact implementation details differ depending upon whether the
list is stored in an array or a linked structure, so we take each one in turn.
22
Sorting
If the list is stored in an array, then the sublists are denoted by their first and last subscript in
the array. The main algorithm simply calls the recursive routine, M_Sort, with the subscripts 1 to
Size. M_Sort then performs the actual sort.
The merging subroutine is the only part the remains to be developed. If the subscripts of the
first sublist run from First_1 to Last_1 and the subscripts of the second sublist run from First_2 to
Last_2, then a routine for merging the two sorted sublists is (note that a temporary array Result is
used to temporarily store the merged sublists):
Terminate
If I_1 < Last_1 then Copy rest of Array(I_1)...Array(Last_1) into Result
If I_2 < Last_2 then Copy rest of Array(I_2)...Array(Last_2) into Result
Copy Result into Array(First_1)...Array(Last_2)
end merge
Ideally the two sublists should be merged in place so as to use to no extra storage space and
to eliminate the need to copy the merged list back into the original array. This is impossible
23
Sorting
(Why?), so the two sublists must merged into a separate array, Result, and then the contents of
this array copied back into the original array. This additional array must obviously be as large as
the original array, so this at least doubles the amount of space required by the method. The result-
ing routines are a bit on the slow side. The Big Oh execution times seem promising:
Minimum: O( Size * log2 Size )
Average: O( Size * log2 Size )
Maximum: O( Size * log2 Size )
but these execution times are a bit misleading. Because of all the sublists used, the work in
merging them, and the recursive calls, the Merge Sort tends to require a lot of memory and more
time than the Big Oh times indicate.
Some of this overhead disappears when the list is stored in a linked structure Assume the
linked structure is a collection of nodes and each node contains one item and a pointer to the next
node. The pointer in the last node is assumed to have a value of null. The merger in this case can
be done by rearranging pointers, so the only extra space required (assuming the space used by the
pointers is necessary for some other reason) is that for the recursive calls. The amount of this
additional space is O( log2 Size ).
Assume a linked list is denoted by a pointer to the First Node in the linked structure, then a
possible algorithm is
Merge_Sort ( List )
If List.Next /= Λ then --List has more than one item
Divide List into two sublists, First and Second
Merge_Sort ( First )
Merge_Sort ( Second )
Merge the sublist First with the sublist Second
end merge_sort
The linked structure however introduces an interesting subproblem: How to divide a linked
list into two separate, equal sized, linked lists? Let List point to the first item in the list and First
and Second point to the first items in the two respective sublists. Then one possible algorithm to
subdivide the list is:
Repeat for each item in list (while Pointer /= Λ and then Pointer.Next /= Λ )
Last_of_First <-- Second
Second <-- Second.Next
Pointer <-- Pointer.Next.Next
end repeat
24
Sorting
Last_of_First.Next <-- Λ
end divide
The routine to merge the two sorted sublists is rather tedious, but straightforward. As before,
First and Second point to the two respective sublists and List points to the final, merged list.
Terminate
If First /= Λ then
Pointer.Next <-- First
If Second /= Λ then
Pointer.Next <-- Second
end merge sublists
The big Oh execution times of the linked structure version of Merge Sort are the same as
those for the array version. However, because the linked structure version does not have to copy
back the merged sublists, the linked version uses less storage space and executes faster.
A non-recursive version of Merge Sort is possible, but tends to obscure the basic technique
used by Merge Sort. On the other hand, it executes faster because there are no recursive calls and
there is no need for the additional storage needed by the recursive calls. It is left for the exercises.
There are some other sort techniques based upon using sublists. The radix sort methods to
sort, for example, a list of integers, use (for base 10 numbers) 10 boxes numbered 0 through 9.
Each box has a queue and, to break the list into sublists, we insert each number into one of the
queues depending upon the particular digit in the number we are currently using. There are two
25
Sorting
methods of choosing the digit, the Most Significant Digit method and the Least Significant Digit
method. We present each method in turn by applying the method to a sample set of numbers.
we insert each number into the queue associated with the Least Significant Digit (LSD) in the
number; i.e., the unit's digit first. This gives the boxes and queues:
0 -->
1 --> 21, 11, 31
2 --> 12, 32, 22
3 --> 23, 33, 13
where every item in box 1 ends in a one, every item in box 2 ends in a two, and so forth.
Collecting these queues into one list gives:
Note that the numbers in this list are sorted on the unit's digit.
Distributing the numbers in this list into the boxes using the next least significant digit, the
ten's digit, gives:
0 -->
1 --> 11, 12, 13
2 --> 21, 22, 23
3 --> 31, 32, 33
. . . . .
where the items in each box or queue are now sorted on the last two digits.
Collecting these into one list gives the sorted list.
Note that after two passes the numbers are sorted on the last two digits. After n passes of the
Least Significant Digit method, the numbers are sorted on the last n digits. Hence, the number of
passes required is equal to the number of digits in the largest number in the list.
If the value of Maximum_Number is the largest number in the list, then the mathematical
expression, log10 Maximum_Number, is a close approximation to the number of digits in the
largest number in the list. Therefore, the execution times are:
Minimum: O( Size * log10 Maximum_Number )
Average: O( Size * log10 Maximum_Number )
Maximum: O( Size * log10 Maximum_Number )
where Maximum_Number is the largest number in the list.
26
Sorting
The method looks fast in theory, O( Size * log10 Maximum_Number ), but the overhead due
to the queues can be large and, in practice, the method is relatively slow unless one has special
equipment. (Special sorting devices using this method have been built and are very fast, but of
use only in very special cases. Banks, for example, use special purpose sorting machines based
upon this approach to sort checks. )
One other practical difficulty with radix sorting is that it is difficult in high level languages,
such as Ada, to access individual digits in a number.
The method can also sort alphabetical data; the basic technique is the same except that one
uses twenty-six boxes labeled A, B,..., Z. If necessary, extra boxes can be used for blank spaces
and other special characters. The Big Oh execution times are:
Minimum: O( Size * log26 Maximum_Length )
Average: O( Size * log26 Maximum_Length )
Maximum: O( Size * log26 Maximum_Length )
where Maximum_Length is the number of characters in the longest item.
The method can even be used for binary data. In this case there are only two boxes.
This method sorts the items into boxes starting with the Most Significant Digit (MSD). Given
the sample list
21,12,32,11,23,33,22,13,31
and distributing these using the Most Significant Digit, the tens digit, gives the boxes and queues:
0 -->
1 --> 12, 11, 13
2 --> 21, 23, 22
3 --> 32, 33, 31
. . . . .
This gives us essentially three sublists, one of numbers in the teens, one of numbers in the
twenties, and one of numbers in the thirties. Each sublist is now sorted separately using the unit's
digit and the three sorted sublists are then joined into one large, sorted list.
The Most Significant Digit method is, in some sense, a mirror image of the Least Significant
Digit method and has the same sort times:
Minimum: O( Size * log10 Maximum_Number )
Average: O( Size * log10 Maximum_Number )
Maximum: O( Size * log10 Maximum_Number )
where Maximum_Number is the largest number in the list.
The method can also be used for alphabetic data (twenty-six boxes) or even binary data (two
boxes).
The Most Significant Digit method is a poor method on a single processor system because of
all of the overhead associated with the queues. On a multiprocessor system, however, each of the
27
Sorting
three sublists can be sorted on a separate processor; in other words, the three sublists can be
sorted in parallel. The time savings can be significant.
Exercises
2. Give an example of a list which produces the worst (best) possible execution times for Merge
Sort.
3. Sort the following lists using (A) Most Significant Digit and (B) Least Significant Digit:
a. 12, 32, 22, 21, 13, 33, 11, 15,
b. 121, 231, 113, 312, 233, 321, 112, 221, 313,
c. Abe, Chuck, Abby, Chris, Abel, Cherry,
d. 011, 010, 000, 101, 111, 001.
5. Develop a non-recursive version of Merge Sort for a list stored in an array. Hint: One way is
to merge first lists of lengths 1, then lists of length 2, and so forth, using a for loop. Watch for the
case when the Size of the list is not exactly equal to a power of 2.
7. Determine the minimum, average, and maximum number of comparisons and moves required
by:
a. Merge Sort and b Least Significant Digit Radix Sort
assuming the list is stored in (A) an array or (B) a linked structure.
9. The queues in the Least Significant Digit method can be implemented by using a set of
queues. Compare this method to using a special purpose collection of queues. (Hint: What is the
cost of combining ten queues into a single list?)
28
Sorting
There are several choices to be made in translating the sort algorithms above into Ada
routines. The first set of choices concern generality. To insure generality, the routines should
work for any kind of data which, in Ada, usually implies a generic subprogram. If speed is impor-
tant, however, the routine should be specific to the actual data type to be sorted. Secondly, the
routine should work for a list of any size and, if the list is stored in an array, any range or type of
subscripts.
The Ada subprogram, Program 10.5.1, is a generic procedure to perform a Selection Sort.
Selection Sort was chosen because it is a particularly simple method and suffices to illustrate how
to write Ada sort procedures based upon any sort method.
The generic parameter list specifies the Data_Type, the Array_Data_Type, and a greater-than
function to compare two Data_Type items. Note that the Array_Data_Type specifies an array
with integer subscripts; a procedure to sort arrays with enumerated subscripts requires changing
this specification. As noted above, if speed is important, the generic routine should be altered to a
routine specific to the data to be sorted. The greater-than function compares two complete items
to determine the larger of the pair, or, if the individual items are, for example, records to be sorted
on only one field of the record, then it can compare only that field.
The only parameter of the sort procedure itself is the array, Data_Array, containing the items
to be sorted. The routine itself sorts all of the items in the Data_Array; that is, all items with
subscripts in the range Data_Array'First..Data_Array'Last. This is very general, but not always a
good choice. If the size of the array changes from one invocation to the next, this choice requires
the invoking routine to pass a slice of the array, which, in many compilers, means a lot of
additional overhead to create the slice from the original array. In other words, this can increase
the amount of time necessary to perform the sort. A faster version would pass the whole array
and the size of the list to be sorted. But this means checking the value of the size to make certain
it is a valid value and, if not, raising an exception. But exceptions can only be raised from
packages, so this requires changing the freestanding procedure into one inside a package.
Note how seemingly minor changes can require major modifications in the Ada procedure
itself.
29
Sorting
generic
type Data_Type is private; --Data type stored in Array
---------------------------------------------------------------
begin
for I in Data_Array'First..(Data_Array'Last-1) loop
--Find the minimum item in the rest of the list
Minimum := Data_Array(I);
Location_of_Minimum := I;
for J in (I+1)..Data_Array'Last loop
if Minimum > Data_Array(J) then
Minimum := Data_Array(J);
Location_of_Minimum := J;
end if;
end loop;
end Selection_Sort;
30
Sorting
Exercises
1. Alter Program 10.5.1 so that it works for an array with enumerated subscripts.
2. Redesign Program 10.5.1 so that it works for lists stored in linked structures. What changes
does this require in the generic parameter list?
3. Redesign Program 10.5.1 so that it works for a list of arbitrary size stored in an array with
subscripts ranging from 1 to some positive integer greater than or equal to the size of the list.
(How can the routine guarantee that the specified array is large enough to hold a list of the speci-
fied size?)
4. Compare the speed of the procedure developed in Exercise 3 that of Program 10.5.1 when
both are applied to the same input data. One way to do this is to actually run both procedures
with the same set of test lists ( say a thousand or so test lists) and measure the execution time.
Does the final result depend upon the size of the test list?
10.6. Theory
There is an extensive theoretical foundation for sorting and sorting methods. One of the
major theoretical results is:
This theorem gives us a lower bound on sorting time; there is no point to trying to find a
general O( Size ) sort algorithm because the theorem states that such algorithms are impossible.
The conditions of the theorem are, however, also of interest. If we have multiple processor
computer systems, then faster methods are possible. The Most Significant Digit radix sort
mentioned above is one such method. There are many more, so called, parallel sorting methods.
There also exist particular data collections for which faster sorts are possible. If, for example,
the list of numbers contains only one or two numbers out of order, then Insertion Sort can sort
the list in O( Size ).
As another example to illustrate fast sorting methods for special data collections, assume the
data collection contains most of the numbers between 1 and 10,000. Then consider the method:
1. initialize an array of 10,000 items to all zeros;
2. then for each item in the data collection, add one to the corresponding entry in
the array; and
3. after all the numbers in the collection have been processed, output items corre-
sponding to the non-zero entries in the array.
This method is an O( Size ) sort algorithm, but only works for very special data sets.
Thus, for multiple processors or special data sets, faster sort times are possible, but the
general result still holds: A single processor and an arbitrary data collection mean at best an O(
Size * log2Size ) sorting time.
31
Sorting
This result is interesting for another reason. Sorting is one of the few areas in computer
science where explicit lower bounds are known. In other words, where bounds are known on the
problem rather than a particular algorithm. Thus, all reasonable sort algorithms will have the
same Big Oh execution time, O( Size * log2Size ), and, in general, no better result is possible.
The various sorting algorithms are all attempts to make the Big Oh constant as small as possible
or to use less storage space. It would be convenient if explicit lower bounds were known for all
problems. As noted in the last section of the Graph chapter, there is a continual search for exact
lower bounds, but they are hard to find.
There is no such thing as a general, best sorting method. Every method has its advantages
and disadvantages and the proper choice often depends upon the data to be sorted and the exact
version of the method.
As a simple example, Insertion Sort works well on a collection with only a few items out of
place. If there are more than a few items out of place or the items are far from their final position,
Insertion sort degenerates into an O( Size2 ) method and becomes one of the slower methods.
Theoretical bounds are, of course, available for all of the standard sorting methods and, in
theory, it suffices to compare these bounds to determine the best method. Theoretical bounds,
however, skim over such questions as the kind of list that is being sorted, its size, how close to
sorted the list is, the compiler used to generate the machine code, the computer used to execute
the sort, and even the effect of the operating system on the sorting time. Some older computers,
for example, can only address 64,000 bytes of memory at a time and, if the list exceeds this size,
execution can slow to a crawl. Every computer system has some practical limit on the amount of
main memory available for sorting purposes. If the size of the list exceeds conveniently available
memory, then the operating system might have to start paging items into and out of the memory
with all of the attendant slowdowns implied by paging. (Some sort methods are more affected by
paging than others.) Some compilers don't take full advantage of hardware pointers or built-in
subscripting capabilities and a program that is very fast in general may be slow when a certain
compiler is used.
The most important factor may be the list to be sorted. Most of the sort algorithms and the
theoretical analyses assume that the list of items is in random order. If the list is not in random
order, but contains some kind of order, a different method or kind of sorting method may be
preferable. In the largest study known to the author, IBM measured a large collection of actual
data to be sorted. Their conclusion was that most of the lists were close to sorted to begin with.
This fact invalidates the assumptions used in the standard theoretical analyses. If their conclusions
are at all common, then it may be that we need to reconsider our approach. The only way to be
certain in most cases is to measure the actual data to be sorted. If the lists very greatly from one
time to the next, then the measurements will show this and the standard analyses are valid. If, on
the other hand, the lists all tend to be ordered, then this should affect the choice of sorting
method.
Similarly, it does not suffice to say, for example, the Bubble Sort method; one must specify
exactly which version is meant. Over the years, many, many versions of Bubble Sort have been
developed and the exact version used has an effect. The array version of the Bubble Sort
32
Sorting
algorithm used in Section 10.1.3 of this chapter moves the smallest item all the way to the front of
the collection during the first pass. During this same first pass it only moves the largest item
downward one position in the list. Hence, it works very poorly on a list in reverse order.
There is a minor variation of the Bubble Sort (the linked structure version in Section 10.1.3)
which, during the first pass, moves the largest item all the way to the rear of the list and the small-
est item only one position forward. This one also works very poorly on a collection in reverse
order. Compare the two versions, however, when only one item is out of place in the collection.
Let the smallest item be at the rear of the list and let all the other items be in increasing order.
Now one version sorts the list in one pass and the other version needs Size-1 passes to sort the
list. (Which is which?)
Even more interesting, the Bubble Sort version given earlier moves items up or down the list
by repeated swapping. Each swap requires three moves, so moving an item five positions in the
list requires 15 moves. Some versions move items up in the list until it finds where the current
item goes (somewhat the way Insertion Sort makes space for the next item). Moving an item five
positions in the list now requires, one move to save the current item, five moves in the list, and
one move to insert the item back in the list for a total of seven moves. While more complex, this
version can greatly reduce the number of moves and can be much faster than the version given in
the text.
Most "real" sorting methods are combinations of two or more of the above methods. As an
example, each pass of Quicksort has a fairly high overhead so that, for small sets, Insertion Sort is
faster than Quicksort. Detailed studies of execution times show that the break point is in the
neighborhood of ten or twenty items; below this boundary Insertion Sort is faster, above this
boundary Quicksort is faster. (The exact value of the break even point, say somewhere in the
range five to twenty-five items, depends upon exact details of the algorithm, compiler, and
computer. The difference between using ten or twenty as the break point has little effect on the
overall results, so we will rather arbitrarily use ten as the break point.) So, a more sophisticated
versions of Quicksort, one which switches from Qsort to Insertion Sort for any sublist of size ten
or less, is in Algorithm 10.7.1.
Thus, one criteria for choosing a method is collection size, some O( Size2 ) methods are faster
for small sets because they tend to have less overhead per pass. For most large sets there is no
question, O( Size * log2Size ) methods are better.
Even for large sets there is some question about which method to use. The exact method of
choosing the pivot in Quicksort, for example, is important; while in theory any item in the collec-
tion can be used as pivot, it tends to work better if the middle item of the collection is used. It
works even better if the median item is used, so many approximate the median by choosing the
median of the first, middle, and last items in the list as the pivot. The additional overhead is small
and the performance improvement is great.
Radix methods have too much overhead to be used for most purposes. As noted earlier, they
are used in very special cases such as parallel processing or in special hardware such as the
machines used to sort bank checks. The Most Significant Digit method might be useful if one did
not have to completely sort the list but only put it in more or less reasonable order. One or two
passes would then group together items with the same leading one or two digits. As a variation
on this, in some cases one might use one or two passes of Most Significant Digit to get the data
into small groups and then sort the individual groups with Bubble Sort or Insertion Sort.
33
Sorting
Heap Sort looks better than it is because sifting an item down requires two comparisons per
level. There are versions which eliminate this extra comparison but at the cost of extra
complexity.
Insertion Sort and Bubble Sort are mainly useful for short lists or list of items almost already
in order. The question is: Can the designer guarantee that these conditions are true and will
remain true in the future?
There is one final factor to consider. All of the sorting methods presented in this chapter are
designed for internal sorting. Sorting a list stored on, say, a hard disk is a completely different
problem and special methods have been developed for this case. There is, in fact, commercial
software available which examines your list and your hardware and then determines the best
sorting method. It can even take into account the disk rotational speed and where on the disk the
data is stored. Needless to say, such software is far beyond this textbook, but it is nice to know
that experts have developed such useful tools for the rest of us. Most of us, however, need to
sort list internally and the methods in this chapter cover the most common internal sort methods.
34
Sorting
The goal of this chapter is to introduce you to the interplay between the data structure used
and the sorting methods available. The same interplay holds for many other problem areas.
Changing the data structure can greatly affect the time and space requirements of a problem. If,
after reading this book, the reader is better able to understand and use the results of this interplay
between data structures and algorithms to solve his/her problems, then the book has been a
success.
Exercises
1. Which sort methods require no extra space? Which sort methods require extra space? How
much extra space does each method need?
2. The relationships between sort methods and particular data structures is interesting.
a. Which sort methods depend upon a specific data structure?
b. For as many data structure as possible, list the sort methods that can use this
data structure.
3. The problem "Find the ten largest items in a collection" can be done using any of the sort
methods. For each sort method, develop a corresponding solution to this problem and compare
the resulting algorithms for speed.
4. The problem "Find the median item in a collection" can be done using any of the sort methods.
For each sort method, develop a corresponding solution to this problem and compare the resulting
algorithms for speed.
5. Give the execution time for each sorting method when original list is:
a. already sorted, b. has two items out of order.
7. A stable sorting method is one where, after the sort is completed, two equal items have the
same order they had before the sort was started; in other words, the sorting method preserves the
order of equal valued items. Which of the above methods are stable?
9. Develop a version of Bubble Sort which replaces multiple swaps with a "move up and replace
strategy." How does this version compare to the one in the text for speed?
10. The Quicksort version given in the last section changes to Insertion Sort for lists with less
than ten items. Study the effects of changing this from 10 to 5, 15, or 20 by using the various
versions to sort a large number of lists.
35
Appendix A
This appendix summarizes some of the mathematical results used in this book. The results are
taken from standard mathematical works and the reader is referred there for more details.
The first results are from college algebra and any college algebra book which discusses sums of
series will have the following series and their sums.
N
Σ (a + bk) = aN + bN(N + 1)/2 = [N(a + bN + b)]/2
k=0
Most college algebra books also contain the following result, provided x ≠ 1:
N
a − [a + Nb]x N−1 bx[1 − x N ]
Σ (a + bk)x k =
k=0 1−x
+
(1 − x) 2
xk = 1 − x
N
Σ
N+1
= (x N+1 − 1)/(x − 1) .
k=0 1−x
N
(2N − N − 1)2 N+1 + 2
Σ k2 k =
k=0 (−1) 2
= (N − 1)2 N+1 + 2.
Appendix A - 1
Appendix B
Designing a Text Package
The data types and packages developed in Chapter 2 are deliberately kept simple to illustrate
some feature of records or arrays and their use in packages. Most realistic data types are often
more complicated and require careful analysis before determining implementation details.
As a more realistic example, we consider a variable length string data type and develop it in
some detail. Ada allows only fixed length strings and the string lengths must match before com-
paring two strings or assigning a value to a string variable. This is a nuisance in ordinary textual
data processing where one wants to process strings without having to keep track of the details of
each string's length before processing the string. A variable length string is a string whose
length is arbitrary. Two variable length strings can be directly compared or assignments made
without worrying about the length of the strings.
The advantage of a variable length string data type is that the user can ignore the length of a
string and, hence, they correspond more closely to everyday usage of textual data. In other
words, instead of having to use the fixed length string type where every item has to be exactly
the same length, we want to be able to declare an item to be a variable length string and then ig-
nore the length of the data. For example, if the program must process peoples names, we might
declare Name to be variable length string and then use operations, such as Get(Name), without
having to make everyone's name exactly the same length. This certainly simplifies life and
makes our programs more readable and more likely to be correct.
To differentiate the new data type from fixed length strings, we will call the new data type,
Text. A more precise definition of the data type Text is:
a. a set of data values, each value consisting of a string of zero or more characters;
if the string contains one or more characters, the first character must be a non-
blank character, and
b. the operations:
- Get which inputs a Text value,
- Put which outputs a Text value,
- To_Text which converts a fixed String value to a Text value, and
- the comparison operators <, <=, =, /=, >, and >= which compare two Text
values and produce the correct Boolean value.
Note that Text is one example where the real world object and the data type are the same; that
is, the real world object and the data type have exactly the same set of possible values and the
same set of operations produce identical values. The only difficulty will occur if the value of a
Text variable is stored in an array, for then the set of possible Text values is limited by the
length of the array.
Applying the definition of an external point of view of a data structure from Chapter 1, we
note that from the external point of view Text variables have two features to consider: First, this
definition of Text makes it is impossible to access any subpart of a Text value. Second, the exis-
tence of the comparison operators implies that there is an ordering between individual Text val-
ues; that is, we can compare two Text values and determine which is less than or greater than the
other.
Appendix B - 1
Applying the definition of an internal point of view of a data structure from Chapter 1 leads
to much more interesting results. The question is how to implement this definition of Text as a
data structure in the package? We have some important choices to make at this point. First,
how shall we store the Text values in the computer? Second, how shall we implement the
operations?
Some possible storage structures are:
1. Each value is stored in a linked list with one character per node.
2. Each value is stored in an array (one character per item)
where the array is either fixed length or variable length.
In other languages, there might be more choices, but Ada allows storing collections of values
only in arrays and linked lists.
Linked lists are more flexible than arrays, but they also use more storage space per item.
Some other disadvantages of linked lists are more subtle and require some thought.
To understand the disadvantages of linked list for general data types, we need to start by con-
sidering data type implementations in Ada. The set of data type values must be either private or
limited private. Otherwise, the using program can access the individual parts of a value and
perform operations on the values which are not part of the package. This would violate the
whole concept of a data type.
Now, Ada treats private linked lists and private arrays differently. For example, if the using
or client program includes the assignment statement:
A := B;
where A and B are both array valued variables, then executing this assignment statement copies
the value of the array named B into the storage locations allotted to the array named A.
On the other hand, if A and B are access types (i.e., point to linked lists), then executing this
assignment statement sets the value of A so that A points to the same location as B. Thus, any
change in B's linked list, no matter how much later during program execution, is also a change in
A's linked list. This is not what we normally mean by an assignment statement; we normally ex-
pect the assignment statement to change the value of A depending only on B's value at the time
the assignment statement is executed. To avoid this difficulty, we normally use linked lists only
to store data type values that are not used in assignment statements; that is, usually limited pri-
vate types.
Another possible solution is to make our data type limited private and overload the assign-
ment operator, but Ada forbids overloading the assignment operator. The best we can do is de-
velop a new assignment procedure; for example, Assign( A, B) which assigns the value of B to
A, but this is clumsy and most confusing to the program reader who expects an assignment like
statement to contain a := token.
Thus, a Text value should be a private type and stored in arrays. The question then is should
we use a variable length array or a fixed length array? Variable length arrays have their advan-
tages, but Ada forbids constructing arrays where each component of the array is itself a variable
length array. Thus, if we stored a Text variable in a discriminated record (a variable length ar-
ray), we could not form arrays of Text variables. There might be times when we wish to use an
array of Text values; for example, a list of names stored in an array. Storing Text variables in a
variable length arrays would forbid this. Therefore, we will use fixed length records containing
an array and a length field.
Appendix B - 2
Since arrays have fixed lengths, we must use a length long enough to include any possible
Text value that can occur in the using program. There are two ways to do this: one is to pick a
length, say 80 characters, and arbitrarily make this the maximum length allowed; another is
make the maximum length a parameter that is determined at compile time (in Ada we can use a
generic parameter to do this). We will use the first way for now and leave generics for later.
This suggests a record definition of the general form:
where Maximum_Size is preset to the maximum allowed length of a Text value. Specification
B.1 contains an Ada package specification using this approach.
Note that the maximum length of the Text data is always given in terms of the constant inte-
ger Maximum_Size which is set to 80 at the beginning of the package specification. This makes
it easy to change the maximum allowed length. We change this one constant, recompile, and the
maximum length is changed everywhere in the package. A good design always declares a value
at the beginning of the package specification for every constant that one might wish to change at
some time in the future. If there is any doubt in your mind about whether or not to treat a given
constant this way, always error on the safe side and replace the constant by data name whose
value is easily alterable.
A practical package must also allow for various types of errors. The major possible error in
this package is attempting to generate too long a value, a value whose length is greater than
Maximum_Size. There are several ways to handle such errors. For reasons discussed in Chap-
ter 3, this package uses an exception to handle the error. The exception is raised at appropriate
points in the package body.
We now consider implementing the operations. Since Text is a private type, we do not over-
load the = and /= operators; but instead use the system operators. This, however, is inconsistent
with our record definition. Assume we have two identical Text values and that each contains
three characters. The system = operator considers not only the first three characters in the Data
field of the Text record, but all the other characters as well. These remaining characters may
well be different even though the first three characters agree. Thus, the = operator could decide
the two Text values are different even when they are identical. To avoid this difficulty, we will
make the additional assumption that all unused positions in the Data vector will be set to a blank.
Now we can use the system = and /= operators. The complete package body is contained in
Specification B.1.
This is a lot of detail for one rather straightforward data type. One reason is that it is a com-
monly used data type and it helps to consider its design in some detail. The second and most im-
portant reason is to impress you with the fact that we must make implementation decisions and
language features do affect our choices. Beginning courses concentrate on learning the basic
features of a language but often leave for a later course the more subtle effects of these features.
This course must always consider those effects.
Appendix B - 3
-- A Package to implement the Data Type Text
-- Text consists of:
-- - a set of data values, each value consisting of a string
-- of zero or more characters; if the string contains one
-- or more characters, the first character must be a non-
-- blank character, and
-- - the operations:
-- - Get which inputs a Text value,
-- - Put which outputs a Text value,
-- - To_Text which converts a fixed String value to a
-- Text value, and
-- - the comparison operators <, <=, =, /=, >, and >=
-- which compare two Text values and produce the
-- correct Boolean value.
package Text_Package is
Maximum_Size : constant Integer := 80; --Maximum length of
--a Text value.
type Text is private;
--Input/Output procedures
--Conversion Operator
Appendix B - 4
--Comparison Operators
private --Declarations
type Text is
record
Size : Integer range 0..Maximum_Size := 0;--Text Length.
Data : String(1..Maximum_Size); -- Text value.
end record;
end Text_Package;
Appendix B - 5
-- Package Body For Text Data Type
with Ada.Text_IO;
package body Text_Package is
--Input/Output Operators
--------------------------------------------------------------
procedure Get_Char (Char : out Character) is
--Inputs next character value
begin
if Ada.Text_IO.End_Of_Line
then
Char := ' ';
Ada.Text_IO.Skip_Line;
else
Ada.Text_IO.Get(Char);
end if;
end Get_Char;
---------------------------------------------------------------
procedure Get (Item : out Text) is
begin
--Skip leading blanks
Get_Char( Char );
while Char = ' ' loop
Get_Char( Char );
end loop;
begin
--Output Text value
Ada.Text_IO.Put( Item.Data(1..Item.Size));
--Conversion Operator
---------------------------------------------------------------
function Text_Of (Str : String) return Text is
begin
--Make sure string is not too long
if STR'Length > Maximum_Size then
raise Text_Value_Too_Long;
end if;
Appendix B - 7
-- Comparison operators
---------------------------------------------------------------
function ">" ( Left : in Text;
Right : in Text) return Boolean is
begin
return
Left.Data(1..Left.Size) > Right.Data(1..Right.Size );
end ">";
--------------------------------------------------------------
function ">=" ( Left : in Text;
Right : in Text) return Boolean is
begin
return
Left.Data(1..Left.Size) >= Right.Data(1..Right.Size);
end ">=";
---------------------------------------------------------------
function "<" ( Left : in Text;
Right : in Text) return Boolean is
begin
return
Left.Data(1..Left.Size) < Right.Data(1..Right.Size);
end "<";
---------------------------------------------------------------
function "<=" ( Left : in Text;
Right : in Text) return Boolean is
begin
return
Left.Data(1..Left.Size) <= Right.Data(1..Right.Size);
end "<=";
end Text_Package;
Appendix B - 8
Exercises
!. Add the following operations to the TEXT package contained in this appendix.
a. concatenation,
b. an index function which returns the position of
one TEXT value inside another TEXT value,
c. a substitute operation which replace one TEXT value
by a second TEXT value inside a third TEXT value,
d. a delete operation which deletes first occurrence of one text value
from a second text value,
e. a function to convert a TEXT value into an INTEGER value,
f. a function to convert a TEXT value into a STRING value,
g. a function to convert an INTEGER value to a TEXT value,
h. a function to convert a STRING value to a TEXT value, and
i. a substring function which returns a specified substring of a given string.
Be sure to make allowances for various types of errors.
2. Alter the TEXT package in Specification B.1 so that the package user can specify the maxi-
mum possible length of a TEXT value:
a. without using a generic parameter, and
b. by using a generic parameter.
Appendix B - 9