Data Representation
Data Representation
Chapter 16:
Data representation
Learning objectives
By the end of this chapter you should be able to:
■ show understanding of why user-defined types are ■ show understanding of the effects of changing the
necessary allocation of bits to mantissa and exponent in a floating-
■ define and use non-composite data types point representation
■ define and use composite data types ■ convert binary floating-point real numbers into denary and
■ choose and design an appropriate user-defined data type vice versa
for a given problem ■ normalise floating-point numbers
■ show understanding of the methods of file organisation ■ show understanding of the consequences of a binary
and select an appropriate method of file organisation and representation only being an approximation to the real
file access for a given problem number it represents (in certain cases)
■ show understanding of methods of file access ■ show understanding that binary representations can give
■ show understanding of hashing algorithms rise to rounding errors.
■ describe the format of binary floating-point real numbers
Cambridge International AS & A Level Computer Science
314 A user-defined data type is a data type for which the programmer has included the definition
in the program. Once the data type has been defined, variables can be created and
associated with the user-defined data type. Note that, although the user-defined data type
is not a built-in data type, using the user-defined data type is only possible if a programming
language offers support for the construct.
TIP
Make sure that you do not confuse user-defined data types and abstract data types (defined in
Section 13.07 of Chapter 13).
Following these definitions, variables can be declared and assigned values, for example:
DECLARE Direction1 : TDirections
DECLARE StartDay : TDays
Direction1 ← North
StartDay ← Wednesday
It is important to note the following points.
• The values of the enumerated type look like string values but they are not. The values
must not be enclosed in quotes.
• The values defined in an enumerated data type are ordinal. This means that enumerated
data types have an implied order of values.
The ordering can be put to many uses in a program. For example, a comparison statement
can be used with the values of the variables of an enumerated data type:
DECLARE Weekend : Boolean
DECLARE Day : TDays
Weekend = TRUE IF Day > Friday
KEY TERMS
User-defined data type: where the programmer includes the definition in the program
Non-composite data type: a data type defined without reference to another data type
Enumerated data type: a non-composite user-defined data type for which the definition identifies all
possible values
315
The enumerated data type is one reason why user-defined data types are sometimes
needed. There could not be a built-in generic definition of an enumerated data type
because the possible values would not be known. The values can only be known when the
programmer has identified them in the type definition.
1 An example of the definition of a pointer type which requires only the identification of a
data type for which the pointer is to be used.
TYPE
TIntegerPointer ← ^Integer
2 An example of the declaration of a variable of the pointer data type which does not
require the use of the caret (^) symbol.
DECLARE MyIntegerPointer : TIntegerPointer
3 An example of the declaration of two ordinary variables of type integer and the
assignment of a value for one of them.
DECLARE Number1, Number2 : INTEGER
Number1 ← 100
4 An example of an assignment to a pointer variable of a value which is the address of a
different variable.
MyIntegerPointer ← @Number1
5 An example of an assignment which uses the ‘dereferenced’ value which has been stored
at the address defined by the pointer variable. This assigns the value 200 to Number2.
Number2 ← MyIntegerPointer^ * 2
KEY TERM
Pointer variable: one for which the value is the address in memory of a different variable
316
Not all programming languages offer support for the use of a pointer data type. Those
languages that do so will have their own version of the symbolism illustrated above with ^
and @.
Because arithmetic can be performed on pointer variables, it is possible to use pointer
variables to construct dynamically varying data structures. For some programming
languages it is necessary to declare an array with a large upper bound to ensure that the
array is unlikely to be fully populated with values. If the language supports the use of a
pointer variable, the size of an array can expand while a program is running. The details of
how this can be done are beyond the scope of this discussion.
KEY TERM
Set: a collection of data items that lacks any structure; contains no duplicates and has a number of
defined operations that can be performed on it
The most useful property of a set is the fact that duplicate values are not allowed. A list or a
one-dimensional array might be created but has to be checked to remove duplicate values. A
simple way of removing duplicate values would be to convert the structure to a set and then
convert the set back to the original structure.
A slightly different example would be if students were allocated to groups for studying a
particular subject. For each subject, the students’ names would be entered into a data
structure defined for that subject. Set data types could then, for example, find out which
students were studying both computer science and physics. The students studying both
subjects would be found by applying the ‘intersection’ operation on the two individual sets.
KEY TERMS
Binary file: a file designed for storing data to be used by a computer program
Record: a collection of fields containing data values
Discussion Point:
A record is a user-defined data type. It is also a component of a file. Can there be or should
there be any relationship between these two concepts?
Cambridge International AS & A Level Computer Science
Serial files
A serial file contains records that have not been organised in any defined order. A typical
use of a serial file would be for a bank to record transactions involving customer accounts.
A program would be running. Each time there was a withdrawal or a deposit the program
would receive the details as data input and would record the data in a transaction file. In a
serial file each new record is simply appended to the file so that the only ordering in the file is
the time order of data entry.
Sequential files
A sequential file has records that are ordered. In the bank example, a sequential file could
be used as a master file for an individual customer account. At regular periods of time, the
transaction file would be read, and all affected customer account master files would be
updated. In order to allow a sequential file to be ordered, there has to be a key field for which
the values are unique and sequential but not necessarily consecutive. When a new record is
to be added to a sequential file it would be possible to simply append the record, with the
intention of sorting the file later. A more likely approach is for the file to be read sequentially
and each record written to a new file. This is continued until the appropriate position for the
new record is reached. The new record is then written to the new file before the remaining
records in the old file are copied in.
Direct-access files
Direct-access files are sometimes referred to as ‘random-access’ files but, as with random-
access memory, the randomness is only that the access can be to any record in the file
without sequential reading of the file. Direct access can be achieved with a sequential file. A
separate index file is created which has two fields per record. The first field has the key field
value and the second field has a value for the position of this key field value in the main file.
The alternative is to use a hashing algorithm when a record is entered into the direct-access
file.
One simple hashing algorithm is applicable if there is a numeric key field in each record.
The algorithm chooses a suitable number and divides this number by the value in the key
field. The remainder from this division then identifies the address in the file for storage of
that record. The suitable number works best if it is a prime number of a similar size to the
expected size of the file.
For simplicity this can be illustrated for 4-digit values in the key field where 1000 is used for
the dividing number. The following represent three calculations:
0045/1000 gives remainder 45 for the address in the file
2005/1000 gives remainder 5 for the address in the file
3005/1000 gives remainder 5 for the address in the file
There are two facts apparent from these calculations. The first fact is that the addresses
calculated do not have any order depending on the value in the key field. The second fact
is that different key field values can produce the same remainder and therefore the same
address in the file.
Part 3: Chapter 16: Data representation
If the records do not have a suitable field with numeric digits, an alternative is to choose a
field with some alphabetic characters. The ASCII code for each character can be looked up
and the values then added. The sum is then used in the same way as described above, to
calculate an address as the remainder from an integer division.
When the same address is calculated for different field values, it is usually referred to as a
collision (the addresses are sometimes called synonyms). The best choice for a hashing
algorithm is one that spreads the addresses most evenly and minimises the number of
collisions. However, collisions cannot be avoided altogether so there has to be a defined
method for dealing with collisons. There are a number of options, including the following:
•• use a sequential search to look for a vacant address following the calculated one
•• keep a number of overflow addresses at the end of the file
•• have a linked list accessible from each address.
Question 16.01
Imagine the possible numeric values for a key field in a direct-access file are in the range of 1
to 30 but you want the file to have fewer than 30 file addresses.
You decide to test two examples of a modular division hashing algorithm. The first test uses
10 as the number for division, the second test uses 11.
a What are the two sets of addresses generated as remainders from the division for the
key values 0 to 39 using 10 and 11?
b State one difference between the two sets of addresses.
c Is there any significant difference between the two sets of addresses? 319
d 11 is a prime number. Prime numbers are stated to give a better spread of use of the
addresses in a file. Do you know when this is more likely to be true?
File access
Once a file organisation has been chosen and the data has been entered into a file, you need
to consider how this data is to be accessed. For a serial file, the normal usage is to read the
whole file record by record. If there was a need to search for a particular value in one of the
fields, the only option would be to read the records from the beginning until the target record
was found. If the data is stored in a sequential file and a particular value is needed, searching
may have to be done in the same way. However, if the key field value is known for the record
containing the wanted data, the process is faster because only key field values need to be
read. For a direct-access file, the value in the key field is submitted to the hashing algorithm.
The value is the same value that was used when entering the data originally and will provide
the same value for the position in the file that was provided when the algorithm was used at
the time of data input. This eliminates the need to read records from the beginning of the file.
However, because of the collision problem some serial searching might be needed after the
initial jump to the hashed position.
File access might also be needed to delete or edit data. For a sequential file the same
method is used as when a new record was added. Records are copied from the old file to a
new file until the record that needs to be deleted or edited is reached. Following deletion or
editing all remaining records are copied to the new file.
For a direct-access file there is no need to create a new file. If a record needs editing it can
be accessed directly and edited without disturbing any other content. However, if a record
Cambridge International AS & A Level Computer Science
is to be deleted it is necessary to have a flag set in the record. Then, in a subsequent reading
process, that record is skipped over.
KEY TERM
Floating-point representation: a representation of real numbers that stores a value for the mantissa
and a value for the exponent
A simple example can be used to illustrate the differences between the two representations.
Let’s consider that a real number is to be stored in eight bits.
Part 3: Chapter 16: Data representation
For the fixed-point option, a possible choice would be to use the most significant bit as a
sign bit and the next five bits for the whole number part. This would leave two bits for the
fractional part. Some important non-zero values in this representation are shown in Table
16.01. (The bits are shown with a gap to indicate the implied position of the binary point.)
A possible choice for a floating-point representation would be four bits for the mantissa and
four bits for the exponent with each using two’s complement representation. The exponent
is stored as a signed integer. The mantissa has to be stored as a fixed-point real value. The
question now is where the binary point should be.
Two of the options for the mantissa being expressed in four bits are shown in Table 16.02(a)
and Table 16.02(b). In each case, the denary equivalent is shown, and the position of the
implied binary point is shown by a gap. Table 16.02(c) shows the three largest magnitude
positive and negative values for integer coding that will be used for the exponent.
a b c
First bit Real value Second bit Real value Integer bit Integer 321
pattern for in denary pattern for in denary pattern value in
a real value a real value denary
011 1 3.5 0 111 0.875 0111 7
011 0 3.0 0 110 0.75 0110 6
010 1 2.5 0 101 0.625 0101 5
101 0 –3.0 1 010 –0.75 1010 –6
100 1 –3.5 1 001 –0.875 1001 –7
100 0 –4.0 1 000 –1.0 1000 –8
Table 16.02 Coding a floating-point real value in eight bits (four for the mantissa and four for
the exponent)
When the mantissa has the implied binary point immediately following the sign bit, a smaller
spacing is produced between the values that can be represented. This is the preferred option
for a floating-point representation. Using this option, the most important non-zero values for
the floating-point representation are shown in Table 16.03. (The implied binary point and the
mantissa exponent separation are shown by a gap.)
The comparison between the values in Tables 16.01 and 16.03 illustrate the greater range of
positive and negative values available if floating-point representation is used.
Table 16.04 Alternative representations of denary 2 using four bits each for mantissa and exponent
For a negative number we can consider representations for –4 as shown in Table 16.05.
Table 16.05 Alternative representations of denary −4 using four bits each for mantissa and exponent
When the number is represented with the highest magnitude for the mantissa, the two
most significant bits are different. This fact can be used to recognise that a number is in a
normalised representation. The values in Tables 16.03 and 16.04 also show how a number
Part 3: Chapter 16: Data representation
could be normalised. For a positive number, the bits in the mantissa are shifted left until
the most significant bits are 0 followed by 1. For each shift left the value of the exponent is
reduced by 1.
The same process of shifting is used for a negative number until the most significant bits are
1 followed by 0. In this case, no attention is paid to the fact that bits are falling off the most
significant end of the mantissa.
Conversion of representations
In Chapter 1 (Section 1.01), a number of methods for converting numbers into different
representations were discussed. These only considered integer values. We now need to
consider the conversion of real numbers.
We can start by considering the conversion of a simple real number, such as 4.75, into a
simple fixed-point binary representation. This looks easy because 4 converts to 100 in binary
and .75 converts to .11 in binary so the binary version of 4.75 should be:
100.11
However, remember that a positive number should start with 0. Can we just add a sign bit?
For a positive number we can. Denary 4.75 can be represented as 0100.11
in binary. 323
For negative numbers we still want to use two’s complement form. So, to find the
representation of –4.75 we can start with the representation for 4.75 then convert it to two’s
complement as follows:
0100.11 converts to 1011.00 in one’s complement
then to 1011.01 in two’s complement
To check the result, we can apply Method 2 from Worked Example 1.01 in Chapter 1. 1011 is
the code for –8 + 3 and .01 is the code for .25; –8 + 3 + .25 = –4.75.
We can now consider the conversion of a denary value expressed as a real number into a
floating-point binary representation. Before considering the conversion method it should
be remembered that most fractional parts do not convert to a precise representation. This
is because the binary fractional parts represent a half, a quarter, an eighth, a sixteenth and
so on. Unless a denary fraction is a sum of a collection of these values, there cannot be
an accurate conversion. In particular, of the values from .1 through to .9, only .5 converts
accurately. This was mentioned in Chapter 1 (Section 1.03) in the discussion about storing
currency values.
The method for conversion of a positive value is as follows.
1 Convert the whole-number part using the method described in Chapter 1 (Section 1.01).
2 Add the 0 sign bit.
3 Convert the fractional part choosing a method from one of the examples in Worked
Example 16.01.
4 Combine the whole number and fractional parts and enter these into the most significant
of the bits allocated for the representation of the mantissa.
Cambridge International AS & A Level Computer Science
5 Fill the remaining bits for the mantissa and the bits for the exponent with zeros.
6 Adjust the position of the binary point by changing the exponent value to achieve a
normalised representation.
To convert a negative value the number is treated initially as positive and the same first five
steps are followed. At this stage a two’s complement conversion of the mantissa code is used
to convert this to a negative value before step 6 is carried out.
Example 1
324 Example 2
Let’s consider the conversion of 8.63. The first step is the same but now the .63 has to be converted by
the ‘multiply by two and record whole number parts’ method. This works as follows:
.63 × 2 = 1.26 so 1 is stored to give the fraction .1
.26 × 2 = .52 so 0 is stored to give the fraction .10
.52 × 2 = 1.04 so 1 is stored to give the fraction .101
.04 × 2 = .08 so 0 is stored to give the fraction .1010
At this stage it can be seen that, multiplying .08 by 2 successively is going to give a lot of zeros in the
binary fraction before another 1 is added so the process can be stopped. .63 has been approximated
as .625. So, following Steps 3–5 in Example 1, the final representation becomes 0100010100 for the
mantissa and 0100 for the exponent.
TASK 16.01
Convert the denary value –7.75 to a floating-point binary representation with ten bits
for the mantissa and four bits for the exponent. Start by converting 7.75 to binary
(make sure you add the sign bit!). Then convert to two’s complement form. Finally,
choose the correct value for the exponent to leave the implied position of the binary
point after the sign bit. Convert back to denary to check the result.
Summary
■ Examples of non-composite user-defined data types include enumerated and pointer data types.
■ Record, set and class are examples of composite user-defined data types.
■ File organisation allows for serial, sequential or direct access. 325
■ Floating-point representation for a real number allows a wider range of values to be represented.
■ A normalised floating-point representation achieves the best precision for the value stored.
■ Stored floating-point values rarely give an accurate representation of the denary equivalent.
Reflection Point:
Whenever you are asked to create a binary representation from a denary value or vice-versa
are you always checking your answer by converting it back to the original value?