+
Chapter 2
Computer Evolution and Performance
William Stallings : Computer Organization and Architecture, 9 th Edition
+ 2
Objectives
Why should we study this chapter?
How are computers developed? generations
What applications require great power computers?
What are Multicore, MICs (many integrated
cores), and GPGPUs (general purpose graphical
processing unit)?
How to assess computer performance?
+ 3
Objectives
After studying this chapter, you should be able to:
Present an overview of the evolution of computer
technology from early digital computers to the
latest microprocessors.
Understand the key performance issues that relate
to computer design.
Explain the reasons for the move to multicore
organization, and understand the trade-off between
cache and processor resources on a single chip.
+ 4
Contents
2.1- A Brief History of Computers
2.2- Designing for Performance
2.3- Multicore, MICs, and GPGPUs
2.6- Performance Assessment
+ 5
2.1- History of Computers
A generation is engraved based on an IC: Integrated Circuit
event/essential invention
First
+ Generation: Vacuum Tubes 6
Basic technology: Vacuum tubes
Building block: Composition and operating of vacuum tube
([Link]
Typical computers:
ENIAC (Electronic Numerical Integrator And Computer)
EDVAC (Electronic Discrete Variable Computer) and John Von Neumann
IAS computer (Princeton Institute for Advanced Studies)
Commercial Computers: UNIVAC ((Universal Automatic Computer)
IBM Computers ( International Business Machines)
+ First Generation: ENIAC Computer 7
(Read by yourself)
Electronic Numerical Integrator And Computer
Designed and constructed at the University of Pennsylvania
Started in 1943 – completed in 1946, by John Mauchly and John Eckert
World’s first general purpose electronic digital computer
Army’s Ballistics Research Laboratory (BRL) needed a way to supply trajectory
tables for new weapons accurately and within a reasonable time frame
Was not finished in time to be used in the war effort
Its first task was to perform a series of calculations that were used to help
determine the feasibility of the hydrogen bomb
Continued to operate under BRL management until 1955 when it was
disassembled (Army’s Ballistics Research Laboratory )
8
ENIAC: Characteristics
Major
Memory drawback
consisted
Occupied was the need
of 20
Contained Capable
1500 Decimal accumulators,
more of for manual
Weighed square 140 kW rather each
than 5000 programming
30 feet Power than capable
18,000 additions by setting
tons of consumption binary of
vacuum per switches
floor machine holding
tubes second and
space a
10 digit plugging/
number unplugging
cables
+ 9
John von Neumann
EDVAC (Electronic Discrete Variable Computer)
First publication of the idea was in 1945
Stored program concept
Attributed to ENIAC designers, most notably the mathematician
John von Neumann
Program represented in a form suitable for storing in memory
alongside the data (program= data + instructions)
IAS computer
Princeton Institute for Advanced Studies
Prototype of all subsequent general-purpose computers
Completed in 1952
10
Structure of von Neumann Machine
CA: Cellular Automata
CC: Cellular Constructor
+ 11
IAS Memory Formats
Both data and instructions are stored
The memory of the IAS consists there
of 1000 storage locations (called
words) of 40 bits each Numbers are represented in binary
form and each instruction is a binary
code
data
Instruction
One word contains 2 instructions
+
Structure
of
IAS
Computer
AC: Accumulator
MQ: Multiplier Quotient
MBR: Memory Buffer Register
IBR: Instruction Buffer Register
PC: program counter
IR: Instruction register
MAR: Memory Address Register
+ 13
Table 2.1
The IAS
Instruction
Set
Hexadecimal Code:
+ 010FA210FB
14
IAS code length: 20 bits
Left instruction: 010FA
Opcode: 01(h)
Address: 0FA
01(h) 0000 0001
Load data in the 0FA memory
word to AC
AC = [0FA]
Right instruction: 210FB
Opcode: 21(h)
Address: 0FB Run IAS
21(h) 0010 0001
Store AC to the 0FB memory
Machine
word
[0FB] = AC
Code
AC: 7 7 OFA
[0FB] = [0FA]
7 OFB
A part of the exercise 2.7
+ 15
Commercial Computers: UNIVAC
(Read by yourself)
1947 – Eckert and Mauchly formed the Eckert-Mauchly Computer
Corporation to manufacture computers commercially
UNIVAC I (Universal Automatic Computer)
First successful commercial computer
Was intended for both scientific and commercial applications
Commissioned by the US Bureau of Census for 1950 calculations
The Eckert-Mauchly Computer Corporation became part of the UNIVAC
division of the Sperry-Rand Corporation
UNIVAC II – delivered in the late 1950’s
Had greater memory capacity and higher performance
Backward compatible
+
16
Was the major manufacturer of
punched-card processing equipment
Delivered its first electronic stored-
program computer (701) in 1953
Intended primarily for scientific
applications
Introduced 702 product in 1955
Hardware features made it suitable
IBM
to business applications (Read by yourself)
Series of 700/7000 computers
established IBM as the
overwhelmingly dominant
computer manufacturer
+ 17
Second Generation: Transistors
Transistor = Transfer – resistor (vật có thể truyền-cản điện)
Building block: Composition and operating of transistor
More details: [Link]
It’s activity is similar to those in vacuum tube
Smaller, Cheaper
Dissipates (phát tán) less heat than a vacuum tube
Is a solid state device made from silicon
Was invented at Bell Labs in 1947
It was not until the late 1950’s that fully transistorized computers were
commercially available
Typical computers: IBM 700/7000 series
+ 18
Second Generation Computers
Introduced: Appearance of the Digital
More complex arithmetic and Equipment Corporation (DEC) in
logic units and control units 1957
The use of high-level
programming languages
PDP-1 (programmed data
processor) was DEC’s first
Provision of system software computer
which provided the ability to:
load programs This began the mini-computer
move data to peripherals and phenomenon that would become
libraries so prominent (leading) in the third
generation
perform common
computations
Table 2.3 : Example Members of the
19
IBM 700/7000 Series
20
IBM
7094
Configuration
Read by yourself
Multiplexer (mạch đa hợp) manages
centrally some devices.
Mag: magnetic
Drum: magnetic drum for storing data
21
Third Generation: Integrated Circuits
IC
1958 – the invention of the integrated circuit
All components of a circuit are minimize to micro size.
So, all of them are packed in a chip
Discrete component
Single, self-contained transistor
Manufactured separately, packaged in their own containers, and
soldered or wired together onto masonite (like circuit boards)
Manufacturing process was expensive and cumbersome (complex)
The two most important members of the third generation
were the IBM System/360 and the DEC PDP-8
+ 22
Microelectronics
+ A computer consists of gates,
23
Integrated memory cells, and
interconnections among these
Circuits elements
Data storage – provided by The gates and memory cells are
memory cells constructed of simple digital
electronic components
Data processing – provided by
gates Exploits the fact that such components as
transistors, resistors, and conductors can
Data movement – the paths be fabricated from a semiconductor such
among components are used to as silicon
move data from memory to
memory and from memory Many transistors can be produced at the
through gates to memory same time on a single wafer(thin piece) of
silicon
Control – the paths among
components can carry control Transistors can be connected with a
signals processor metallization (cover using
metal) to form circuits
More details: [Link]
+ 24
Wafer, Wafer: a thin piece of
Chip, silicon (< 1 mm)
and
Gate
Relationship
+ Chip Growth 25
Figure 2.8 Growth in Transistor Count on Integrated Circuits
Number of
transistors
Year m: million
bn: billion
Moore’s Law 26
1965, Gordon Moore
(co-founder of Intel)
Observed number of transistors that could be put on a
single chip was doubling every year
Consequences of Moore’s law:
The pace slowed to a
doubling every 18
months in the 1970’s
The cost of Computer
but has sustained that computer logic
The electrical
Reduction in
path length is becomes smaller
rate ever since and memory and is more power and Fewer interchip
shortened,
circuitry has convenient to use cooling connections
increasing in a variety of
fallen at a requirements
operating speed environments
dramatic rate
+ 27
Table 2.4: Characteristics of the
System/360 Family
Table 2.4 Characteristics of the System/360 Family
28
Table 2.5: Evolution of the PDP-8
(Read by yourself)
PDP: Programmed Data Processor
Produced by Digital Equipment Corporation (DEC)
+ 29
DEC - PDP-8 Bus Structure
DEC: Digital Equipment Corporation
PDP: Programmed Data Processor
Omni (Latin) = for all
+ LSI
Large
Scale
Later Integration
Generations
VLSI
Very Large
Scale
Integration
ULSI
Semiconductor Memory Ultra Large
Microprocessors Scale
Integration
+ Semiconductor Memory 31
In 1970 Fairchild produced the first relatively capacious semiconductor memory
Chip was about the size of Could hold 256 bits of
Non-destructive Much faster than core
a single core memory
In 1974 the price per bit of semiconductor memory dropped below the price per bit of core
There has been a continuing and rapid decline in memory
Developments in memory and processor technologies
memory cost accompanied by a corresponding increase
changed the nature of computers in less than a decade
in physical memory density
Since 1970 semiconductor memory has been through 13 generations
Each generation has provided four times the storage density of the previous generation, accompanied by declining
cost per bit and declining access time
+ 32
Microprocessors
The density of elements on processor chips continued to rise
More and more elements were placed on each chip so that fewer and fewer
chips were needed to construct a single computer processor
1971 Intel developed 4004
First chip to contain all of the components of a CPU on a single chip
Birth of microprocessor
1972 Intel developed 8008
First 8-bit microprocessor
1974 Intel developed 8080
First general purpose microprocessor
Faster, has a richer instruction set, has a large addressing capability
Evolution of Intel Microprocessors 33
Evolution of Intel Microprocessors 34
+ 35
2.2- Designing for Performance
Desktop applications that require the great power of today’s
microprocessor-based systems include
• Image processing
• Speech recognition
• Videoconferencing
• Multimedia authoring
• Voice and video annotation of files
• Simulation modeling
+ Microprocessor Speed 36
Techniques built into contemporary (current) processors include:
Technique Description
Pipelining Processor moves data or instructions into a conceptual
pipe with all stages of the pipe processing
simultaneously
Branch Processor looks ahead in the instruction code fetched
prediction from memory and predicts which branches, or groups
of instructions, are likely to be processed next
Data flow Processor analyzes which instructions are dependent on
analysis each other’s results, or data, to create an optimized
schedule of instructions
Speculative Using branch prediction and data flow analysis, some
(suy đoán) processors speculatively execute instructions ahead of
execution their actual appearance in the program execution,
holding the results in temporary locations, keeping
execution engines as busy as possible
+ 37
Performance
Balance
Adjust the organization and Increase the number of
bits that are retrieved at
architecture to compensate one time by making
DRAMs “wider” rather
for the mismatch among the than “deeper” and by
using wide bus data
capabilities of the various paths
components
Architectural examples Reduce the frequency of
memory access by
include: incorporating
increasingly complex
and efficient cache
structures between the
processor and main
memory
Change the DRAM Increase the interconnect
interface to make it bandwidth between
more efficient by processors and memory
by using higher speed
including a cache or buses and a hierarchy of
other buffering scheme buses to buffer and
on the DRAM chip structure data flow
Typical I/O Device Data Rates 38
+ 39
Improvements in Chip Organization
and Architecture
Increase hardware speed of processor
Fundamentally due to shrinking logic gate size
More gates, packed more tightly, increasing clock rate
Propagation time for signals reduced
Increase size and speed of caches
Dedicating part of processor chip
Cache access times drop significantly
Change processor organization and architecture
Increase effective speed of instruction execution
Parallelism
+ 40
Problems with Clock Speed and Login
Density
Power
Power density increases with density of logic and clock speed
Dissipating heat
RC (Resistance and Capacitance) delay
Speed at which electrons flow limited by resistance and capacitance of
metal wires connecting them
Delay increases as RC product increases
Wire interconnects thinner, increasing resistance
Wires closer together, increasing capacitance
Memory latency
Memory speeds lag (slow down) processor speeds
+
41
Processor Trends
+ 42
2.3- Multicore, MICs, and GPGPUs
MulticoreCPU: CPU has some cores running
concurrently.
MIC: Many integrated core
GPGPU: General Purpose Graphical Processing Unit
Multicore
The use of multiple processors on
the same chip provides the
potential to increase performance
without increasing the clock rate
Strategy is to use two simpler
processors on the chip rather
than one more complex
processor
With two processors larger
caches are justified
As caches became larger it
made performance sense to
create two and then three levels
of cache on a chip
+ 44
Many Integrated Core (MIC)
Graphics Processing Unit (GPU)
MIC GPU
Leap (fast growth) in performance Core designed to perform parallel
as well as the challenges in operations on graphics data
developing software to exploit such
a large number of cores
Traditionally found on a plug-in graphics
card, it is used to encode and render 2D
The multicore and MIC strategy and 3D graphics as well as process video
involves a homogeneous (same
kind) collection of general purpose
Used as vector processors for a variety of
processors on a single chip applications that require repetitive
computations
Read by Yourself 45
2.4- The Evolution of The Intel x86 Architecture
2.5- Embedded Systems and the ARM
Some definitions:
CISC: Complex Instruction Set Computer, CPU is equipped a
large set of instructions
RISC: Reduced Instruction Set Computer, CPU is equipped basic
instructions only based on the thinking: A high instruction is
created using some basic instructions.
ARM: Advanced RISC Machine
+ 46
2.6- Performance Assessment
Factors affect on computer performance:
Factors
Clock Speed and Instructions per Second
Instruction execution rate
Methods: Benchmarks
Some laws: Read by yourself
Amdahl’s Law
Little’s Law
+ 47
System Clock
- Digital devices need pulses to operate. Pulses are created by a
clock generator (a hardware using crystal oscillator)
- The rate of pulses is known as the clock rate, or clock speed.
- The time between pulses is the cycle time.
- One increment, or pulse, of the clock is referred to as a clock
cycle, or a clock tick.
- Unit: cycles per second, Hertz (Hz)
- Operations performed by a processor, such as fetching an
instruction, decoding the instruction, performing an arithmetic
operation, and so on, are governed by a system clock.
High clock rate High performance.
+ 48
Instruction Execution Rate
- Unit: MIPS (millions of instructions per second)
- Unit: MFLOPs (Floating-point performance is
expressed as millions of floating-point operations
per second)
+ 49
Benchmark
- A test used to measure hardware or software
performance.
- Benchmarks for hardware use programs that test the
capabilities of the equipment
- Benchmarks for software determine the efficiency,
accuracy, or speed of a program in performing a
particular task, such as recalculating data in a
spreadsheet.
- The same data is used with each program tested, so the
resulting scores can be compared to see which programs
perform well and in what areas.
Benchmarks …
50
For example, consider this high-level language statement:
A = B + C /* assume all quantities in main memory */
With a traditional instruction set architecture, referred to as a complex instruction
set computer (CISC), this instruction can be compiled into one processor
instruction:
2 codes may
add mem(B), mem(C), mem (A)
need the same
amount of time
On a typical RISC machine, the compilation would look something like
when they
this:
execute on 2
load mem(B), reg(1);
load mem(C), reg(2);
machines.
add reg(1), reg(2), reg(3);
store reg(3), mem (A)
+ 51
Benchmark
- The design of fair benchmarks is something of an art,
because various combinations of hardware and software
can exhibit widely variable performance under different
conditions. Often, after a benchmark has become a
standard, developers try to optimize a product to run that
benchmark faster than similar products run it in order to
enhance sales (MS Computer Dictionary)
Beginning in the late 1980s and early 1990s, industry
and academic interest shifted to measuring the
performance of systems using a set of benchmark
programs
+ 52
Desirable Benchmark Characteristics
1. It is written in a high-level language, making
it portable across different machines.
2. It is representative of a particular kind of
programming style, such as system
programming, numerical programming, or
commercial programming.
3. It can be measured easily.
4. It has wide distribution.
+ 53
System Performance Evaluation
Corporation (SPEC)
Benchmark suite
A collection of programs, defined in a high-level language
Attempts to provide a representative test of a computer in a
particular application or system programming area
SPEC
An industry consortium
Defines and maintains the best known collection of benchmark
suites
Performance measurements are widely used for comparison and
research purposes
+
Best known SPEC benchmark suite
Industry standard suite for processor
intensive applications
SPEC Appropriate for measuring performance
for applications that spend most of their
time doing computation rather than I/O
CPU2006 Consists of 17 floating point programs
written in C, C++, and Fortran and 12
integer programs written in C and C++
Suite contains over 3 million lines of
code
Fifth generation of processor intensive
suites from SPEC
Gene Amdahl [AMDA67]
+ Amdahl’s Deals with the potential speedup of
Law a program using multiple
processors compared to a single
(Read by processor
yourself) Illustratesthe problems facing
industry in the development of multi-
core machines
Software must be adapted to a
highly parallel execution
environment to exploit the power of
parallel processing
Can be generalized to evaluate and
design technical improvement in a
computer system
+ Amdahl’s Law (Read by yourself) 56
+ 57
Little’s Law (Read by yourself)
The general setup is that we have a steady state system to which items arrive at an
average rate of λ items per unit time. The items stay in the system an average of W
units of time. Finally, there is an average of L units in the system at any one time.
Little’s Law relates these three variables as L = λ W.
Fundamental and simple relation with broad applications
Can be applied to almost any system that is statistically in steady state, and in which
there is no leakage
Queuing system
If server is idle an item is served immediately, otherwise an arriving item joins a
queue
There can be a single queue for a single server or for multiple servers, or multiples
queues with one being for each of multiple servers
Average number of items in a queuing system equals the average rate at which items
arrive multiplied by the time that an item spends in the system
Relationship requires very few assumptions
Because of its simplicity and generality it is extremely useful
+ Questions (Use your notebook) 58
Building blocks: Composition and operating of vacuum tube/transistor
2.1 What is a stored program computer?
2.2 What are the four main components of any general-purpose computer?
2.3 At the integrated circuit level, what are the three principal constituents of a computer
system?
2.4 Explain Moore’s law.
2.5 List and explain the key characteristics of a computer family.
2.6 What is the key distinguishing feature of a microprocessor?
2.7- Refer to the table 2.1
+ Summary
59
Computer Evolution
and Performance
Chapter 2
Multi-core
First generation computers MICs
Vacuum tubes
Second generation computers GPGPUs
Transistors Performance assessment
Third generation computers Clock speed and instructions per
Integrated circuits second
Benchmarks
Performance designs
Amdahl’s Law
Microprocessor speed
Little’s Law
Performance balance
Chip organization and
architecture