Second-Generation Stack Computer Architecture
Second-Generation Stack Computer Architecture
Architecture
A thesis
presented to the Independent Studies Program
of the University of Waterloo
in fulfilment of the
thesis requirements for the degree
Bachelor of Independent Studies (BIS)
Independent Studies
University of Waterloo
Canada
April 2007
ii
Declaration
I authorize the University of Waterloo to lend this thesis to other institutions or individuals for
the purpose of scholarly research.
Signature:
I further authorize the University of Waterloo to reproduce this research paper by photocopy-
ing or other means, in total or in part, at the request of other institutions or individuals for the
purpose of scholarly research.
Signature:
The work in this research paper is based on research carried out in the Independent Studies
Program at the University of Waterloo, Canada. No part of this thesis has been submitted else-
where for any other degree or qualification and is all my own work unless referenced to the
contrary in the text.
iii
Second-Generation Stack Computer Architecture
Abstract
It is commonly held in current computer architecture literature that stack-based computers were
entirely superseded by the combination of pipelined, integrated microprocessors and improved
compilers. While correct, the literature omits a second, new generation of stack computers
that emerged at the same time. In this thesis, I develop historical, qualitative, and quantitative
distinctions between the first and second generations of stack computers. I present a rebuttal
of the main arguments against stack computers and show that they are not applicable to those
of the second generation. I also present an example of a small, modern stack computer and
compare it to the MIPS architecture. The results show that second-generation stack computers
have much better performance for deeply nested or recursive code, but are correspondingly
worse for iterative code. The results also show that even though the stack computer’s zero-
operand instruction format only moderately increases the code density, it significantly reduces
instruction memory bandwidth.
iv
Acknowledgements
Firstly, thanks go to my family, immediate and extended, who have always given me the leeway
and support I needed, who always believed in me.
Sometime in 2000, Ralph Siemsen and Andrew E. Mileski introduced me to the Forth
programming language, which changed my view of programming. Soon after, I discovered the
microprocessors of Chen-Hanson Ting, Jeff Fox, and Charles H. (Chuck) Moore, which did
the same for my view of computer hardware. Aaron Holtzman suggested I play with FPGA
simulations of these computers, and graciously bore all my grumblings about broken Verilog
compilers. At the same time, I had stimulating email discussions with Myron Plichota and
Jecel Mattos de Assumpcao Jr. which led to some of the new ideas in this thesis.
It was Sheryl Cronk who eventually gave me the arguments and reasons to return to Uni-
versity. Many friends bought my old junk and helped me move. For this kick-start and support,
I am forever grateful.
Once at Waterloo, Professor Chrysanne DiMarco became my Adviser. Her thorough knowl-
edge of the English language and of the customs of academia improved me greatly. Thus, I
must atone by myself for any linguistic errors in this thesis. Professors Giuseppe Tenti and
Barry Ferguson unrusted and expanded my mathematical skills. Professor Manoj Sachdev and
his PhD students, Shahab Ardalan and Bhaskar Chatterjee, took much time both inside and
outside of class to discuss the details of VLSI circuitry with me. Professors Mark Aagaard
helped me gain a broader perspective on computer architecture and led me to the class of Pro-
fessor Paul Dasiewicz who taught me more about the subject. PhD candidate Brad Lushman
took time to help me with my exploration of programming languages. I also thank Professor
Anne Innis Dagg of Independent Studies, whose course on independent research rounded me
out well.
Outside of class, the denizens of the Computer Science Club provided both enthusiastic
discussions and gave me a chance to make my work heard. Mike Jeays provided me with a
useful and rare primary source for the KDF9, his favourite computer. Professor Steven M.
Nowick of Columbia University helped me understand his MINIMALIST synthesis tool.
The wheels of Independent Studies were kept turning by Professors Bill Abbott and Richard
Holmes and especially by Susan Gow, who provided endless enthusiasm, countless good ex-
amples, and sage advice.
The writing of this thesis was supervised by Dr. Andrew Morton here at Waterloo, and by
Professor J. Gregory Steffan of the University of Toronto. I am very grateful for their feedback
and guidance.
And finally, thanks to Joy, my fiancée. You brighten my life. You make me happy.
The years ahead with you glow with promise and adventure.
v
”But the speed was power, and the speed was joy, and the speed was pure beauty.”
“If my calculations are correct, when this baby hits eighty-eight miles per hour,
you’re gonna see some serious shit.”
vi
Contents
1 Introduction 1
1.1 Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Part I: Historical Review . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Part II: Qualitative Arguments . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Part III: Quantitative Arguments . . . . . . . . . . . . . . . . . . . . . 3
I Historical Review 5
2 History of the First Generation of Stack Computers 7
2.1 Lukasiewicz and the First Generation . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Poland: Jan Lukasiewicz (1878-1956) . . . . . . . . . . . . . . . . . . 7
2.1.2 Germany: Konrad Zuse (1910 - 1995) . . . . . . . . . . . . . . . . . . 8
2.1.3 Germany: Friedrich Ludwig Bauer (1924-) . . . . . . . . . . . . . . . 8
2.1.4 Australia: Charles Leonard Hamblin (1922-1985) . . . . . . . . . . . . 9
2.1.5 USA: Robert Stanley Barton . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 The First Generation of Stack Computers . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Zuse Z4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 English Electric Co. KDF9 . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Burroughs B5000 and later models . . . . . . . . . . . . . . . . . . . . 14
2.2.4 International Computers Ltd. ICL2900 series . . . . . . . . . . . . . . 16
2.2.5 Hewlett-Packard HP3000 . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Shortcomings and Disappearance of the First Generation . . . . . . . . . . . . 18
2.3.1 Explicit High-Level Language Support . . . . . . . . . . . . . . . . . 18
2.3.2 The Rise of RISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 Excessive Memory Traffic . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.4 The Need for Index Registers . . . . . . . . . . . . . . . . . . . . . . 20
vii
3.2 The Second Generation of Stack Computers . . . . . . . . . . . . . . . . . . . 23
3.2.1 NOVIX NC4016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Harris RTX-2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.3 Sh-BOOM (Patriot Scientific IGNITE I) . . . . . . . . . . . . . . . . . 23
3.2.4 MuP21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.5 F21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.6 c18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Recent Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Strengths and Weaknesses of the Second Generation . . . . . . . . . . . . . . 28
3.4.1 The Need for Index Registers . . . . . . . . . . . . . . . . . . . . . . 28
3.4.2 Stack Manipulation Overhead . . . . . . . . . . . . . . . . . . . . . . 28
3.4.3 Poor Support of ALGOL-like Languages . . . . . . . . . . . . . . . . 29
3.4.4 Reduced Instruction Memory Bandwidth and System Complexity . . . 29
3.4.5 Fast Subroutine Linkage and Interrupt Response . . . . . . . . . . . . 29
II Qualitative Arguments 31
4 Distinguishing the First and Second Generations 33
4.1 Location of Stacks: In-Memory vs. In-Processor . . . . . . . . . . . . . . . . 34
4.2 Use of Stacks: Procedure Nesting vs. Expression Evaluation . . . . . . . . . . 35
4.3 Operations with Stacks: High-Level Language Support vs. Primitive Operations 36
viii
6.2.4.1 Synthesizing More Complex Operations . . . . . . . . . . . 51
6.2.5 Stack Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2.6 No-Op and Undefined . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2.7 Instruction Format and Execution Example . . . . . . . . . . . . . . . 53
6.3 State Machine and Register Transfer Description . . . . . . . . . . . . . . . . 54
6.3.1 Improvement: Instruction Fetch Overlap . . . . . . . . . . . . . . . . . 56
ix
9.2.3 Reducing Gullwing’s CPI by Executing Multiple Instructions using
Generalized Instruction Folding . . . . . . . . . . . . . . . . . . . . . 97
x
B.1.5 Instruction Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
B.1.6 Compiled Instruction Counts . . . . . . . . . . . . . . . . . . . . . . . 144
B.2 Dynamic Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
B.2.1 Overall Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
B.2.2 Executed Instruction Counts . . . . . . . . . . . . . . . . . . . . . . . 145
B.2.3 Average CPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
B.2.4 Instruction Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
B.2.5 Basic Block Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
B.2.6 Data Stack Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
B.2.7 Return Stack Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Bibliography 151
xi
xii
List of Tables
xiii
B.6 Compiled Instruction Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
B.7 Overall Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
B.8 Executed Instruction Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
B.9 Average CPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
B.10 Instruction Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
B.11 Basic Block Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
B.12 Data Stack Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
B.13 Return Stack Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
xiv
List of Figures
xv
A.1 Flight Language Kernel Memory Map . . . . . . . . . . . . . . . . . . . . . . 100
A.2 Counted String Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
xvi
List of Algorithms
xvii
xviii
Chapter 1
Introduction
I first learnt about stack computers in 2000 while working at a computer manufacturer where
co-workers introduced me to the Forth programming language, a stack-based programming
environment. Soon after, while looking for a suitable processor for a homebrew computer
system, I came across a mention of the MuP21 [MT95] in the Usenet Embedded Processor
and Microcontroller Primer and FAQ1:
The MuP21 was designed by Chuck Moore, the inventor of Forth. With the
MuP21, Forth can compile into machine code and still be Forth, because the ma-
chine code IS Forth. The MuP21 freaks out at 100 MIPS while consuming only 50
milliwatts. Not only that, the chip includes a video generator, has only about 7000
transistors (that’s right, 7000 and not 7,000,000), and costs about $20.
The assembler on this chip is a sort of dialect of Forth, as the CPU is modeled
after the Forth virtual machine. MuP21 is a MINIMAL Forth engine. [. . . ] The
CPU programs the video generator and then just manipulates the video buffer. It
is composite video out, so it only needs one pin. MuP21 is only a 40 pin chip.
I’d never heard of anything like it. It was smaller and faster, and its machine code was a
structured language! I was hooked. Understanding this type of hardware and software be-
came a hobby that ultimately led me to pursue a University degree on the topic. However, I
couldn’t simply take a Computer Engineering degree since this kind of computer is virtually
non-existent in the mainstream literature and totally absent from the curriculum. Therefore, I
had to create one under the aegis of the Independent Studies (IS) program.
The IS program is a self-directed course of study guided and vetted by a Faculty Adviser
and composed of a combination of Independent Study Units and regular courses. After two
years of study (typically), a student petitions to enter Thesis Phase and if approved, spends a
year developing a thesis on a selected topic. A successfully completed thesis grants the degree
of Bachelor of Independent Studies (BIS). Overall, IS bears more resemblance to graduate
studies than undergraduate ones.
The structure of this thesis reflects the directions I have taken throughout the IS program.
I began with broad historical explorations of stack architecture and programming languages,
complemented by regular engineering courses on digital systems, computer architecture, and
1
Copyright (c) 1997 by Russ Hersch, all rights reserved. http://www.faqs.org/faqs/microcontroller-faq/primer/
1
integrated circuits. These efforts eventually concentrated on defining, simulating, program-
ming, and partially implementing a particular stack computer design. In this thesis, I leave
aside the issues of programming language design and VLSI implementation to focus on the
architecture of the computer itself.
2
1.2 Thesis Outline
This thesis is divided into three major parts: a Historical Review, Qualitative Arguments, and
Quantitative Arguments. The first and third may be read independently. However, the second
part depends on the background provided by the first and is supported by data from the third.
3
low-level analyses of how they execute iterative, recursive, tail-recursive, and nested subroutine
code. The issue of pipelining Gullwing is explored as a transformation of the DLX pipeline.
Gullwing is found to have a definite advantage at subroutine calls and memory bandwidth, but
is unfortunately architecturally equivalent to a DLX processor without load or branch delay
slots, with the same penalty to performance.
Chapter 8 addresses Gullwing’s inefficient usage of memory for holding compiled code by
adding a third stack to temporarily hold instructions during subroutine calls. This new archi-
tectural feature will increase the density of code to the maximum possible value and accelerate
returns from subroutines.
Finally, Section 9.2 outlines the addition of stacks to a MIPS processor, without altering the
pipeline or instruction set, in order to give it the efficient subroutine call mechanism of a stack
computer. This section also introduces the addition of two forms of parallelism to Gullwing:
one which reduces its instruction count with compound stack operations, and the other which
reduces its CPI by overlapping the execution of instructions.
Appendix A provides the source to the Flight language kernel and the software used to
benchmark Gullwing. Appendix B contains the tabulated raw data from the analyses of the
dynamic and static properties of Gullwing machine code.
4
Part I
Historical Review
5
Chapter 2
I present here the first generation of stack computers in the context of the pioneers of the field
and of the machines that followed their insights. I discuss the organization and design goals of
these computers, their shortcomings, and ultimately their replacement by RISC designs.
7
/
/+
/+5
/+55
/ 10
/ 10 2
5
Hardware cellars, stacks, and pushdown stores have been discussed elsewhere,
possibly as early as 1947 by Alan Turing, certainly in 1949 by Harry D. Huskey
8
in connection with the ZEPHYR (SWAC) computer and in 1956 by Willem L.
van der Poel in connection with the design of the MINIMA computer; in all cases
presumably for the treatment of return jumps in subroutines. [. . . ]
Figure 2.2: Fig. 1 from Bauer and Samelson German Patent #1094019
9
into a sequence of machine instructions. For example, the expression (5 + 5)/2 is expressed
unambiguously as 5 5 + 2 /, while the alternative interpretations (assuming no parentheses)
would be written as 5 2 / 5 + or 5 5 2 / +. Furthermore, only a single stack is required since the
operators are never waiting for their operands.
Hamblin expanded upon this insight in a pair of 1957 papers [Ham57a][Ham57b]1 (reprinted
[Ham85]). In summary:
It is now possible to visualize the general shape a machine designed to use such
code might take. It would have a ’running accumulator’ and ’nesting register’ as
described, and a number-store arranged on something like the pattern indicated,
[. . . ]
The running accumulator is a stack and is equivalent to Bauer’s Number Cellar. The nesting
register is of the same structure but holds the return addresses of subroutines. This separation
of evaluation and flow-control into two stacks, which are also separate from main memory, is
the main architectural feature of second-generation stack computers (Figure 4.3).
Some employees of the English Electric Co. were present when Hamblin delivered his first
paper [All85][Dun77]. They integrated his ideas into their next computer, the KDF9.
1
This is a slightly abridged form of the first paper.
10
2.2 The First Generation of Stack Computers
There were many more machines of this type than those I’ve enumerated here. I’ve mentioned
the ones that are directly linked to the people of the previous section or which have been notable
in industry. A much larger list can be found in Koopman’s book [Koo89, App.A].
2.2.1 Zuse Z4
The Z4 is too simple to fit into either the first or second generation of stack computers, not
being a stored-program machine, but it is the earliest one known and thus deserves mention.
Originally built in 1945, it was damaged during World War II, and later rebuilt in 1950. It
currently resides in the Deutches Museum in Munich, Germany.
Like many of Zuse’s computers, the Z4 was designed to perform engineering calculations.
Its program was read from a punched plastic tape2 , and it included a second tape reader as a
form of subroutine call. Its ingenious mechanical main memory held 64 32-bit floating-point
numbers which could be loaded into a stack of 2 elements, operated upon, and then stored back.
It supported a full complement of basic arithmetic operations, including square root and some
built-in constants such as π. Its 8-bit instruction format was zero-operand, with one-operand
loads and stores for direct addressing of memory. No address calculations or access to the
stack pointer were possible. It supported conditional stop, skip, and call instructions. Figure
2.3 shows the programming model for the Z4. It is a simplified reproduction from Blaauw and
Brooks’ book [BFPB97, fig. 10-29].
Stack Level 0
Stack Level 1 ALU
2
used movie film, in fact!
11
2.2.2 English Electric Co. KDF93
The design of the KDF9 (Figure 2.5) [Eng63, fig. 2] was inspired by Hamblin’s first paper on
stack-based computing [Ham57a] and thus uses a pair of stacks for its operation. The Nesting
Store was a 19-deep hardware stack, with the top two elements visible to the Arithmetic Unit,
upon which expressions were evaluated. The Sub-Routine Jump Nesting Store was similar,
but only 16-deep with only the top-most element visible. Neither of these stacks extended into
main memory. All storage locations were 48 bits wide.
A third set of 16 stores, named ’Q -Stores’ (Figure 2.4) [KDF61, pg. 9], were used for
random access storage, address modification, loop counting, and I/O operations. The first store,
Q0, was a read-only zero register. The remainder were either used as single 48-bit registers,
or triads of 16-bit registers, with direct or accumulative storage. The 16-bit sub-registers could
also be used respectively as modifier, increment, and counter. An access to main memory
could have its address augmented by the modifier. Afterwards, the modifier could then be
incremented by the increment and the counter decremented by one. With a jump instruction to
test the counter, this made for efficient loops and array processing. The counter of a Q-Store
could also hold the amount of positive or negative shift for shift instructions. Finally, a Q-Store
could hold a device number and the start and end addresses of an area in memory in preparation
for an automated I/O operation.
The English Electric Co. went through a series of acquisitions and mergers, eventually
forming International Computers Ltd. in 1968 [Lav80]. However, by then their focus seemed
to have changed to competing with Burroughs’ B5000 series and IBM’s System/360 [Dun77]
and so the dual-stack approach, and the KDF9, was dropped entirely.
The KDF9 is an oddity. Historically, it is a first-generation stack computer. However, based
on the distinguishing criteria for first and second-generation stack computers (Chapter 4), it
falls squarely into the second. Had it not been discontinued, the first and second generations
might have existed in parallel.
3
The variations ’KDF.9’ and ’KDF-9’ are also used.
12
Figure 2.5: KDF9 Block Diagram
13
2.2.3 Burroughs B5000 and later models
The B5000 spawned an entire series of stack computers, all aimed at the direct and efficient
execution of the ALGOL language. They were complex multiprocessing systems with tagged
memory and descriptors for primitive data types and automatic management of subroutine
parameters. I will concentrate here on the design and use of the single, in-memory stack that
governed the execution of a program. This feature is essentially unchanged across the entire
series.
Figure 2.6 shows the implementation of the stack in the B6900 [Bur81, Sec.3]. The stack
memory area is delimited by the contents of the Bottom Of Stack (BOS) and Limit Of Stack
(LOS) registers. The current subroutine area is indicated by the F register which points to
a Mark Stack Control Word (MSCW). This word contains the context information necessary
to return to the subroutine’s caller. The topmost stack element in use is pointed to by the S
register.
The A and B registers are a working cache for the top of the stack and are connected to
the ALU. They are extended by the X and Y registers for double-precision calculations. Their
contents are loaded and unloaded as required by each operation in progress and so their entire
operation is transparent to the program. They are not part of the stack proper since they are
flushed whenever the top of the stack is altered by some operation, such as a subroutine call,
and so cannot be used to pass parameters.
The B7700 added a 32-entry circular stack buffer between main memory and the A and B
registers (Figure 2.7) [Bur73, Sec.2]. This is a genuine buffer that is transparent to the program,
and is only flushed if the processor registers (including the S, F, LOSR, and BOSR registers)
are altered with a SPRR (Set Processor Register) or a MVST (Move To Stack) operation, or
in the case of an atomic memory exchange with the top of the stack using RDLK (Read With
Lock).
Both computation and subroutine linkage were done on the same stack in a manner specif-
ically designed to support the structure of the ALGOL programming language. When a sub-
routine or nested block of code was to be entered, a MSCW was placed on the stack, followed
by the parameters to the subroutine, followed by a Return Control Word (RCW) which saved
the condition flags, amongst other things. The local variables and temporary values were then
allocated above all this. Only at this point could the Enter (ENTR) operator be executed to
enter the subroutine.
In 1986, the Burroughs Corporation merged with the Sperry Corporation into the Unisys
Corporation [Ros87]. The B5000 series of computers continues in the company’s ClearPath
line of mainframes.
14
Figure 2.6: B6900 Top-of-Stack and Stack Bounds Registers
15
2.2.4 International Computers Ltd. ICL2900 series
The ICL2900 series, introduced in 1974, was fairly similar to the Burroughs computers save for
a lack of tagged memory and a different approach to the use of the stack. Its design is derived
from the Manchester MU5 [IC78]. The ICL computers were accumulator-based with a stack
that was explicitly referenced by the programmer. Figure 2.8 shows a comparison between the
stacks of the B6700 and the ICL2900 [Dor75a]. The top three stack elements were buffered in
registers [Dor75b, Chu75].
16
2.2.5 Hewlett-Packard HP3000
The HP 3000 series was originally introduced in 1972. It is similar to the Burroughs computers,
but has been simplified to support real-time response [McK80, Sto80]. Figure 2.9 shows the
structure of its stack. The main difference lies in the use of the four-element circular stack
buffer. Unlike the B7700, the current top two elements of the stack buffer feed the ALU
directly and the S register points to the current head of the buffer instead of main memory.
Like the Burroughs computers, the buffer is managed automatically and flushed on subroutine
calls [Bla77]. In later models (Series 68), the stack buffer was expanded to eight elements
[Hew84, pg.86]. This series of computers was being sold by Hewlett-Packard, under the name
’e3000’, up until November 2001.
17
2.3 Shortcomings and Disappearance of the First Genera-
tion
During their heyday of about 20 years, stack computers were quite possibly the most sophis-
ticated general-purpose computers available. But in retrospect, they had several glaring short-
comings which were endemic in machines of the time.
• The machines were slow, leading to long compilation times, only to end up with sub-
optimal code!
• Since code was written mostly by hand, and programs were getting larger and harder to
write (including compilers), supporting a high-level language helped the programmer.
These led to two major features: hardware support for the execution models of structured
languages such as ALGOL, and the integration of complex functions in the instruction set,
implemented as microcode, making it easier to program the computer directly.
These features became weaknesses over time. A computer designed to execute one lan-
guage well would perform poorly with another [Org73, ch.8]. As compilers improved they
generated simpler subroutine linkages that did not match the full-featured built-in ones [HP02,
2.14]. The compilers also could not use the complex instructions provided. Finally, the mi-
crocode for these computers had itself grown to the point of unmanageability [Pat85] (reprinted
[Pat86, FL86]).
Eventually, compilers became able to effectively reduce high-level languages features into
series of simple operations, and the RISC computers that followed were designed in that light.
18
• Ditzel and Patterson criticized the original arguments for High-Level Language Com-
puter Systems (HLLCS) [DP80] (reprinted [DP86, FL86], and [DP98b] with updated
comments [DP98a]), and conclude that “. . . almost any system can be a HLLCS through
the appropriate software. . . ”. They also wrote an overview of the arguments for reduced
instruction sets [PD80].
• Patterson later wrote an extremely broad article on the features and successes of the early
RISC experiments, including software measurements and compiler techniques [Pat85]
(reprinted [Pat86, FL86]), and advocates taking implementation as a factor in computer
architecture.
• At Berkeley, Patterson and Sequin headed the RISC I and RISC II projects as one ap-
proach to RISC designs [PS81] (reprinted [PS98a] with updated comments [PS98b]).
The fine details of their implementation were presented in the PhD thesis of one of their
students, Manolis Katevenis [Kat85].
• One of the premises of RISC design is that the hardware and the software must be con-
sidered together. Hennessy and Jouppi measured the low-level features of software and
proposed some architectural guidelines to support them without the pitfalls of past high-
level language support. These included “. . . the use load/store architecture and the ab-
sence of condition codes. . . ”. [HJB+ 82]. These data guided the Stanford MIPS project
[HJP+ 82].
• The major technological change of the time was the emergence of Very Large Scale In-
tegrated (VLSI) circuits which made possible the implementation of an entire processor
on a single chip. The various approaches to integrated architecture are discussed by
Hennessy [Hen84] (reprinted [Hen86, FL86]).
19
B7700 was likely the only first-generation stack computer to address this problem by including
a genuine 32-entry buffer (Figure 2.7) for the top of the stack [Bur73, pg.3-36] .
• The KDF9 was the first to do so by including the Q-Stores [Hal62] (Figure 2.4). They
were abundant (16) and could be also used as general-purpose registers.
• The B5000 series encoded loop counts in special instructions, such as BEGIN LOOP
(BLP), END LOOP (ELP), and JUMP OUT LOOP CONDITIONAL (JLC), which re-
used some internal registers to hold addresses while in Character Mode4 [Bur63] [Bur67].
• The B7700 added a vector mode of operations in which the index for one loop and the
addresses and increments for up to three arrays were stored in separate internal registers
so as to free the stack for computations [Bur73, pg.3-112].
• One of the three top-of-stack registers of the ICL 2900 could be used as an index register
[Dor75a].
• The HP 3000 series had a single index register (X) to support loops [Hew84].
4
Character Mode processed 6-bit Binary-Coded Decimal numbers, while Word Mode processed binary 48-bit
numbers.
20
Chapter 3
In this chapter, I present the second generation of stack computers in the context of the pioneers
of the field and of the machines that followed their insights. At the same time that the first
generation of stack computers was fading away in the light of RISC, the second generation
began to emerge and found a niche in embedded control systems instead of general-purpose
computing.
21
Burroughs B5500 computer was the influence for the use of a stack for expression evaluation
[Moo91]. The best introduction to Forth and the methodologies it favours are a pair of books
by Leo Brodie [BI86] [Bro84].
A Forth system is divided into two interpreters. The outer interpreter receives source input
and looks up each word in a dictionary. If found, it calls the inner interpreter to process the
word’s definition. In the most basic case a word definition is a series of addresses of other
words, themselves defined in the same manner, ending with primitives which are written in
machine code2 . The inner interpreter is a small virtual machine which walks through these
definitions and executes the primitives it encounters. The inner interpreter keeps track of the
nesting of these definitions on a stack in memory, commonly referred to as the Return Stack.
The Forth primitives do their operations on another such stack, the Data Stack, where they
take their arguments and place their results. The primitives are thus simple function applica-
tions which can be composed by executing them in sequence. Higher-level words are functional
compositions of these primitives. These new words interact with the stack and compose in the
same manner as the primitives.
A second-generation stack computer is basically a physical realization of the inner inter-
preter, the Forth primitives, and the stacks. The primitives become the instruction set which
operates on a hardware Data Stack. The inner interpreter reduces to simple call and return
instructions which use a Return Stack to store the return addresses of subroutines.
2
This is known as ’indirect-threaded code’. There exists also direct-threaded, string-threaded, token-threaded,
and subroutine-threaded versions, each with different size/speed trade-offs. Second-generation stack computers
are subroutine-threaded systems.
22
3.2 The Second Generation of Stack Computers
I’m concentrating here on the computers primarily designed by Chuck Moore. There are many
more machines than the ones listed here (see Koopman’s book [Koo89, App.A]), but Moore’s
work was by far the most ground-breaking and influential. Much less was published about his
machines than those of the first generation. Therefore, the descriptions here are mostly based
on information found in Koopman’s book, reference manuals, and unpublished documentation.
23
to/from memory as required. The implementation used about 9000 gates. Figure 3.2 shows a
block diagram of the main processor [Sha02, fig.1].
Contrary to the NC4016 or the RTX-2000 the Sh-BOOM did not use an unencoded instruc-
tion format, but packed four 8-bit, Forth-like instructions into each memory word. This formed
a simple instruction cache that allowed instructions to be executed while the next memory fetch
was in progress. This also allowed very small loops to execute from within the cache without
requiring an instruction fetch. Another interesting feature was the use of conditional SKIP
instructions which, if the condition was met, would skip over the remainder of the instructions
in the memory word. Conditional jumps and calls were implemented this way [GWS91].
The Sh-BOOM broke away from a pure stack architecture by including 16 general-purpose
registers (g0-g15), and by making most of the on-chip return stack addressable as registers (r0,
r1, etc. . . ). The general-purpose registers were used for temporary storage and for I/O opera-
tions, much like the KDF9. To support the stack frames of ALGOL-like languages, instead of
simply pushing values on the return stack, a number of empty locations could be allocated in
one step and then later filled from the general-purpose registers.
The Sh-BOOM design is currently being marketed by Patriot Scientific4 as the IGNITE I
[Sha02] (previously PSC1000 [Sha99]) processor core, targeted at embedded Java applications.
It is the most sophisticated second-generation stack computer currently available.
4
http://www.ptsc.com/
24
Figure 3.2: IGNITE I Block Diagram
3.2.4 MuP21
First published in 1995 [MT95], the MuP21 was the first of a family of chips dubbed Minimal
Instruction Set Computers (MISC). Like the Sh-BOOM, it used a packed instruction format
which held four 5-bit instructions in a 20-bit memory word. The internal stacks and ALU were
21-bits wide to support arithmetic carry. The data stack was only six cells deep, and the return
stack was a mere 4 cells deep. An address register (A) was added for temporary storage and
for memory accesses.
Like the Sh-Boom, the MuP21 also contained a small auxiliary processor which had pri-
ority access to the memory bus. However, it was a video processor which executed its own
instruction set, also 5-bits wide, which was tailored for creating a 16-colour NTSC signal at an
output pin. A frame of video was described by a block of these instructions which could be
manipulated by the main processor.
Amazingly, the entire implementation used only 7000 transistors in a 1.2u process, had a
typical speed of 80 MIPS, and dissipated only 50 mW. There are hints that the design was
fully asynchronous. The MuP21 was an influential design. Its simplicity made it an ideal
choice for undergraduate and hobbyist projects, usually as FPGA implementations (for exam-
ple: [HH00]).
25
3.2.5 F21
Jeff Fox had formed UltraTechnology in 1990 in Berkeley, to develop a custom computer in
collaboration with Chuck Moore. The result was the F21 microprocessor [Fox98], which was
an extension of the MuP21. The instruction set added post-incrementing loads and stores from
the address register and the top of the return stack, and some extra arithmetic operations. The
stacks were expanded to about 17 cells each. Like the MuP21, the memory interface was
20-bits wide and values were stored internally as 21 bits.
Like the MuP21, the F21 contained a video coprocessor, and added similar coprocessors
for analog I/O and serial networks. Some common routines were included in on-chip ROM.
In a 0.8u process, the F21 was implemented in about 15,000 transistors, and had a typical
execution rate of 100 MIPS (peaking internally at 500 MIPS), depending on memory access
time. Ultratechnology ceased to exist in 2002 with nothing formally published about the design
and only some prototype chips made. The website for the company5 contains some fairly
detailed documentation. For a reconstruction of what the block-level design might have been
like, see Figure 6.1.
3.2.6 c18
Around 2001, Moore took the F21 design in a new direction and produced the c18 [Moo01b,
Moo01a]. Architecturally, it is virtually identical to the F21, but adds a second address register
for on-chip communication. Its width was also reduced to 18 bits to match the fast cache
memory chips available at the time. This leaves room to pack only 3 instructions per memory
word.
The coprocessors were eliminated and replaced by a watchdog timer. There is no external
memory bus. External memory must be accessed via the parallel I/O pins, and programs must
be loaded into a few hundred words of on-chip RAM to be executed.
The c18 was simulated in a modern 1.8V 0.18u process. It had a predicted *sustained*
execution rate of 2400 MIPS, while dissipating about 20 mW. It was an aggressive, fully asyn-
chronous design.
The c18 was targeted at multiprocessing. A 5x5 array of c18’s, connected by horizontal
and vertical buses, would fit in 7mm^sq. This eventually became realized as the SEAforth-24
multiprocessor currently entering production at Intellasys6 (Intelligent Array Systems) .
26
formance, and instruction-level parallelism [Bai94] [BS94] [Bai00, DBL00] [Bai04]
[SB04]. His 1996 PhD thesis presented an improvement to stack spill/fill algorithms
so as to further reduce memory traffic [Bai96].
• During his Master’s studies at the University of Alberta, Robert James Chapman ex-
plored the synthesis of stack computers using VHDL [Cha97] [Cha98] and wrote a pa-
per which decomposed the usual stack permutation operations into smaller primitives
[Cha95].
• While at the Technical University of Munich, Bernd Paysan wrote his 1996 Diploma the-
sis on the 4stack processor8 [Pay96], a 4-way superscalar VLIW (Very Long Instruction
Word) design specified in Verilog. It is meant for embedded DSP (Digital Signal Pro-
cessing) applications, but has a supervisor mode and virtual memory for desktop use. He
also designed the much smaller b16 microprocessor9 [Pay02], a 12-bit version of which
is used internally at Mikron AG.
• Chung Kwong Yuen 10 at the National University of Singapore has an unpublished paper
[Yue] on how to implement a reorder buffer to obtain superscalar execution in stack
computers11.
• Chen-Hanson Ting currently runs the eForth Academy12 in Taiwan, which provides de-
sign classes for embedded systems. He created the P series of microprocessors13 derived
from the works of Chuck Moore. They are described in [Tin97a, Tin97b]1415 .
• A number of students at the Chinese University of Hong Kong designed and imple-
mented two versions of a derivative of the MuP21 microprocessor : The MSL16 was
first implemented on an FPGA (Field-Programmable Gate Array) [LTL98] and later re-
implemented in silicon using asynchronous logic as the MSL16A [TCCLL99].
7
The only documentation was at http://www.stringtuner.com/myron.plichota/steamer1.htm which is now de-
funct, but archived in the Internet Archive Wayback Machine at http://www.archive.org/web/web.php
8
http://www.jwdt.com/~paysan/4stack.html
9
http://www.jwdt.com/~paysan/b16.html
10
http://www.comp.nus.edu.sg/~yuenck/
11
One of two papers available at http://www.comp.nus.edu.sg/~yuenck/stack
12
http://www.eforth.com.tw/
13
http://www.eforth.com.tw/academy-n/Chips.htm
14
Publication list: http://www.eforth.com.tw/academy-n/Bookstore/bookstore_4.htm
15
Published by Offete Enterprises: http://www.ultratechnology.com/offete.html
27
3.4 Strengths and Weaknesses of the Second Generation
The second generation of stack computers still has some of the drawbacks of the first: a need
for index registers, stack manipulation overhead, and it additionally supports ALGOL-like lan-
guages poorly. However, the second generation also has some distinct advantages: reduced
memory bandwidth and system complexity, and faster subroutine linkage and interrupt re-
sponse.
• The NC4016 buffered the topmost Return Stack element on-chip so it could be used as an
index register (I). A loop-on-index instruction would decrement I, and then conditionally
jump to the beginning of the loop. The RTX-2000 used the same mechanism.
• The Sh-Boom used a count (ct) and an index (x) register. The count register was used
by decrement-and-branch-on-non-zero instructions, and the index register was used for
direct, post-incrementing, and pre-decrementing memory accesses.
• The MuP21 used the A register to hold memory addresses which could then be moved to
the data stack, modified, and the returned to A for the next access. The alternative would
have needed a deeper stack and more stack manipulation opcodes.
• The F21 could do post-incrementing memory accesses from A register and from the top
of the Return Stack. They were primarily meant for fast memory-to-memory transfers.
The c18 has the same mechanism.
28
3.4.3 Poor Support of ALGOL-like Languages
Languages which are derived from ALGOL, such as C, use the stack as a means of allocating
space for, amongst others, the local variables of a procedure. This entails pushing entire struc-
tures onto the stack and then accessing them randomly, often through a pointer. Thus a local
variable must be present in main memory since the stack in a second-generation computer is
not addressable.
Solutions to this problem include more sophisticated compilers [Koo94] [ME97] [ME98],
the addition of some form of frame pointer register which can support indexed addressing, or
making the return stack addressable such as in the PSC1000/IGNITE I [Sha02] [Sha99].
29
30
Part II
Qualitative Arguments
31
Chapter 4
The existing computer architecture literature considers all stack computers to be of the same
kind1. This view seems correct when contrasted against modern register-based machines. How-
ever, it conflates the first and second generations of stack computers, which makes difficult a
clear discussion of their respective properties. Distinguishing the generations is important since
the second resembles current register-based designs much more than it does the first. Without
this distinction, an entire branch of computer architecture is missed solely due to a categorical
error.
The differences between the generations stem from the two historical approaches to stack
computation. The first generation is based on Bauer’s “stack principle” for subroutine storage
allocation (Section 2.1.3) and is exemplified by the Burroughs B5000 series (Section 2.2.3).
The second generation originates in Hamblin’s method for evaluating and composing expres-
sions (Section 2.1.4), first seen in the English Electric KDF9 (Section 2.2.2), and later inde-
pendently rediscovered as the Forth programming language (Section 3.1.1.1).
The only significant exception to this conflation I’ve found is a section in Feldman and
Retter’s text [FR93, pp.599-604] which lists some of the same “first generation stack machines”
as I do in Section 2.2, then proceeds to explain in a nutshell the origin and features of the second
generation of stack computers, although they do not refer to them by that name.
In this chapter, I expand on Feldman and Retter’s statements and propose a codification
based on some properties of the stacks: their location, purpose, and the operations done upon
them.
1
Except for a passing mention in the preface of Koopman’s book [Koo89] and a short summary in Bailey’s
PhD Thesis [Bai96].
33
4.1 Location of Stacks: In-Memory vs. In-Processor
“[. . . ] It is this explicit coupling of all ALU operations to a hardware stack which
sets these machines apart.”
[FR93, pg.600]
Locals of B Locals of B
Stack Pointer Return Address Return Address
Parameters of B Parameters of B
Return Value from B Return Value from B
Locals of A Locals of A
Return Address Return Address
Parameters of A Parameters of A
Stack Return Value from A Return Value from A
Registers
CPU Memory
CPU Memory
34
4.2 Use of Stacks: Procedure Nesting vs. Expression Evalu-
ation
“[. . . ] While recursion is easy to accomplish with a stack in memory, this is not
what we mean by stack machine. [. . . ]”
“[. . . ] Unlike most earlier stack machines, these Forth processors have two stacks,
one to hold temporary data and pass subroutine parameters, and the other to save
subroutine return addresses and loop counters. [. . . ]”
[FR93, pg.600]
The second distinguishing feature is the use of the stacks.
First-generation stack computers (Figure 4.1) used stacks as structured temporary storage
for program procedures. Each procedure invocation would automatically cause the allocation
of an amount of space on the stack to contain (typically) the parameters of the procedure, its
return value, its local variables, and the return address of its caller. This area is referred-to as
a procedure activation record. Values from the record were loaded and stored into the small
internal stack as needed for expression evaluation. The internal stack only held the intermediate
results of computations within a given procedure. All linkage between procedures was done
solely on the stack in memory.
General-purpose register computers (Figure 4.2) use the same kind of stacks, in the same
manner, except that the procedure activation records are manually managed by software and
procedure linkage can occur through the registers if the parameters, locals, and return values
fit within them.
Second-generation stack computers (Figure 4.3) use separate stacks to control procedure
nesting and to perform expression evaluation. The return addresses of procedure calls are
stored on a stack dedicated to that purpose (Return Stack). Storing them separately helps to
eliminate the division of the other stack (Data Stack) into procedure activation records. Thus,
a called procedure finds its parameters (P) on top of the Data Stack, manipulates them directly
as locals (L) during expression evaluation, and leaves on top either its return value (R) upon
exit or the parameters for calling another procedure. The Data Stack is used for an effectively
single, large, and complex expression evaluation whose parts are tracked by the contents of the
Return Stack.
Memory
CPU
L/P/R of B Return Address
L/P/R of B Return Address
L/P/R of B
L/P/R of A
L/P/R of A
L/P/R of A
Data Stack Return Stack
35
4.3 Operations with Stacks: High-Level Language Support
vs. Primitive Operations
“First generation stack machines, such as the Burroughs B5000, B5500, B6700,
B7700, Hewlett-Packard HP3000, and ICL2900, were designed to execute lan-
guages like Algol-60 [Dor75b]. Most of these had a single stack which was used
for holding temporary data, passing subroutine parameters, and saving subroutine
addresses. [. . . ]”
[FR93, pg.600]
The third distinguishing feature is the operations performed upon the stacks.
First-generation stack computers (Figure 4.1) had built-in hardware and microcode support
for high-level language (typically ALGOL) and operating system features. The procedure call
and return instructions would automatically allocate and deallocate activation records on the
stack. The arithmetic instructions would determine the data type of their operands from special
tags. Indirect reference words and data descriptors would allow for lexical scoping across ac-
tivation records and resolve call-by-name references. Multiple stacks could be maintained and
cross-referenced to enable multitasking and inter-process communication. Descriptors could
point to data on disk to support code and data paging-on-demand. The end result was powerful,
but extremely complex. These features are well-described in Organick’s book [Org73].
General-purpose register computers (Figure 4.2) have none of these language-specific fea-
tures, although they were designed with the efficient support of ALGOL-like languages in
mind. A large register file supports fast, simple procedure linkage when possible, and can hold
multiple intermediate values during expression evaluation. The instruction set is simple but can
use any registers as source and target. High-level language and system features are managed
in software by the compiler and the operating system.
Second-generation stack computers (Figure 4.3) have more in common with general-purpose
register machines than with first-generation stack computers. The call and return instructions
only save and restore the calling procedure’s return address, leaving more complex procedure
linkage to software. Arithmetic instructions operate solely on the top few elements of an inter-
nal stack and their operands must be loaded from and stored to memory explicitly. Only simple
direct and indirect addressing modes are supported, although post-incrementing/decrementing
versions are common. Other than the implicit use of stacks for basic procedure linkage and
expression evaluation, all high-level language and operating system features are implemented
in software in the same manner as in register-based machines.
36
Chapter 5
Additionally, the CiteSeer2 scientific literature library show a total of 1525 citations for all
editions combined, as of January 2006.
1
http://portal.acm.org/
2
http://citeseer.ist.psu.edu/
37
Incidentally, the main text on second-generation stack computers [Koo89] had, according
to the ACM Guide, seven citations in the Fall of 2004 and twelve as of January 2006. It is not
listed in CiteSeer.
38
5.3 Expression Evaluation on a Stack3
Hennessy & Patterson state:
The first point, which implies that the stack is in memory, is no longer valid. Second-generation
stack computers keep their stacks internal to the CPU (Section 4.1). Furthermore, the access
to a stack is faster than to registers since no addressing is required. The inputs to the ALU are
hardwired to the top two elements of the stack. This is put to advantageous use when pipelining
(Section 7.4).
Secondly, the claim that compilers can use registers more effectively is true only because
much research has been done on register allocation in modern compilers. Prior to the advent
of modern compiler techniques like graph colouring, registers were seen as difficult to use
and stacks were favoured. For example, this is the reason the SPARC architecture uses reg-
ister windows which effectively form a stack of activation records [PS98a, pg. 2]. There has
been promising work showing that it is possible to cache virtually all local variable memory
references onto the stack [Koo94]4 [ME97]5 [ME98]6 . Also, it could be possible to evaluate
expressions in an out-of-order fashion on a stack computer (Section 3.3).
Lastly, the final point raised is true. Operands are hidden on the stack and even with the
aforementioned compiler techniques this fact makes for poor performance on iterative code
due to the stack manipulations required (Section 7.3.3). However, a register-based computer
has the same kind of repeated memory accesses and register-to-register copying overhead when
3
The arguments in this section suggest an interesting way of thinking qualitatively about stacks: that they
are the ’reciprocal’ (or the ’inverse’) of registers. For example, reading a register does not destroy its contents,
but writing does. Conversely, reading from a stack pops information from it, but writing to it simply pushes the
existing information down. Up to its maximum capacity, no information is lost. This is a tantalizing symmetry.
4
http://www.ece.cmu.edu/~koopman/stack_compiler/index.html
5
http://www.complang.tuwien.ac.at/papers/maierhofer%26ertl97.ps.gz
6
http://www.complang.tuwien.ac.at/papers/maierhofer%26ertl98.ps.gz
39
entering or exiting a subroutine (Section 7.3.6). This overhead is virtually nil in stack comput-
ers . Finally, there are hints that a redesigned stack computer instruction set could combine
stack manipulations with arithmetic operations ’for free’, without adding any new datapath or
control lines (Sections 9.2.2 and 9.2.3).
More importantly, registers can be used to hold variables. When variables are
allocated to registers, the memory traffic is reduced, the program is sped up (since
registers are faster than memory), and the code density improves (since a register
can be named with fewer bits than a memory location). [HP96, pg. 71][HP02, pg.
93]
As stated in the previous section, a stack can also be used to hold variables. Since second-
generation stack computers keep their stack in the CPU, the memory traffic is reduced (Section
7.2.3) and the program sped up in the same proportion. The code density is improved to an
even greater extent since a stack does not need to be named (Section 7.2.4). No addressing bits
are required since operations implicitly use the top of the stack. Overall, these features are no
different than in register-based computers.
The authors of both the original IBM 360 paper [ABB64] and the original PDP-
11 paper [BCM+ 70] argue against the stack organization. They cite three major
points in their arguments against stacks:
1. Performance is derived from fast registers, not the way they are used.
2. The stack organization is too limiting and requires many swap and copy op-
erations.
3. The stack has a bottom, and when placed in slower memory there is a perfor-
mance loss.
[HP96, pg. 113][HP02, pg. 149]
At first glance, these points are correct when referring to first-generation stack computers: an
abundance of fast registers which can be randomly and non-destructively accessed will increase
performance by reducing memory traffic and re-ordering operations. For second-generation
stack computers however, these points are moot:
40
2. The limitations are partially a matter of compiler technology (Section 5.3) and on the
other hand, stacks avoid the subroutine call overhead of register-based computers (Sec-
tion 7.3.6).
3. This is a straw-man: All stacks have a bottom. Past experiments have shown that an
on-chip stack buffer that is 16 to 32 elements deep eliminates virtually all in-memory
stack accesses for both expression evaluation and subroutine nesting and that the number
of accesses to memory decreases exponentially for a linear increase in hardware stack
depth [Koo89, 6.4] [Bai96, 4.2.1] (Appendices B.2.6 and B.2.7).
Current Intel processors use exactly such a stack, 16 elements deep, to cache return
addresses on-chip [Int06, 2.1.2.1, 3.4.1.4], as does the Alpha AXP 21064 processor
[McL93]. Hennessy & Patterson themselves show data supporting this feature [HP96,
pg. 277] [HP02, pg. 214]. The SPARC architecture accomplishes a similar results with
its register windows.
The preceding quote is however an abridged version. The original statements by Bell et al.
were:
The System/360 designers also claim that a stack organized machine such as the
English Electric KDF 9 (Allmark and Lucking, 1962) or the Burroughs B5000
(Lonergan and King, 1961) has the following disadvantages:
1. Performance is derived from fast registers, not the way they are used.
2. Stack organization is too limiting, and requires many copy and swap opera-
tions.
3. The overall storage of general register and stack machines are the same, con-
sidering point #2.
4. The stack has a bottom, and when placed in slower memory there is a perfor-
mance loss.
5. Subroutine transparency is not easily realized with one stack.
6. Variable length data is awkward with a stack.
We generally concur with points 1, 2, and 4. Point 5 is an erroneous conclusion,
and point 6 is irrelevant (that is, general register machines have the same problem).
[BCM+ 70]
Hennessy & Patterson are simply repeating the points which are supported by the authors of
this quote. In retrospect, it is peculiar that the authors lump both the KDF9 and the B5000
together, despite being very different machines (see Sections 2.2.2, 2.2.3, and Chapter 4). The
arguments should not apply equally to both. It turns out that the quote is an extremely abridged
version of the original statements by the System/360 designers:
Serious consideration was given to a design based on a pushdown accumulator
or stack. This plan was abandoned in favor of several registers, each explicitly
addressed. Since the advantages of the pushdown organization are discussed in
the literature, it suffices here to enumerate the disadvantages which prompted the
decision to use an addressed-register organization:
41
1. The performance advantage of a pushdown stack organization is derived prin-
cipally from the presence of several fast registers, not from the way they are
used or specified.
2. The fraction of “surfacings” of data in the stack which are “profitable”, i.e.,
what was needed next, is about one-half in general use, because of the oc-
currence of repeated operands (both constants and common factors). The
suggests the use of operations such as TOP and SWAP, which respectively
copy submerged data to the active positions and assist in clearing submerged
data when the information is not longer required.
3. With TOP’s and SWAP’s counted, the substantial instruction density gained
by the widespread use of implicit addresses is about equalled by that of the
same instructions with explicit, but truncated, addresses which specify only
the fast registers.
4. In any practical implementation, the depth of the stack has a limit. The reg-
ister housekeeping eliminated by the pushdown organization reappears as
management of a finite-depth stack and as specification of locations of sub-
merged data for TOP’s and SWAP’s. Further, when part of a full stack must
be dumped to make room for new data, it is the bottom part, not the active
part, which should be dumped.
5. Subroutine transparency, i.e., the ability to use a subroutine recursively, is one
of the apparent advantages of the stack. However, the disadvantage is that the
transparency does not materialize unless additional independent stacks are
introduced for addressing purposes.
6. Fitting variable-length fields into a fixed-width stack is awkward.
In the final analysis, the stack organisation would have been about break-even for
a system intended principally for scientific computing. Here the general-purpose
objective weighed heavily in favor of the more flexible addressed-register organi-
zation. [ABB64]
I’ll address these points individually:
1. As previously mentioned, on-chip stacks are really a linear collection of registers. Addi-
tionally, the unabridged statement supports the use of stacks when either fully on-chip,
as in second-generation stack computers (which the KDF9 was), or sufficiently buffered
as in the B7700 (Sections 2.2.3 and 2.3.3) which was not introduced until 1973.
2. While stack permutations are inevitable, how often they are required is a strong function
of how the code is arranged by the programmer or compiler. It also depends on whether
the code is primarily iterative, recursive, or composed of nested procedures (Section
7.3.1). The choice of stack instruction set affects the overhead significantly (Section
9.2.2).
3. The static code analyses for a basic second-generation stack computer (Appendix B.1.5)
support this statement, but also suggest that the instruction density can be increased much
further (Chapter 8). This is also dependent on the nature and arrangement of the code.
42
4. Experiments done since show that a stack needs to be only 16 to 32 elements deep to
virtually eliminate stack overflow to main memory [Koo89, 6.4.1] [Bai96, 4.2]. What
memory traffic remains can be mitigated by the hardware stack management algorithms
that have been discovered since [Koo89, 6.4.2] [Bai96, 6.2]. The abridged versions
of this statement omitted the trade-off between the register housekeeping required for
subroutine calls on register-based computers versus the stack housekeeping required for
iterative code on stack-based computers.
5. I suspect the authors were thinking of the KDF9 since first-generation stack comput-
ers exhibit subroutine transparency using a single stack. This is possible because the
activation record of an instance of a subroutine is stored in main memory. For second-
generation stack computers like the KDF9, whose stack is not randomly accessible, a
single stack is insufficient since at the very least the arguments being passed recursively
would be buried under the return address (Section 4.2). Thus at the minimum a second
stack is needed to hold return addresses.
The statement is correct for both generations when index registers are used (Sections
2.3.4 and 3.4.1) since unless they are also stacks themselves, nested subroutines which
use them would have to do housekeeping to preserve and access their values, effectively
implementing in software the stack of a first-generation stack computer at a great perfor-
mance penalty. Finally, given integrated circuits, the cost of an extra stack is minimal.
6. Bell et al. are correct when saying that this point is irrelevant. The problem, if present,
is orthogonal to the internal organization of a fixed-width machine.
In summary, the arguments quoted by Hennessy & Patterson were oversimplified and referred
to old hardware and compiler technology. In fact, the original source of the arguments is far
less critical of stacks than the version Bell et al. present and most of the shortcomings have
been overcome since then.
43
44
Part III
Quantitative Arguments
45
Chapter 6
1
The name ’Gullwing’ was inspired by Richard Bach’s book, "Johnathan Livingston Seagull", and also by the
DeLorean DMC-12 made famous in the "Back To The Future" movie trilogy by Robert Zemeckis and Bob Gale.
47
6.1.1 Memory Subsystem
At the top is the main memory (MEM), whose address port is continually driven by the Mem-
ory Address Register (MAR). The MAR is itself loaded by one of the Address Register (A),
the Program Counter (PC), or the Return Register (R). The contents of each of these can be
modified by a dedicated incrementer (+1). The MAR can also be loaded directly from memory
output in the case of jumps and calls, or from the incremented version of the PC in the case of
subroutine returns.
The PC holds the address of the next memory word to be read for fetching instructions or
literals. Upon a subroutine call, it is stored into R, which is itself saved onto the Return Stack
(RS). The reverse process occurs when a subroutine returns.
The A register is used under program control to hold the addresses of loads and stores, with
optional post-incrementing. The R register can be used in the same manner.
48
of the instruction registers are shifted over by one instruction and filled-in at the far end with a
Program Counter Fetch (PC@) instruction (see Section 6.2). After the last loaded instruction
is executed, the PC@ instruction fetches the next instruction group into the ISR.
If the instruction after INST is a PC@, then INST is the last one of this group. This is
signalled by the Last line (L) being high during the execution of INST. If the encoding of PC@
is conveniently chosen to be all zeroes, then L is simply the NOR of all the bits of S1. This
line is used to overlap the execution of INST and PC@ whenever possible (Section 6.3.1).
• To fetch instructions, the F21 uses an internal memory access delay triggered by a lack
of memory-accessing instructions in its ISR. Gullwing instead uses a Program Counter
Fetch (PC@) instruction as the instruction fetching mechanism. It was the simplest
choice to implement and the most deterministic in behaviour. This also allows skipping
of unused instruction slots, which would otherwise have to be filled with no-ops.
• The F21 call and jump instructions use the remainder of the ISR to hold the low-order
bits of the target address. The number of these bits depends on where in the ISR the
instruction is placed. These bits would replace the corresponding ones in the PC, provid-
ing a limited branch distance relative to the current value of the PC. In Gullwing, the call
and jump instructions take their target address from a memory address which follows the
instruction, usually the very next one. This allows uniform, absolute addressing over the
entire memory range.
• Like the calls and jumps, literal fetches on the F21 also place their argument in the
remaining low-order bits. In Gullwing, the literal fetch instruction takes its literal value
from a memory address which follows the instruction, usually the very next one. This
makes for uniform and general storage of literals.
49
respectively take theirs from the Program Counter (PC) and the Return Register (R). They all
execute in two cycles, except for PC@ which requires only one.
50
are destructive: they pop two values from the DS and push one result back. Unary operations
simply replace the contents of the top of the DS. All these instructions execute in a single cycle.
The +* operation is unusual in that it is conditional and non-destructive. It replaces the
contents of TOP with the sum of TOP and of the first element of DS, only if the original value
of TOP was odd (least significant bit is set). Combined with shift operations, it provides a
means of synthesizing a multiplication operation.
Algorithm 2 shows how the multiplication of two 4-bit numbers is implemented. The
topmost element of the Data Stack contains the multiplier, while the second contains the mul-
tiplicand.
The multiplicand is first shifted right four times to align it with the empty space in front
of the multiplier. It is then added to that empty space if the least significant bit (LSB) of
the multiplier is set. The partial product and multiplier are shifted left by one, which both
prepares the LSB with the next bit of the multiplier and has the effect of effectively shifting
the multiplicand right by one relative to the partial product. The process is repeated four times,
51
shifting out the multiplier and leaving the completed 8-bit product on the top of the stack. The
multiplicand is then discarded.
This method produces one bit of product approximately every two cycles and is a compro-
mise between full hardware and software implementations. Its disadvantage is that it is limited
to single-word products, and thus to half-word factors. However, it can also be optimized when
the multiplier is much smaller than the multiplicand.
The multiplication of full-word factors, with a double-word product, could be accomplished
in the same way by shifting across two registers. For example, TOP could hold the multiplicand
and A the multiplier. By having +* test the LSB of A, and 2/ shifting from TOP into A, the
process would leave the lower half of the product in A and the upper half in TOP. Executing
A> would then place both halves on the Data Stack.
52
6.2.7 Instruction Format and Execution Example
Instruction opcodes are 5-bit numbers. Six of these can be packed into a 32-bit memory word.
The last two bits are unused (see Section 6.1.3). Literals and addresses always occupy an entire
word. Each instruction position is termed a slot and the slots in a word constitute an instruction
group. After a group is fetched, the instructions in the slots are executed in turn until they are
exhausted. Figure 6.3 shows the memory layout of a random snippet of code tailored to show
the layout of the instructions. They are laid out in left-to-right order for convenience only.
In the initial state of this example, the group at address 1 has just been fetched, the instruc-
tion in Slot 0 (S0) is about to be executed, and the Program Counter (PC) has been incremented
to point to address 2. The first instruction (>R) executes without comment and shifts the con-
tents of the ISR to the left by one instruction, placing JMP0 as the current instruction.
In the case of the JMP0 conditional jump being not taken, the PC is incremented by one to
point to address 3, and the execution continues with DUP. When the LIT instruction is reached
it fetches the literal pointed-to by the PC (at address 3) and increments the PC by one. The
addition instruction is then executed.
When the PC@ instruction is reached, it fetches a new group from the location pointed to
by the PC, which is currently address 4, and then increments the PC. Eventually the CALL
instruction is reached. It loads the instruction group from the target address in the memory
word pointed to by PC (at address 5), and loads the PC with the target address plus one. The
remaining slots are never executed and are filled by convention with PC@.
Had the JMP0 conditional jump been taken, it would have executed similarly to the CALL
instruction, loading the group at address 4 and leaving the PC pointing to address 5. The strictly
word-wide access to memory is why the R> in Slot 0 at address 4 could not be placed in Slot 5
at address 1. Jumping or calling to an address always begins executing at Slot 0.
S0 S1 S2 S3 S4 S5
1 >R JMP0 DUP LIT + PC@
2 Address for JMP0 (4)
3 Number for LIT
4 R> XOR CALL PC@ PC@ PC@
5 Address for CALL
53
6.3 State Machine and Register Transfer Description
The operation of the unpipelined Gullwing processor is very simple. There are no exceptions or
interrupts. Algorithms 3 through 4 show the state transitions and register transfers. The inputs
of its state machine consist of the opcode of the current instruction (INST), the current state
(S), and whether the contents of the TOP register is all zeroes (=0), has the most significant bit
clear (MSB0), or has the least significant bit set (LSB1). These three flags have been collapsed
into a single input column (TOP) for brevity. ’Don’t Care’ values are represented by X. The
outputs are the next state (N) and the various enable and select signals needed to steer the
datapath, which are represented symbolically.
All actions on a line occur concurrently. For example, the operation of PC@ can be under-
stood as: leave the state at zero, route the PC through the incrementer (+1) and store its output
into both PC and MAR, and load the ISR with the current output of MEM. All instructions end
with either loading or shifting the ISR (ISR< <), which changes INST to the next instruction
to execute (see Figure 6.2). Instructions that take two cycles use the state bit to select which
phase to execute.
Instructions which load instructions from memory, such as CALL and JMP, execute the
same sequence of steps as PC@ in their second phase. For brevity, these steps have been
replaced with the opcode.
54
Algorithm 5 Gullwing ALU Instructions
Inputs Outputs
----------- ---------
INST TOP S N Control
---- --- - - -------
COM X 0 0 TOP→ALU(NOT)→TOP, ISR< <
AND X 0 0 TOP,DS→ALU(AND)→TOP, DS(POP), ISR< <
XOR X 0 0 TOP,DS→ALU(XOR)→TOP, DS(POP), ISR< <
+ X 0 0 TOP,DS→ALU(+)→TOP, DS(POP), ISR< <
2* X 0 0 TOP→ALU(2*)→TOP, ISR< <
2/ X 0 0 TOP→ALU(2/)→TOP, ISR< <
+* LSB0 0 0 ISR< <
+* LSB1 0 0 DS,TOP→ALU(+)→TOP, ISR< <
55
6.3.1 Improvement: Instruction Fetch Overlap
Using an instruction (PC@) to fetch the next group of instructions eliminates the need for
dedicated instruction-fetching hardware. The fact that it is also shifted into the ISR for free is
also very elegant. However, this means that at least one out of every seven instructions executed
will be a PC@, or about 14%5 . This overhead could be reduced by overlapping fetching with
execution while the memory bus is free.
A priori, it is unclear how much benefit would come from overlapping the instruction fetch.
Koopman provides a table of the dynamic instruction execution frequencies of Forth primitives
averaged over a set of benchmarks [Koo89, 6.3]. These primitives map almost directly to the
Gullwing instruction set and the benchmarks from Section 7.1 are also written in a Forth-like
language. The data predicts that approximately half of the executed primitives are either flow
control or load/store instructions and thus access memory. Assuming that this probability is
evenly distributed, the fetching of instructions could be overlapped half the time, reducing the
overhead to about 7% for straight-line code.
Accomplishing this overlap depends on this fact: If the the last instruction before a PC@
does not access memory, then both instructions can be executed simultaneously without con-
flict. Figure 6.2 shows how the ISR makes this possible. If the next instruction to be executed
is a PC@6 , then the current instruction being executed is the last one of this group. This is
signalled by raising the Last flag (L). If the current instruction does not access memory7 , then
instead of shifting the ISR at the end of its execution, the instruction will fetch the next group
of instructions from memory in the same manner as PC@, as if it had executed concurrently.
Implementing this optimization requires adding the L bit as an input to the state machine.
Flow control and load/store instructions ignore this bit since they always access memory. The
remaining instructions now have two versions, selected by L, which either shift or load the ISR.
Algorithms 8, 9, and 10 show the necessary changes.
Analysis of code executed with an overlapping instruction fetch (Appendix B.2.2) confirms
that the actual overhead of explicit PC@ instructions is reduced to 4.1 to 2.2% of the total
executed instructions . It also shows that 4.2 to 7.5% of instructions are executed concurrently
(’FOLDS’) with a PC@. This demonstrates a 50.6 to 77.3% reduction in instruction fetch
overhead.
5
The actual overhead will be lower since other flow control instructions do their own instruction loading.
Without instruction fetch overlapping the actual overhead of instruction fetching is 8.3 to 9.7%.
6
PC@ is conveniently encoded as all zeroes.
7
Since the opcodes are divided about equally between memory and non-memory instructions, a single bit (the
MSB, for example) can be used to test if the instruction accesses memory. This should simplify the implementa-
tion of the state machine.
56
Algorithm 8 Gullwing ALU Instructions with Instruction Fetch Overlap
Inputs Outputs
------------- ---------
INST TOP L S N Control
---- --- - - - -------
COM X 0 0 0 TOP→ALU(NOT)→TOP, ISR< <
COM X 1 0 0 TOP→ALU(NOT)→TOP, PC@
AND X 0 0 0 TOP,DS→ALU(AND)→TOP, DS(POP), ISR< <
AND X 1 0 0 TOP,DS→ALU(AND)→TOP, DS(POP), PC@
XOR X 0 0 0 TOP,DS→ALU(XOR)→TOP, DS(POP), ISR< <
XOR X 1 0 0 TOP,DS→ALU(XOR)→TOP, DS(POP), PC@
+ X 0 0 0 TOP,DS→ALU(+)→TOP, DS(POP), ISR< <
+ X 1 0 0 TOP,DS→ALU(+)→TOP, DS(POP), PC@
2* X 0 0 0 TOP→ALU(2*)→TOP, ISR< <
2* X 1 0 0 TOP→ALU(2*)→TOP, PC@
2/ X 0 0 0 TOP→ALU(2/)→TOP, ISR< <
2/ X 1 0 0 TOP→ALU(2/)→TOP, PC@
+* LSB0 0 0 0 ISR< <
+* LSB0 1 0 0 PC@
+* LSB1 0 0 0 DS,TOP→ALU(+)→TOP, ISR< <
+* LSB1 1 0 0 DS,TOP→ALU(+)→TOP, PC@
Algorithm 10 Gullwing No-Op and Undefined Instructions with Instruction Fetch Overlap
Inputs Outputs Inputs Outputs
------------ --------- ------------ ---------
INST TOP L S N Control INST TOP L S N Control
---- --- - - - ------- ---- --- - - - -------
NOP X 0 0 0 ISR< < UND1 X 1 0 0 PC@
NOP X 1 0 0 PC@ UND2 X 0 0 0 ISR< <
UND0 X 0 0 0 ISR< < UND2 X 1 0 0 PC@
UND0 X 1 0 0 PC@ UND3 X 0 0 0 ISR< <
UND1 X 0 0 0 ISR< < UND3 X 1 0 0 PC@
57
58
Chapter 7
As stated in Section 1.1, this thesis aims to divide the family of stack-based computers into first
and second generations. Part of this distinction consists of showing that the second generation
resembles the register-based machines which replaced the first.
This chapter supports this argument by comparing the Gullwing processor from Chapter 6
with the well-known MIPS and DLX processors used as demonstrators by Hennessy & Pat-
terson. This comparison is based on statistics derived from the organization and execution of
programs and further based on a comparison of the pipeline structure of each microprocessor.
The characteristics compared include: cycle time, cycle count, cycles per instruction, instruc-
tion count, dynamic instruction mix, memory accesses per cycle, code size, and code density.
59
is executed as it is received and can contain the name of any built-in or previously defined
function. The source to the kernel is listed in Appendix A.1.
Metacompiler The metacompiler manipulates the Flight kernel to retarget its operations to
the memory area used by the virtual machine. It saves and restores the internal state of the
kernel, such as the location of the dictionary and input buffer, and defines a new main loop
to replace the kernel’s default one. The new loop first searches the new dictionary in the
virtual machine memory and if there is no match, continues the search in the kernel’s original
dictionary. This allows the use of functions previously defined outside of the virtual machine
to bootstrap code inside it. The compilation of the metacompiler requires 106 lines of code
containing 491 names. This process exercises the functions of the kernel and its extensions and
executes 4,069,393 instructions. The source is listed in Appendix A.3.2.
Self-Hosted Kernel The first thing compiled into the virtual machine is another instance of
the Flight kernel. While the original kernel was written in assembly language, this new kernel
1
Only non-blank lines are counted.
2
Each name is either a function name, or a string to be processed.
3
Although the Virtual Machine software includes the compilation of the Extensions, the latter contributes to
only 2.2% of the total number of executed instructions.
60
is defined using the functions of the original kernel and all its extensions. The result is a higher-
level description of the Flight kernel, written in itself. The new kernel binary residing in the
virtual machine memory area is identical to the original4 . The compilation of the self-hosted
kernel requires 246 lines of code containing 681 names. This process exercises the functions
of the kernel and its extensions via the indirection of the metacompiler and executes 6,511,976
instructions. The source is listed in Appendix A.3.3.
Flight Language Kernel Extensions Now that a Flight kernel resides in the virtual machine
memory, is it possible to start the virtual machine and execute this new kernel. The Flight
language extensions from Section 7.1.2 are fed to the new kernel, exercising the same code,
but through the emulation layer of the virtual machine. This executes 184,993,301 instructions,
taking about 90% of the total execution time of the VM test suite.
61
instructions which do not exist in a given machine are represented by a dash (’-’).
For both the compiler and interpreter test data, the dynamic instruction mix of the Gullwing
processor differs from that of DLX and MIPS in some significant ways:
• The proportion of loads and stores is lower. I believe this originates from a lack of
complex data structures in the Gullwing software, and the need for DLX and MIPS to
use a stack in main memory to pass parameters to subroutines.
• There are many more immediate loads (fetches) since Gullwing cannot include small
literal operands within most instructions as DLX and MIPS do.
• The proportion of calls and returns is much greater. The stack-based architecture of
Gullwing results in extremely efficient procedure linkage, which makes the inlining of
code unnecessary in most cases.
• About a quarter of all the instructions executed are stack manipulation instructions. Some
are manipulations made explicit by the absence of operands in Gullwing instructions as
a consequence of the lack of random access to a stack. DLX and MIPS include implicit
move, copy, and delete operations within their instructions by using a three-operand
instruction format which gives random access to their registers. A large portion of these
stack manipulations are moves of addresses from the top of the Data Stack to the Address
Register, as performed by the >A instruction. Section 9.2.3 discusses a mean to reduce
the overhead of these moves.
Additionally, the interpreter dynamic instruction mix (Table 7.2) has some further differences:
• The large incidence of shifts is due to the use of the shift-left (2/) instruction in the VM
to extract individual instructions out of a memory word.
62
Benchmark GCC (92) GCC (2000) Extensions
DLX/MIPS Instr. DLX MIPS Gullwing Gullwing Instr.
load 22.8% 25.1% 15.2% @A, @A+, @R+
store 14.3% 13.2% 0.4% !A, !A+, !R+
add 14.6% 19.0% 10.3% +
sub 0.5% 2.2% -
mul 0.1% 0.1% -
div 0.0% - -
compare 12.4% 6.1% -
load imm 6.8% 2.5% 16.8% LIT, PC@
cond branch 11.5% 12.1% 6.6% JMP0, JMP+
(incl. TAKEN jumps)
cond move - 0.6% -
jump 1.3% 0.7% 2.1% JMP
call 1.1% 0.6% 6.4% CALL
return, jmp ind 1.5% 0.6% 6.4% RET
shift 6.2% 1.1% 0.2% 2/, 2*
and 1.6% 4.6% 0% AND
or 4.2% 8.5% -
other (xor, not) 0.5% 2.5% 6.5% XOR, NOT
other (moves) - - 29.2% DUP, DROP, OVER,
>R, R>, >A, A>
other - - 0% NOP, +*
Table 7.1: Compilers Dynamic Instruction Mix
63
Benchmark Lisp(92) Perl (2000) VM
DLX/MIPS Instr. DLX MIPS Gullwing Gullwing Instr.
load 31.3% 28.7% 9.3% @A, @A+, @R+
store 16.7% 16.2% 4.6% !A, !A+, !R+
add 11.1% 16.7% 7.7% +
sub 0.0% 2.5% -
mul 0.0% 0.0% -
div 0.0% - -
compare 5.4% 3.8% -
load imm 2.4% 1.7% 18.0% LIT, PC@
cond branch 14.6% 10.9% 2.0% JMP0, JMP+
(incl. TAKEN jumps)
cond move - 1.9% -
jump 1.8% 1.7% 2.7% JMP
call 3.1% 1.1% 5.0% CALL
return, jmp ind 3.5% 1.1% 7.5% RET
shift 0.7% 0.5% 12.6%* 2/, 2*
and 2.1% 1.2% 3.4% AND
or 6.2% 8.7% -
other (xor, not) 0.1% 3.1% 3.9% XOR, NOT
other (moves) - - 23.3% DUP, DROP, OVER,
>R, R>, >A, A>
other - - 0% NOP, +*
Table 7.2: Interpreters Dynamic Instruction Mix
64
The higher CPI of Gullwing does not compare favourably with that of DLX. However,
loads and taken jumps on Gullwing take two cycles, which is as if a load or branch stall always
occurred. Therefore, Gullwing is really architecturally equivalent to a DLX without load delay
slots and without delayed branches, which always stalls on loads and taken branches. For
example, if 100% of loads are assumed to stall on DLX, their load penalty increases to 29.6%
for GCC and 33.7% for Lisp, which raises the total CPI of DLX to 1.34 and 1.41 respectively,
which is comparable to Gullwing.
Correspondingly, there are some possible optimizations to Gullwing which would reduce
the CPI of loads and jumps and make Gullwing equivalent to a normal DLX (Section 9.2.3).
Test GCC Lisp Test Extensions VM
Total Loads 29.6% 33.7% Instr. Type Fraction of Total CPI
Load Stalls 23% 24% Conditionals 0.071 0.031
Load Penalty 6.81% 8.10% Subroutine 0.300 0.304
Branch Penalty 4% 7% Fetches 0.168 0.180
Total Penalty 10.8% 15.1% Load/Store 0.312 0.276
Overall CPI 1.11 1.15 ALU 0.169 0.277
Stack 0.292 0.233
Table 7.3: DLX CPI with Load and Branch Total 1.312 1.301
Penalties
Table 7.4: Gullwing CPI by Instruction Type
65
data fetches/loads and thus only one Memory Address Register (MAR) (Section 6.1.1). Thus,
a memory access is defined as an alteration of the MAR.
The average number of memory accesses per cycle for an instruction is calculated as so:
The number of accesses is determined from the register transfer description (Section 6.3)
as is the number of cycles, which can alternately be seen in Appendix B.2.4. The percentage of
total cycles is listed under the C/C column in Appendix B.2.2. Table 7.5 condenses this data.
In summary, the Gullwing processor performs an average of 0.667 memory accesses per
cycle while using a single memory bus, compared to 1.421 for DLX/MIPS which uses two.
It would be ideal to reduce Gullwing’s memory bandwidth further, but part of the reason
it is already low is because of the extra instructions required to manipulate the stack (which
do not access memory). Reducing this instruction overhead will increase the proportion of
memory-accessing instructions and thus increase the average number of memory accesses per
cycle, towards a maximum of one (Section 9.2.3).
Either way, since loads and stores cannot (and have no need to) overlap instruction fetches
on Gullwing, the maximum number of memory accesses per cycle cannot exceed one. By
comparison, MIPS must have a minimum of one memory access per cycle. When MIPS must
manipulate its call stack, it must perform additional memory accesses in the form of data loads
and stores, increasing its memory bandwidth further7 .
Test Extensions VM
Instruction Accesses Cycles Acc./Cyc. Fraction of Total Cycles
PC@ 1 1 1 0.031 0.017
FOLDS 1 1 1 0.032 0.058
CALL 2 2 1 0.098 0.076
JMP 2 2 1 0.033 0.042
JMP0, JMP+ 1 1 1 0.046 0.009
JMP0, JMP+ (TAKEN) 2 2 1 0.007 0.001
RET 2 2 1 0.098 0.115
LIT 1 1 1 0.097 0.122
@A, @A+, @R+ 2 2 1 0.233 0.142
!A, !A+, !R+ 2 2 1 0.006 0.070
all others 0 1 0 0.352 0.392
Total of (Acc./Cyc.) × F raction 0.681 0.652
Average 0.667
Table 7.5: Gullwing Memory Accesses Per Cycle (Total)
7
Implying, of course, a second memory bus.
66
Test Compiler Interpreter
CPU DLX MIPS DLX MIPS
Component GCC92 GCC00 Lisp Perl
Loads 22.8% 25.1% 31.3% 28.7%
Stores 14.3% 13.2% 16.7% 16.2%
Total 37.1% 38.3% 48.0% 44.9%
Average 37.7% 46.5%
Average 42.1%
Table 7.6: DLX/MIPS Memory Accesses Per Cycle Caused by Loads and Stores
8
Calculated as the sum of the fraction of total memory accesses done by instructions which alter the Program
Counter (PC): PC@, FOLDS, CALL, RET, all JMPs, and LIT, averaged over both the Ext. and VM tests.
67
7.2.4.1 Basic Blocks and Instruction Fetch Overhead
This reduction in the number of instruction fetches manifests as a reduction of the number of
PC Fetch (PC@) instructions required to load the next sequential instruction word. If a basic
block of code spans several memory words, the last instruction in each word will be a PC@
which fetches the next group of instructions (Sections 6.2.7 and 6.3.1). The call and jump
instructions that terminate basic blocks do their own instruction loading.
Appendix B.2.5 shows that the average length of a basic block, measured in instructions,
fits well within the six instruction slots in each memory word. In fact, 71.1 to 80.9% of basic
blocks fit in a single memory word. This places a limit on how much overhead the PC@
instructions can cause.
This PC@ overhead drops as the memory word gets wider and more instructions can be
packed in one. At the limit, when all basic blocks fit into one memory word, no PC@ in-
structions will be executed and all loading of instructions will come from calls and jumps. In
effect, each memory word behaves as a cache line. This effect is amplified by the mechanism
proposed in Chapter 8.
Memory Words This is a measure of code size, assuming 32-bit memory words. Instructions
are displayed such that one line represents one word of memory.
Instructions This is the number of instructions executed. All other things being equal, a
difference in number suggests a difference in suitability to the given task.
68
Memory Accesses This is the count of the number of memory fetches, loads, and stores9 . For
MIPS, this also include the fetching of instructions. There is no such distinction in Gullwing:
fetching instructions is a special case of data loads.
Cycles This is the count of the number of cycles required. The cycle time is assumed to
be the same for both processors. All MIPS instructions each count as one cycle, while the
Gullwing instructions count as one or two cycles10 . The MIPS pipeline is assumed to be full at
the start and to never stall.
Memory Accesses Per Cycle This is a measure of the memory bandwidth required. A mea-
sure of one access per cycle implies a fully-utilized memory bus.
Cycles Per Instruction (CPI) The MIPS pipeline is assumed to never stall and thus to always
have a CPI of one. Since Gullwing instructions take one or two cycles, the CPI depends on the
code being executed.
Instructions Per Memory Word This is a measure of code density. For MIPS, this measure
is always one. For Gullwing, it is variable.
7.3.2 Demonstrators
The C source for the demonstrators was compiled to assembly language, at various optimiza-
tion levels, with a little-endian MIPS32 cross-compiling version of GCC 3.4.4. The simplest
resulting code was then hand-optimized, if possible, to eliminate the quirks of the compiler
and reproduce optimal hand-assembled code. From this, the equivalent stack-oriented code
was written with an effort towards keeping the algorithm unchanged.
The first demonstrator is the triangular numbers function. It is the additive analogue of the
factorial function: each triangular number is the sum of all the preceding positive integers. It is
quite possibly the tiniest non-trivial example of looping code (since the number of repetitions
is part of the calculations). It is also interesting since it is expressible in several forms. The
iterative, recursive, and tail-recursive forms are analyzed11 .
9
In the MIPS pipeline, all instructions require one memory access to fetch the instruction, and another access
if the instruction is a load or a store. For Gullwing, there is no division between instruction and data memory.
The fetching or loading of instructions is done by the flow control instructions. Calls, returns, jumps, and taken
conditionals perform two memory accesses. Untaken conditionals and PC@ (PC Fetch) do only one. A PC@ is
implied at the end of a group of instructions that does not end in a flow control instruction. Literal fetches take
one cycle.
10
For Gullwing, conditional jumps use a variable number of cycles. If taken, they behave like a jump, call, or
return, taking two cycles. If not taken, they merely advance to the next instruction, taking one cycle. Fortunately,
in the programs shown, the conditional jump is used to test the loop exit condition or the recursion terminal case
and thus takes one cycle for all algorithm steps except the last one. It is thus considered a one-cycle instruction.
See Section 6.3 for details.
11
There is a closed form of the triangular numbers algorithm which I’ve not investigated here because it requires
multiplication, which Gullwing does not have: T rin = 21 n(n + 1). Since it’s a straightforward expression, it
should evaluate in the same manner on stack-based and register-based computers, with perhaps a small stack
manipulation overhead on stack computers, depending on the exact instruction set.
69
The second demonstrator is a sequence of progressively more complex subroutine calls
implementing the sum of their parameters. While the function is trivial, it is implemented in a
manner that highlights parameter passing in nested subroutines.
70
Algorithm 11 Triangular Iterative C Source
int triangular (int foo) {
int bar = 0;
while(foo != 0){
bar = bar + foo;
foo = foo - 1;
}
return bar;
}
MIPS Gullwing
move 0 4 >R, R>, DUP, OVER
addu, addiu 2 2 +
load imm 0 1 LIT
beq 1 1 JMP0
b 1 1 JMP
Table 7.8: Iterative Dynamic Instruction Mix
71
7.3.4 Recursive Triangular Numbers
The recursive version stores the intermediate values on the stack across a series of recursive
calls, summing them once the terminal case is reached.
The stack-oriented instruction set of Gullwing eliminates all the explicit call stack manipu-
lations and copying of arguments done by MIPS, approximately halving the number of memory
words required and also resulting in straightforward code12 .
Gullwing has to perform very little extra stack manipulation and so does not have an ar-
tificially reduced memory bandwidth as in the iterative example. On the other hand, MIPS
must now manage a stack in memory, increasing its cycle and memory access counts to well
above those of Gullwing. The result is that Gullwing genuinely requires about half the mem-
ory bandwidth than MIPS. Section 9.2.1 discusses a method which could eliminate the stack
management overhead of MIPS.
12
It is fair to say that the MIPS code shown, which has been cleaned-up from the actual compiler output, is
spaghetti-code. It’s not possible to optimize it further without altering the algorithm (recursing to a separate entry
point), and even then it would only save one instruction. The convoluted nature of efficient code created by a
compiler is not usually a human concern, but there’s no way I can call it a good thing.
72
Algorithm 15 Triangular Recursive MIPS32 Algorithm 16 Triangular Recursive Gullwing
Assembly Assembly
Tri: addiu $sp,$sp,-32 Tri: DUP JMP0 DUP LIT + CALL
sw $31,28($sp) End
sw $16,24($sp) -1
move $16,$4 Tri
beq $4,$0,$L1 +
move $2,$0 End: RET
addiu $4,$4,-1
jal Tri
addu $2,$2,$16
$L1: lw $31,28($sp)
lw $16,24($sp)
j $31
addiu $sp,$sp,32
MIPS Gullwing
move 2 2 DUP
addu, addiu 4 2 +
load imm 0 1 LIT
sw 2 0 A!, etc...
lw 2 0 A@, etc...
beq 1 1 JMP0
jal 1 1 CALL
j 1 1 RET
Table 7.10: Recursive Dynamic Instruction Mix
73
7.3.5 Tail-recursive Triangular Numbers
Expressing the recursive algorithm in a tail-recursive form reuses the parameters across calls,
reducing the data stack depth to that of the iterative case, and provides the opportunity to
eliminate the tail call.
The advantage of this representation is that the tail call is trivially eliminated from the
Gullwing code by replacing the CALL instruction with a JMP and eliminating the following
RET instruction. This eliminates the accumulation of return addresses from the return stack.
Optimizing the MIPS code in the same manner yields the initial iterative code. The tail of the
Gullwing code is not counted since it executes only once.
Expressing the algorithm in a tail-recursive form brings it closer to the iterative case and
so some of the stack manipulation overhead reappears in the Gullwing code. However, the
memory bandwidth remains lower than MIPS as in the recursive case.
74
Algorithm 18 Triangular Tail-Recursive Algorithm 19 Triangular Tail-Recursive Gull-
MIPS32 Assembly wing Assembly
Tri: addiu $sp,$sp,-32 Tri: OVER JMP0 OVER + >R
sw $31,24($sp) End
beq $4,$0,$L1 LIT + R> CALL
addu $5,$5,$4 -1
jal Tri Tri
addiu $4,$4,-1 RET
$L1: End: >R DROP R> RET
lw $31,24($sp)
j $31
addiu $sp,$sp,32
MIPS Gullwing
move 0 4 DUP, OVER, >R, R>
addu, addiu 4 2 +
load imm 0 1 LIT
sw 1 0 A!, etc...
lw 1 0 A@, etc...
beq 1 1 JMP0
jal 1 1 CALL
j 1 1 RET
Table 7.12: Tail-Recursive Dynamic Instruction Mix
75
7.3.6 Subroutine Calls
Nested subroutine calls are really a form of expression evaluation (Section 4.2), which is a
process that naturally maps onto a stack. However, the efficiency of this mapping is quite
dependent on how the expression is arranged by the programmer or compiler. A fundamental
assumption here is that subroutines are library calls and thus cannot be inlined or have their
registers globally allocated by a smart compiler.
The MIPS subroutine call overhead, manifested as additional instructions and memory ac-
cesses to manage an in-memory stack and move parameters between registers, is quickly am-
plified as the nesting of subroutines increases. This overhead is so large that even the low
density and high CPI of sequence of calls in Gullwing code does not negate it. Section 9.2.1
discusses a method which could eliminate the stack management overhead of MIPS.
Add2 The stack code cycle overhead here comes from (the unoptimized) RET requiring two
cycles to execute.
MIPS Gullwing
addu 1 1 +
j 1 1 RET
Table 7.14: Add2 Dynamic Instruction Mix
76
Add3 The stack code shows a great advantage, as previously seen in the recursive triangular
number example, even for a single nested call with one additional parameter.
77
Add4 This more complex example requires some stack manipulation. Nonetheless, its per-
formance is still far better than the corresponding MIPS code.
78
7.4 Pipelining
The unpipelined view of the Gullwing processor presented in Chapter 6 shows a simple com-
puter with a number of components comparable to the DLX processor of Hennessy & Pat-
terson. However, the cycle time of Gullwing as shown must be greater than that of the DLX
simply because an instruction cycle includes the full path from the Instruction Shift Register
(ISR), through the (implicit) decoding logic, to the controlled units such as the ALU.
Pipelining Gullwing would overlap the decoding and executing of an instruction and reduce
the cycle time to that of the slowest stage, which is usually the adder in the ALU. I’ll show this
change in the same manner as Koopman [Koo90], as a transformation of the well-known DLX
pipeline, but in much more detail. The resulting Gullwing pipeline has a structure that implies
a comparable cycle time to the DLX pipeline.
IF ID EX MEM WB
ALU
IM Reg DM Reg
The pipelining of the Gullwing processor can be explained by exchanging the subsystems
in the DLX pipelining with those of Gullwing and following the implications.
The first change is that Gullwing has a single memory bus for instructions and data. This
removes the distinction between IM and DM. Since the IF stage fetches an instruction every
cycle, this implies a structural hazard for every data access. However, the zero-operand instruc-
tion format of Gullwing means that several instructions can be packed as a group into a single
memory word (Section 6.2.7). This reduces the occurrence of the structural hazard to between
2 and 4 percent of executed instructions (PC@ count in Appendix B.2.2). Furthermore, the end
of a group of instructions is explicitly marked by the inclusion of a PC@ (PC Fetch) instruction
which fetches the next group of instructions in parallel with the current instruction if possible.
Combined, these two features accomplish the function of the IF stage and effectively divides
it between MEM, where the instructions are fetched, and ID, where they are held and decoded.
79
The second change is the replacement of the register file with a stack, which has the dis-
advantage of forcing a RAW (Read-After-Write) dependency between all instructions. This
means that an instruction in ID must wait for the previous instruction in EX to reach WB before
being able to continue. However, a stack has the advantage of having its inputs and outputs
immediately available, without having to decode addresses13 . This effectively makes them
into registers equivalent to the ID/EX and EX/MEM pipeline registers connected to the ALU.
Thus the stack can be moved out of ID and placed into EX without any speed penalty. This
eliminates the WB stage and simplifies ID.
The third change is the use of direct addressing. The DLX processor uses displacement
addressing to simulate a number of other addressing modes. This requires the ALU to compute
an address for each load or store instruction, forcing EX to precede MEM. Direct addressing
removes this dependency and so both stages can now operate in parallel. Since MEM already
contains the incrementer for the Program Counter (brought over from IF in the first transfor-
mation), it can be re-used to implement post-incrementing direct addressing.
The end result of these changes results in the pipeline shown in Figure 7.2a, where ID
decodes the instructions from the ISR, and EX and MEM execute the instructions. Figure 7.2b
shows how these stages map to the existing Gullwing functional blocks. Note that the EX and
MEM stages both contain adding circuitry and so place a lower limit on the cycle time that is
comparable to that of the DLX EX stage.
The operation of the pipeline is similar to that of the DLX (Figure 7.3). Since the pipeline
introduces one stage of latency to execution, the next group of instructions is loaded into the
ISR while the last instruction in the current group (5) decodes. This process is detailed in the
next section. Instructions that require two cycles to execute, such as loads, occupy the ID
stage for the duration of their execution (Figure 7.4). Loads and stores must take two cycles
since they currently cannot be overlapped, but on the other hand there is no load delay slot.
Overlapping loads and stores are discussed in Section 9.2.3.
ID DS R
EX
MEM RS
MEM
(a) Pipeline (b) Stage Details
80
Cycle (Current Instruction Word)
Instruction 1 2 3 4 5 6 1
0 ID EX/MEM
1 ID EX/MEM
2 ID EX/MEM
3 ID EX/MEM
4 ID EX/MEM
5 (next word loaded while 4 executes and 5 decodes) ID EX/MEM
0 ID
Figure 7.3: Gullwing Pipeline Operation
Cycle Notes
Instruction 4 5 6 7
@A ID MEM @A stays in ID
@A ID EX/MEM
4 ID EX/MEM
5 ID
Figure 7.4: Gullwing Load/Stores Pipeline Diagram
81
inserted PC@ will be in S1 and enable the L signal, signalling the last instruction. If the in-
struction in S0 does not access memory, then both it and the PC@ execute in parallel and S0
and S1 are then loaded from S214 , which contains the actual last instruction (Figure 7.6). If
the instruction in S0 accesses memory, then the instruction fetch is not overlapped, and the
instructions are shifted as usual (Figure 7.7).
If not all the instruction slots are filled, then the compiler must make sure that a PC@ is
compiled before the last instruction and fill the remaining slots with NOPs. It must also make
sure the built-in PC@ is never executed after a previous PC@ in the same word. For example,
if the last instruction would end up in S4, then it must be moved to S5 and a NOP placed in S4.
Such instruction reorderings usually leads to a one-cycle penalty in execution.
32 MEM PC@
LSB MSB
5 5 5 5 5 5 2
x
S0 5 5 S1 5 S2 5 S3 5 S4 5 PC@ 5 S5 5 NOP
5 5
INST 5
L
(a) ISR for Pipeline
Cycle Notes
Instruction 4 5 6 1
3 ID EX/MEM
4 ID EX Doesn’t use MEM
PC@ (L is set) MEM Executed concurrently
5 ID EX/MEM Loaded from S2
0 ID
Figure 7.6: Gullwing Instruction Fetch (with Overlap) Pipeline Diagram
14
S1 is overwritten by S2 so as to prevent two PC@ from being executed in sequence. The instruction in S2,
being the actual last instruction, can never be a PC@, else the situation described in the first paragraph occurs.
82
Cycle Notes
Instruction 4 5 6 7 1
3 ID EX/MEM
4 ID MEM Uses MEM
PC@ (L is set) ID MEM Executed sequentially
5 ID EX/MEM Loaded from S1
0 ID
Figure 7.7: Gullwing Instruction Fetch (without Overlap) Pipeline Diagram
Cycle Notes
Instruction 3 4 6 5 1
2 ID EX/MEM Data Hazard Slot
JMP0 ID MEM JMP0 stays in ID
JMP0 ID MEM
4 ID EX/MEM Branch Delay Slot
0 ID Branch Target
Figure 7.8: Gullwing Taken Jumps or Calls Pipeline Diagram
83
7.5 Summary and Performance Comparison
In Section 7.2, the comparison of benchmark statistics revealed these facts about Gullwing,
relative to DLX/MIPS:
• A greater number of literal fetches, subroutine calls, and stack permutations are executed.
• An average CPI of 1.31, which is poor compared to the average of 1.14 for a DLX.
However, Gullwing is actually architecturally equivalent to a DLX without load delay
slots or delayed branches, whose average CPI is 1.38.
• An average number of memory accesses per cycle of 0.667, compared to 1.421 for
DLX/MIPS (Section 7.2.3).
• An average code density of only 1.2 instructions per memory word, out of a potential
maximum of three, because most of the instruction slots remain unused.
In Section 7.3, a detailed inspection and analysis of equivalent programs which express fun-
damental code features uncovered these differences between Gullwing and a generic MIPS-32
computer:
• The random-access registers and explicit operands of the MIPS design are a definite
advantage when multiple values must be maintained at once in an algorithm. Gullwing
must instead execute additional instructions to manipulate the stacks to get to the value
it needs.
• The MIPS processor must simulate a stack in software for subroutines calls and recur-
sive procedures. The extra instructions to implement this stack consume more memory
bandwidth and processors cycles than the equivalent Gullwing code.
• The Gullwing processor requires less memory bandwidth, often half that of MIPS, re-
gardless of the number of cycles required for an algorithm.
In Section 7.4, the pipelined form of Gullwing is derived through incremental transformation
of the DLX pipeline. The result is a 2-stage pipeline composed of an Instruction Decode stage
followed by parallel Execute and Memory stages. Each stage is structurally no more complex
than any stage from the DLX pipeline, which implies that the cycle time will be similar. The
Gullwing pipeline exhibits similar branch data hazard and delay slots as the DLX pipeline. In
the worst case, these delays should increase the average CPI from 1.31 to 1.51.
In summary, a pipelined Gullwing processor would have a similar cycle time relative to a
DLX processor. However, with the exception of subroutine calls, Gullwing usually requires
a greater number of cycles to execute the same algorithms. It also suffers from a higher CPI
due to un-optimized load and branch delays. Therefore, to bring Gullwing up to the same
performance as DLX, the number of executed instructions and/or the average CPI must be
reduced (Sections 9.2.2 and 9.2.3).
84
Chapter 8
The density of Gullwing low-level code is very good. A 32-bit word holds six instruction slots.
However, instructions that require an in-line argument such as calls, jumps, and literal fetches,
also additionally use up one whole subsequent memory word1 . Including one such instruction
in a group, while keeping all slots filled, raises the memory usage to two words and thus halves
the code density to three. Adding another such instruction drops the density to two, and so on
until all six instructions require an in-line argument, with a resulting minimum code density of
six-sevenths (∼0.86). As the number of slots in a memory word increases, the minimum code
density increases towards unity.
This situation is unfortunately a narrow best case. It is only applicable when all instruction
slots can be filled. This is true for literal fetches since they do not alter the program flow, and
for conditional jumps since they simply continue with the next instruction if the condition is
not met. Calls and unconditional jumps always load a new instruction group2 . Memory is
word-addressed and groups always begin execution with the first slot, therefore jumping or
returning to the instructions in the slots after a call or jump is impossible and must instead go
to the next memory word after the argument. Sequences of calls thus end up wasting most of
the available slots (Figure 8.1), bringing the minimum code density down to one-half.
Sequences of calls are typical of high-level code, where a procedure is primarily composed
of calls to other procedures. Sequences of jumps are rare, never executed sequentially, and are
not considered further. The actual usage of the instruction slots is listed in Appendix B.1.5.
The low ratio of filled instruction slots suggests that there is room for significant improvement
in code density.
Unreachable Slots
CALL
Address of Subroutine 1
CALL
Address of Subroutine 2
CALL
Address of Subroutine 3
85
8.1 Improving High-Level Code Density by Adding an In-
struction Stack
The key to improving the density of high-level code is the observation that if the instructions
to be executed after a subroutine call are placed in the same memory word as the call, they
will be fetched along with the call and they should not need to be fetched again when the
subroutine returns. The instructions simply need to be temporarily stored in the processor
while the subroutine executes. The number of words that need to be stored is identical to the
current nesting depth of the program. This suggests extending the Instruction Shift Register
(ISR) with a stack that operates in synchrony with the Return Stack (RS). Figure 8.2 illustrates
the process. For clarity, only four slots are depicted and the Return Register (R) is omitted (see
Figure 6.1).
When a subroutine call is executed, the remaining instructions are pushed onto an Instruc-
tion Stack (IS) at the same time that the return address is pushed from the Program Counter
(PC) onto the RS. When the subroutine returns, the saved instructions are popped from the IS
and placed at the head of the ISR at the same time that the return address is popped from the RS
into the PC. The last slot in the ISR is filled with a Program Counter Fetch (PC@) instruction,
as during normal execution. The net effect is that the instructions following a call are now ex-
ecuted upon return from the subroutine. This makes it possible to increase the minimum code
density of high-level code back to six-sevenths (Figure 8.3).
ISR PC ISR PC
CALL RET
PC@
IS RS IS RS
(a) (b)
Available Slots
86
8.1.1 Side-Effects on Return Stack Manipulation
A consequence of the Instruction Stack (IS) is that the return addresses on the Return Stack
(RS) must always be matched with the corresponding instructions stored on the IS. Any offset
between the Instruction and Return stacks would mean that from that point onwards all re-
turns to calling procedures would execute a few random instructions before fetching the next
(correct) groups of instructions!
The >R (“To R”) and R> (“R From”) instructions move data back and forth between the top
of the Data Stack (TOP) and the RS. Hence they could leave the RS with a different number
of elements than the IS. To compensate, >R pushes a group of PC@ instructions onto the IS,
and R> pops the IS, discarding the instructions. If >R is used to push a return address onto
the RS, such as when the address of a function call is computed at runtime, then the next
subroutine return will execute the stored PC@ and forcibly fetch the first instruction group
of that procedure. If R> is used to discard a return address, possibly for some kinds of error
handling, the stored instructions for that procedure are also discarded. Figure 8.4 illustrates the
process.
It is still possible to cause incorrect execution with the unusual code sequence ’R> >R’
which would replace the stored instructions with a PC@. This would skip a few instructions
upon return unless the instruction slots after the corresponding call we deliberately left unused
(and thus filled with PC@ anyway). However, the need to inspect or alter the address of the
calling procedure’s caller is rather unusual. Similarly, unless there exists a means of loading
or storing the contents of the IS under program control, the RS cannot be saved to memory for
the purpose of context switching, debugging, or exception handling.
IS RS IS RS
(a) (b)
87
8.2 Implementation
The implementation of the code density optimization is straightforward, consisting mainly of
adding control lines to push and pop the IS as required. Algorithm 29 shows the changes re-
quired. Added controls are in bold, while removed controls are struck through. The instruction
fetch overlap optimizations (Section 6.3.1) are not included as they are orthogonal.
88
Silicon Area Adding a third stack increases the silicon area required by the processor. The
additional area consumed by the IS is similar to that of the Return Stack since it must be of the
same depth and slightly less wide.
The breakeven point between the additional area of the IS and the saved area in main
memory, given a uniform word width, is when the reduction R in the original size S of a
program, due to the addition of the IS mechanism, is equal to the depth D of the IS stack:
S − S(1 − R) = D.
For example, if R is taken to be the previously determined value of 31.5%, then S needs to
be 3.17 times larger than D for its size to be reduced by the same amount as the size of the IS.
Since a stack rarely needs to be more than 32 words deep (Section 5.5), S would only need to
be equal to 102 memory words to justify the additional area of the IS mechanism.
For larger programs, the reduction in code size would greatly outweigh the area of the IS.
The size of the main memory can thus be correspondingly reduced 3 . The lowered total silicon
area (processor and memory) would especially benefit embedded systems.
Subroutine Overhead The use of the IS eliminates the need to fetch the remaining instruc-
tion slots after a call instruction. Algorithm 29 shows that the load in the eliminated second
cycle of the return instruction is no longer required because the remaining instruction slots are
now loaded from the IS during the first cycle. This reduces the overhead of calling and return-
ing from a subroutine to three cycles, down from four, and also reduces the associated memory
traffic by the same 25%.
3
This assumes that the stacks and main memory are implemented using the same memory technology.
89
90
Chapter 9
The first part of this thesis presented the historical origins of the first generation of stack com-
puters and found that these machines were derived from Bauer’s stack principle for the alloca-
tion of storage in nested subroutines, later used in the specification of ALGOL and now seen as
the call stack of C and other programming languages, and that the second generation of stack
computers was based on Hamblin’s independently discovered stack principle geared at func-
tion composition instead of storage allocation. The English Electric KDF9, made commercially
available around 1963, was found to stand out as the first second-generation stack computer
and the only one until the NOVIX NC4016 in 1985. This gap, and the coincidence with the
appearance of RISC microprocessors, accounts for the obscurity of the second generation.
The second part of this thesis built upon the first by proposing a set of criteria to distinguish
first and second-generation stack computers. In summary, second generation stack computers
keep their stacks in the CPU instead of main memory, use the stacks for the evaluation and
composition of functions instead of procedure storage allocation, and use simple, RISC-like
instructions instead of complex microcoded operations. Chapter 5 then presented a rebuttal to
the influential arguments against stack architectures cited by Hennessy & Patterson and found
that they are not applicable to second-generation stack computers due to their different design
and to advances in hardware technology and compilers. The most telling finding is that some
modern processors, such as the Intel Pentium IV and the Digital Alpha AXP 21064, use a 16-
deep internal stack to cache return addresses in the same manner as second-generation stack
computers.
The third part of this thesis specified the design of a small second-generation stack ma-
chine, named ’Gullwing’. The first unusual feature found was that the packing of multiple
instructions per memory word, made possible by the zero-operand instruction format, reduced
the number of sequential instruction fetches to between 8.3% and 9.7% of the total number
of executed instructions despite achieving an average code density of only 1.2 instructions per
memory word. An additional simple optimization of the instruction fetch mechanism reduced
this fetching overhead by 50.6 to 77.3%, down to 4.1 to 2.2% of the total number of executed
instructions, effectively eliminating the need for a second memory bus dedicated to fetching
instructions.
Chapter 7 then compared Gullwing to the DLX and MIPS processors, via some aggregate
91
benchmarks and some low-level code comparisons. In Section 7.2, it was observed that 23.3
to 29.2% of the executed instructions on Gullwing are stack manipulation instructions whose
actions are implicit in the three-operand instruction format of MIPS. Similarly, Gullwing must
perform 10 to 16% more immediate loads since there are no operands to hold small constants.
The average CPI of Gullwing was between 1.301 and 1.312: a penalty of 13.1 to 18.2% over
DLX. However, Gullwing is architecturally equivalent to a DLX processor without delayed
branches and loads. Without these optimizations the CPI of DLX would have been 1.34 to
1.41 instead: 2.1 to 8.4% worse than Gullwing.
It was also found that MIPS performed an average of 1.421 memory accesses per cycle,
while Gullwing required only 0.667: a 53% reduction in memory bandwidth. For instruction
fetching alone, Gullwing required 55.9% fewer memory accesses per cycle on average (0.441
vs. 1.00 for MIPS). These improvements were found to originate in the packing of multiple
instructions per memory word: 71.1 to 80.9% of code basic blocks fit within a single memory
word on Gullwing.
Section 7.3 showed that Gullwing is at a disadvantage when multiple intermediate results
are required. A comparison of iterative code demonstrated that Gullwing required 25% more
memory space, 125% more instructions, 25% more memory accesses, and 150% more cycles,
compared to MIPS, to execute the iterative algorithm due to the need to shuffle values on and
off the Data Stack.
On the other hand, Gullwing exhibits extremely efficient function calls. The same algo-
rithm, implemented recursively on Gullwing, required 54% less memory space, 38% fewer
instructions, 59% fewer memory accesses, and 23% fewer cycles than the equivalent recur-
sive MIPS code due to the elimination of the instructions required to simulate a stack in main
memory.
When looking at pure nested subroutines, as would be the case with precompiled libraries,
Gullwing’s advantage at subroutine calls is amplified further: 58 to 67% less (code) memory
space, 71 to 75% fewer instructions, 62 to 72% fewer memory accesses, 50 to 52% fewer
cycles, and a 25 to 44% reduction in memory bandwidth, despite an average CPI between 1.67
and 2.00 and a code density between 0.60 and 0.86 instructions per word.
As originally specified, the Gullwing processor was not pipelined, and so its cycle time
would have exceeded that of the pipelined DLX processor. A pipelined form of Gullwing was
specified by transforming the stages of the DLX pipeline to Gullwing’s stack architecture. The
result is a two-stage pipeline with parallel EX and MEM stages, which has branch delay slots like
DLX, but no load delay slots and additional branch hazard slots due to the use of a stack instead
of registers. In the worst case where these slots could not be used productively, the average CPI
of Gullwing would increase by 12.9 to 17.8%, to a range of 1.481 to 1.533. Assuming that the
ALU is the critical path of a simple pipeline, then the two-stage pipelined form of Gullwing
should have a similar cycle time to the five-stage DLX pipeline.
Finally, Chapter 8 proposed the use of a third stack to temporarily hold instructions during
subroutine calls. This Instruction Stack would maximize the density of high-level code with
many branches and calls, reducing the overall code size by a little over 30% (up to a theoretical
limit of 50%), and would reduce the memory traffic and cycle count overhead of calling and
returning from a subroutine by 25%.
92
9.1 Contributions
This thesis makes the following contributions:
1. a historical review of first-generation stack computers which uncovers the origins of the
conceptual difference between first and second-generation machines (Chapter2);
3. a set of criteria to distinguish first and second-generation stack computers which expand
on those given by Feldman and Retter [FR93, pp.599-604] (Chapter 4);
4. a rebuttal of the arguments against stack computers cited by Hennessy and Patterson,
showing that they are applicable to the first generation of stack computers, but not the
second (Chapter 5);
8. the design of the Gullwing pipeline as a transformation of the DLX pipeline (Section
7.4);
9. the proposal of an instruction stack to maximize the code density and accelerate the
subroutine returns of Gullwing (Chapter 8).
93
9.2.1 Reducing the DLX/MIPS Subroutine Call Overhead by Adding Stacks
The MIPS architecture has very inefficient subroutine calls compared to Gullwing, stemming
from the need to save and restore registers to a stack in memory (Section 7.3.6). Conversely,
Gullwing has poor performance for iterative code because it lacks random-access registers. A
combination of both might yield the best of both worlds.
Figure 9.1 shows a conceptual modification to the MIPS register file which would create
the possibility of running stack-like code for efficient subroutine calls, while otherwise leaving
the instruction set, architecture, and pipeline unchanged. A Data Stack and a Return Stack are
added ’underneath’ some registers. The Return Stack goes under register $31 since it is where
the current return address is normally stored by the jal (Jump And Link) instruction. The
Data Stack could be placed under any other registers, but is placed here under registers $2 and
$1 for illustrative purposes. Two registers are used so that two operands can be taken from the
stack when needed. Register $0 is is of course a source of zeroes on reads and a sink on writes.
If the stacks are disabled, the register file behaves as usual, and existing code runs un-
changed. When the stacks are enabled, a read from a register connected to a stack ($31, $2,
and $1) pops the content of the stack into the register, making the read operation destructive,
and a write to a register pushes the previous contents of the register onto the stack. The excep-
tions to this behaviour are when both $2 and $1 are read together, only the contents of $2 are
overwritten by the stack, and if $1 is both read and written in that same instruction, it is simply
overwritten as if it had been popped, then pushed.
Given this rough sketch of stack and register interaction, Table 9.1 shows how a number
of stack computer instructions can be synthesized by suitable MIPS instructions. Small inline
constants and branch target labels are denoted by ’$Ln’.
The stack manipulation instructions, such as DROP and OVER, should never be required.
A compiler would use other registers as usual for the storage of counters and common subex-
pressions, thus avoiding the stack manipulation overhead of iterative code seen in Section 7.3.3,
and avoiding the RAW (Read After Write) dependency of the stack. Conversely, the stack could
be used to hold local variables and arguments to subroutines, which would reduce or outright
eliminate the loads and stores required for nested subroutine calls on MIPS (Section 7.3.6).
Algorithm 30 shows the recursive example from Section 7.3.4 implemented using stacks.
Tables 9.2 and 9.3 show that, relative to the original MIPS code, the loads and stores are
eliminated, the algorithm uses 38% fewer cycles, and memory bandwidth is reduced by 24%.
A similar register and stack mechanism was proposed by Richard L. Sites [Sit78], with the
primary intent of simplifying expression evaluation. He observed that a stack placed under one
of the registers allowed the evaluation of an expression using fewer actual registers. In current
systems, this could make possible smaller register files, or give a compiler more room to do
register allocation.
RS DS
94
Stack MIPS Stack MIPS
CALL jal $Ln NOT xor $1,$1,-1
RET j $31 AND and $1,$1,$2
JMP b $Ln + add $1,$1,$2
JMP0 beq $1,$0,$Ln DROP add $0,$0,$1
LIT addi $1,$0,$Ln SWAP add $1,$2,$0
@ lw $1,($30) PUSH add $1,$30,$0
! sw $1,($30) POP add $30,$1,$0
Table 9.1: Synthesized Stack Operations on MIPS with Stacks
95
9.2.2 Reducing Gullwing’s Instruction Count with Compound Stack Op-
erations
There has been a hidden assumption throughout this thesis about which operations can be
performed on a stack while avoiding random access to the elements it contains. The commonly
given model of stack behaviour assumes that a dyadic operation pops two items from the stack
in sequence, stores them in some working registers, operates upon the items, and then pushes
the (usually single-item) result onto the stack.
This process is usually optimized to a single step by making the visible top of the stack into
a separate register, thus allowing two items to be read simultaneously and the result to be stored
into the register while the stack proper simply discards its topmost element. This mechanism
is illustrated by the Data Stack in Figure 6.1.
However, the capacity to see the two topmost items creates the possibility of more complex
operations without adding cycles, data lines, or control lines, and thus without most of the
pitfalls of past, high-level instruction sets.
For example, addition on a stack destroys both its arguments, which brings about the use of
the OVER instruction to preserve the second element by pushing onto the stack a copy of the
second element which is is then added to the first. This is seen in the iterative and tail-recursive
code examples (Sections 7.3.3 and 7.3.5). Similarly, both arguments can be preserved by first
copying them both with two consecutive OVER instructions.
By simply not popping the stack proper when performing an addition, the ’OVER +’ se-
quence is accomplished in a single instruction. Similarly, by merging the addition operation
into the implementation of the OVER instruction, the sequence ’OVER OVER +’ reduces to
a single instruction. The Gullwing implementation of these new operations is shown by the
first two lines of Algorithm 31. Removed actions are struck out, while added ones are bolded.
Similarly, the last three lines show the implementation of a non-destructive JMP0 instruction,
replacing the common ’DUP JMP0’ sequence, where the stack is simply not manipulated.
To provide enough opcodes to support these compound operations, the width of the Gull-
wing opcodes would have to be increased to six bits. While this is still comparable to MIPS,
further study is needed to make sure that at least one compound operation would occur on
average per executed group of instructions, thus balancing out the reduction from six to five of
the number of instruction per group, and still saving one cycle compared to the execution of
the original instruction group.
96
9.2.3 Reducing Gullwing’s CPI by Executing Multiple Instructions using
Generalized Instruction Folding
Section 6.3.1 describes a change to the decoding of instructions which allows the overlapping
of the execution of the PC@ (’PC Fetch’) instruction with that of any instruction that does not
access memory, retiring both instructions at once. This mechanism could be generalized to a
greater number of pairs of instructions.
For example, the ’OVER +’ and ’DUP JMP0’ compound operations described in Section
9.2.2 could be implemented, without adding new opcodes, by decoding both the current and
the next instruction and then shifting them both out. This feature widens the instruction input
to the decoder and might present a time or area penalty since the instruction encoding is dense1 .
Loads and stores can especially benefit from the decoding of two instructions at once.
Instead of holding steady for two cycles the instruction input to the decoder while the load or
store executes its two phases, the first phase can be executed when the instruction enters the
first part of the two-instruction window of the improved decoder, and the second phase when
the instruction enters the second part of the window. Algorithm 32 shows how the instruction
sequence ’DROP A@+ R@+ XOR RET’2 would be executed: The first line shows the first
phase of A@+ folded with the execution of DROP. The second line contains the overlapped
second phase of A@+ and first phase of R@+. The third line does not overlap the execution of
XOR with the second phase of @R+ since the load must complete before XOR can operate on
the loaded value. Finally, the fourth and fifth lines show how XOR and RET are overlapped.
The execution of this five-instruction code sequence now takes five cycles instead of eight.
Its CPI went from 1.60 to 1.00. The number of memory accesses per cycle increased from 0.75
to 1.00. The performance is now on par with that of non-branching code on MIPS.
The data dependency of the ’@R+ XOR’ sequence could be avoided by enabling the ALU
to use MEM as an input. Similarly, the ’LIT JMP0’ sequence could be overlapped if an ad-
ditional zero-detect circuit was connected to MEM, and ’>A @A’ would benefit from a direct
path between TOP and MAR. This forwarding of data, equivalent to the result forwarding of
the MIPS processor, does add multiplexers in the data path and thus the cycle time to cycle
count trade-off should be considered carefully.
1
By comparison, MIPS R-type instructions use a total of twelve opcode bits, but do not encode 212 unique
instructions. The encoding must thus be sparse and simple to decode.
2
A simple hypothetical comparison routine based on COMPARE_STRING from Appendix A.1.3.
97
98
Appendix A
The appendix provides the source code for the benchmarks described in Section 7.1.
HERE Contains the address of the memory location that is the current target for compilation.
HERE_NEXT Contains the address of the next memory word where code is to be compiled.
SLOT Contains the bitmask which defines the current available instruction slot in the mem-
ory word pointed-to by HERE.
THERE Contains the address of the top of the function name dictionary. Also the pointer to
the bottom of the input buffer.
NAME_END Contains the address of the end of the function name dictionary. It is used to
detect the failure of a dictionary search.
99
INPUT Contains the address of the top of the input buffer, which is the beginning of the most
recently received string.
Start of Memory
Function Code
HERE
HERE_NEXT
Free Memory
INPUT
Input Buffer
THERE
Name Dictionary
NAME_END
Figure A.1: Flight Language Kernel Memory Map
1
Comparing a string to itself is a corner case not dealt with here.
100
A.1.2 Utility Functions
These are small functions which synthesize operations not implemented by the Gullwing pro-
cessor. They are shown here as subroutines but are normally compiled in-line due to their
size.
MINUS Negates the number on the top of the Data Stack before adding it to the next number.
OR Performs the bit-wise logical OR of the top two numbers on the Data Stack.
STRING_TAIL Returns the address of the tail of a string. Takes the head address of the
string. The tail contains an address used for referencing code or terminating string compar-
isons.
PUSH_STRING Alters INPUT to allocate the space for a counted string. Sets the tail of the
string to point to its own address. Takes the length of the string from the Data Stack.
101
POP_STRING Alters INPUT to discard the most recently SCAN’ed string.
For comparison, this alternate version keeps a copy of the address of INPUT on the stack. It
avoids a literal fetch and a function call, but is harder to follow. An optimizing compiler might
generate code like this.
COMPARE_STRINGS Takes the head address of two strings. Returns the addresses (in the
same order) of the first non-matching pair of symbols. Note that the practise of tailling each
string with the address of the tail guarantees that the comparison will terminate (at the tail).
>A >R
LOOP:
R@+ A@+ XOR
JMP0 LOOP
LIT -1 R> +
LIT -1 A> + RET
READ1 Presumed here is that READ1 always returns a memory word. Implementation
depends on interface to outside world. Here I assume a memory port for illustration purposes.
102
SCAN_STRING Reads a string into a pushed string entry.
SCAN Reads in a string from the outside world. Input is a counted string, where the first
memory word contains the following number of memory words used by the string, regardless
of symbol encoding. Always reads in the string. There is no check.
The location after the string is kept free to hold its own address (which is [THERE]) as a
termination marker for LOOK, or an actual code address should it become a name entry.
To see if there is enough free space, the sender of the string can check in advance if the
result of "INPUT HERE -" is greater than the string length. The check depends on the fact that
HERE points to an address that begins at 0 and increments, while INPUT points to a location
that begins at THERE and decrements. THERE begins at the end of memory and decrements
towards zero. If they meet, then there is no more free memory. Thus if the start address is
lesser than the one HERE points at, there is no room for the string (or for anything else at all!).
CALL READ1
CALL PUSH_STRING
JMP SCAN_STRING
103
// uses the DUPed name entry address
DROP
CALL STRING_TAIL
// point to start of next name entry
LIT 1 +
// are we at end of name dict?
DUP LIT [address of NAME_END] >A A@ XOR JMP0 NOMATCH
DUP JMP LOOP
MATCH:
// return code address for name entry
>A DROP A@ RET
NOMATCH:
// return address of end of dict
DROP LIT [address of NAME_END] >A A@ RET
LAST_SLOT Setup at fifth instruction slot, so the next compilation will occur at the zeroth
slot.
ALIGN Makes HERE point to the next free location, so we don’t point a name at the tail
end of the previous procedure or clobber literals. Make HERE_NEXT point to the following
location. Mark HERE as empty with NULL_SLOT.
104
// update HERE
LIT [address of HERE] >A A!
JMP NULL_SLOT
NEXT_SLOT Point to first slot if HERE is empty (NULL_SLOT), else point to next free
slot at HERE, else ALIGN.
DEFN_AS Takes the ALIGNed address of a code entry (usually [HERE]) and the address of
a SCANed string (from INPUT) and converts it to a name entry. Updates THERE.
The name string must be the only one in the input buffer stack as it is converted in place,
so INPUT must be pointing to it and its tail must be at THERE. Because of this, INPUT does
not need to be changed.
CALL ALIGN
// get [HERE]
LIT [address of HERE] >A A@ RET
CALL NEW_WORD
JMP DEFN_AS
105
A.1.7 Compilation Functions
These functions enable the kernel to compile the Gullwing opcodes. By convention, functions
that compile a Gullwing opcode or inline another function have parentheses around their name.
COMPILE_OPCODE Takes an opcode (in slot 0) and compiles it at the next empty HERE/SLOT
location. Assumes the current slot is full. Leaves SLOT pointing to next full slot.
This function fails for the PC@ instruction, since its opcode is all zeroes. It shouldn’t be
necessary to compile it explicitly, as that’s what ALIGN does when it zeroes the memory word.
Note the addition instead of logical OR at the end. We can do this since empty slots are filled
with PC@ opcodes (zeroes).
CALL NEXT_SLOT
// get slot mask and invert
>R LIT [address of SLOT] >A A@ ~
//place opcode on top
R>
// slot 0
OVER OVER AND JMP0 COMPILE
// shift opcode by one slot
2* 2* 2* 2* 2*
// slot 1
OVER OVER AND JMP0 COMPILE
2* 2* 2* 2* 2*
// slot 2
OVER OVER AND JMP0 COMPILE
2* 2* 2* 2* 2*
// slot 3
OVER OVER AND JMP0 COMPILE
2* 2* 2* 2* 2*
// slot 4
OVER OVER AND JMP0 COMPILE
2* 2* 2* 2* 2*gate
// then we *have* to be in slot 5
106
COMPILE:
LIT [address of HERE] >A A@
// compile opcode, drop mask, return
>A A@ + A! DROP RET
COMPILE_LITERAL Takes a procedure address or literal (for CALL, JMP, JMP0, or LIT)
and compiles it at HERE_NEXT. Increments HERE_NEXT.
LIT [address of HERE_NEXT] >A A@gate
// store address/literal and increment
>A A!+
// update HERE_NEXT
A> LIT [address of HERE_NEXT] >A A! RET
CMPC, (CALL) Takes the address of a procedure an compiles a call to it. Then aligns to the
next free word since later slots in current word will never execute.
LIT [opcode for CALL]
CALL COMPILE_OPCODE
CALL COMPILE_LITERAL
JMP ALIGN
CMPJ, (JMP)
LIT [opcode for JMP]
CALL COMPILE_OPCODE
CALL COMPILE_LITERAL
JMP ALIGN
CMP0, (JMP0) If JMP0 is not taken, the next opcode in the same memory word runs instead,
unlike CALL and JMP.
LIT [opcode for JMP0]
CALL COMPILE_OPCODE
JMP COMPILE_LITERAL
CMP+, (JMP+)
LIT [opcode for JMP+]
CALL COMPILE_OPCODE
JMP COMPILE_LITERAL
107
A.1.8 Inline Compilation
Copying a word slot-by-slot requires decompiling it instruction by instruction and recompiling
it at the new location. This requires reimplementing the compilation words ’in reverse’. This
is complicated and ugly. Simply ALIGNing and copying existing code is no better since the
words that will be worthwhile to inline are the shortest, and so many slots will get proportion-
ally wasted. The extreme case is when inlining machine instructions: you would end up with
one opcode per word! So that’s the key: we know in advance which words will need to be
inlined, and so we write a function which when run compiles the code in situ. It’s a macro!
The kernel contains a full set of inlining functions for the Gullwing opcodes.
EXECUTE Synthesizes a function call or jump using an address on the top of the Data Stack.
If called: Takes a procedure address and calls to it. If inlined or jumped-to: Takes a procedure
address and jumps to it.
NXEC This is the user-interface loop. Reads in a string, looks up the function address, and
calls to it. In effect, the kernel does nothing but execute all commands given to it.
CALL SCAN
CALL LOOK
CALL POP_STRING
CALL EXECUTE
JMP NXEC
EXEC Reads in the address of a function and calls to it. It is an alternate main loop meant
to be used by communicating instances of the kernel since it is more efficient to pass function
addresses than names.
CALL READ1
CALL EXECUTE
JMP EXEC
108
A.1.10 Decimal to Binary Conversion
By design, the encoding of strings is irrelevant. However, numbers cannot avoid a predefined
decimal encoding. Ideally, this would be UNICODE, but I settled for decimal ASCII numerals
for now (0 -> 48,...9 -> 57), one per memory word. Wasteful, but simple.
NUMI Takes the string address of an unsigned decimal number and returns the corresponding
integer on the stack. No error or overflow checking!
• Functions which compile code have their names between parentheses ’()’.
• The use of strings in the input buffer is denoted by a dollar sign ’$’.
109
• Moving data, input, and output is denoted by angle brackets ’<’ and ’>’.
The Flight language evolves very quickly at the beginning. The first few functions are used
throughout the latter code and must be understood before proceeding further.
110
alias CMPJMPZERO (JMP0)
alias CMPJMPPLUS (JMP+)
alias CMPCALL (CALL)
alias CMPRET ;
alias NUMC #
alias NUMI $n
alias LOOK $l
alias POP_STRING $pop
alias DEFN $:
alias SCAN >$
alias WRITE1 #>
alias READ1 >#
alias TENSTAR 10*
alias CMPLOADA (A@)
alias CMPSTOREA (A!)
alias CMPLOADAPLUS (A@+)
alias CMPSTOREAPLUS (A!+)
alias CMPLOADRPLUS (R@+)
alias CMPSTORERPLUS (R!+)
alias CMPXOR (XOR)
alias CMPAND (AND)
alias CMPNOT (NOT)
alias CMPTWOSTAR (2*)
alias CMPTWOSLASH (2/)
alias CMPPLUS (+)
alias CMPPLUSSTAR (+*)
alias CMPDUP (DUP)
alias CMPDROP (DROP)
alias CMPOVER (OVER)
alias CMPTOR (>R)
alias CMPRFROM (R>)
alias CMPTOA (>A)
alias CMPAFROM (A>)
alias CMPNOP (NOP)
111
A.2.2 Interactively Usable Opcodes
These functions create interpreted versions of some opcodes to interactively manipulate mem-
ory and the Data Stack. Opcodes which manipulate the A register and the Return Stack cannot
be interpreted since the interpretation process alters their contents.
: @ (>A) (A@) ;
: ! (>A) (A!) ;
: XOR (XOR) ;
: AND (AND) ;
: NOT (NOT) ;
: OR (OVER) (NOT) (AND) (XOR) ;
: 2* (2*) ;
: 2/ (2/) ;
: + (+) ;
: +* (+*) ;
: DUP (DUP) ;
: DROP (DROP) ;
: OVER (OVER) ;
: NOP (NOP) ;
112
// Same, and compile as literal
: -n# c -n c # ;
// Synthesize subtraction
: - (negate) (+) ;
: (-) c (negate) c (+) ;
: \a n# 7 n# 1 c #> j #>
: \b n# 8 n# 1 c #> j #>
: \t n# 9 n# 1 c #> j #>
: \n n# 10 n# 1 c #> j #>
: \v n# 11 n# 1 c #> j #>
: \f n# 12 n# 1 c #> j #>
: \r n# 13 n# 1 c #> j #>
: \s n# 32 n# 1 c #> j #>
: if-
n# 0 c (JMP+)
l# HERE_NEXT (@)
(N-) 1 ;
113
// Backpatch the jump to target
// the next memory word
: else
(>R) c NEW_WORD (R>)
(>A) (A!) ;
: max
(OVER) (OVER) (-)
if-
(>R) (DROP) (R>) ;
else
(DROP) ;
: min
(OVER) (OVER) (-)
if-
(DROP) ;
else
(>R) (DROP) (R>) ;
: abs
(DUP) if- (negate) ; else ;
: <=
(-) (DUP)
if-
(DROP) -n# 1 ;
else
if
n# 0 ;
else
-n# 1 ;
: >=
(-) (DUP)
if-
(DROP) -n# 0 ;
else
if
n# 0 ;
else
-n# 1 ;
114
A.2.6 Code Memory Allocation
This function allocates a zeroed-out span of memory in the code area, usually for static storage
of data.
: allot
(DUP)
if
c ALIGN
(N-) 1
j allot
else
(DROP) ;
115
// Output a string, given its address
: cs>
(DUP) (@) (+) (N+) 1
(A>) (>R)
: cs>-loop (R>) (OVER) (OVER) (XOR)
if (>R) (R@+) c #> j cs>-loop
else (DROP) (DROP) ;
: match?
l# INPUT (@) c STRING_TAIL (XOR) ;
: end?
l NAME_END @ # (XOR) ;
: erase
(DUP) l# THERE (!)
(DUP) (N+) 1 l# INPUT (!)
(@) (DUP)
(N-) 1 l# HERE (!)
l# HERE_NEXT (!)
j ALIGN
116
: forget
c >$
l# THERE (@) (N+) 1 (DUP)
: forget-loop
l# INPUT (@)
c COMPARE_STRINGS
c match?
if (DROP) c STRING_TAIL (N+) 1 (DUP) c end?
if
(DUP) j forget-loop
else
(DROP) c POP_STRING ;
else
c erase (DROP) ;
: 15x15
(>R)
(2*) (2*) (2*) (2*)
(2*) (2*) (2*) (2*)
(2*) (2*) (2*) (2*)
(2*) (2*) (2*)
(R>)
(+*) (2/) (+*) (2/) (+*) (2/) (+*) (2/)
(+*) (2/) (+*) (2/) (+*) (2/) (+*) (2/)
(+*) (2/) (+*) (2/) (+*) (2/) (+*) (2/)
(+*) (2/) (+*) (2/) (+*) (2/) (+*)
(>R) (DROP) (R>) ;
117
: U/
c divby0check
n# 0 (>A)
: U/-loop
(OVER) (>R) (DUP) (R>) c <=
if
(DUP) (>R) (-) (R>)
(A>) (N+) 1 (>A)
j U/-loop
else
(DROP) (A>) ;
: numstrlen
c abs (DUP) (N-) 10
if- (DROP) n# 1 ; else (DUP) (N-) 100
if- (DROP) n# 2 ; else (DUP) (N-) 1000
if- (DROP) n# 3 ; else (DUP) (N-) 10000
if- (DROP) n# 4 ; else (DUP) (N-) 100000
if- (DROP) n# 5 ; else (DUP) (N-) 1000000
if- (DROP) n# 6 ; else (DUP) (N-) 10000000
if- (DROP) n# 7 ; else (DUP) (N-) 100000000
if- (DROP) n# 8 ; else (DUP) (N-) 1000000000
if- (DROP) n# 9 ; else (DROP) n# 10 ;
118
A.2.11 Simple Fibonacci Examples
// Given two Fibonacci numbers,
// compute the next pair
// eg: 1 1 -> 1 2 -> 2 3 ...
: 1fib
(OVER) (>R) (+) (R>) ;
: manyfibs
c 1fib c 1fib c 1fib j 1fib
119
A.2.12 Static Variables
This code creates the ability to store data in the code area at the time a function in created.
It is equivalent to C language static variables. This is used to implement named variables as
functions which return the address of the associated storage.
// Output: 1 3 2
pass
120
A.2.13 Accumulator Generator
This is a simple example of generating code with an initial argument that becomes static data.
121
// Factored out Fibonacci code
: memfib
(OVER) (@) (OVER) (@) (+) (>R)
(OVER) (@) (OVER) (!) (DROP)
(>A) (R>) (DUP) (A!) ;
n 0 n 1 fibgen2 fibonacci
// Output: 1 2 3 5 8 13 21 34
fibonacci #$> \t
fibonacci #$> \t
fibonacci #$> \t
fibonacci #$> \t
fibonacci #$> \t
fibonacci #$> \t
fibonacci #$> \t
fibonacci #$> \t \n
122
n 3 caesargen encode
-n 3 caesargen decode
// Input: ABCD
// Output: ABCD DEFG ABCD
>$ ABCD
// Print the string
l INPUT @ DUP cs> \t
DUP map1 encode
// Print the ciphered version
DUP cs> \t
// Print the deciphered version
DUP map1 decode
$> \t \n
123
A.2.16 Higher-Order Function (Map)
This is an example of a mapped function generator.
// Takes an integer and the function name of
// a function that alters memory.
// Compiles code which applies the given
// function to each location in a string
// at the interval provided by the integer
: mapgen
c (DUP) c (>R) l# STRING_TAIL c (CALL) c (R>)
n 1 # c # c (+)
c NEW_WORD
c (DUP) c l c (CALL) (>R) c # (R>) c (+)
c (OVER) c (OVER) c (XOR)
c if (>R) c (JMP) (R>)
c else c (DROP) c (DROP) c ; ;
n 5 caesargen encode1
-n 5 caesargen decode1
// Input: lmnopq
// Output:lmnopq qmsouq lhnjpl
>$ lmnopq
l INPUT @ DUP cs> \t
DUP cipher
DUP cs> \t
DUP decipher
$> \t \n
124
A.3 Virtual Machine
The virtual machine is an emulation of the Gullwing hardware. The opcodes are emulated
directly on the hardware if possible, and their memory accesses are bounds-checked.
A.3.1 VM
// Define an 8kB memory for the VM
: MEMSIZE n# 8192 ;
: MEM MEMSIZE var ;
: OPCODEWIDTH n# 5 ;
: OPCODEMASK n# 31 ;
: MEMHEAD MEM # ;
: MEMTAIL MEM_OUTPUT # ;
: (check_low)
MEMHEAD negate # c # c (+) ;
: (check_high)
c (negate) MEMTAIL # c # c (+) ;
: mem_in_range?
(DUP) (check_low) (OVER) (check_high) (OR) ;
: mem_access_msg
create >$ ILLEGAL_MEMORY_ACCESS: $>c
does j cs>
: report_mem_error
c mem_access_msg
c \s c #$> j \n
: access_check
c mem_in_range?
if-
c report_mem_error
j errcontext
else ;
125
: PCFETCHopcode n# 0 ;
: CALLopcode n# 1 ;
: RETopcode n# 2 ;
: JMPopcode n# 3 ;
: JMPZEROopcode n# 4 ;
: JMPPLUSopcode n# 5 ;
: LOADAopcode n# 6 ;
: STOREAopcode n# 7 ;
: LOADAPLUSopcode n# 8 ;
: STOREAPLUSopcode n# 9 ;
: LOADRPLUSopcode n# 10 ;
: STORERPLUSopcode n# 11 ;
: LITopcode n# 12 ;
: UND0opcode n# 13 ;
: UND1opcode n# 14 ;
: UND2opcode n# 15 ;
: XORopcode n# 16 ;
: ANDopcode n# 17 ;
: NOTopcode n# 18 ;
: TWOSTARopcode n# 19 ;
: TWOSLASHopcode n# 20 ;
: PLUSopcode n# 21 ;
: PLUSSTARopcode n# 22 ;
: DUPopcode n# 23 ;
: DROPopcode n# 24 ;
: OVERopcode n# 25 ;
: TORopcode n# 26 ;
: RFROMopcode n# 27 ;
: TOAopcode n# 28 ;
: AFROMopcode n# 29 ;
: NOPopcode n# 30 ;
: UND3opcode n# 31 ;
: do_pcfetch
PCREG # (@)
c access_check (DUP)
(N+) 1 (A!)
(@) ISRREG # (!) ;
126
: do_call
// Move run_vm return address
(R>)
PCREG # (@)
(DUP) (N+) 1 (>R)
(@) c access_check (DUP)
(N+) 1 PCREG # (!)
(@) ISRREG # (!)
// Restore run_vm return address
(>R) ;
: do_ret
(R>)
(R>) c access_check (DUP)
(N+) 1 PCREG # (!)
(@) ISRREG # (!)
(>R) ;
: do_jmp
PCREG # (@) (@)
c access_check (DUP)
(N+) 1 PCREG # (!)
(@) ISRREG # (!) ;
: do_jmpzero
if
PCREG # (@) (N+) 1 (A!) ;
else
j do_jmp
: do_jmpplus
if-
PCREG # (@) (N+) 1 (A!) ;
else
j do_jmp
: do_loada
AREG # (@)
(DUP) MEM_INPUT # (XOR)
if
c access_check (@) ;
else
(DROP) j >#
127
: do_storea
AREG # (@)
(DUP) MEM_OUTPUT # (XOR)
if
c access_check (!) ;
else
(DROP) j #>
: do_loadaplus
AREG # (@)
(DUP) MEM_INPUT # (XOR)
if
c access_check
(>A) (A@+) (A>)
AREG # (!) ;
else
(N+) 1 AREG # (!) j >#
: do_storeaplus
AREG # (@)
(DUP) MEM_OUTPUT # (XOR)
if
c access_check
(>A) (A!+) (A>)
AREG # (!) ;
else
(N+) 1 AREG # (!) j #>
: do_loadrplus
(R>) (>A)
(R>)
(DUP) MEM_INPUT # (XOR)
if
c access_check
(>R) (R@+)
(A>) (>R) ;
else
(N+) 1 (>R)
(A>) (>R)
j >#
128
: do_storerplus
(R>) (>A)
(R>)
(DUP) MEM_OUTPUT # (XOR)
if
c access_check
(>R) (R!+)
(A>) (>R) ;
else
(N+) 1 (>R)
(A>) (>R)
j #>
: do_lit
PCREG # (@)
(DUP) (N+) 1 (A!)
(@) ;
: do_und n 31 COMPILE_OPCODE ;
: do_xor (XOR) ;
: do_and (AND) ;
: do_not (NOT) ;
: do_twostar (2*) ;
: do_twoslash (2/) ;
: do_plus (+) ;
: do_plusstar (+*) ;
: do_dup (DUP) ;
: do_drop (DROP) ;
: do_over (OVER) ;
: do_tor
(R>) (>A)
(>R)
(A>) (>R) ;
: do_rfrom
(R>) (>A)
(R>)
(A>) (>R) ;
129
: ,
c #>c c ALIGN ;
: &>
c l c , ;
130
: (shift_isr)
c (2/) c (2/) c (2/) c (2/) c (2/) ;
: (extract_instruction)
OPCODEMASK # c # c (AND) ;
: do_next_instruction
ISRREG # (@)
(DUP) (shift_isr) (A!)
(extract_instruction)
instruction_call_table # (+) (@)
(>R) ;
A.3.2 Metacompiler
The metacompiler saves and restores the internal state of the language kernel. This allows
redirecting the operation of the kernel to a different memory area. In this case, it is used to
direct compilation and execution of code to the previously defined Virtual Machine memory
area.
131
: nm_there n 1 var ;
: nm_slot n 1 var ;
: nm_input n 1 var ;
: nm_name_end n 1 var ;
132
// Flight Language Kernel main loop while in VM.
// (Compare with NXEC)
// Lookup names in VM dictionary, and if not found,
// repeat in native machine dictionary.
: vm_nxec
c SCAN
c LOOK
(DUP) l# NAME_END (@) (XOR)
// If found in VM memory
if
c POP_STRING
c EXECUTE
j vm_nxec
else
// Else find in native machine memory
// but execute with kernel pointed at VM memory
(DROP)
c save_vm_there
c save_vm_name_end
c restore_nm_there
c restore_nm_name_end
c LOOK
c save_nm_there
c save_nm_name_end
c restore_vm_there
c restore_vm_name_end
c POP_STRING
c EXECUTE
j vm_nxec
: init_name_dict
create >$ zeroword $>c
does c c>$ n# 0 c DEFN_AS ;
133
c save_nm_here
c save_nm_here_next
c save_nm_slot
c save_nm_there
c save_nm_input
c save_nm_name_end
c restore_vm_here
c restore_vm_here_next
c restore_vm_slot
c restore_vm_there
c restore_vm_input
c restore_vm_name_end
c init_name_dict
// Change the main loop
(unwind)
j vm_nxec
134
A.3.3 Self-hosted Kernel
This is a reimplementation of the Flight language kernel, written in the language it defines,
using all the extensions previously compiled. It compiles to the same original machine code
in which the kernel was originally implemented (Section A.1), but it is compiled in the Virtual
Machine memory area instead.
VM(
: HERE
: HERE_NEXT
: THERE
: SLOT
: INPUT
: NAME_END
: PUSH_STRING
(DUP) n# 1 (+) (NOT)
l# INPUT (@) (+) (DUP) (A!)
(!)
l# INPUT (@)
c STRING_TAIL
(DUP) (!) ;
: POP_STRING
l# INPUT (@)
c STRING_TAIL
n# 1 (+)
l# INPUT (!) ;
: COMPARE_STRINGS
(>A) (>R)
: CS_LOOP
(R@+) (A@+) (XOR)
j0 CS_LOOP
-n# 1 (R>) (+)
-n# 1 (A>) (+) ;
135
: SCAN_STRING
l# INPUT (@) (>R) (R@+)
: SS_LOOP
c READ1 (R!+)
-n# 1 (+) (DUP)
if
j SS_LOOP
else
(DROP) (R>) (DUP) (!) ;
: SCAN
c READ1
c PUSH_STRING
j SCAN_STRING
: FIRST_SLOT
l# SLOT (>A)
n# 31 (A!) ;
: LAST_SLOT
l# SLOT (>A)
n# 1040187392 (A!) ;
: NULL_SLOT
l# SLOT (>A)
n# 0 (A!) ;
: ALIGN
l# HERE_NEXT (@)
(DUP) (>R) n# 0 (R!+)
(R>) (A!)
l# HERE (!)
j NULL_SLOT
: NEXT_SLOT
l# SLOT (@)
n# 0 (OVER) (XOR)
if
n# 1040187392 (OVER) (XOR)
if
(2*) (2*) (2*) (2*) (2*) (A!) ;
else
(DROP) c ALIGN j FIRST_SLOT
else
(DROP) j FIRST_SLOT
136
: DEFN_AS
l# THERE (@) (!)
l# INPUT (@)
-n# 1 (+)
l# THERE (!) ;
: NEW_WORD
c ALIGN
l# HERE (@) ;
: DEFN
c NEW_WORD
j DEFN_AS
: LOOK
l# THERE (@) n# 1 (+) (DUP)
: LOOK_LOOP
l# INPUT (@)
c COMPARE_STRINGS
l# INPUT (@)
c STRING_TAIL (XOR)
if
(DROP) c STRING_TAIL
n# 1 (+)
(DUP) l# NAME_END (@) (XOR)
if
(DUP) j LOOK_LOOP
else
(DROP) l# NAME_END (@) ;
else
(>A) (DROP) (A@) ;
: COMPILE_OPCODE
c NEXT_SLOT
(>R) l# SLOT (@) (NOT)
(R>) (OVER) (OVER) (AND) if
(2*) (2*) (2*) (2*) (2*)
(OVER) (OVER) (AND) if
(2*) (2*) (2*) (2*) (2*)
(OVER) (OVER) (AND) if
(2*) (2*) (2*) (2*) (2*)
(OVER) (OVER) (AND) if
(2*) (2*) (2*) (2*) (2*)
(OVER) (OVER) (AND) if
137
(2*) (2*) (2*) (2*) (2*)
else else else else else
l# HERE (@)
(@) (+) (A!) (DROP) ;
: COMPILE_LITERAL
l# HERE_NEXT (@)
(>A) (A!+)
(A>) l# HERE_NEXT (!) ;
: CMPCALL
CALLopcode # c COMPILE_OPCODE c COMPILE_LITERAL j ALIGN
: CMPJMP
JMPopcode # c COMPILE_OPCODE c COMPILE_LITERAL j ALIGN
: CMPJMPZERO
JMPZEROopcode # c COMPILE_OPCODE j COMPILE_LITERAL
: CMPJMPPLUS
JMPPLUSopcode # c COMPILE_OPCODE j COMPILE_LITERAL
: NUMC
LITopcode # c COMPILE_OPCODE j COMPILE_LITERAL
: CMPPCFETCH PCFETCHopcode # j COMPILE_OPCODE
: CMPRET RETopcode # j COMPILE_OPCODE
: CMPLOADAPLUS LOADAPLUSopcode # j COMPILE_OPCODE
: CMPLOADRPLUS LOADRPLUSopcode # j COMPILE_OPCODE
: CMPLOADA LOADAopcode # j COMPILE_OPCODE
: CMPSTOREAPLUS STOREAPLUSopcode # j COMPILE_OPCODE
: CMPSTORERPLUS STORERPLUSopcode # j COMPILE_OPCODE
: CMPSTOREA STOREAopcode # j COMPILE_OPCODE
: CMPNOT NOTopcode # j COMPILE_OPCODE
: CMPAND ANDopcode # j COMPILE_OPCODE
: CMPXOR XORopcode # j COMPILE_OPCODE
: CMPPLUS PLUSopcode # j COMPILE_OPCODE
: CMPTWOSTAR TWOSTARopcode # j COMPILE_OPCODE
: CMPTWOSLASH TWOSLASHopcode # j COMPILE_OPCODE
: CMPPLUSSTAR PLUSSTARopcode # j COMPILE_OPCODE
: CMPAFROM AFROMopcode # j COMPILE_OPCODE
: CMPTOA TOAopcode # j COMPILE_OPCODE
: CMPDUP DUPopcode # j COMPILE_OPCODE
: CMPDROP DROPopcode # j COMPILE_OPCODE
: CMPOVER OVERopcode # j COMPILE_OPCODE
: CMPTOR TORopcode # j COMPILE_OPCODE
: CMPRFROM RFROMopcode # j COMPILE_OPCODE
: CMPNOP NOPopcode # j COMPILE_OPCODE
: EXECUTE (>R) ;
138
: EXEC
c READ1
c EXECUTE
j EXEC
: NXEC
c SCAN
c LOOK
c POP_STRING
c EXECUTE
j NXEC
: TENSTAR
(DUP) (2*) (2*) (2*) (OVER) (+) (+) ;
: NUMI
l# INPUT (@)
(>R) (R@+)
n# 0 (>A)
(DUP)
if
: NUMI_LOOP
(A>) c TENSTAR (R@+) -n# 48 (+) (+) (>A)
-n# 1 (+) (DUP)
if
j NUMI_LOOP
else
else
(DROP) (R>) (DROP) (A>)
j POP_STRING
l NXEC l START_WORD !
)VM
139
A.3.4 Flight Language Extensions
This is a repetition of the compilation of the Extensions from Section A.2, except that they are
processed by the new kernel within the Virtual Machine. This indirection greatly alters and
increases the nature of the code that is executed during the compilation of the extensions.
run_vm
140
Appendix B
The following analyses are based on the software developed in Appendix A for the Gullwing
processor. The analyses are imperfect since it cannot always be known if a memory word
contains instructions or a literal. For example, this happens wherever the software contains
directly accessed memory locations. These memory locations are neither executed nor accessed
in-line like literal fetches. The analysis software considers these memory locations to contain
instructions by default, despite actually being literals. Fortunately, such cases are infrequent
and contribute very little noise to the data, showing up as rare, large literal values and UND
instructions.
141
B.1.2 Range of Literals
Table B.2 shows the distribution of the required number of bits required to represent the abso-
lute value of literals in immediate fetches. Most of these are small constants used in calcula-
tions.
142
B.1.4 Instructions per Instruction Word
Table B.4 shows the distribution of the number of instructions compiled into memory words.
The PC@ (PC Fetch) instruction is not counted since it is used to fill the empty instruction slots.
A memory word which contains zero instructions is thus actually filled with PC@ instructions.
143
B.1.6 Compiled Instruction Counts
Table B.6 shows the number of times each possible instruction was found in memory. The
ratios are calculated relative to the total number of instructions (C/I) and to the total number of
instruction slots (C/S).
144
B.2 Dynamic Analyses
The dynamic analyses are done on an execution trace log of each test in Appendix A. The
’Bare’ case present in the static analyses is not included since the Flight language kernel is
built into the simulator and does virtually nothing without external input.
Test Ext. VM
Instructions 5,018,751 204,325,372
Cycles 6,574,996 265,567,537
Table B.7: Overall Execution
145
Instruction Ext. C/I C/C VM C/I C/C
JMP0 304,141 0.061 0.046 2,280,533 0.011 0.009
JMP+ 63 0.000 0.000 63 0.000 0.000
JMP0 TAKEN 23,307 0.005 0.007 83,610 0.000 0.001
JMP+ TAKEN 137 0.000 0.000 1,875,075 0.009 0.014
CALL 320,857 0.064 0.098 10,143,368 0.050 0.076
RET 321,997 0.064 0.098 15,306,286 0.075 0.115
JMP 107,803 0.021 0.033 5,617,365 0.027 0.042
PC@ 230,798 0.041 0.031 4,428,928 0.022 0.017
FOLDS 210,080 0.042 0.032 15,381,940 0.075 0.058
LIT 636,924 0.127 0.097 32,273,461 0.158 0.122
@A 321,272 0.064 0.098 16,753,698 0.082 0.126
@A+ 320,744 0.064 0.098 1,546,398 0.008 0.012
!A 12,909 0.003 0.004 9,326,086 0.046 0.070
!A+ 428 0.000 0.000 1272 0.000 0.000
@R+ 120,753 0.024 0.037 560,414 0.003 0.004
!R+ 6038 0.001 0.002 28,593 0.000 0.000
XOR 319,174 0.064 0.049 4,203,870 0.021 0.016
AND 2247 0.000 0.000 7,042,637 0.034 0.027
NOT 2249 0.001 0.000 3,758,267 0.018 0.014
2* 9914 0.002 0.002 32,069 0.000 0.000
2/ 0 0.000 0.000 25,802,660 0.126 0.097
+ 515,465 0.103 0.078 15,671,697 0.077 0.059
+* 0 0.000 0.000 0 0.000 0.000
DUP 212,572 0.042 0.032 10,923,102 0.053 0.041
DROP 103,643 0.021 0.016 511,936 0.003 0.002
OVER 6777 0.001 0.001 3,770,383 0.018 0.014
>R 104,923 0.021 0.016 7,110,502 0.035 0.027
R> 103,781 0.021 0.016 1,947,580 0.010 0.007
>A 633,291 0.126 0.096 21,482,640 0.105 0.081
A> 302,944 0.060 0.046 1,843,179 0.009 0.007
NOP 0 0.000 0.000 0 0.000 0.000
UND0 0 0.000 0.000 0 0.000 0.000
UND1 0 0.000 0.000 0 0.000 0.000
UND2 0 0.000 0.000 0 0.000 0.000
UND3 0 0.000 0.000 0 0.000 0.000
Table B.8: Executed Instruction Counts
146
B.2.3 Average CPI
Table B.9 shows the computed average CPI (Cycles Per Instruction) values based on the in-
struction counts from Table B.8. The ’Worst’ and ’Best’ values are for the hypothetical bound-
ary cases where all conditional jumps are taken or not, respectively.
Test Ext. VM
Best 1.305 1.290
Actual 1.310 1.300
Worst 1.371 1.311
Table B.9: Average CPI
147
B.2.5 Basic Block Length
Table B.11 lists the lengths, measured in instructions, of the basic blocks encountered during
execution. Calls, returns, and jumps (taken or not) terminate a basic block. A block length of
zero signifies two consecutive calls or jumps.
The odd peak at length 17 is the main loop of the VM, do_next_instruction, which
was deliberately inlined into a single basic block for performance reasons.
148
B.2.6 Data Stack Depth
Table B.12 shows the distribution of the number of items on the Data Stack over the entire
execution.
149
B.2.7 Return Stack Depth
Table B.13 shows the distribution of the usage of the Return Stack over the entire execution.
The depth of the Return Stack is usually equal to the call depth of the program, plus some
transient, temporary storage.
150
Bibliography
[Bai94] Chris Bailey, HLL enhancement for stack based processors, EuroMicro Journal
of Microprocessing and Microprogramming 40 (1994), 665–668.
[Bar87] R. S. Barton, A new approach to the functional design of a digital computer, IEEE
Annals of the History of Computing 09 (1987), no. 1, 11–15.
151
[Bau90] F. L. Bauer, The cellar principle of state transition and storage allocation, IEEE
Ann. Hist. Comput. 12 (1990), no. 1, 41–49.
[Bau02] Friedrich L. Bauer, From the Stack Principle to ALGOL, pp. 26–42, in Broy and
Denert [BD02], 2002, Points to possible earlier origins of stacks for computation.
[BD02] Manfred Broy and Ernst Denert (eds.), Software pioneers: contributions to
software engineering, Springer-Verlag New York, Inc., New York, NY, USA,
2002.
[BFPB97] Gerrit A. Blaauw and Jr. Frederick P. Brooks, Computer architecture: Concepts
and evolution, Addison-Wesley Longman Publishing Co., Inc., Boston, MA,
USA, 1997.
[BI86] Leo B. Brodie and FORTH Inc., Starting FORTH, second ed., Prentice-Hall, Inc.,
Upper Saddle River, NJ, USA, 1986.
[Bla77] Russell P. Blake, Exploring a stack architecture, Computer 10 (1977), no. 5, 30–
39, QA76.5.I54, not in ACM Digital Library.
[Bro84] Leo B. Brodie, Thinking FORTH: a language and philosophy for solving
problems, Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1984.
[BSa] Friedrich Ludwig Bauer and Klaus Samelson, French patent 1.204.424: Machine
à calculer automatique et procédé pour son exploitation, Filed March 28, 1958.
Delivered August 10, 1959. Published January 26, 1960.
152
[BSd] , Us patent 3,047,228: Automatic computing machines and method of
operation, Filed March 28, 1958. Granted July 31, 1962.
[BS94] C. Bailey and R. Sotudeh, HLL enhancement for stack based processors, Selected
papers of the short notes session on Euromicro ’94 (Amsterdam, The Netherlands,
The Netherlands), Elsevier Science Publishers B. V., 1994, pp. 685–688.
[Bur81] Burroughs Corporation, Detroit, Michigan, B6900 system reference manual, July
1981, Form 5100986.
[Cha95] Robert James Chapman, Stack quarks, Proc. Rochester Forth Con-
ference on Emerging Technology (Rochester, New York) (Lawrence
P. G. Forsley, ed.), University of Rochester, The Institute for Applied
Forth Research, Inc., June 1995, Decomposes the typical stack per-
mutation operations into smaller primitives. Online as of Nov. 2006:
http://www.compusmart.ab.ca/rc/Papers/StackQuarksPaper.pdf.
[Chu75] Yaohan Chu, High-level language computer architecture, Academic Press, Inc.,
Orlando, FL, USA, 1975.
[DBL00] 26th EUROMICRO 2000 conference, informatics: Inventing the future, 5-7
september 2000, Maastricht, The Netherlands, IEEE Computer Society, 2000.
153
[Dor75a] R. W. Doran, The International Computers ltd. ICL2900 computer architecture,
SIGARCH Comput. Archit. News 4 (1975), no. 3, 24–47.
[Dor75b] Robert W. Doran, Architecture of stack machines, pp. 63–108, in [Chu75], 1975.
[Dun77] Fraser George Duncan, Stack machine development: Australia, great britain, and
europe, Computer 10 (1977), no. 5, 50–52.
[Fox98] Jeff Fox, F21 CPU, web page, 1998, Online as of April 2007:
http://ultratechnology.com/f21.html.
[FR93] James M. Feldman and Charles Retter, Computer architecture; a designer’s text
based on a generic RISC, McGraw-Hill, Inc., 1993.
[Fre98] Paul Frenger, Forth in space, or, so NEAR yet so far out, SIGPLAN Not. 33
(1998), no. 6, 24–26.
[Fre01] , Close encounters of the Forth kind, SIGPLAN Not. 36 (2001), no. 4,
21–24.
[GL03] William F. Gilreath and Phillip A. Laplante, Computer Architecture, Kluwer Aca-
demic Publishers, 2003.
154
[GWS91] II George William Shaw, Sh-BOOM: the sound of the RISC market changing,
Proceedings of the second and third annual workshops on Forth, ACM Press,
1991, p. 125.
[Hal62] A. C. D. Haley, The KDF.9 computer system, Proceedings of the AFIPS Fall Joint
Computer Conference, vol. 21, 1962, pp. 108–120.
[Hay97] John P. Hayes, Computer architecture and organization, McGraw-Hill, Inc., 1997.
[HJB+ 82] John Hennessy, Norman Jouppi, Forest Baskett, Thomas Gross, and John Gill,
Hardware/software tradeoffs for increased performance, ASPLOS-I: Proceedings
of the first international symposium on Architectural support for programming
languages and operating systems (New York, NY, USA), ACM Press, 1982,
pp. 2–11.
[HJP+ 82] John Hennessy, Norman Jouppi, Steven Przybylski, Christopher Rowen, Thomas
Gross, Forest Baskett, and John Gill, MIPS: A microprocessor architecture, MI-
CRO 15: Proceedings of the 15th annual workshop on Microprogramming (Pis-
cataway, NJ, USA), IEEE Press, 1982, pp. 17–22.
155
[HP02] , Computer architecture: A quantitative approach, Morgan Kaufmann
Publishers Inc., 2002.
[IC78] R. N. Ibbett and P. C. Capon, The development of the MU5 computer system,
Commun. ACM 21 (1978), no. 1, 13–24.
[Int00] Intersil, Data sheet for HS-RTX2010RH, March 2000, File Number 3961.3.
[KDF61] KDF9: Very high speed data processing system for commerce, industry, science,
English Electric, Kidsgrove, Stoke-On-Trent, Staffordshire, England, 1961, Sales
brochure for the KDF9.
[KKC92a] William F. Keown, Philip Koopman, and Aaron Collins, Performance of the
Harris RTX 2000 stack architecture versus the Sun 4 SPARC and the Sun 3
M68020 architectures, SIGARCH Comput. Archit. News 20 (1992), no. 3, 45–52.
[KKC92b] , Real-time performance of the Harris RTX 2000 stack architecture versus
the Sun 4 SPARC and the Sun 3 M68020 architectures with a proposed real-time
performance benchmark, SIGMETRICS Perform. Eval. Rev. 19 (1992), no. 4,
40–48.
[Koo89] Philip J. Koopman, Stack computers: the new wave, Halsted Press, 1989, A com-
pendium of stack computer architectures. Has useful experimental data.
[Koo90] , Modern stack computer architecture, System Design and Network Ar-
chitecture Conference (1990), 153–164.
[Koo91] , Some ideas for stack computer design, Rochester Forth Conference
(1991), 58.
156
[Lav80] Simon Hugh Lavington, Early british computers: The story of vintage computers
and the people who built them, Butterworth-Heinemann, Newton, MA, USA,
1980.
[LTL98] P. H. W. Leong, P. K. Tsang, and T. K. Lee, A FPGA based forth microprocessor,
FCCM ’98: Proceedings of the IEEE Symposium on FPGAs for Custom Com-
puting Machines (Washington, DC, USA), IEEE Computer Society, 1998, p. 254.
[Luk29] Jan Lukasiewicz, Elements of mathematical logic, Warsaw, 1929, [English trans-
lation of 1958 edition: Macmillan, 1963].
[McK80] William M. McKeeman, Stack computers, pp. 319–362, in [Sto80], 1980.
[McL93] Edward McLellan, The Alpha AXP architecture and 21064 processor, IEEE Mi-
cro 13 (1993), no. 3, 36–47.
[ME97] Martin Maierhofer and M. Anton Ertl, Optimizing stack code, Forth-Tagung
1997, 1997.
[ME98] , Local stack allocation, Compiler Construction 1998, Springer LNCS
1383, 1998, pp. 189–203.
[MK97] M. Morris Mano and Charles R. Kime, Logic and computer design fundamentals,
Prentice-Hall, Inc., 1997.
[ML70] Charles H. Moore and Geoffrey C. Leach, FORTH – a language for interactive
computing, Mohasco Industries, Inc., Amsterdam, NY, 1970, Internal publica-
tion.
[Moo91] Charles H. Moore, Forth - the early years, Unpublished notes that became
the papers by Rather, Colburn, and Moore [RCM93] [RCM96]. Accesible at
http://www.colorforth.com/HOPL.html as of Nov. 2006., 1991.
[Moo01a] , 25x emulator, Proceedings of the 17th EuroForth Conference (Schloss
Dagstuhl, Saarland, Germany), University of Teesside, November 2001, ISBN: 0
907550 97 6.
[Moo01b] , c18 colorForth compiler, Proceedings of the 17th EuroForth Conference
(Schloss Dagstuhl, Saarland, Germany), University of Teesside, November 2001,
One of the few published papers by Chuck Moore. Describes the c18 instruction
set in detail.
[MP95] Silvia M. Muller and Wolfgang J. Paul, The complexity of simple computer
architectures, Springer-Verlag New York, Inc., 1995.
[MT95] Charles H. Moore and C. H. Ting, MuP21 – a MISC processor, Forth Dimensions
(1995), 41, http://www.ultratechnology.com/mup21.html.
[Mur86] Robert W. Murphy, Under the hood of a superchip: the NOVIX Forth engine, J.
FORTH Appl. Res. 3 (1986), no. 2, 185–188.
157
[Mur90] William D. Murray, Computer and digital system architecture, Prentice-Hall, Inc.,
1990.
[Org73] Elliott Irving Organick, Computer system organization: The B5700/B6700 series
(ACM monograph series), Academic Press, Inc., Orlando, FL, USA, 1973.
[Pay96] Bernd Paysan, Implementation of the 4stack processor using Verilog, Diploma
thesis, Technische Universitat Munchen, Institut fur Informatik, August 1996,
http://www.jwdt.com/~paysan/4stack.html.
[Pay02] Berndt Paysan, b16–A Forth processor in an FPGA, Forth-Tagung 2002 (2002),
http://www.b16-cpu.de/.
[PD80] David A. Patterson and David R. Ditzel, The case for the reduced instruction set
computer, SIGARCH Comput. Archit. News 8 (1980), no. 6, 25–33.
[PS81] David A. Patterson and Carlo H. Sequin, RISC I: A reduced instruction set VLSI
computer, ISCA ’81: Proceedings of the 8th annual symposium on Computer
Architecture (Los Alamitos, CA, USA), IEEE Computer Society Press, 1981,
pp. 443–457.
[PS98b] David A. Patterson and Carlo H. Sequin, RISC I: a reduced instruction set VLSI
computer, ISCA ’98: 25 years of the international symposia on Computer archi-
tecture (selected papers) (New York, NY, USA), ACM Press, 1998, pp. 216–230.
158
[Ras03] James Rash, Space-related applications of forth, webpage:
http://forth.gsfc.nasa.gov/, April 2003, Presents space-related applications
of Forth microprocessors and the Forth programming language at NASA.
Accessed on Nov. 2006.
[RCM93] Elizabeth D. Rather, Donald R. Colburn, and Charles H. Moore, The evolution of
Forth, The second ACM SIGPLAN conference on History of programming lan-
guages (Cambridge, Massachusetts, United States), ACM Press, 1993, pp. 177–
199.
[RCM96] , The evolution of Forth, History of programming languages—II (New
York, NY, USA), ACM Press, 1996, pp. 625–670.
[Ros87] Robert F. Rosin, Prologue: The Burroughs B5000, Annals of the History of Com-
puting 9 (1987), no. 1, 6–7.
[SB60] K. Samelson and F. L. Bauer, Sequential formula translation, Commun. ACM 3
(1960), no. 2, 76–83.
[SB04] Huibin Shi and Chris Bailey, Investigating available instruction level parallelism
for stack based machine architectures, DSD ’04: Proceedings of the Digital Sys-
tem Design, EUROMICRO Systems on (DSD’04) (Washington, DC, USA), IEEE
Computer Society, 2004, pp. 112–120.
[Sha99] George William Shaw, PSC1000 microprocessor reference manual, Patriot Sci-
entific Corporation, San Diego, CA, March 1999, Ref. No. 99-0370001.
[Sha02] , IGNITE intellectual property reference manual, Patriot Scientific Cor-
poration, San Diego, CA, March 2002, Revision 1.0.
[Sit78] Richard L. Sites, A combined register-stack architecture, SIGARCH Comput.
Archit. News 6 (1978), no. 8, 19–19.
[SSK97] Dezso Sima, D. Sima, and Peter Kacsuk, Advanced computer architectures,
Addison-Wesley Longman Publishing Co., Inc., 1997.
[Sta90] William Stallings, Computer organization and architecture, 2nd ed., Prentice Hall
PTR, 1990.
[Sta93] , Computer organization and architecture: principles of structure and
function, 3rd ed., Macmillan Publishing Co., Inc., 1993.
[Sta02] , Computer organization and architecture, Prentice Hall Professional
Technical Reference, 2002.
[Sto80] Harold S. Stone, Introduction to computer architecture, Science Research Asso-
ciates, 1980.
[Sto92] , High-performance computer architecture, Addison-Wesley Longman
Publishing Co., Inc., 1992.
159
[TCCLL99] P.K. Tsang, K.H. C.C. Cheung, T.K. Lee Leung, and P.H.W. Leong,
MSL16A: an asynchronous Forth microprocessor, TENCON 99. Proceedings of
the IEEE Region 10 Conference, vol. 2, September 1999, pp. 1079–1082.
[Wil91] Barry Wilkinson, Computer architecture: Design and performance, 1st ed.,
Prentice-Hall, Inc., 1991.
160