L 2
L 2
l l
l l
chow
Using the number of explicit operands named per instructions. Using operand location. Can ALU operands be located in memory? RISC architecture requires all operands in register. Stack architecture requires all operands in stack. (top portion of the stack inside CPU; the rest in memory) Using operations provided in the ISA. Using types and size of operands.
cs420/520-CH2-5/14/99--Page 1-
l l
Code size (How many bytes in a programs executable code?) Code density (How many instructions in x K Bytes?) Instruction length. Stack architecture has the best code density. (No operand in ALU ops) Code efficiency. (Are there limitation on operand access?) Stack can not be randomly accessed. e.g. Mtop-x x>=2 cannot be directly accessed. Mtop-Mtop-1 is translated into subtract followed by negate Bottleneck in operand traffic. Stack will be the bottleneck; both input operands from the stack and result goes back to the stack. Memory traffic (How many memory references in a program (i+d)?) Accumulator Arch: Each ALU operation involves one memory reference. Easiness in writing compilers. General-Purpose Register Arch: we have more registers to allocate. more choices more difficult to write? Easiness in writing assembly programs. Stack Arch: you need to use reverse polish expression.
cs420/520-CH2-5/14/99--Page 2-
chow
ALU
a b
registers
cs420/520-CH2-5/14/99--Page 6-
Instruction count=8 Total bits for the instructions=bits for opcodes + bits for addresses =(8*8)+(6*(5+16)+2*(5+5))=210 bits Memory traffic=210 bits+6*4*8 bits = 402 bits
chow cs420/520-CH2-5/14/99--Page 7-
chow
cs420/520-CH2-5/14/99--Page 8-
chow
cs420/520-CH2-5/14/99--Page 9-
32 bit wide bus an integer not aligned is stored in bytes 2002, 2003, 2004, 2005 2000 2004 2002 2003 2004 2005 Mem. 32bits To get the integer in to the register, it not only requires two memory accesses but also requires CPU to perform shift and or operations when receiving data.
chow cs420/520-CH2-5/14/99--Page 10-
Byte Ordering
There are two ways to ordering the bytes of short, int, long data in memory. l Little endian(e.g, PC) put the byte with the less significant value in the lower address of the allocated memory. Least significant byte sends/saves first. l Big endian(e.g, SPARC) put the byte with more significant value in the lower address of the allocated memory. Most signficant byte sends first. address 2000 size X Y 2011 On sparc: 08 00 00 04 00 00 00 01 00 00 01 00 big endian machine: For X, it was allocated at addresses 2004-7. The least significant of the four bytes, 01, is allocated at address 2007 (the higher address).
On Pentium: 08 fd 04 00 01 00 00 00 00 01 00 00 little endian machine: For X, it was allocated at addresses 2004-7. The least significant of the four bytes, 01, is allocated at addresses 2004 (the lower address).
chow
cs420/520-CH2-5/14/99--Page 12-
Memory Alignment
What happens if we change the declaration by moving the size field to the end? struct PenRecord{ unsigned char msgType; int X; int Y; short size; } pr; Assume pr is allocated at memory starting at address 2000. After we executed pr.msgType=8, pr.size=4, pr.X=1, pr.Y=256, here are the content of memory on different machines generated by write2.c address 2000 X Y size 2015 On sparc: 08 00 00 00 00 00 00 01 00 00 01 00 00 04 06 38 16 bytes On Pentium: 08 00 00 00 01 00 00 00 00 01 00 00 04 00 40 00 12 bytes On old 86: 08 01 00 00 00 00 01 00 00 04 00 11 bytes. Bytes with red color are padding bytes. A structure variable is allocated with memory that is multiple of its largest element size. What if we change int to double
chow cs420/520-CH2-5/14/99--Page 13-
~cs520/alignment/writer.c
#include <stdio.h> ... struct PenRecord { unsigned char msgType; short size; int X; int Y;}; main() { int fd, cc; int accessible; struct PenRecord r; if (access(datafile, F_OK) == 0) { /* file with same name exist */ if ((fd=open(datafile, O_WRONLY, 00777)) <= 0) { printf(fd=%d\n, fd); exit(1);} } else if ((fd=open(datafile, O_CREAT, 00777)) <= 0) { printf(fd=%d\n, fd); exit(1);} r.msgType=8; r.X=1; r.Y=256; r.size=4; printf(size of PenRecord =%d\n, sizeof(struct PenRecord)); cc=write(fd, &(r.msgType), sizeof(struct PenRecord)); printf(no.of bytes written is %d\nr.msgType=%d, r.X=%d, r.Y=%d, r.size=%d\n, cc, r.msgType, r.X, r.Y, r.size); close(fd); }
chow
cs420/520-CH3-5/14/99--Page 15-a
~cs520/alignment/reader.c
#include <stdio.h> .... struct PenRecord { unsigned char msgType; short size; int X; int Y;} r; main() { int fd, cc; if ((fd=open(datafile, O_RDONLY, 0)) <= 0) exit(1); printf(size of PenRecord = %d\n, sizeof(struct PenRecord)); cc=read(fd, &r, sizeof(struct PenRecord)); printf(no.of bytes read is %d\nr.msgType=%d, r.X=%d, r.Y=%d, r.size=%d\n, cc, r.msgType, r.X, r.Y, r.size); close(fd); } Run writer on SPARC, we get the following output: size of PenRecord = 12 no.of bytes written is 12 r.msgType=8, r.X=1, r.Y=256, r.size=4 Run reader on DEC3100 with the datafile generated by writer on SPARC, got: size of PenRecord = 12 no.of bytes read is 12 r.msgType=8, r.X=16777216, r.65536, , r.size=1024 Run reader on SPARC with the datafile generated by writer on SPARC, got: size of PenRecord = 12 no.of bytes read is 12 r.msgType=8, r.X=1, r.Y=256, r.size=4
chow
cs420/520-CH3-5/14/99--Page 16-a
Register vs. Memory ADD R4, (R1) as intermediate location ADD R4, @(R1) Indexed Full Size vs. Short Displacement LW R12, (R11+R2) Single vs. Multiple Index Registers ADD R1, 100(R2)[R3] Self Modifying Pre- vs. Post-Increment ADD R1, (R2)+ Pre- vs. Post-Decrement ADD R1, -(R2) Fixed vs. Data Dependent Increment (Decrement) Powerful instruction set architectures such as VAX were designed to translate the constructs of programming languages in very few instructions. providing high level hardware support for languages. But they also made the design of those processors very difficult and caused the delay of the delivery. In early 1980s, the direction of computer architecture starts the shift to a simpler architecture and RISC was born.
chow cs420/520-CH2-5/14/99--Page 18-
chow
cs420/520-CH2-5/14/99--Page 19-
chow
cs420/520-CH2-5/14/99--Page 20-
Branches every 6 to 20 instructions Procedure calls every 70 to 300 instructions Branches 12 to 14 times more frequent than procedure calls. 0 is the most frequent used immediate value in compared (83%). Most backward-going branches are loop branches Loop branches are taken with 90% probability Branch behavior is application dependent and sometimes compilerdependent.
chow
cs420/520-CH2-5/14/99--Page 21-
Procedure Call/Return
Two basic approaches to save registers in procedure calls: l Caller-saving l Callee-saving Where to save registers in procedure calls: l Memory area pointed by the stack pointer l Register Windows in Sparc architecture
chow
cs420/520-CH2-5/14/99--Page 22-
Procedure inlining and loop transformation Global+local optimization register allocation Detailed instruction selection and machine-dependent optimization; may include or be followed by assembler
cs420/520-CH2-5/14/99--Page 23-
Code Generation
Procedure Inlining/Integration
int gv; int addn(int src, int n) { return src+n; } main() { int i, j; gv = 1; i = 3; j = addn(gv, i); replace this statement with j=gv+i printf(j=%d\n, j); }
l l l
chow
Avoid the overhead of procedure call/return by replacing the procedure call with the procedure body (with proper variable substitution). Trade-off between the size of the procedure body and the frequencies of the procedure call. The code size of expanded program vs. call/return overhead. Some modern programming languages such as C++ have explicit inline language construct.
cs420/520-CH2-5/14/99--Page 24-
int A[128], B[128]; int i, j, k; B[i+j*k] = A[i+j*k]; $14, 8($sp) $15, 4($sp) $24, $14, $15 $25, 12($sp) $8, $25, $24 $9, $8, 4 $10, $sp, 1040 $11, $9, $10 $12, -512($11) $12, -1024($11) # load j # load k # $24=j*k # load i # $8 = i+j*k # each integer is 4 bytes, # add the base address of the stack # -512 = the offset from base address to A[0] # -1024 = the offset from base address to B[0]
Note that the common subexpression, i+j*k, is only calculated once. This is MIPS code compiled on DEC3100 using cc -S -O1 commsubexp.c
chow
cs420/520-CH2-5/14/99--Page 25-
Constant Propagation
# 2 # 3 # 4 li sw # 5 li sw # 6 li sw # 7 li sw
l
int i, j, k, l; i = 3; $14, 3 $14, 12($sp) j = i+4; $15, 7 $15, 8($sp) k = i; $24, 3 $24, 4($sp) l = k; $25, 3 $25, 0($sp)
# const 3 is propagated to k
Note that the constant propagation will be stopped when the value of the variable is replaced by the result of a non-constant expression.
chow
cs420/520-CH2-5/14/99--Page 26-
Code Motion
codemotion(k, j) int k; int j; { int i, base; int A[128]; for (i=1; i<10; i++) { base = k*j; A[base+i] = i; } } M68020 code
_codemotion: |#PROLOGUE# 0 link moveml |#PROLOGUE# 1 moveq movl mulsl moveq asll movl addl lea L77003: movl addql addql moveq cmpl jlt |#PROLOGUE# 2 moveml unlk |#PROLOGUE# 3 rts a6,#-524 #192,sp@ #1,d6 a6@(8),d0 k* j is moved out a6@(12),d0 of the loop #4,d7 #2,d0 d0,a0 d7,a0 a6@(-512,a0:l),a0 d6,a0@+ #1,d6 #4,d7 #40,d1 d1,d7 L77003 a6@(-524),#192 a6 auto-increment addressing mode
chow
cs420/520-CH2-5/14/99--Page 27-
$4
spnew+40 spnew+36 spnew+32 spnew+28 spnew+24 spnew+20 spnew+16 spnew+12 spnew+8 spnew+4 spnew+0 =SPold-40
jal: save PC+4 (next instruction address) to $31, them jump to the target address
chow
cs420/520-CH3-5/14/99--Page 28-a
chow
cs420/520-CH2-5/14/99--Page 29-
l l
Register Allocation is more effective for stack-allocated objects than for global variables. Aliased variables (there are multiple ways to reference such variables) can not be register-allocated. Is it possible to allocate variable a in a register? p = &a; a = 2; *p =3;How to compile this statement? i = a + 4; p typically contains a memory address, not a register address. *p=3; can be implemented as ADDI R1, R0, #3; LW R2, p; SW 0(R2), R1 After register allocation, the remaining memory traffic consists of the following five types: 1. Unallocated referencepotential register-allocable references 2. Global scalar 3. Save/restore memory reference 4. A required stack referencedue to aliasing, or caller-saved variables 5. A computed referencesuch as heap references, or reference to a stack variable via a pointer or array index.
cs420/520-CH2-5/14/99--Page 31-
chow
Regularitymake operations, data types, and addressing mode orthogonal. This helps simplify code generation. Try not to restrict a register with a certain class of operations. Provide primitives not solutionse.g. the overloaded CALLS instructions in VAX. (It tries to do too much.) 1. Align the stack if needed. 2. Push the argument count on the stack. 3. Save the registers indicated by the procedure call mask on the stack. 4. Push the return address on the stack, then push the top and base of the stack pointers for the activation record. 5. Clear the condition codes 6. Push a word for status information and a zero word on the stack. 7. Update the two stack pointers. 8. Branch to the first instruction of the procedure. Simplify trade-offs (decision making) among alternative instruction sequences.
chow
cs420/520-CH2-5/14/99--Page 32-
R10 R8
R3 R1 R0 Registers
chow
1000
0x8F
Memory
cs420/520-CH2-5/14/99--Page 34-
DLX Architecture
32 32-bit GPRs, R0,R1,..., R31. R0 always 0. 32 32-bit FPRs (Floating Point Registers), F0, F1,..., F31. Each of them can store a single precision floating point value. The register pairs, (F0,F1), (F2, F3),..., (F30,F31), serve as double precision FPRs. They are named, F0, F2, F30. They are instructions for moving or converting values between FPRs and GPRs. Data types include 8-bit bytes, 16-bit half word, 32-bit word for integer 32-bit single precision and 64-bit double precision floating point. Only immediate and displacement address modes exist. Byte addressing, Big Endian, with 32 bit address. A load-store architecture.
chow
cs420/520-CH2-5/14/99--Page 35-
chow
cs420/520-CH2-5/14/99--Page 36-
chow
cs420/520-CH2-5/14/99--Page 37-
chow
cs420/520-CH2-5/14/99--Page 38-
J and JAL use 26 bit signed offset added to the PC+4 (the instruction after J/JAL) The other use registers (32 bits) for destination addresses.
chow
cs420/520-CH2-5/14/99--Page 39-
Homework #5
1. Byte ordering and memory alignment: struct CircleRecord { unsigned char size; short X; int time;} cr; a) How many bytes will be allocated for the following cr structure on PC? Ans: 8 bytes. There will be a padding byte after size, since X field needs to align with an even address. The address of the available memory happens to be multiple of 4 when we allocation time field, a 4 byte integer. b) How about on SUN SPARC? Ans: same 8 bytes. c) Show the content of memory area allocated to cr, generated by a program compiled on SPARC with cr.size=8; cr.X=3; cr.time=256. Assume cr was allocated with the memory starting at address 2000. Use hexadecimal to represent the byte content and follow the same format in pages 11-13 of the handout. d) If the same data was read for cr by a program compiled on PC, what will be the decimal values of cr.size, cr.X, cr.time? Assume the padding characters for the structure are all zeros.
chow
cs420/520-CH2-5/14/99--Page 40-
Homework#5
2. For the following DLX code, ADDI R1, R0, #1 ;; keep the value of i in register R1 LW ADDI L1: R2, 1500(R0) R3, R0, #100 ;; keep the value of C in R2 ;; keep the loop count, 200, in R3
SGT R4, R1, R3 BNEZ R4, L2 SLLI LW ADD SW ADDI J R5, R1, #2 R6, 5000(R5) R6, R6, R2 0(R5), R6 R1, R1, #1 L1 ;; multiply i by 4 ;; calculate address of B[i] ;; B[i]+C ;; A[i]=B[i]+C ;; i++
L2: SW 2000(R0), R1 ;; save final value of i to memory a) What is the total number of memory references (including data refernces and instruction references) b) How many instructions are dynamically executed (including those outside of the loop)?
chow
cs420/520-CH2-5/14/99--Page 41-
chow
cs420/520-CH2-5/14/99--Page 42-
Solution to hw#5
1. Byte ordering and memory alignment: a) How many bytes will be allocated for the following cr structure on DEC3100? struct CircleRecord { unsigned char size; short X; short Y;} cr; Ans: 6 bytes with one byte padding between the size field and the X field. b) How about on SUN SPARC? Ans: 6 bytes. c) How about on PC? Ans: 5 bytes. d) Show the sequence of bytes (according to the byte ordering of DEC3100) w.r.t. cr generated by a program compiled on DEC3100 with cr.size=8; cr.X=3, cr.Y=256. (See page 11 format) Ans: 2003 2002 2001 2000 2003 2002 2001 2000
X 1
8 0
6 bytes
8 0
5 bytes
X: padding LittleEndian PC
chow
cs420/520-CH2-5/14/99--Page 43-
e) If the same data was read for cr by a program compiled on PC, what will be the values of cr.size, cr.X, cr.Y? Assume the padding characters for the structure are all zeros Ans: Note that only five bytes will be read in as indicated in the figure above. cr.size =8; 2001 and 2002 will be interpreted as cr.X=3*256=768. 2003 and 2004 will be interpreted as cr.Y=0. 2. For the following DLX code, ADDI R1, R0, #1 ;; keep the value of i in register R1 LW R2, 1500(R0) ;; keep the value of C in R2 ADDI R3, R0, #100 ;; keep the loop count, 100, in R3 L1: SGT R4, R1, R3 BNEZ R4, L2 SLLI LW ADD SW ADDI J R5, R1, #2 R6, 5000(R5) R6, R6, R2 0(R5), R6 R1, R1, #1 L1 ;; multiply i by 4 ;; calculate address of B[i] ;; B[i]+C ;; A[i]=B[i]+C ;; i++
L2: SW 2000(R0), R1 ;; save final value of i to memory a) What is the total number of memory references? Ans: The loop between SGT and J instructions contains 8 instructions.
chow cs420/520-CH2-5/14/99--Page 44-
The loop will be executed 100 times. Therefore 800 instructions will be executed in this loop. There are 3 instructions before L2. SGT, BNEZ, and SW will be executed as the last three instructions. Totally, there are 800+3+3=806 instruction references. There are 2*100 data references inside the loop. There are 2 data reference outside the loop. Totally, there are 202 data references. Therefore, the total number of memory references is 1008. Since the data and instructions are all 32 bits. There are 32*1008 bits referenced. b) How many instructions are dynamically executed? Ans: There are 806 instructions dynamically executed.
chow
cs420/520-CH2-5/14/99--Page 45-
Solution to Homework #5
1. a) 6 bytes. b) 6 bytes.c) .
2003 2002 2001 2000 2000 2001 2002 2003
X 1
8 0
8 1
X 0
BigEndian
SUN3
d) cr.size = 8, cr.X=768, cr.Y=1. Note that if the larger field in the struct declaration is a 8byte double precision field, the struct will be allocated with size of multiple 8, which dictates how many padding bytes for the struct. 2. See the figures in the next page. Three registers are required. 3. a) The number of memory reference is instruction reference + operand reference = 4+9*100+3+2+2*200+1 = 1110. b) The instructions dynamically executed is 4+9*100+3 = 907.
chow
cs420/520-CH2-5/14/99--Page 46-
D The live ranges of C and D are not overlapped. They can share the same register.
chow
cs420/520-CH2-5/14/99--Page 47-
The live ranges of A and D are not overlapped. They can share the same register. D
chow
cs420/520-CH2-5/14/99--Page 48-