§3.
1 Introduction
COMPUTER ORGANIZATION AND DESIGN
The Hardware/Software Interface
ARM
Edition Arithmetic for Computers
n Operations on integers
Addition and subtraction
Chapter 3
n
n Multiplication and division
Dealing with overflow
Arithmetic for Computers
n
n Floating-point real numbers
n Representation and operations
Chapter 3 — Arithmetic for Computers — 2
§3.2 Addition and Subtraction
Integer Addition Integer Subtraction
n Example: 7 + 6 n Add negation of second operand
n Example: 7 – 6 = 7 + (–6)
+7: 0000 0000 … 0000 0111
–6: 1111 1111 … 1111 1010
+1: 0000 0000 … 0000 0001
n Overflow if result out of range
n Overflow if result out of range n Subtracting two +ve or two –ve operands, no overflow
n Adding +ve and –ve operands, no overflow n Subtracting +ve from –ve operand
n Adding two +ve operands n Overflow if result sign is 0
n Overflow if result sign is 1 n Subtracting –ve from +ve operand
n Adding two –ve operands n Overflow if result sign is 1
n Overflow if result sign is 0
Chapter 3 — Arithmetic for Computers — 3 Chapter 3 — Arithmetic for Computers — 4
§3.3 Multiplication
Arithmetic for Multimedia Multiplication
n Graphics and media processing operates n Start with long-multiplication approach
on vectors of 8-bit and 16-bit data
multiplicand
n Use 64-bit adder, with partitioned carry chain 1000
multiplier
n Operate on 8×8-bit, 4×16-bit, or 2×32-bit vectors × 1001
1000
n SIMD (single-instruction, multiple-data) 0000
0000
n Saturating operations 1000
product 1001000
n On overflow, result is largest representable
value Length of product is
the sum of operand
n c.f. 2s-complement modulo arithmetic lengths
n E.g., clipping in audio, saturation in video
Chapter 3 — Arithmetic for Computers — 5 Chapter 3 — Arithmetic for Computers — 6
Multiplication Hardware Optimized Multiplier
n Perform steps in parallel: add/shift
n One cycle per partial-product addition
Initially 0
n That’s ok, if frequency of multiplications is low
Chapter 3 — Arithmetic for Computers — 7 Chapter 3 — Arithmetic for Computers — 8
Faster Multiplier LEGv8 Multiplication
n Uses multiple adders n Three multiply instructions:
n Cost/performance tradeoff n MUL: multiply
n Gives the lower 64 bits of the product
n SMULH: signed multiply high
n Gives the upper 64 bits of the product, assuming the
operands are signed
n UMULH: unsigned multiply high
n Gives the lower 64 bits of the product, assuming the
operands are unsigned
n Can be pipelined
n Several multiplication performed in parallel
Chapter 3 — Arithmetic for Computers — 9 Chapter 3 — Arithmetic for Computers — 10
§3.4 Division
Division Division Hardware
n Check for 0 divisor
Initially divisor
n Long division approach in left half
quotient n If divisor ≤ dividend bits
dividend n 1 bit in quotient, subtract
1001 n Otherwise
1000 1001010 n 0 bit in quotient, bring down next
-1000 dividend bit
divisor
10 n Restoring division
101 n Do the subtract, and if remainder
1010 goes < 0, add divisor back
-1000 n Signed division
remainder 10
n Divide using absolute values
n Adjust sign of quotient and remainder
n-bit operands yield n-bit as required Initially dividend
quotient and remainder
Chapter 3 — Arithmetic for Computers — 11 Chapter 3 — Arithmetic for Computers — 12
Optimized Divider Faster Division
n Can’t use parallel hardware as in multiplier
n Subtraction is conditional on sign of remainder
n Faster dividers (e.g. SRT devision)
generate multiple quotient bits per step
n Still require multiple steps
n One cycle per partial-remainder subtraction
n Looks a lot like a multiplier!
n Same hardware can be used for both
Chapter 3 — Arithmetic for Computers — 13 Chapter 3 — Arithmetic for Computers — 14
§3.5 Floating Point
LEGv8 Division Floating Point
n Two instructions: n Representation for non-integral numbers
n SDIV (signed) n Including very small and very large numbers
n UDIV (unsigned) n Like scientific notation
n –2.34 × 1056 normalized
n Both instructions ignore overflow and n +0.002 × 10–4 not normalized
division-by-zero n +987.02 × 109
n In binary
n ±1.xxxxxxx2 × 2yyyy
n Types float and double in C
Chapter 3 — Arithmetic for Computers — 15 Chapter 3 — Arithmetic for Computers — 16
Floating Point Standard IEEE Floating-Point Format
n Defined by IEEE Std 754-1985 single: 8 bits single: 23 bits
double: 11 bits double: 52 bits
n Developed in response to divergence of S Exponent Fraction
representations
n Portability issues for scientific code x = ( -1)S ´ (1+ Fraction) ´ 2(Exponent -Bias)
n Now almost universally adopted n S: sign bit (0 Þ non-negative, 1 Þ negative)
n Two representations n Normalize significand: 1.0 ≤ |significand| < 2.0
n Always has a leading pre-binary-point 1 bit, so no need to
n Single precision (32-bit) represent it explicitly (hidden bit)
n Significand is Fraction with the “1.” restored
n Double precision (64-bit) n Exponent: excess representation: actual exponent + Bias
n Ensures exponent is unsigned
n Single: Bias = 127; Double: Bias = 1203
Chapter 3 — Arithmetic for Computers — 17 Chapter 3 — Arithmetic for Computers — 18
Single-Precision Range Double-Precision Range
n Exponents 00000000 and 11111111 reserved n Exponents 0000…00 and 1111…11 reserved
n Smallest value n Smallest value
n Exponent: 00000001 n Exponent: 00000000001
Þ actual exponent = 1 – 127 = –126 Þ actual exponent = 1 – 1023 = –1022
n Fraction: 000…00 Þ significand = 1.0 n Fraction: 000…00 Þ significand = 1.0
n ±1.0 × 2–126 ≈ ±1.2 × 10–38 n ±1.0 × 2–1022 ≈ ±2.2 × 10–308
n Largest value n Largest value
n exponent: 11111110 n Exponent: 11111111110
Þ actual exponent = 254 – 127 = +127 Þ actual exponent = 2046 – 1023 = +1023
n Fraction: 111…11 Þ significand ≈ 2.0 n Fraction: 111…11 Þ significand ≈ 2.0
n ±2.0 × 2+127 ≈ ±3.4 × 10+38 n ±2.0 × 2+1023 ≈ ±1.8 × 10+308
Chapter 3 — Arithmetic for Computers — 19 Chapter 3 — Arithmetic for Computers — 20
Floating-Point Precision Floating-Point Example
n Relative precision n Represent –0.75
n all fraction bits are significant n –0.75 = (–1)1 × 1.12 × 2–1
n Single: approx 2–23 n S=1
n Equivalent to 23 × log102 ≈ 23 × 0.3 ≈ 6 decimal n Fraction = 1000…002
digits of precision
n Exponent = –1 + Bias
n Double: approx 2–52 n Single: –1 + 127 = 126 = 011111102
n Equivalent to 52 × log102 ≈ 52 × 0.3 ≈ 16 decimal n Double: –1 + 1023 = 1022 = 011111111102
digits of precision
n Single: 1011111101000…00
n Double: 1011111111101000…00
Chapter 3 — Arithmetic for Computers — 21 Chapter 3 — Arithmetic for Computers — 22
Floating-Point Example Denormal Numbers
n What number is represented by the single- n Exponent = 000...0 Þ hidden bit is 0
precision float
11000000101000…00 x = ( -1)S ´ (0 + Fraction) ´ 2-Bias
n S=1 n Smaller than normal numbers
n Fraction = 01000…002 n allow for gradual underflow, with
n Fxponent = 100000012 = 129 diminishing precision
n x = (–1)1 × (1 + 012) × 2(129 – 127)
n Denormal with fraction = 000...0
= (–1) × 1.25 × 22
= –5.0 x = ( -1)S ´ (0 + 0) ´ 2-Bias = ±0.0
Two representations
of 0.0!
Chapter 3 — Arithmetic for Computers — 23 Chapter 3 — Arithmetic for Computers — 24
Infinities and NaNs Floating-Point Addition
n Exponent = 111...1, Fraction = 000...0 n Consider a 4-digit decimal example
n 9.999 × 101 + 1.610 × 10–1
n ±Infinity
n 1. Align decimal points
n Can be used in subsequent calculations, n Shift number with smaller exponent
avoiding need for overflow check n 9.999 × 101 + 0.016 × 101
n Exponent = 111...1, Fraction ≠ 000...0 n 2. Add significands
n Not-a-Number (NaN) n 9.999 × 101 + 0.016 × 101 = 10.015 × 101
n Indicates illegal or undefined result n 3. Normalize result & check for over/underflow
n 1.0015 × 102
n e.g., 0.0 / 0.0
n 4. Round and renormalize if necessary
n Can be used in subsequent calculations
n 1.002 × 102
Chapter 3 — Arithmetic for Computers — 25 Chapter 3 — Arithmetic for Computers — 26
Floating-Point Addition FP Adder Hardware
n Now consider a 4-digit binary example n Much more complex than integer adder
n 1.0002 × 2–1 + –1.1102 × 2–2 (0.5 + –0.4375)
n Doing it in one clock cycle would take too
n 1. Align binary points
n Shift number with smaller exponent long
n 1.0002 × 2–1 + –0.1112 × 2–1 n Much longer than integer operations
n 2. Add significands n Slower clock would penalize all instructions
1.0002 × 2–1 + –0.1112 × 2–1 = 0.0012 × 2–1
FP adder usually takes several cycles
n
n
n 3. Normalize result & check for over/underflow
n 1.0002 × 2–4, with no over/underflow n Can be pipelined
n 4. Round and renormalize if necessary
n 1.0002 × 2–4 (no change) = 0.0625
Chapter 3 — Arithmetic for Computers — 27 Chapter 3 — Arithmetic for Computers — 28
FP Adder Hardware Floating-Point Multiplication
n Consider a 4-digit decimal example
n 1.110 × 1010 × 9.200 × 10–5
n 1. Add exponents
n For biased exponents, subtract bias from sum
Step 1 n New exponent = 10 + –5 = 5
n 2. Multiply significands
n 1.110 × 9.200 = 10.212 Þ 10.212 × 105
n 3. Normalize result & check for over/underflow
Step 2
n 1.0212 × 106
n 4. Round and renormalize if necessary
Step 3 n 1.021 × 106
n 5. Determine sign of result from signs of operands
Step 4 n +1.021 × 106
Chapter 3 — Arithmetic for Computers — 29 Chapter 3 — Arithmetic for Computers — 30
Floating-Point Multiplication FP Arithmetic Hardware
n Now consider a 4-digit binary example n FP multiplier is of similar complexity to FP
1.0002 × 2–1 × –1.1102 × 2–2 (0.5 × –0.4375)
n
adder
n 1. Add exponents
n Unbiased: –1 + –2 = –3 n But uses a multiplier for significands instead of
n Biased: (–1 + 127) + (–2 + 127) = –3 + 254 – 127 = –3 + 127 an adder
n 2. Multiply significands n FP arithmetic hardware usually does
n 1.0002 × 1.1102 = 1.1102 Þ 1.1102 × 2–3
n 3. Normalize result & check for over/underflow n Addition, subtraction, multiplication, division,
n 1.1102 × 2–3 (no change) with no over/underflow reciprocal, square-root
n 4. Round and renormalize if necessary n FP « integer conversion
n 1.1102 × 2–3 (no change)
n Operations usually takes several cycles
n 5. Determine sign: +ve × –ve Þ –ve
n –1.1102 × 2–3 = –0.21875 n Can be pipelined
Chapter 3 — Arithmetic for Computers — 31 Chapter 3 — Arithmetic for Computers — 32
FP Instructions in LEGv8 FP Instructions in LEGv8
n Separate FP registers n Single-precision arithmetic
n 32 single-precision: S0, …, S31 n FADDS, FSUBS, FMULS, FDIVS
n 32 double-precision: DS0, …, D31 n e.g., FADDS S2, S4, S6
n Sn stored in the lower 32 bits of Dn n Double-precision arithmetic
n FP instructions operate only on FP registers n FADDD, FSUBD, FMULD, FDIVD
n Programs generally don’t do integer ops on FP data, n e.g., FADDD D2, D4, D6
or vice versa n Single- and double-precision comparison
n More registers with minimal code-size impact n FCMPS, FCMPD
n FP load and store instructions n Sets or clears FP condition-code bits
n LDURS, LDURD
n STURS, STURD n Branch on FP condition code true or false
n B.cond
Chapter 3 — Arithmetic for Computers — 33 Chapter 3 — Arithmetic for Computers — 34
FP Example: °F to °C FP Example: Array Multiplication
n C code: n X=X+Y×Z
float f2c (float fahr) { n All 32 × 32 matrices, 64-bit double-precision elements
return ((5.0/9.0)*(fahr - 32.0)); n C code:
}
void mm (double x[][],
n fahr in S12, result in S0, literals in global memory
double y[][], double z[][]) {
space int i, j, k;
n Compiled LEGv8 code: for (i = 0; i! = 32; i = i + 1)
f2c: for (j = 0; j! = 32; j = j + 1)
LDURS S16, [X27,const5] // S16 = 5.0 (5.0 in memory) for (k = 0; k! = 32; k = k + 1)
LDURS S18, [X27,const9] // S18 = 9.0 (9.0 in memory) x[i][j] = x[i][j]
FDIVS S16, S16, S18 // S16 = 5.0 / 9.0 + y[i][k] * z[k][j];
LDURS S18, [X27,const32] // S18 = 32.0 }
FSUBS S18, S12, S18 // S18 = fahr – 32.0
n Addresses of x, y, z in X0, X1, X2, and
FMULS S0, S16, S18 // S0 = (5/9)*(fahr – 32.0)
i, j, k in X19, X20, X21
BR LR // return
Chapter 3 — Arithmetic for Computers — 35 Chapter 3 — Arithmetic for Computers — 36
FP Example: Array Multiplication FP Example: Array Multiplication
n LEGv8 code: …
ADD X9, X9, X21 // X9 = i * size(row) + k
mm:... LSL X9, X9, 3 // X9 = byte offset of [i][k]
LDI X10, 32 // X10 = 32 (row size/loop end) ADD X9, X1, X9 // X9 = byte address of a[i][k]
LDI X19, 0 // i = 0; initialize 1st for loop LDURD D18, [X9,#0] // D18 = 8 bytes of a[i][k]
L1: LDI X20, 0 // j = 0; restart 2nd for loop FMULD D16, D18, D16 // D16 = a[i][k] * b[k][j]
L2: LDI X21, 0 // k = 0; restart 3rd for loop FADDD D4, D4, D16 // f4 = c[i][j] + a[i][k] * b[k][j]
LSL X11, X19, 5 // X11 = i * 2 5 (size of row of c) ADDI X21, X21, 1 // $k = k + 1
ADD X11, X11, X20 // X11 = i * size(row) + j CMP X21, X10 // test k vs. 32
LSL X11, X11, 3 // X11 = byte offset of [i][j] B.LT L3 // if (k < 32) go to L3
ADD X11, X0, X11 // X11 = byte address of c[i][j] STURD D4, [X11,0] // = D4
LDURD D4, [X11,#0] // D4 = 8 bytes of c[i][j] ADDI X20, X20, #1 // $j = j + 1
L3: LSL X9, X21, 5 // X9 = k * 2 5 (size of row of b) CMP X20, X10 // test j vs. 32
ADD X9, X9, X20 // X9 = k * size(row) + j B.LT L2 // if (j < 32) go to L2
LSL X9, X9, 3 // X9 = byte offset of [k][j] ADDI X19, X19, #1 // $i = i + 1
ADD X9, X2, X9 // X9 = byte address of b[k][j] CMP X19, X10 // test i vs. 32
LDURD D16, [X9,#0] // D16 = 8 bytes of b[k][j] B.LT L1 // if (i < 32) go to L1
LSL X9, X19, 5 // X9 = i * 2 5 (size of row of a)
Chapter 3 — Arithmetic for Computers — 37 Chapter 3 — Arithmetic for Computers — 38
§3.6 Parallelism and Computer Arithmetic: Subword Parallelism
Accurate Arithmetic Subword Parallellism
n IEEE Std 754 specifies additional rounding n Graphics and audio applications can take
control
advantage of performing simultaneous
n Extra bits of precision (guard, round, sticky)
n Choice of rounding modes operations on short vectors
n Allows programmer to fine-tune numerical behavior of n Example: 128-bit adder:
a computation n Sixteen 8-bit adds
n Not all FP units implement all options n Eight 16-bit adds
n Most programming languages and FP libraries just n Four 32-bit adds
use defaults
n Trade-off between hardware complexity, n Also called data-level parallelism, vector
performance, and market requirements parallelism, or Single Instruction, Multiple
Data (SIMD)
Chapter 3 — Arithmetic for Computers — 39 Chapter 3 — Arithmetic for Computers — 40
ARMv8 SIMD Other ARMv8 Features
n 32 128-bit registers (V0, …, V31) n 245 SIMD instructions, including:
n Works with integer and FP n Square root
n Fused multiply-add, multiply-subtract
n Convertion and scalar and vector round-to-
n Examples:
integral
n 16 8-bit integer adds:
n Structured (strided) vector load/stores
n ADD V1.16B, V2.16B, V3.16B
n Saturating arithmetic
n 4 32-bit FP adds:
n FADD V1.4S, V2.4S, V3.4S
Chapter 3 — Arithmetic for Computers — 41 Chapter 3 — Arithmetic for Computers — 42
§3.7 Real Stuff: Streaming SIMD Extensions and AVX in x86
x86 FP Architecture x86 FP Instructions
n Originally based on 8087 FP coprocessor Data transfer Arithmetic Compare Transcendental
n 8 × 80-bit extended-precision registers FILD mem/ST(i) FIADDP mem/ST(i) FICOMP FPATAN
n Used as a push-down stack FISTP mem/ST(i) FISUBRP mem/ST(i) FIUCOMP F2XMI
FLDPI FIMULP mem/ST(i) FSTSW AX/mem FCOS
n Registers indexed from TOS: ST(0), ST(1), … FLD1 FIDIVRP mem/ST(i) FPTAN
FSQRT
n FP values are 32-bit or 64 in memory FLDZ
FABS
FPREM
FPSIN
n Converted on load/store of memory operand FRNDINT FYL2X
n Integer operands can also be converted
on load/store n Optional variations
n I: integer operand
n Very difficult to generate and optimize code n P: pop operand from stack
n Result: poor FP performance n R: reverse operand order
n But not all combinations allowed
Chapter 3 — Arithmetic for Computers — 43 Chapter 3 — Arithmetic for Computers — 44
§3.8 Going Faster: Subword Parallelism and Matrix Multiply
Streaming SIMD Extension 2 (SSE2) Matrix Multiply
n Adds 4 × 128-bit registers n Unoptimized code:
n Extended to 8 registers in AMD64/EM64T
n Can be used for multiple FP operands 1. void dgemm (int n, double* A, double* B, double* C)
2. {
n 2 × 64-bit double precision 3. for (int i = 0; i < n; ++i)
4. for (int j = 0; j < n; ++j)
n 4 × 32-bit double precision 5. {
6. double cij = C[i+j*n]; /* cij = C[i][j] */
n Instructions operate on them simultaneously 7. for(int k = 0; k < n; k++ )
8. cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */
n Single-Instruction Multiple-Data
9. C[i+j*n] = cij; /* C[i][j] = cij */
10. }
11. }
Chapter 3 — Arithmetic for Computers — 45 Chapter 3 — Arithmetic for Computers — 46
§3.8 Going Faster: Subword Parallelism and Matrix Multiply
§3.8 Going Faster: Subword Parallelism and Matrix Multiply
Matrix Multiply Matrix Multiply
n x86 assembly code: n Optimized C code:
1. vmovsd (%r10),%xmm0 # Load 1 element of C into %xmm0 1. #include <x86intrin.h>
2. mov %rsi,%rcx # register %rcx = %rsi 2. void dgemm (int n, double* A, double* B, double* C)
3. xor %eax,%eax # register %eax = 0 3. {
4. vmovsd (%rcx),%xmm1 # Load 1 element of B into %xmm1 4. for ( int i = 0; i < n; i+=4 )
5. add %r9,%rcx # register %rcx = %rcx + %r9 5. for ( int j = 0; j < n; j++ ) {
6. vmulsd (%r8,%rax,8),%xmm1,%xmm1 # Multiply %xmm1, 6. __m256d c0 = _mm256_load_pd(C+i+j*n); /* c0 = C[i][j]
element of A */
7. add $0x1,%rax # register %rax = %rax + 1 7. for( int k = 0; k < n; k++ )
8. cmp %eax,%edi # compare %eax to %edi 8. c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */
9. vaddsd %xmm1,%xmm0,%xmm0 # Add %xmm1, %xmm0 9. _mm256_mul_pd(_mm256_load_pd(A+i+k*n),
10. jg 30 <dgemm+0x30> # jump if %eax > %edi 10. _mm256_broadcast_sd(B+k+j*n)));
11. add $0x1,%r11d # register %r11 = %r11 + 1 11. _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */
12. vmovsd %xmm0,(%r10) # Store %xmm0 into C element 12. }
13. }
Chapter 3 — Arithmetic for Computers — 47 Chapter 3 — Arithmetic for Computers — 48
§3.8 Going Faster: Subword Parallelism and Matrix Multiply
§3.9 Fallacies and Pitfalls
Matrix Multiply Right Shift and Division
n Optimized x86 assembly code: n Left shift by i places multiplies an integer
1. vmovapd (%r11),%ymm0 # Load 4 elements of C into %ymm0 by 2i
2. mov %rbx,%rcx # register %rcx = %rbx
3. xor %eax,%eax # register %eax = 0 n Right shift divides by 2i?
4. vbroadcastsd (%rax,%r8,1),%ymm1 # Make 4 copies of B element
5. add $0x8,%rax # register %rax = %rax + 8 n Only for unsigned integers
6. vmulpd (%rcx),%ymm1,%ymm1 # Parallel mul %ymm1,4 A elements
7. add %r9,%rcx # register %rcx = %rcx + %r9
n For signed integers
8. cmp %r10,%rax # compare %r10 to %rax n Arithmetic right shift: replicate the sign bit
9. vaddpd %ymm1,%ymm0,%ymm0 # Parallel add %ymm1, %ymm0
10. jne 50 <dgemm+0x50> # jump if not %r10 != %rax n e.g., –5 / 4
11. add $0x1,%esi # register % esi = % esi + 1 n 111110112 >> 2 = 111111102 = –2
12. vmovapd %ymm0,(%r11) # Store %ymm0 into 4 C elements
n Rounds toward –∞
n c.f. 111110112 >>> 2 = 001111102 = +62
Chapter 3 — Arithmetic for Computers — 49 Chapter 3 — Arithmetic for Computers — 50
Associativity Who Cares About FP Accuracy?
n Parallel programs may interleave n Important for scientific code
operations in unexpected orders n But for everyday consumer use?
n Assumptions of associativity may fail n “My bank balance is out by 0.0002¢!” L
(x+y)+z x+(y+z)
x -1.50E+38 -1.50E+38 n The Intel Pentium FDIV bug
y 1.50E+38 0.00E+00 n The market expects accuracy
z 1.0 1.0 1.50E+38
1.00E+00 0.00E+00 n See Colwell, The Pentium Chronicles
n Need to validate parallel programs under
varying degrees of parallelism
Chapter 3 — Arithmetic for Computers — 51 Chapter 3 — Arithmetic for Computers — 52
§3.9 Concluding Remarks
Concluding Remarks Concluding Remarks
n Bits have no inherent meaning n ISAs support arithmetic
n Interpretation depends on the instructions n Signed and unsigned integers
applied n Floating-point approximation to reals
n Computer representations of numbers n Bounded range and precision
n Finite range and precision n Operations can overflow and underflow
n Need to account for this in programs
Chapter 3 — Arithmetic for Computers — 53 Chapter 3 — Arithmetic for Computers — 54