Lec05 Quantization I
Lec05 Quantization I
ai Lecture 05
Quantization
Part I
Song Han
Associate Professor, MIT
Distinguished Scientist, NVIDIA
@SongHan_MIT
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai
ffi
ffi
Lecture Plan
Today we will:
1. Review the numeric data types, including integers
1 1 0 0 1 1 1 1
and oating-point numbers (FP32, FP16, INT4, etc.) × × × × × × × ×
2. Learn the basic concept of neural network -27+ 26 + 25 + 24 + 23 + 22 + 21 + 20 = -49
quantization
3. Learn three types of common neural network Continuous Signal Quantized Signal
quantization: Signal
1. K-Means-based Quantization
2. Linear Quantization
3. Binary and Ternary Quantization time
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 2
fl
ffi
ffi
Low Bit Operations are Cheaper
Less Bit-Width → Less Energy
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 3
fl
fl
fl
fl
ffi
ffi
Low Bit Operations are Cheaper
Less Bit-Width → Less Energy
How should
32 bit oat ADD we make 0.9
deep learning more e cient?
16 ✕
8 bit int MULT 0.2
Computing's Energy Problem (and What We Can Do About it) [Horowitz, M., IEEE ISSCC 2014]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 4
fl
fl
fl
fl
ffi
ffi
ffi
Numeric Data Types
How is numeric data represented in modern computing systems?
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 5
ffi
ffi
Integer
• Unsigned Integer
• n-bit Range:[0, 2 − 1]
n
0 0 1 1 0 0 0 1
× × × × × × × ×
27 + 26 + 25 + 24 + 23 + 22 + 21 + 20 = 49
• Signed Integer
• Sign-Magnitude Representation Sign Bit
[ − 1]
n−1 n−1
• n-bit Range: −2 + 1, 2 1 0 1 1 0 0 0 1
• Both 000…00 and 100…00 represent 0 × × × × × × ×
- 26 + 25 + 24 + 23 + 22 + 21 + 20 = -49
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 6
ffi
ffi
Fixed-Point Number
Integer . Fraction
“Decimal” Point
0 0 1 1 0 0 0 1
× × × × × × × ×
-23+ 22 + 21 + 20 + 2-1 + 2-2 + 2-3 + 2-4 = 3.0625
0 0 1 1 0 0 0 1
× × × × × × × ×
( -27+ 26 + 25 + 24 + 23 + 22 + 21 + 20 ) × 2-4 = 49 × 0.0625 = 3.0625
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 7
ffi
ffi
Floating-Point Number
Example: 32-bit oating-point number in IEEE 754
23 22 21 20 2-1 2-2 2-3 2-4
00111110100010000000000000000000
125 0.0625
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 8
fi
ffi
ffi
fl
Floating-Point Number
Example: 32-bit oating-point number in IEEE 754
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 9
fi
ffi
ffi
fl
Floating-Point Number
Example: 32-bit oating-point number in IEEE 754
00111110100010000000000000000000 00000000000000000000000000000000
125 0.0625 0 0
10000000000000000000000000000000
0 0
0.265625 = 1.0625 × 2-2 = (1 + 0.0625) × 2125-127 0 = 0 × 2-126
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 10
ffi
ffi
fl
Floating-Point Number
Example: 32-bit oating-point number in IEEE 754
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 11
ffi
ffi
fl
Floating-Point Number
Example: 32-bit oating-point number in IEEE 754
00000000100000000000000000000000 00000000000000000000000000000001
1 0 0 2-23
2-126 = (1 + 0) × 21-127 2-149 = 2-23 × 2-126
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 12
ffi
ffi
fl
Floating-Point Number
Example: 32-bit oating-point number in IEEE 754
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 13
ffi
ffi
fl
Floating-Point Number
Example: 32-bit oating-point number in IEEE 754
00000000100000000000000000000000 00000000011111111111111111111111
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 14
ffi
ffi
fl
Floating-Point Number
Example: 32-bit oating-point number in IEEE 754
01111111100000000000000000000000 - 11111111 - - - - - - - - - - - - - - - - - - - - - - -
+∞ (positive in nity) NaN (Not a Number)
11111111100000000000000000000000
-∞ (negative in nity) much waste. revisit in fp8.
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 15
ffi
fi
fi
ffi
fl
Floating-Point Number
Example: 32-bit oating-point number in IEEE 754
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 16
ffi
ffi
fl
Floating-Point Number
Exponent Width → Range; Fraction Width → Precision
Exponent Fraction Total
IEEE 754 Single Precision 32-bit Float (IEEE FP32)
(bits) (bits) (bits)
8 23 32
5 10 16
8 7 16
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 17
ffi
ffi
Numeric Data Types
• Question: What is the following IEEE half precision (IEEE FP16) number in decimal?
• Sign: -
• Exponent: 100012 - 1510 = 1710 - 1510 = 210
• Fraction: 11000000002 = 0.7510
• Decimal Answer = - (1 + 0.75) × 22 = -1.75 × 22 = -7.010
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 18
ffi
ffi
Numeric Data Types
• Question: What is the decimal 2.5 in Brain Float (BF16)?
• Sign: +
• Exponent Binary: 110 + 12710 = 12810 = 100000002
• Fraction Binary: 0.2510 = 01000002
• Binary Answer
0100000000100000
Sign 8 bit Exponent 7 bit Fraction
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 19
ffi
ffi
Floating-Point Number
Exponent Width → Range; Fraction Width → Precision
Exponent Fraction Total
IEEE 754 Single Precision 32-bit Float (IEEE FP32)
(bits) (bits) (bits)
8 23 32
5 10 16
* FP8 E4M3 does not have INF, and S.1111.1112 is used for NaN. 4 3 8
* Largest FP8 E4M3 normal value is S.1111.1102 =448.
Nvidia FP8 (E5M2) for gradient in the backward
* FP8 E5M2 have INF (S.11111.002) and NaN (S.11111.XX2).
* Largest FP8 E5M2 normal value is S.11110.112 =57344. 5 2 8
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 20
ffi
ffi
INT4 and FP4
Exponent Width → Range; Fraction Width → Precision
INT4
-1,-2,-3,-4,-5,-6,-7,-8 -1,-2,-3,-4,-5,-6,-7,-8
S 0, 1, 2, 3, 4, 5, 6, 7 0, 1, 2, 3, 4, 5, 6, 7
0 1 2 3 4 5 6 7
0 0 0 1 =1
0 1 1 1 =7
FP4 (E1M2)
-0,-0.5,-1,-1.5,-2,-2.5,-3,-3.5 -0,-1,-2,-3,-4,-5,-6,-7
S E M M 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5 0 1 2 3 3.5 0, 1, 2, 3, 4, 5, 6, 7
×0.5
0 0 0 1 =0.25×21-0=0.5
0 1 1 1 =(1+0.75)×21-0=3.5
FP4 (E2M1)
-0,-0.5,-1,-1.5,-2,-3,-4,-6 -0,-1,-2,-3,-4,-6,-8,-12
S E E M 0, 0.5, 1, 1.5, 2, 3, 4, 6 0, 1, 2, 3, 4, 6, 8, 12
×0.5
0 1 2 3 4 6
0 0 0 1 =0.5×21-1=0.5
0 1 1 1 =(1+0.5)×23-1=1
no inf, no NaN
FP4 (E3M0)
-0,-0.25,-0.5,-1,-2,-4,-8,-16 -0,-1,-2,-4,-8,-16,-32,-64
S E E E 0, 0.25, 0.5, 1, 2, 4, 8, 16 0, 1, 2, 4, 8, 16, 32, 64
×0.25
0 1 2 4 8 16
0 0 0 1 =(1+0)×21-3=0.25
0 1 1 1 =(1+0)×27-3=16 no inf, no NaN
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 21
ffi
ffi
What is Quantization?
time
Floating-Point
Storage
Weights
Floating-Point
Computation
Arithmetic
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 23
ffi
ffi
Neural Network Quantization: Agenda
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1 1 0 1 1
Floating-Point Floating-Point
Computation
Arithmetic Arithmetic
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 24
ffi
ffi
Neural Network Quantization
Weight Quantization
weights
(32-bit oat)
2.09 -0.98 1.48 0.09
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 25
fl
ffi
ffi
Neural Network Quantization
Weight Quantization
weights
(32-bit oat)
2.09 -0.98 1.48 0.09 2.09, 2.12, 1.92, 1.87
0.05 -0.14 -1.08 2.12
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 26
fl
ffi
ffi
K-Means-based Weight Quantization
weights
(32-bit oat)
2.09 -0.98 1.48 0.09
0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 0.00 0.00 -1.00 2.00
indexes codebook
quantization error
32 bit × 16 2 bit × 16 32 bit × 4
storage
= 512 bit = 64 B = 32 bit = 4 B = 128 bit = 16 B = 20 B
0.09 0.02 -0.02 0.09
3.2 × smaller
0.05 -0.14 -0.08 0.12
Assume N-bit quantization, and #parameters = M >> 2N.
32 bit × M N bit × M 32 bit × 2N 0.09 -0.08 0 -0.03
= 32M bit = NM bit = 2N+5 bit
-0.13 0 0.03 -0.01
32/N × smaller
Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 28
fl
fl
ffi
ffi
K-Means-based Weight Quantization
Fine-tuning Quantized Weights
weights cluster index
(32-bit oat) (2-bit int) centroids
gradient
gradient
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
Quantization Only
0.5%
0.0%
-0.5%
Accuracy Loss
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
2% 5% 8% 11% 14% 17% 20%
Model Size Ratio after Compression
Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 31
ffi
ffi
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
2% 5% 8% 11% 14% 17% 20%
Model Size Ratio after Compression
Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 32
ffi
ffi
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
2% 5% 8% 11% 14% 17% 20%
Model Size Ratio after Compression
Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 33
ffi
ffi
Before Quantization: Continuous Weight
Count
Weight Value
Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 34
ffi
ffi
After Quantization: Discrete Weight
Count
Weight Value
Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 35
ffi
ffi
After Quantization: Discrete Weight after Retraining
Count
Weight Value
Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 36
ffi
ffi
How Many Bits do We Need?
s per weight
Huffman Encoding
hts
Encode Weights
same same
Book accuracy accuracy
ook
Train Connectivity
Encode Weights
original same same same
network accuracy Generate Code Book accuracy accuracy
Prune Connections
original 9x-13x 27x-31x Encode Index 35x-49x
Quantize the Weights
size reduction reduction reduction
with Code Book
Train Weights
Figure 1: The three stage compression pipeline: pruning, quantization and Huffman coding. Pruning
reduces the number of weights by 10⇥, while quantization further
Deep Compression [Han et al., ICLR 2016]
improves the compression rate:
between TinyMLand
MIT 6.5940:27⇥ and E 31⇥. Huffman
cient Deep coding gives more compression: between 35⇥
Learning Computing andcientml.ai
https://e 49⇥. The 39
ffi
ffi
Deep Compression Results
Original Compressed Compression Original Compressed
Network
Size Size Ratio Accuracy Accuracy
64
1x1 Conv
Squeeze
16
1x1 Conv 3x3 Conv
Expand Expand
64 64
Output
Concat/Eltwise
128
Deep
AlexNet 6.9MB 35x 57.2% 80.3%
Compression
Deep
SqueezeNet 0.47MB 510x 57.5% 80.3%
Compression
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size [Iandola et al., arXiv 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 42
ffi
ffi
K-Means-based Weight Quantization
oat
ReLU outputs
weights cluster index
(32-bit oat) (2-bit int) centroids oat
bias +
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 oat
oat
• The weights are decompressed using a lookup table (i.e., codebook) during runtime inference.
• K-Means-based Weight Quantization only saves storage cost of a neural network model.
• All the computation and memory access are still oating-point.
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 43
fl
fl
fl
fl
fl
fl
fl
fl
ffi
fl
ffi
fl
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1 1 0 1 1
Floating-Point Floating-Point
Computation Integer Arithmetic
Arithmetic Arithmetic
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 44
ffi
ffi
Linear Quantization
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 45
ffi
ffi
What is Linear Quantization?
weights
(32-bit oat)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 46
fl
ffi
ffi
What is Linear Quantization?
An a ne mapping of integers to real numbers
weights quantized weights zero point scale reconstructed weights
(32-bit oat) (2-bit signed int) (2-bit signed int) (32-bit oat) (32-bit oat)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 47
fl
fl
fl
ffi
ffi
ffi
Linear Quantization
An a ne mapping of integers to real numbers r = S(q − Z)
weights quantized weights zero point scale
(32-bit oat) (2-bit signed int) (2-bit signed int) (32-bit oat)
r = ( q − Z ) × S
Floating-point Integer Integer Floating-point
• quantization parameter • quantization parameter
• allow real number r=0 be
exactly representable by a
quantized integer Z
Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 48
fl
fl
ffi
ffi
ffi
ffi
Linear Quantization
An a ne mapping of integers to real numbers r = S(q − Z)
rmin 0 rmax
r Floating-point range
Floating-point
×S
Floating-point Scale
q
Integer qmin Z qmax Bit Width
2
qmin
-2
qmax
1
Zero point 3 -4 3
4 -8 7
N -2N-1 2N-1-1
Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 49
ffi
ffi
ffi
ffi
Scale of Linear Quantization
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
rmin = S (qmin − Z)
range
×S
Floating-point
q Scale
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 50
ffi
ffi
ffi
Scale of Linear Quantization
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
×S rmin
Floating-point Z = qmin −
q Scale
S
qmin Z qmax
Zero point
( S )
rmin
Z = round qmin −
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 52
ffi
ffi
ffi
Zero Point of Linear Quantization
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
Y = WX
SWSX
qY = (qW − ZW) (qX − ZX) + ZY
SY
SWSX
qY = (qWqX − ZWqX − ZXqW + ZWZX) + ZY
SY
Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 54
ffi
ffi
ffi
ffi
Linear Quantized Matrix Multiplication
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following matrix multiplication.
Y = WX
SWSX Precompute
qY = (qWqX − ZWqX − ZXqW + ZWZX) + ZY
SY
N-bit Integer Multiplication N-bit Integer
32-bit Integer Addition/Subtraction Addition
Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 55
ffi
ffi
ffi
ffi
Linear Quantized Matrix Multiplication
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following matrix multiplication.
Y = WX
SWSX
qY = (qWqX − ZWqX − ZXqW + ZWZX) + ZY
SY
SWSX
Empirically, the scale is always in the interval (0, 1).
• SY Fixed-point Multiplication
SWSX −n
= 2 M0, where M0 ∈ [0.5,1)
SY
Bit Shift
Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 56
ffi
ffi
ffi
ffi
Linear Quantized Matrix Multiplication
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following matrix multiplication.
Y = WX
SWSX Precompute
qY = (qWqX − ZWqX − ZXqW + ZWZX) + ZY
SY
Rescale to N-bit Integer Multiplication N-bit Integer
N-bit Integer 32-bit Integer Addition/Subtraction Addition
ZW = 0?
Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 57
ffi
ffi
ffi
ffi
Symmetric Linear Quantization
Zero point Z = 0 and Symmetric oating-point range
rmin 0 rmax − | r |max 0 | r |max
r Floating-point
range r
×S
Floating-point
q Scale
q
qmin Z qmax qmin Z=0 qmax
Zero point
×S ×S
Floating-point
q Scale
q
qmin Z qmax qmin Z=0 qmax
Zero point rmin = S (qmin − Z)
rmax − rmin
S= rmin − | r |max | r |max
qmax − qmin S= = = N−1
Bit Width qmin qmax
2 -2 1 qmin − Z qmin 2
3 -4 3
4 -8 7
N -2N-1 2N-1-1
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 59
ffi
ffi
Linear Quantized Matrix Multiplication
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following matrix multiplication, when Zw=0.
Y = WX
SWSX Precompute
qY = (qWqX − ZWqX − ZXqW + ZWZX) + ZY
SY
Rescale to N-bit Integer Multiplication N-bit Integer
N-bit Integer 32-bit Integer Addition/Subtraction Addition
ZW = 0
SWSX
qY = ( W X
q q − ZX W)
q + ZY
SY
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 60
ffi
ffi
ffi
Linear Quantized Fully-Connected Layer
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.
Y = WX + b
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 61
ffi
ffi
ffi
Linear Quantized Fully-Connected Layer
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.
Y = WX + b
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 62
ffi
ffi
ffi
Linear Quantized Fully-Connected Layer
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.
Y = WX + b
ZW = 0 Zb = 0, Sb = SWSX
SWSX Precompute
qY = (qWqX + qb − ZXqW) + ZY
SY We will discuss how to
qbias = qb − ZXqW compute activation zero point
SWSX
in the next lecture.
qY = (qWqX + qbias) + ZY
SY
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 63
ffi
ffi
ffi
Linear Quantized Fully-Connected Layer
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.
Y = WX + b
ZW = 0
Zb = 0, Sb = SWSX
qbias = qb − ZXqW
SWSX
qY = (qWqX + qbias) + ZY
SY
Rescale to N-bit Int Mult. N-bit Int
N-bit Int 32-bit Int Add. Add
Note: both qb and qbias are 32 bits.
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 64
ffi
ffi
ffi
Linear Quantized Convolution Layer
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following convolution layer.
Y = Conv (W, X) + b
ZW = 0
Zb = 0, Sb = SWSX
qbias = qb − Conv (qW, ZX)
SWSX
qY =
SY ( Conv ( q ,
W Xq ) + qbias) + ZY
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 65
ffi
ffi
ffi
Linear Quantized Convolution Layer
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following convolution layer.
Y = Conv (W, X) + b
zero point int int quantized outputs
ZW = 0 ZY + qW
int
Zb = 0, Sb = SWSX scale factor
×
SWSX /SY
qbias = qb − Conv (qW, ZX) quantized bias
int32
qbias +
SWSX
( bias)
int32 int32
qY = Conv ( q ,
W Xq ) + q + ZY
Conv
SY int int
quantized inputs quantized weights
qX qW
Rescale to N-bit Int Mult. N-bit Int
N-bit Int 32-bit Int Add. Add
Note: both qb and qbias are 32 bits.
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 66
ffi
ffi
ffi
different micro-architectures: 1) Snapdragon 835 LITTLE
70
Top 1 Accuracy
Neural Network ResNet-50 Inception-V3
60
Floating-point
76.4% 78.4%
Accuracy 50
8-bit Integer- Float
quantized 74.9% 75.4% 8-bit
Acurracy 40
5 15 30 60 120
Latency (ms)
Latency-vs-accuracy tradeo of oat vs. integer-only
Figure 4.1: Latency-vs-accuracy
MobileNets on ImageNet using Snapdragon
tradeoff 835
of float vs. big cores.
integer-only
MobileNets on ImageNet using Snapdragon 835 big cores.
Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
Integer-only quantized MobileNets achieve higher accu-
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 67
ffi
ffi
ffi
ff
fl
ffi
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1
K-Means-based Linear
Quantization Quantization
Integer Weights;
Floating-Point
Storage Floating-Point Integer Weights
Weights
Computation
Floating-Point
Codebook
Floating-Point
Integer Arithmetic
?
Arithmetic Arithmetic
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 68
ffi
ffi
Summary of Today’s Lecture
Today, we reviewed and learned
• the numeric data types used in the modern computing
systems, including integers and oating-point numbers. 1 1 0 0 1 1 1 1
× × × × × × × ×
-27+ 26 + 25 + 24 + 23 + 22 + 21 + 20 = -49
• the basic concept of neural network quantization:
converting the weights and activations of neural
networks into a limited discrete set of numbers.
rmin 0 rmax
r Floating-point
range
• two types of common neural network quantization:
• K-Means-based Quantization ×S
• Linear Quantization Floating-point
q Scale
qmin Z qmax
Zero point
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 69
ffi
ffi
fl
References
1. Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey
[Deng et al., IEEE 2020]
2. Computing's Energy Problem (and What We Can Do About it) [Horowitz, M., IEEE ISSCC 2014]
3. Deep Compression [Han et al., ICLR 2016]
4. Neural Network Distiller: https://intellabs.github.io/distiller/algo_quantization.html
5. Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference
[Jacob et al., CVPR 2018]
6. BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations
[Courbariaux et al., NeurIPS 2015]
7. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations
Constrained to +1 or −1. [Courbariaux et al., Arxiv 2016]
8. XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al.,
ECCV 2016]
9. Ternary Weight Networks [Li et al., Arxiv 2016]
10.Trained Ternary Quantization [Zhu et al., ICLR 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 70
ffi
ffi
fi
ffi