0% found this document useful (0 votes)
35 views

Lec05 Quantization I

Uploaded by

peter.yeh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Lec05 Quantization I

Uploaded by

peter.yeh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

EfficientML.

ai Lecture 05
Quantization
Part I

Song Han
Associate Professor, MIT
Distinguished Scientist, NVIDIA
@SongHan_MIT

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai
ffi
ffi
Lecture Plan
Today we will:
1. Review the numeric data types, including integers
1 1 0 0 1 1 1 1
and oating-point numbers (FP32, FP16, INT4, etc.) × × × × × × × ×
2. Learn the basic concept of neural network -27+ 26 + 25 + 24 + 23 + 22 + 21 + 20 = -49

quantization
3. Learn three types of common neural network Continuous Signal Quantized Signal
quantization: Signal

1. K-Means-based Quantization
2. Linear Quantization
3. Binary and Ternary Quantization time

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 2
fl
ffi
ffi
Low Bit Operations are Cheaper
Less Bit-Width → Less Energy

Operation Energy [pJ]

8 bit int ADD 0.03


30 ✕

32 bit int ADD 0.1

16 bit oat ADD 0.4

32 bit oat ADD 0.9


16 ✕
8 bit int MULT 0.2

32 bit int MULT 3.1

16 bit oat MULT 1.1

32 bit oat MULT 3.7


Rough Energy Cost For Various Operations in 45nm 0.9V 1 10 100 1000
1 = 200
Computing's Energy Problem (and What We Can Do About it) [Horowitz, M., IEEE ISSCC 2014] This image is in the public domain

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 3
fl
fl
fl
fl
ffi
ffi
Low Bit Operations are Cheaper
Less Bit-Width → Less Energy

Operation Energy [pJ]

8 bit int ADD 0.03


30 ✕

32 bit int ADD 0.1

16 bit oat ADD 0.4

How should
32 bit oat ADD we make 0.9
deep learning more e cient?
16 ✕
8 bit int MULT 0.2

32 bit int MULT 3.1

16 bit oat MULT 1.1

32 bit oat MULT 3.7


Rough Energy Cost For Various Operations in 45nm 0.9V 1 10 100 1000

Computing's Energy Problem (and What We Can Do About it) [Horowitz, M., IEEE ISSCC 2014]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 4
fl
fl
fl
fl
ffi
ffi
ffi
Numeric Data Types
How is numeric data represented in modern computing systems?

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 5
ffi
ffi
Integer
• Unsigned Integer
• n-bit Range:[0, 2 − 1]
n
0 0 1 1 0 0 0 1
× × × × × × × ×
27 + 26 + 25 + 24 + 23 + 22 + 21 + 20 = 49
• Signed Integer
• Sign-Magnitude Representation Sign Bit
[ − 1]
n−1 n−1
• n-bit Range: −2 + 1, 2 1 0 1 1 0 0 0 1
• Both 000…00 and 100…00 represent 0 × × × × × × ×
- 26 + 25 + 24 + 23 + 22 + 21 + 20 = -49

• Two’s Complement Representation


• n-bit Range: [−2 , 2 − 1]
n−1 n−1
1 1 0 0 1 1 1 1
• 000…00 represents 0 × × × × × × × ×
• 100…00 represents −2n−1 -27+ 26 + 25 + 24 + 23 + 22 + 21 + 20 = -49

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 6
ffi
ffi
Fixed-Point Number

Integer . Fraction
“Decimal” Point

0 0 1 1 0 0 0 1
× × × × × × × ×
-23+ 22 + 21 + 20 + 2-1 + 2-2 + 2-3 + 2-4 = 3.0625

0 0 1 1 0 0 0 1
× × × × × × × ×
( -27+ 26 + 25 + 24 + 23 + 22 + 21 + 20 ) × 2-4 = 49 × 0.0625 = 3.0625

(using 2’s complement representation)

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 7
ffi
ffi
Floating-Point Number
Example: 32-bit oating-point number in IEEE 754
23 22 21 20 2-1 2-2 2-3 2-4

Sign 8 bit Exponent 23 bit Fraction (signi cant / mantissa)

(-1)sign × (1 + Fraction) × 2Exponent-127 Exponent Bias = 127 = 28-1-1

How to represent 0.265625?


0.265625 = 1.0625 × 2-2 = (1 + 0.0625) × 2125-127

00111110100010000000000000000000

125 0.0625

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 8
fi
ffi
ffi
fl
Floating-Point Number
Example: 32-bit oating-point number in IEEE 754

Sign 8 bit Exponent 23 bit Fraction (signi cant / mantissa)

(-1)sign × (1 + Fraction) × 2Exponent-127 Exponent Bias = 127 = 28-1-1

How should we represent 0?

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 9
fi
ffi
ffi
fl
Floating-Point Number
Example: 32-bit oating-point number in IEEE 754

Sign 8 bit Exponent 23 bit Fraction


Should have been (-1)sign × (1 + Fraction) × 20-127
(-1)sign × (1 + Fraction) × 2Exponent-127 But we force to be (-1)sign × Fraction × 21-127
Normal Numbers, Exponent≠0 Subnormal Numbers, Exponent=0

00111110100010000000000000000000 00000000000000000000000000000000

125 0.0625 0 0

10000000000000000000000000000000

0 0
0.265625 = 1.0625 × 2-2 = (1 + 0.0625) × 2125-127 0 = 0 × 2-126

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 10
ffi
ffi
fl
Floating-Point Number
Example: 32-bit oating-point number in IEEE 754

Sign 8 bit Exponent 23 bit Fraction

(-1)sign × (1 + Fraction) × 2Exponent-127 (-1)sign × Fraction × 21-127


Normal Numbers, Exponent≠0 Subnormal Numbers, Exponent=0

What is the smallest positive subnormal value?

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 11
ffi
ffi
fl
Floating-Point Number
Example: 32-bit oating-point number in IEEE 754

Sign 8 bit Exponent 23 bit Fraction

(-1)sign × (1 + Fraction) × 2Exponent-127 (-1)sign × Fraction × 21-127


Normal Numbers, Exponent≠0 Subnormal Numbers, Exponent=0

00000000100000000000000000000000 00000000000000000000000000000001

1 0 0 2-23
2-126 = (1 + 0) × 21-127 2-149 = 2-23 × 2-126

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 12
ffi
ffi
fl
Floating-Point Number
Example: 32-bit oating-point number in IEEE 754

Sign 8 bit Exponent 23 bit Fraction

(-1)sign × (1 + Fraction) × 2Exponent-127 (-1)sign × Fraction × 21-127


Normal Numbers, Exponent≠0 Subnormal Numbers, Exponent=0

What is the largest positive subnormal value?

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 13
ffi
ffi
fl
Floating-Point Number
Example: 32-bit oating-point number in IEEE 754

Sign 8 bit Exponent 23 bit Fraction

(-1)sign × (1 + Fraction) × 2Exponent-127 (-1)sign × Fraction × 21-127


Normal Numbers, Exponent≠0 Subnormal Numbers, Exponent=0

00000000100000000000000000000000 00000000011111111111111111111111

1 0 0 2-23 + 2-22 +…+ 2-1 =1 - 2-23


2-126 = (1 + 0) × 21-127 2-126-2-149 = (1 - 2-23) × 2-126

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 14
ffi
ffi
fl
Floating-Point Number
Example: 32-bit oating-point number in IEEE 754

Sign 8 bit Exponent 23 bit Fraction

(-1)sign × (1 + Fraction) × 2Exponent-127 (-1)sign × Fraction × 21-127


Normal Numbers, Exponent≠0 Subnormal Numbers, Exponent=0

01111111100000000000000000000000 - 11111111 - - - - - - - - - - - - - - - - - - - - - - -
+∞ (positive in nity) NaN (Not a Number)

11111111100000000000000000000000
-∞ (negative in nity) much waste. revisit in fp8.

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 15
ffi
fi
fi
ffi
fl
Floating-Point Number
Example: 32-bit oating-point number in IEEE 754

Sign 8 bit Exponent 23 bit Fraction

Exponent Fraction=0 Fraction≠0 Equation

00H = 0 ±0 subnormal (-1)sign × Fraction × 21-127


01H … FEH = 1 … 254 normal (-1)sign × (1 + Fraction) × 2Exponent-127
FFH = 255 ±INF NaN

subnormal values normal values


±0 2-149 (1-2-23) 2-126 2-126 (1+1-2-23)×2127

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 16
ffi
ffi
fl
Floating-Point Number
Exponent Width → Range; Fraction Width → Precision
Exponent Fraction Total
IEEE 754 Single Precision 32-bit Float (IEEE FP32)
(bits) (bits) (bits)
8 23 32

IEEE 754 Half Precision 16-bit Float (IEEE FP16)

5 10 16

Google Brain Float (BF16)

8 7 16

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 17
ffi
ffi
Numeric Data Types
• Question: What is the following IEEE half precision (IEEE FP16) number in decimal?

1100011100000000 Exponent Bias = 1510


Sign 5 bit Exponent 10 bit Fraction

• Sign: -
• Exponent: 100012 - 1510 = 1710 - 1510 = 210
• Fraction: 11000000002 = 0.7510
• Decimal Answer = - (1 + 0.75) × 22 = -1.75 × 22 = -7.010

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 18
ffi
ffi
Numeric Data Types
• Question: What is the decimal 2.5 in Brain Float (BF16)?

2.510 = 1.2510 × 21 Exponent Bias = 12710

• Sign: +
• Exponent Binary: 110 + 12710 = 12810 = 100000002
• Fraction Binary: 0.2510 = 01000002
• Binary Answer

0100000000100000
Sign 8 bit Exponent 7 bit Fraction

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 19
ffi
ffi
Floating-Point Number
Exponent Width → Range; Fraction Width → Precision
Exponent Fraction Total
IEEE 754 Single Precision 32-bit Float (IEEE FP32)
(bits) (bits) (bits)
8 23 32

IEEE 754 Half Precision 16-bit Float (IEEE FP16)

5 10 16

Nvidia FP8 (E4M3)

* FP8 E4M3 does not have INF, and S.1111.1112 is used for NaN. 4 3 8
* Largest FP8 E4M3 normal value is S.1111.1102 =448.
Nvidia FP8 (E5M2) for gradient in the backward
* FP8 E5M2 have INF (S.11111.002) and NaN (S.11111.XX2).
* Largest FP8 E5M2 normal value is S.11110.112 =57344. 5 2 8

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 20
ffi
ffi
INT4 and FP4
Exponent Width → Range; Fraction Width → Precision
INT4
-1,-2,-3,-4,-5,-6,-7,-8 -1,-2,-3,-4,-5,-6,-7,-8
S 0, 1, 2, 3, 4, 5, 6, 7 0, 1, 2, 3, 4, 5, 6, 7
0 1 2 3 4 5 6 7
0 0 0 1 =1
0 1 1 1 =7
FP4 (E1M2)
-0,-0.5,-1,-1.5,-2,-2.5,-3,-3.5 -0,-1,-2,-3,-4,-5,-6,-7
S E M M 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5 0 1 2 3 3.5 0, 1, 2, 3, 4, 5, 6, 7
×0.5

0 0 0 1 =0.25×21-0=0.5
0 1 1 1 =(1+0.75)×21-0=3.5
FP4 (E2M1)
-0,-0.5,-1,-1.5,-2,-3,-4,-6 -0,-1,-2,-3,-4,-6,-8,-12
S E E M 0, 0.5, 1, 1.5, 2, 3, 4, 6 0, 1, 2, 3, 4, 6, 8, 12
×0.5
0 1 2 3 4 6
0 0 0 1 =0.5×21-1=0.5
0 1 1 1 =(1+0.5)×23-1=1
no inf, no NaN
FP4 (E3M0)
-0,-0.25,-0.5,-1,-2,-4,-8,-16 -0,-1,-2,-4,-8,-16,-32,-64
S E E E 0, 0.25, 0.5, 1, 2, 4, 8, 16 0, 1, 2, 4, 8, 16, 32, 64
×0.25
0 1 2 4 8 16
0 0 0 1 =(1+0)×21-3=0.25
0 1 1 1 =(1+0)×27-3=16 no inf, no NaN
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 21
ffi
ffi
What is Quantization?

Quantization is the process of constraining an input from a


continuous or otherwise large set of values to a discrete set.

Continuous Signal Quantized Signal Original Image 16-Color Image


Signal Quantization Error

time

Images are in the public domain.


The di erence between an input value and its quantized value
“Palettization”
is referred to as quantization error.
Quantization [Wikipedia]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 22
ff
ffi
ffi
Neural Network Quantization: Agenda
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1 1 0 1 1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1 1 0 0 1


( -1 ) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2 0 1 1 0

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0 1 1 1 1

K-Means-based Linear Binary/Ternary


Quantization Quantization Quantization

Floating-Point
Storage
Weights

Floating-Point
Computation
Arithmetic

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 23
ffi
ffi
Neural Network Quantization: Agenda
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1 1 0 1 1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1 1 0 0 1


( -1 ) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2 0 1 1 0

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0 1 1 1 1

K-Means-based Linear Binary/Ternary


Quantization Quantization Quantization
Integer Weights;
Floating-Point
Storage Floating-Point
Weights
Codebook

Floating-Point Floating-Point
Computation
Arithmetic Arithmetic

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 24
ffi
ffi
Neural Network Quantization
Weight Quantization
weights
(32-bit oat)
2.09 -0.98 1.48 0.09

0.05 -0.14 -1.08 2.12

-0.91 1.92 0 -1.03

1.87 0 1.53 1.49

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 25
fl
ffi
ffi
Neural Network Quantization
Weight Quantization
weights
(32-bit oat)
2.09 -0.98 1.48 0.09 2.09, 2.12, 1.92, 1.87
0.05 -0.14 -1.08 2.12

-0.91 1.92 0 -1.03

1.87 0 1.53 1.49


2.0

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 26
fl
ffi
ffi
K-Means-based Weight Quantization
weights
(32-bit oat)
2.09 -0.98 1.48 0.09

0.05 -0.14 -1.08 2.12

-0.91 1.92 0 -1.03

1.87 0 1.53 1.49

Deep Compression [Han et al., ICLR 2016]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 27
fl
ffi
ffi
K-Means-based Weight Quantization
weights cluster index reconstructed weights
(32-bit oat) (2-bit int) centroids (32-bit oat)
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 2.00 -1.00 1.50 0.00

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 0.00 0.00 -1.00 2.00

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -1.00 2.00 0.00 -1.00

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 2.00 0.00 1.50 1.50

indexes codebook
quantization error
32 bit × 16 2 bit × 16 32 bit × 4
storage
= 512 bit = 64 B = 32 bit = 4 B = 128 bit = 16 B = 20 B
0.09 0.02 -0.02 0.09

3.2 × smaller
0.05 -0.14 -0.08 0.12
Assume N-bit quantization, and #parameters = M >> 2N.
32 bit × M N bit × M 32 bit × 2N 0.09 -0.08 0 -0.03
= 32M bit = NM bit = 2N+5 bit
-0.13 0 0.03 -0.01
32/N × smaller
Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 28
fl
fl
ffi
ffi
K-Means-based Weight Quantization
Fine-tuning Quantized Weights
weights cluster index
(32-bit oat) (2-bit int) centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00

1.87 0 1.53 1.49 3 1 2 2 0: -1.00

gradient

-0.03 -0.01 0.03 0.02

-0.01 0.01 -0.02 0.12

-0.01 0.02 0.04 0.01

-0.07 -0.02 0.01 -0.02

Deep Compression [Han et al., ICLR 2016]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 29
fl
ffi
ffi
K-Means-based Weight Quantization
Fine-tuning Quantized Weights
weights cluster index ne-tuned
(32-bit oat) (2-bit int) centroids centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 ×lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Deep Compression [Han et al., ICLR 2016]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 30
fi
fl
ffi
ffi
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset

Quantization Only
0.5%
0.0%
-0.5%
Accuracy Loss

-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
2% 5% 8% 11% 14% 17% 20%
Model Size Ratio after Compression
Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 31
ffi
ffi
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset

Pruning Only Quantization Only


0.5%
0.0%
-0.5%
Accuracy Loss

-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
2% 5% 8% 11% 14% 17% 20%
Model Size Ratio after Compression
Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 32
ffi
ffi
K-Means-based Weight Quantization
Accuracy vs. compression rate for AlexNet on ImageNet dataset

Pruning + Quantization Pruning Only Quantization Only


0.5%
0.0%
-0.5%
Accuracy Loss

-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
2% 5% 8% 11% 14% 17% 20%
Model Size Ratio after Compression
Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 33
ffi
ffi
Before Quantization: Continuous Weight

Count

Weight Value
Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 34
ffi
ffi
After Quantization: Discrete Weight

Count

Weight Value
Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 35
ffi
ffi
After Quantization: Discrete Weight after Retraining

Count

Weight Value
Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 36
ffi
ffi
How Many Bits do We Need?

Deep Compression [Han et al., ICLR 2016]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 37
ffi
ffi
Huffman Coding

s per weight
Huffman Encoding

hts

Encode Weights
same same
Book accuracy accuracy

ights 27x-31x Encode Index 35x-49x


reduction reduction

ook

• In-frequent weights: use more bits to represent


, quantization and Huffman coding. Pruning
tion further improves the compression rate:
• Frequent weights: use less bits to represent
compression: between 35⇥ and 49⇥. The
rse representation. The compression scheme
Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 38
ffi
ffi
Published as a conference paper at ICLR 2016
Summary of Deep Compression
Quantization: less bits per weight
Pruning: less number of weights Huffman Encoding

Cluster the Weights

Train Connectivity
Encode Weights
original same same same
network accuracy Generate Code Book accuracy accuracy
Prune Connections
original 9x-13x 27x-31x Encode Index 35x-49x
Quantize the Weights
size reduction reduction reduction
with Code Book
Train Weights

Retrain Code Book

Figure 1: The three stage compression pipeline: pruning, quantization and Huffman coding. Pruning
reduces the number of weights by 10⇥, while quantization further
Deep Compression [Han et al., ICLR 2016]
improves the compression rate:
between TinyMLand
MIT 6.5940:27⇥ and E 31⇥. Huffman
cient Deep coding gives more compression: between 35⇥
Learning Computing andcientml.ai
https://e 49⇥. The 39
ffi
ffi
Deep Compression Results
Original Compressed Compression Original Compressed
Network
Size Size Ratio Accuracy Accuracy

LeNet-300 1070KB 27KB 40x 98.36% 98.42%

LeNet-5 1720KB 44KB 39x 99.20% 99.26%

AlexNet 240MB 6.9MB 35x 80.27% 80.30%

VGGNet 550MB 11.3MB 49x 88.68% 89.09%

GoogleNet 28MB 2.8MB 10x 88.90% 88.92%

ResNet-18 44.6MB 4.0MB 11x 89.24% 89.28%

Can we make compact models to begin with?


Deep Compression [Han et al., ICLR 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 40
ffi
ffi
SqueezeNet
Input

64
1x1 Conv
Squeeze
16
1x1 Conv 3x3 Conv
Expand Expand
64 64
Output
Concat/Eltwise
128

Vanilla Fire module


SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size [Iandola et al., arXiv 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e
41 cientml.ai 41
ffi
ffi
Deep Compression on SqueezeNet
Top-1 Top-5
Network Approach Size Ratio
Accuracy Accuracy

AlexNet - 240MB 1x 57.2% 80.3%

AlexNet SVD 48MB 5x 56.0% 79.4%

Deep
AlexNet 6.9MB 35x 57.2% 80.3%
Compression

SqueezeNet - 4.8MB 50x 57.5% 80.3%

Deep
SqueezeNet 0.47MB 510x 57.5% 80.3%
Compression

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size [Iandola et al., arXiv 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 42
ffi
ffi
K-Means-based Weight Quantization
oat
ReLU outputs
weights cluster index
(32-bit oat) (2-bit int) centroids oat
bias +
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 oat
oat

0.05 -0.14 -1.08 2.12 decode 1 1 0 3 2: 1.50 Conv


oat oat
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 inputs weights
oat
1.87 0 1.53 1.49 3 1 2 2 0: -1.00
decode
uint
• quantized weights
During Computation In Storage • codebook ( oat)

• The weights are decompressed using a lookup table (i.e., codebook) during runtime inference.
• K-Means-based Weight Quantization only saves storage cost of a neural network model.
• All the computation and memory access are still oating-point.

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 43
fl
fl
fl
fl
fl
fl
fl
fl
ffi
fl
ffi
fl
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1 1 0 1 1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1 1 0 0 1


( -1 ) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2 0 1 1 0

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0 1 1 1 1

K-Means-based Linear Binary/Ternary


Quantization Quantization Quantization
Integer Weights;
Floating-Point
Storage Floating-Point Integer Weights
Weights
Codebook

Floating-Point Floating-Point
Computation Integer Arithmetic
Arithmetic Arithmetic

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 44
ffi
ffi
Linear Quantization

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 45
ffi
ffi
What is Linear Quantization?
weights
(32-bit oat)

2.09 -0.98 1.48 0.09

0.05 -0.14 -1.08 2.12

-0.91 1.92 0 -1.03

1.87 0 1.53 1.49

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 46
fl
ffi
ffi
What is Linear Quantization?
An a ne mapping of integers to real numbers
weights quantized weights zero point scale reconstructed weights
(32-bit oat) (2-bit signed int) (2-bit signed int) (32-bit oat) (32-bit oat)

2.09 -0.98 1.48 0.09 1 -2 0 -1 2.14 -1.07 1.07 0

0.05 -0.14 -1.08 2.12 -1 -1 -2 1 0 0 -1.07 2.14

-0.91 1.92 0 -1.03


( -2 1 -1 -2
-1 ) 1.07 = -1.07 2.14 0 -1.07

1.87 0 1.53 1.49 1 -1 0 0 2.14 0 1.07 1.07


we will learn how to
determine these parameters quantization error

-0.05 0.09 0.41 0.09


Binary Decimal
01 1 0.05 -0.14 -0.01 -0.02
00 0
11 -1 0.16 -0.22 0 0.04
10 -2
-0.27 0 0.46 0.42

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 47
fl
fl
fl
ffi
ffi
ffi
Linear Quantization
An a ne mapping of integers to real numbers r = S(q − Z)
weights quantized weights zero point scale
(32-bit oat) (2-bit signed int) (2-bit signed int) (32-bit oat)

2.09 -0.98 1.48 0.09 1 -2 0 -1

0.05 -0.14 -1.08 2.12 -1 -1 -2 1

-0.91 1.92 0 -1.03


( -2 1 -1 -2
-1 ) 1.07
1.87 0 1.53 1.49 1 -1 0 0

r = ( q − Z ) × S
Floating-point Integer Integer Floating-point
• quantization parameter • quantization parameter
• allow real number r=0 be
exactly representable by a
quantized integer Z
Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 48
fl
fl
ffi
ffi
ffi
ffi
Linear Quantization
An a ne mapping of integers to real numbers r = S(q − Z)

rmin 0 rmax

r Floating-point range

Floating-point

×S
Floating-point Scale

q
Integer qmin Z qmax Bit Width
2
qmin
-2
qmax
1
Zero point 3 -4 3
4 -8 7
N -2N-1 2N-1-1
Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 49
ffi
ffi
ffi
ffi
Scale of Linear Quantization
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)

rmin 0 rmax rmax = S (qmax − Z)


r Floating-point

rmin = S (qmin − Z)
range

×S
Floating-point
q Scale

qmin Z qmax rmax − rmin = S (qmax − qmin)


Zero point
rmax − rmin
S=
qmax − qmin

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 50
ffi
ffi
ffi
Scale of Linear Quantization
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)

2.09 -0.98 1.48 0.09

rmin 0 rmax 0.05 -0.14 -1.08 2.12


r Floating-point
range -0.91 1.92 0 -1.03

1.87 0 1.53 1.49


×S
Floating-point
rmax − rmin
q Scale
S=
qmin Z qmax qmax − qmin
Zero point
Binary Decimal
qmin qmax
01 1 2.12 − (−1.08)
00 0 =
11 -1 1 − (−2)
−2 −1 0 1 10 -2 = 1.07
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 51
ffi
ffi
ffi
Zero Point of Linear Quantization
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)

rmin 0 rmax rmin = S (qmin − Z)


r Floating-point
range

×S rmin
Floating-point Z = qmin −
q Scale
S
qmin Z qmax
Zero point

( S )
rmin
Z = round qmin −

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 52
ffi
ffi
ffi
Zero Point of Linear Quantization
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)

2.09 -0.98 1.48 0.09

rmin 0 rmax 0.05 -0.14 -1.08 2.12


r Floating-point
range -0.91 1.92 0 -1.03

1.87 0 1.53 1.49


×S
rmin
Floating-point
q Scale
Z = qmin −
qmin Z
Zero point
qmax
S
Binary Decimal
qmin qmax
01 1 −1.08
00 0 = round(−2 − )
11 -1 1.07
−2 −1 0 1 10 -2 =−1
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 53
ffi
ffi
ffi
Linear Quantized Matrix Multiplication
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following matrix multiplication.

Y = WX

SY (qY − ZY) = SW (qW − ZW) ⋅ SX (qX − ZX)

SWSX
qY = (qW − ZW) (qX − ZX) + ZY
SY
SWSX
qY = (qWqX − ZWqX − ZXqW + ZWZX) + ZY
SY
Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 54
ffi
ffi
ffi
ffi
Linear Quantized Matrix Multiplication
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following matrix multiplication.

Y = WX
SWSX Precompute
qY = (qWqX − ZWqX − ZXqW + ZWZX) + ZY
SY
N-bit Integer Multiplication N-bit Integer
32-bit Integer Addition/Subtraction Addition

Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 55
ffi
ffi
ffi
ffi
Linear Quantized Matrix Multiplication
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following matrix multiplication.

Y = WX
SWSX
qY = (qWqX − ZWqX − ZXqW + ZWZX) + ZY
SY
SWSX
Empirically, the scale is always in the interval (0, 1).
• SY Fixed-point Multiplication
SWSX −n
= 2 M0, where M0 ∈ [0.5,1)
SY
Bit Shift
Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 56
ffi
ffi
ffi
ffi
Linear Quantized Matrix Multiplication
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following matrix multiplication.

Y = WX
SWSX Precompute
qY = (qWqX − ZWqX − ZXqW + ZWZX) + ZY
SY
Rescale to N-bit Integer Multiplication N-bit Integer
N-bit Integer 32-bit Integer Addition/Subtraction Addition

ZW = 0?

Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 57
ffi
ffi
ffi
ffi
Symmetric Linear Quantization
Zero point Z = 0 and Symmetric oating-point range
rmin 0 rmax − | r |max 0 | r |max
r Floating-point
range r

×S
Floating-point
q Scale
q
qmin Z qmax qmin Z=0 qmax
Zero point

Bit Width qmin qmax


2 -2 1
3 -4 3
4 -8 7
N -2N-1 2N-1-1
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 58
ffi
ffi
fl
Symmetric Linear Quantization
Full range mode
rmin 0 rmax − | r |max 0 | r |max
r Floating-point
range r

×S ×S
Floating-point
q Scale
q
qmin Z qmax qmin Z=0 qmax
Zero point rmin = S (qmin − Z)

rmax − rmin
S= rmin − | r |max | r |max
qmax − qmin S= = = N−1
Bit Width qmin qmax
2 -2 1 qmin − Z qmin 2
3 -4 3
4 -8 7
N -2N-1 2N-1-1
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 59
ffi
ffi
Linear Quantized Matrix Multiplication
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following matrix multiplication, when Zw=0.

Y = WX
SWSX Precompute
qY = (qWqX − ZWqX − ZXqW + ZWZX) + ZY
SY
Rescale to N-bit Integer Multiplication N-bit Integer
N-bit Integer 32-bit Integer Addition/Subtraction Addition
ZW = 0
SWSX
qY = ( W X
q q − ZX W)
q + ZY
SY

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 60
ffi
ffi
ffi
Linear Quantized Fully-Connected Layer
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.

Y = WX + b

SY (qY − ZY) = SW (qW − ZW) ⋅ SX (qX − ZX) + Sb (qb − Zb)


ZW = 0

SY (qY − ZY) = SWSX (qWqX − ZXqW) + Sb (qb − Zb)

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 61
ffi
ffi
ffi
Linear Quantized Fully-Connected Layer
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.

Y = WX + b

SY (qY − ZY) = SW (qW − ZW) ⋅ SX (qX − ZX) + Sb (qb − Zb)


ZW = 0

SY (qY − ZY) = SWSX (qWqX − ZXqW) + Sb (qb − Zb)


Zb = 0, Sb = SWSX

SY (qY − ZY) = SWSX (qWqX − ZXqW + qb)

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 62
ffi
ffi
ffi
Linear Quantized Fully-Connected Layer
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.

Y = WX + b
ZW = 0 Zb = 0, Sb = SWSX

SY (qY − ZY) = SWSX (qWqX − ZXqW + qb)

SWSX Precompute
qY = (qWqX + qb − ZXqW) + ZY
SY We will discuss how to
qbias = qb − ZXqW compute activation zero point

SWSX
in the next lecture.

qY = (qWqX + qbias) + ZY
SY
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 63
ffi
ffi
ffi
Linear Quantized Fully-Connected Layer
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• So far, we ignore bias. Now we consider the following fully-connected layer with bias.

Y = WX + b
ZW = 0
Zb = 0, Sb = SWSX
qbias = qb − ZXqW

SWSX
qY = (qWqX + qbias) + ZY
SY
Rescale to N-bit Int Mult. N-bit Int
N-bit Int 32-bit Int Add. Add
Note: both qb and qbias are 32 bits.

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 64
ffi
ffi
ffi
Linear Quantized Convolution Layer
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following convolution layer.

Y = Conv (W, X) + b
ZW = 0
Zb = 0, Sb = SWSX
qbias = qb − Conv (qW, ZX)

SWSX
qY =
SY ( Conv ( q ,
W Xq ) + qbias) + ZY

Rescale to N-bit Int Mult. N-bit Int


N-bit Int 32-bit Int Add. Add
Note: both qb and qbias are 32 bits.

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 65
ffi
ffi
ffi
Linear Quantized Convolution Layer
Linear Quantization is an a ne mapping of integers to real numbers r = S(q − Z)
• Consider the following convolution layer.

Y = Conv (W, X) + b
zero point int int quantized outputs
ZW = 0 ZY + qW
int
Zb = 0, Sb = SWSX scale factor
×
SWSX /SY
qbias = qb − Conv (qW, ZX) quantized bias
int32

qbias +
SWSX
( bias)
int32 int32

qY = Conv ( q ,
W Xq ) + q + ZY
Conv
SY int int
quantized inputs quantized weights
qX qW
Rescale to N-bit Int Mult. N-bit Int
N-bit Int 32-bit Int Add. Add
Note: both qb and qbias are 32 bits.

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 66
ffi
ffi
ffi
different micro-architectures: 1) Snapdragon 835 LITTLE

INT8 Linear Quantization core, (figure. 1.1c), a power-efficient processor found in


Google Pixel 2; 2) Snapdragon 835 big core (figure. 4.1), a
An a ne mapping of integers to real numbers r = S(qcore
high-performance − Z)employed by Google Pixel 2; and 3)
Snapdragon 821 big core (figure. 4.2), a high-performance
core used in Google Pixel 1.

70

Top 1 Accuracy
Neural Network ResNet-50 Inception-V3
60

Floating-point
76.4% 78.4%
Accuracy 50
8-bit Integer- Float
quantized 74.9% 75.4% 8-bit
Acurracy 40
5 15 30 60 120
Latency (ms)
Latency-vs-accuracy tradeo of oat vs. integer-only
Figure 4.1: Latency-vs-accuracy
MobileNets on ImageNet using Snapdragon
tradeoff 835
of float vs. big cores.
integer-only
MobileNets on ImageNet using Snapdragon 835 big cores.

Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference [Jacob et al., CVPR 2018]
Integer-only quantized MobileNets achieve higher accu-
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 67
ffi
ffi
ffi
ff
fl
ffi
Neural Network Quantization
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1 -2 0 -1

0.05 -0.14 -1.08 2.12 1 1 0 3 2: 1.50 -1 -1 -2 1


( -1 ) 1.07
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -2 1 -1 -2

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 1 -1 0 0

K-Means-based Linear
Quantization Quantization
Integer Weights;
Floating-Point
Storage Floating-Point Integer Weights
Weights

Computation
Floating-Point
Codebook

Floating-Point
Integer Arithmetic
?
Arithmetic Arithmetic

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 68
ffi
ffi
Summary of Today’s Lecture
Today, we reviewed and learned
• the numeric data types used in the modern computing
systems, including integers and oating-point numbers. 1 1 0 0 1 1 1 1
× × × × × × × ×
-27+ 26 + 25 + 24 + 23 + 22 + 21 + 20 = -49
• the basic concept of neural network quantization:
converting the weights and activations of neural
networks into a limited discrete set of numbers.
rmin 0 rmax

r Floating-point
range
• two types of common neural network quantization:
• K-Means-based Quantization ×S
• Linear Quantization Floating-point
q Scale

qmin Z qmax
Zero point

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 69
ffi
ffi
fl
References
1. Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey
[Deng et al., IEEE 2020]
2. Computing's Energy Problem (and What We Can Do About it) [Horowitz, M., IEEE ISSCC 2014]
3. Deep Compression [Han et al., ICLR 2016]
4. Neural Network Distiller: https://intellabs.github.io/distiller/algo_quantization.html
5. Quantization and Training of Neural Networks for E cient Integer-Arithmetic-Only Inference
[Jacob et al., CVPR 2018]
6. BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations
[Courbariaux et al., NeurIPS 2015]
7. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations
Constrained to +1 or −1. [Courbariaux et al., Arxiv 2016]
8. XNOR-Net: ImageNet Classi cation using Binary Convolutional Neural Networks [Rastegari et al.,
ECCV 2016]
9. Ternary Weight Networks [Li et al., Arxiv 2016]
10.Trained Ternary Quantization [Zhu et al., ICLR 2017]

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 70
ffi
ffi
fi
ffi

You might also like