This document discusses quantization techniques for convolutional neural networks to improve performance. It examines quantizing models trained with floating point precision to fixed point to reduce memory usage and accelerate inference. Tensorflow and Caffe Ristretto quantization approaches are described and tested on MNIST and CIFAR10 datasets. Results show quantization reduces model size with minimal accuracy loss but increases inference time, likely due to limited supported operations.
Purposes
Problem
Accuracy/inference speed trade-off
Forreal world application, convolutional neural network(CNN) model
can take more than 100MB of space and can be computationally too
expensive
How to enable embedded devices like smart phones with the power of
Neural Networks?
There is the need of floating point precision?
No, deep neural network tends to cope well with noise in their input
Training still needs floating point precision to work, it is an iteration of
little incremental adjustments of the weights
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 3 / 25
4.
Purposes
Solution
The solution isquantization.
Deep networks can be trained with floating point precision, then a
quantization algorithm can be applied to obtain smaller models and speed
up the inference phase reducing memory requirements:
Fixed-point compute units are typically faster and consume far less
hardware resources and power than floating-point engines
Low-precision data representation reduces the memory footprint,
enabling larger models to fit within the given memory capacity
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 4 / 25
5.
Purposes
Project purpose
Using amachine learning framework with support for convolutional neural
networks:
Define different kind of network
Train
Quantize
Evaluate the original and the quantized models
Compare them in terms of model size, cache misses, and inference time
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 5 / 25
6.
Quantization
What is quantization?
Developingthis project we saw two different approaches:
Tensorflow
Caffe Ristretto
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 6 / 25
7.
Quantization
Tensorflow quantization
Unsupervised approach
Geta trained network
Obtain for each layer the min and
the max of the weights value
Represent the weights distributed
linearly between the minimum and
maximum with 8 bits precision
The operations have to be
reimplemented for the 8-bit format
The resulting data structure is composed by an array containing the
quantized value, and the two float min and max
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 7 / 25
8.
Quantization
Caffe Ristretto quantization
Supervisedapproach
Get a trained network
Three different methods:
Dynamic fixed point: a modified fixed-point format
Minifloat: bit-width reduced floating point numbers
Power of two: layers with power-of-two parameters don’t need any
multipliers, when implemented in hardware
Evaluate the performance of the network during quantization in order
to keep the accuracy higher than a given threshold
Support for training of quantized networks (fine-tuning)
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 8 / 25
Our work
Caffe ristretto
Theresults obtained with Ristretto on a simple network for the Mnist
dataset are not so satisfying...
network accuracy model size (MB) Time (ms) LL_d misses (106
) L1_d misses (106
)
Orginal 0.9903 1.7 29.2 32.098 277.189
Dynamic f. p. 0.9829 1.7 126.41 42.077 303.209
Minifloat 0.9916 1.7 29.5 37.149 282.396
Power of two 0.9899 1.7 61.1 35.774 280.819
Linux running on macbook pro, cachegrind tool for cache statistics. Intel i5 2.9 GHz, L3 cache 3MB, 16 GB ram.
The quantized values are stored in float size after the quantization
The quantized layers implementation works with float variables:
perform the computation with low precision values stored in float
variables
quantize the results, still stored in float variables
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 10 / 25
11.
Our work
Tensorflow
The quantizationis better supported
The quantized model is stored with low precision weights
Some low precision operations are already implemented
We tried different topologies of networks, to see how quantization affects
different architectures
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 11 / 25
12.
Our work
How weused the tensorflow quantization tool
We used python (with a bit OO, since we needed a way to use it with
different networks)
An abstract class defines the pattern of the network that the main
script can handle
The core methods of the pattern are
prepare: load the data and build the computational graph and the
training step of the network
train: iterate the train step
The main script takes in input an instance and:
calls prepare and train
quantizes the obtained network
evaluates the accuracy
evaluates cache performance using linux-perf
plots the data
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 12 / 25
13.
Our work
Topologies onMNIST
(a) Big
(b) More
convolutional (c) More FC (d) Tf Example (e) Only FC
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 13 / 25
14.
Our work
Some dataon MNIST - accuracy
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 14 / 25
15.
Our work
Some dataon MNIST - models size
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 15 / 25
16.
Our work
Some dataon MNIST - data cache misses
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 16 / 25
17.
Our work
Some dataon MNIST - inference time
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 17 / 25
18.
Our work
Some dataon CIFAR10 - accuracy
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 18 / 25
19.
Our work
Some dataon CIFAR10 - models size
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 19 / 25
20.
Our work
Some dataon CIFAR10 - data cache misses
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 20 / 25
21.
Our work
Some dataon CIFAR10 - inference time
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 21 / 25
22.
Our work
Why isthe inference time worst?
We see an improvement in performance only for the size of the model,
and so for the data cache misses
Inference time and last level cache misses are worst in quantized
networks
From the tensorflow github page:
Only a subset of ops are supported, and on many platforms the quantized
code may actually be slower than the float equivalents, but this is a way of
increasing performance substantially when all the circumstances are right.
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 22 / 25
23.
Our work
Original net- tensorflow benchmark tool
Original network on MNIST dataset:
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 23 / 25
24.
Our work
Quantized net- tensorflow benchmark tool
Quantized network on MNIST dataset:
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 24 / 25
25.
Our work
References
Tensorflow quantization:https://www.tensorflow.org/performance/quantization
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/graph_trans-
forms
Ristretto: http://lepsucd.com/?page_id=621
Github repository of the project:
https://github.com/EmilianoGagliardiEmanueleGhelfi/CNN-compression-performance
Deep Learning with Limited Numerical Precision (Suyog Gupta, Ankur
Agrawal, Kailash Gopalakrishnan)
Emanuele Ghelfi, Emiliano Gagliardi CNN Quantization June 18, 2017 25 / 25