0% found this document useful (0 votes)
50 views

Cuda 2d

FastWave is an FPGA accelerator for autoregressive convolutional neural networks like WaveNet. It achieves 11x faster generation speed than GPU and 66x faster than CPU. Key contributions include developing basic blocks for dilated causal convolutions and queues, and a customizable matrix multiplication engine. An end-to-end framework accesses only on-chip memory for high throughput. Extensive optimizations and evaluations compare performance and resource usage of the optimized design.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Cuda 2d

FastWave is an FPGA accelerator for autoregressive convolutional neural networks like WaveNet. It achieves 11x faster generation speed than GPU and 66x faster than CPU. Key contributions include developing basic blocks for dilated causal convolutions and queues, and a customizable matrix multiplication engine. An end-to-end framework accesses only on-chip memory for high throughput. Extensive optimizations and evaluations compare performance and resource usage of the optimized design.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

FastWave: Accelerating Autoregressive

Convolutional Neural Networks on FPGA


Shehzeen Hussain∗ , Mojan Javaheripi∗ , Paarth Neekhara† , Ryan Kastner† and Farinaz Koushanfar∗
∗ UC San Diego Department of Electrical and Computer Engineering
† UC San Diego Department of Computer Science

Email: [email protected], [email protected]

Abstract—Autoregressive convolutional neural networks the inference still remains sequential and CNN based models
(CNNs) have been widely exploited for sequence generation are usually very deep. However, to overcome this problem,
arXiv:2002.04971v1 [eess.AS] 9 Feb 2020

tasks such as audio synthesis, language modeling and neural Fast-Wavenet [11] exploits the temporal dependencies by
machine translation. WaveNet is a deep autoregressive CNN
composed of several stacked layers of dilated convolution that caching redundant computations using fixed-length convolu-
is used for sequence generation. While WaveNet produces tional queues and thus makes the generation time linear in
state-of-the art audio generation results, the naive inference length of the sequence. Such efforts have made it feasible
implementation is quite slow; it takes a few minutes to generate to use autoregressive CNNs for practical sequence generation
just one second of audio on a high-end GPU. In this work, we applications, as an alternative to RNN-based models.
develop the first accelerator platform FastWave for autoregressive
convolutional neural networks, and address the associated design While the Fast-Wavenet algorithm provides a speed-up in
challenges. We design the Fast-Wavenet inference model in audio generation over naı̈ve Wavenet implementation, the
Vivado HLS and perform a wide range of optimizations generation time is still high, even on a high-end GPU. It
including fixed-point implementation, array partitioning and takes 120 seconds to generate 1 second of audio using Fast-
pipelining. Our model uses a fully parameterized parallel Wavenet implementation on a NVIDIA TITAN Xp GPU. Prior
architecture for fast matrix-vector multiplication that enables
per-layer customized latency fine-tuning for further throughput works have shown FPGAs to be successful in accelerating the
improvement. Our experiments comparatively assess the trade- inference of pre-trained neural networks by providing custom
off between throughput and resource utilization for various data paths to achieve high parallelism. A vast amount of
optimizations. Our best WaveNet design on the Xilinx XCVU13P such research focuses on accelerating neural networks in the
FPGA that uses only on-chip memory, achieves 66× faster image domain [12], [13], speech recognition [14], [15] and
generation speed compared to CPU implementation and 11×
faster generation speed than GPU implementation. language modelling [16]. To the best of our knowledge, similar
efforts have not been made for accelerating neural networks
I. I NTRODUCTION for speech/audio synthesis.
Autoregressive convolutional models achieve state-of-the- We aim to accelerate audio synthesis using the autoregres-
art results in audio [1]–[4] and language domains [5], [6] with sive CNN model - WaveNet on FPGA. The primary challenges
respect to both estimating the data distribution and generating in deploying auto-regressive CNN inference on FPGA are
high-quality samples. Wavenet [1] is an example of autoregres- designing modules for dilated convolutional layers, buffers for
sive convolutional network, used for modelling audio for appli- storing redundant computations using convolutional queues,
cations such as text-to-speech (TTS) synthesis and music gen- and dealing with the large depth of these networks which
eration. WaveNet has been rated by human listeners to provide is necessary to maintain high audio quality. In this work,
substantially more natural sounding audio when compared to we address these challenges of deploying large autoregressive
the best existing parametric and concatenative systems in TTS convolutional models on FPGAs and perform a wide range
applications for both English and Mandarin [1]. Popular cloud of application and architectural optimizations. Furthermore,
based TTS synthesis systems such as Google Now and Google we comprehensively analyze and compare the performance
Assistant, that produce natural sounding speech, are built on of Fast-Wavenet implementation on FPGA with the CPU and
WaveNet architecture [4], [7]. Alongside achieving state-of- GPU counterparts.
the art results in the audio domain, convolutional models Summary of Contributions:
are prominent for natural language modeling tasks like text • Creation of the first accelerator platform for autoregres-
generation and machine translation [5]. sive convolutional neural networks. We deploy the fast
Generally, both autoregressive convolutional neural net- inference model Fast-Wavenet [11] on Xilinx XCVU13P
works (CNNs) and Recurrent Neural Networks (RNNs) [8] FPGA which achieves 11 times faster generation speed
are widely popular for sequence modelling tasks. The main than a high-end GPU and 66 times faster generation speed
advantage of CNN based models is that they can achieve than a high-end CPU.
higher parallelism during training and can capture longer time- • Development of reconfigurable basic blocks pertinent to
dependencies as compared to RNN based models [9], [10]. autoregressive convolutional networks i.e., dilated causal
However, this comes at a cost of slower inference, since convolutional layers, convolutional queues, and fully con-
nected layer. Our operations are powered by a fully- learnable parameters of Wavenet model. During inference,
customizable matrix-multiplication engine that uses two next-sample audio (xt ) generation is performed by sampling
levels of parallelism controlled by tunable parameters. from the conditional probability distribution given all of the
• Creation of an end-to-end framework that accesses only previous samples, P (xt |xt−1 , ..., x1 , x0 , λ).
on-chip memory thereby ensuring high throughput and One possible method for modeling the probability den-
avoiding any latency caused by communication with off- sity is via a stack of causal convolutional layers as de-
chip memory. picted in Figure 1(a). The input passes through this stack
• Exploration of the design space using different architec- of convolutional layers and gated activation functions and
tural and application optimizations, as well as comparing finally through a softmax layer to get the posterior proba-
the performance and resource usage. We present extensive bility P (xt |xt−1 , ..., x1 , x0 ). The downside of this approach
evaluation of throughput and power efficiency for our is that in order to model long temporal dependencies from
fully optimized and baseline designs. samples far in the past, the causal convolutional network
II. BACKGROUND requires either many layers or large filters to increase the
receptive field. In general, the receptive field is calculated as
In this section, we provide a background on autoregressive
# of layers + f iltersize + 1 which gives a receptive field
convolutional neural networks. We choose WaveNet as an
of 5 in the architecture shown in Figure 1(a). To address this
ideal representation of such models, and describe its overall
problem, WaveNet leverages dilated convolutions [20], [21]
generative architecture. We first elaborate on the 1D convo-
which deliver higher receptive fields without significant in-
lution operation as it is the core computation performed in
crease in computational cost of the model. Dilated convolution
the WaveNet model. Next, we explain WaveNet and its more
is equivalent to performing convolutions with dilated filters
efficient inference algorithm called Fast-Wavenet.
where the size of the filter is expanded by filling the empty
A. 1D Convolution positions with zeros. In practice, this is achieved by performing
The 1D convolution operation is performed by sliding a one a convolution where the filter skips input values with a certain
dimensional kernel over a 1D input signal. Each output value step.
at position i is produced by calculating the dot product of the Fig 1(b) illustrates a network with dilated causal convolu-
kernel and the overlapping values of the input signal, starting tions for dilation values of 1, 2, 4, and 8. Here, the input
from position i. More formally, for an input vector a of length nodes are shown with color blue and the output is shown
n and a kernel k of length m, the 1D convolution is calculated with orange. Each edge in the graph correspond to a 1-
as follows: dimensional convolution (See section II-A), more generally
m a matrix multiplication. Due to the binary tree structure of
(a ∗ k)i = σj=1 kj × ai−j+ m2 (1)
the network, the time complexity of computing output at
where i is an arbitrary index in the output vector, which has each time-step is O(2L ) where L is the number of layers
a total length of n − m + 1. The subscripts denote the indices in the network, which gets highly undesirable as L increases.
of the kernel/input vectors. Similarly, the total memory required to store the inputs, output,
B. WaveNet and Autoregressive CNNs and the intermediate layer features is O(2L ).
Autoregressive Neural Networks are popularly used for
sequence generation tasks which rely on ancestral sampling
i.e. the predictive distribution for each sample in the sequence
is conditioned on all previous ones. While RNNs are popular
autoregressive models, they do not exhibit high receptive field
making them unsuitable for modeling sequences with long-
term dependencies like audio [9]. In contrast, autoregressive
CNN based models use a stack of dilated convolutional (a)
layers to achieve higher receptive field necessary for modeling
sequences with long-term dependencies.
Wavenet [1] is an autoregressive convolutional neural net-
work that produces raw audio waveforms by directly model-
ing the underlying probability distribution of audio samples.
This has led to state-of-the-art performance in text-to-speech
synthesis [2], [7], [17], [18], speech recognition [19], and (b)
other audio generation settings [1], [3], [4]. The Wavenet Fig. 1: a. (Left) Stacked causal convolution layers without any
architecture aims to model the conditional probability among dilations. b. (Right) Stacked causal 1-d convolution layers with
subsequent audio samples. The joint probability distribution increasing dilation. (Figures from WaveNet paper [1]).
of waveform sample
QT points x = x0 , x1 , ..., xT can be written
as: P (x|λ) = t=1 P (xt |xt−1 , .., x0 , λ) where λ denotes the
C. Fast-Wavenet FastWave architecture comprises of 28 convolutional layers
The naı̈ve implementation in Figure 1(b) has many redun- with 128 channels each in order to maintain high audio
dant computations when generating a new sample, that is, quality. When designing an accelerator for such models, it is
it recomputes activations that have been already computed important to be aware of the system restrictions, particularly
for generating previous samples. Fast-Wavenet [11] proposed those of memory access bandwidth [12], [22]. Accessing off-
an efficient algorithm that caches these recurrent activations chip memory is expensive and can limit the throughput of
in queues instead of recomputing them from scratch while our network, making it important to compress DNNs into an
generating a new sample. Fast-Wavenet uses a per-layer first- optimal model for efficient inference.
in-first-out queue to cache the states to be used in future Design Flow: We start with an open source TensorFlow
timestamps. implementation of the Fast-Wavenet algorithm. We save the
The queue size at each layer is determined by its corre- weights of the convolutional and fully connected layers of
sponding dilation value. Figure 2 demonstrates an example our trained model which are used in the inference stage for
4-layer network and their corresponding queues. For the first generating audio. We implement the Fast-Wavenet inference in
layer, the dilation value is 1 and therefore the corresponding NumPy without using any high level deep learning libraries.
queue (Q1) only keeps one value. Similarly, the output layer This implementation serves as a bridge between the high level
has a dilation value of 8, which means that its queue (Q4) TensorFlow and the low level Vivado HLS implementation
will store 8 recurrent values. By removal of redundant compu- in C++. On the FPGA platform, we then accelerate the
tations due to the queue storing mechanism, the computational audio generation process from random seeds, and perform
complexity of Fast-Wavenet is O(L) where L is the number optimized operations using queue buffers and matrix-vector
of layers. The overall memory requirement for queues as multiplications to generate raw audio.
well as the intermediate values remains the same as the nav̈e
implementation, i.e., O(2L ). A. Model Architecture and Training on GPU
The basic queue operations performed in the Fast-Wavenet
are as follows (refer to Figure 2 ): Block Layer Filter Queue Input Output Queue
No. No. Width Length Channels Channels Size
1) Pop phase: The oldest recurrent states are popped off the
1 1 2 1 1 128 1
queues in each layer and fed as input to the generation 1 2 2 2 128 128 256
model. These popped off states and the current input are 1 3 2 4 128 128 512
1 4 2 8 128 128 1024
operated with the convolutional kernel to compute the 1 5 2 16 128 128 2048
current output and the new recurrent states. 1 6 2 32 128 128 4096
1 7 2 64 128 128 8192
2) Push Phase: Newly calculated recurrent states (orange 1 8 2 128 128 128 16384
dots) are pushed to the back of their respective layer 1 9 2 256 128 128 32768
1 10 2 512 128 128 65536
queues to be used in future time stamps. 1 11 2 1024 128 128 131072
1 12 2 2048 128 128 262144
Maintaining the convolutional queues in the above manner al- 1 13 2 4096 128 128 524288
lows us to handle the sparse convolutional operation and avoid 1 14 2 8192 128 128 1048576
2 1 2 1 128 128 128
redundant computations and makes the generation algorithm 2 2 2 2 128 128 256
linear in terms of length of the sequence. 2 3 2 4 128 128 512
2 4 2 8 128 128 1024
2 5 2 16 128 128 2048
Convolutional Queues Generation Model 2 6 2 32 128 128 4096
2 7 2 64 128 128 8192
Queue Pop 2 8 2 128 128 128 16384
2 9 2 256 128 128 32768
Q4 2 10 2 512 128 128 65536
2 11 2 1024 128 128 131072
Q3 2 12 2 2048 128 128 262144
2 13 2 4096 128 128 524288
Q2 2 14 2 8192 128 128 1048576

Q1 Queue Push TABLE I: Details of Fast-Wavenet Architecture. The column


Queue Size denotes the number of floating point numbers
Fig. 2: Basic queue operations (Push and Pop) performed in stored in each queue and is equal to QueueLength ×
Fast-Wavenet to achieve linear time in audio generation. InputChannels.

III. M ETHODOLOGY We use an open-source TensorFlow implementation of Fast-


Our primary objective is to accelerate the inference of Wavenet [11] to pre-train our network in Python. The network
autoregressive CNNs for sequence generation on FPGAs. As architecture we use is a stack of two dilated convolutional
an ideal candidate for autoregressive models, we choose the blocks. Each block consists of 14 convolutional layers with
WaveNet model for raw audio generation from random seed kernel size (filter width) = 2 and dilation increasing in powers
inputs. The computation and storage complexity of such state- of 2. Therefore each of the kernels is a 3-dimensional array of
of-the-art autoregressive CNNs is very high, particularly our shape 2 × inputchannels × outputchannels. The number of
output channels in each layer is 128 in the baseline implemen- accelerator design. Given a seed input, our system generates
tation. After each convolutional layer there is a tanh activation an output stream in an autoregressive manner, one-sample at a
function which serves as the non-linearity in our model as time. The output sample produced at each time-step is fed back
used in the original WaveNet paper [1]. A tanh activation as input to generate the next output sample. During each cycle,
normalizes values between -1 and 1 and also allows us to better as the input goes through all the convolutional layers, the
utilize fixed point data-types in the Vivado implementation corresponding convolutional queues are updated using push
without compromising on accuracy. and pop operations as explained in section II. It is important to
After the two convolutional blocks, we have a single fully note that the entire model including the convolutional queues
connected layer which maps the activation of size 100 from and the parameters does not use any off-chip memory and
the last convolutional layer to an output vector of size 256 are stored in the BRAM and URAM available on the FPGA
followed by a softmax normalization layer. The output after board. We describe the details of implementing the convolution
the softmax layer is the generated distribution. The target audio operations, queue updates and output generation using the fully
is quantized linearly between -1 and 1 into 256 values. The connected layers in the following section.
one-hot representation of each sample of size 256 serves as
the target distribution at each time-step. The cross entropy IV. I MPLEMENTATION D ETAILS
loss between the generated and target distribution is back-
propagated to train the convolutional kernels and weights of Our design is composed of 5 main elements: (i) The dilated
the fully connected layer. convolution layers, (ii) the queue control unit, (iii) the fully-
Memory challenges: The primary memory bottle-neck in connected layer, (iv) the matrix multiplication engine, and (v)
implementing the Fast-Wavenet inference is not the parameters the network description module. We implement and accelerate
of the convolutional kernels, but convolutional queues which the inference of WaveNet on the Xilinx XCVU13P FPGA.
cache the intermediate outputs of the convolutional layers to be
used for future predictions. The size of these queues increases A. Dilated Convolution Layer
exponentially with the depth of the block in the neural net-
As explained in the Section II-C, Fast-WaveNet leverages
work. As highlighted in Table I, the 14th convolutional queue
queues to implement the dilated convolutional layers. A
in each of the blocks stores 1, 048, 576 floating point numbers
convolution of size = 2 is used in the WaveNet architecture,
(≈ 33M b). One way of addressing this challenge is to reduce
and can be implemented as two matrix-vector multiplications
the number of channels in the 13th and 14th convolutional
followed by vector addition in the manner explained below.
layers via pruning. However in our experiments, we found
Notations used for our variables along with the shapes are
pruning to degrade the quality of generated audio. To address
listed below:
this memory challenge without pruning the network, we utilize
both BRAMs and URAMs available on our FPGA board. We
ICn : Number of Input channels of layer n.
store all convolutional queues on the BRAMs by default and
OCn : Number of Output channels of layer n.
off-load the 14th convolutional queue of each block onto the
O[n](OCn ×1) : Output of convolutional layer n.
URAMs on our board. In this way, we are able to utilize
K[n](2×OCn ×ICn ) : Convolutional kernel of layer n.
only on-chip memory and achieve higher bandwidth without
Q[n](queueLength×ICn ) : Convolutional queue of layer n.
compromising on audio quality.
B. Accelerator Design Overview O1 [n] = K[n][0](OCn ×ICn ) × Q[n][0](ICn ×1)

Output Stream …. O2 [n] = K[n][1](OCn ×ICn ) × O[n − 1](ICn ×1)


GPU FPGA
O[n]OCn ×1 = O1 [n](OCn ×1) + O2 [n](OCn ×1)
On-Chip w
Memory
w
Weights
In other words, we matrix-multiply the first component of
Convolutional Queues
the convolutional kernel with the first element of the queue,
TensorFlow and the 2nd component of the kernel with the previous layer’s
Training on output and then add the two products to obtain the output
Audio samples
of any layer. The details of the matrix-vector multiplication
engine have been provided in Section IV-D.
The output of the convolution layer is then passed to
Random Seed Input
tanh activation function. We use the CORDIC implementation
Fig. 3: Acceleration Methodology available in Vivado HLS math library for applying tanh
allowing us to optimize our design and memory usage. The
The primary objective of our system is to generate an output output of the dilated convolution module is a vector of length
stream given a seed input. Figure 3 shows the overview of our equal to the number of layer output channels.
B. Cyclic Queue Buffer Unit multiple MAC operations. Figure 4 presents our method to
In order to reduce the number of operations, Fast-Wavenet parallelize the matrix-vector multiplication computations.
aims to remove redundant convolution operations by caching
previous calculations in a Queue, thereby reducing the com-
plexity of synthesis to O(L) time. This means that after
performing a convolutional operation, we push the compute
into the end of the queue and pop the out the first element.
These push and pop operations are shown in figure 3. As
described above the queue in each layer Q[n] is a 2-d array of
shape QueueLength × InputChannels. The QueueLength
depends on the dilationF actor of the layer and is equal
to 2dilationF actor . We aim to fit our queue computations
in the on-chip memory BRAMs and URAMs. Our baseline
queue implementation in Vivado HLS used shift operations to
perform pop and push functionalities of a queue. The longest
queue in our model is of size 8192 × 24. The shifting of a Fig. 4: Schematic representation of the matrix multiplication
large number of elements in the queue resulted in very high engine and the corresponding parallelization factors.
latency.
To make queue push and pop operations computationally ef- We define two levels of parallelism for our engine
ficient, we implemented our queues using fixed length circular which control the parallel computations with parameters
arrays for each layer. This is a lot more efficient than shifting num parallel in and num parallel out, denoting the level of
all the elements present in the queue. The push and pop parallelism in the input and output, respectively. For the first
operations are reduced to just overwriting one column of our level of parallelism, multiple rows of the weight matrix are
circular array which is indexed using modulo QueueLength processed simultaneously by dividing it into chunks, each
index. having num parallel out rows. In each round, a chunk of the
C. Fully-connected Layer weights matrix is copied to one of the weight buffers while
the other weight buffer is fed into the dot product modules
The fully connected layer in WaveNet is a linear layer together with a copy of the input vector. The iterations end
after all the convolutional layers. This layer is characterized when all rows of the weight matrix have been processed.
by a weight matrix Wchannels×OutputSize and a bias vector For the second level of parallelism, each dot-product function
b1×OutputSize . The fully connected layer performs the follow- partitions its input vectors into num parallel in chunks and
ing operation on ConvOut: the output of the last convolution concurrently executes MAC operations over the partitioned
layer: subsets. The accumulated results of the subsets are then
F inalOutput = ConvOut × W + b added together within the reduce sum function to compute
In our design, the weight matrix W has shape 100 × 256 the final output. The reduce sum module performs a tree-base
and bias b has shape 1 × 256. We use arg-max sampling on reduction algorithm as outlined in Figure 5. The reduction
the final vector of length 256 to obtain the quantized output function takes an array of size 2M as its input (array a) and
value between -1 and 1. oscillates between 2 different modes. In mode 0, the function
reduces a by using temp as a temporary array. In mode 1,
D. Matrix Multiplication Engine temp is reduced using a. The result is returned based on the
The most computationally-intensive operation in DNN exe- final mode.
cution is matrix-vector multiplication. FPGAs are equipped The aforementioned parameters num parallel in and
with DSP units which offer a high computation capacity num parallel out are individually defined for each of the
together with the reconfigurable logic. The basic function of layers to enable fine-tuning according to the per-layer require-
a DSP unit is Multiplication Accumulation (MAC). Layers in ments. Due to the limited number of available resources on the
a convolutional neural network take as input a vector XN ×1 FPGA platform, it is not possible to define high parallelization
and compute the output vector YM ×1 as formulated below: factors for all layers. As such, we give priority to layers
with higher computational complexity, i.e., higher number of
Y = f (W X + b) (2) input and output channels, by instantiating their correspond-
ing matrix multiplication engines with larger parallelization
where f(.) is a nonlinear function, WM ×N is the 2D matrix of
parameters.
the weights and bM ×1 is a vector of bias values. As can be
seen, each layer is computing a vector-matrix multiplication E. Network Description Module
and a vector-vector addition. In order to optimize the design In this module, we implement the overall architecture of our
and make efficient use of the DSP blocks, we proposed network as a stack of dilated conventional layers and perform
a parallelized approach to convert layer computations into queue update operations followed by a fully connected layer.
Resource Utilization Performance Correctness
BRAM URAM FF LUT Clock-Cycle Throughput
DSP48E Latency MSE LSD
(Mb) (Mb) (K) (K) Time (ns) (Hz)
Design / Available Resources 94.5 360 3456 1728 12288
FloatingPointBaseline 93 (98%) 144 (40%) 35 (1%) 86 (5%) 288 (2%) 12110989 8.83 9.4 0 0
FloatingPointCQ 93 (98%) 144 (40%) 35 (1%) 83 (5%) 330 (3%) 6170104 8.83 18.4 0 0
FloatingPointPipeline 93 (99%) 144 (40%) 231 (7%) 231 (13%) 475 (4%) 612952 8.88 183.7 0 0
FixedPointUnrolling 79 (84%) 144 (40%) 22 (1%) 146 (8%) 660 (5%) 293914 8.75 388.8 0.006 0.104
FixedPointMME (Best) 90 (96%) 144 (40%) 425 (12%) 1669 (97%) 540 (4%) 78275 8.66 1475.2 0.006 0.104

TABLE II: Resource Utilization, Performance and Measured Error in generation for each design implementation. The error
metrics namely, Mean Squared Error (MSE) and Log-Spectral Distance (LSD) is measured by comparing the generated audio
from FPGA implementations against audio generated from corresponding GPU implementation. The percentages reported
indicate percentage of resources utilized by the design.

mean squared error between their representations in time


domain as a sequence of floating point numbers. That is,
M SE = mean((x1 − x2 )2 ). The MSE losses reported
are from the comparison of the entire waveform i.e. the
total mean squared error from all 32000 samples.
• Log-Spectral Distance (LSD): The log-spectral dis-
tance [23] is a commonly utilized metric, obtained as
the root mean square error between the normalized log-
spectra of given signals. Given two signals x1 , x2 , we
calculate log-spectral distance between them as follows:
Fig. 5: Realization of the tree-based vector reduction algo- ps1 = (abs(stf t(x1 )))2
rithm. ps2 = (abs(stf t(x2 )))2
ls1 = normalize(log(ps1 )) (3)
This module instantiates the corresponding function for each ls2 = normalize(log(ps2 ))
network layer and manages the layer inter-connections. Since LSD = RM SE(ls1 , ls2 )
each layer is independently instantiated, we can use custom
dilation, channels and parallelization parameters for each layer. Here ps1 , ps2 are the power spectra and ls1 , ls2 are the
After the last fully connected layer, to make audio generation normalized log spectra of signals x1 , x2 respectively. The
deterministic we use arg-max sampling. This allows us to normalization is performed across all frequencies in the
bypass the final softmax layer since we can directly apply the log spectrograms.
arg-max function on the output of our final fully connected • Qualitative Evaluation: Along with the quantitative re-
layer. sults, we also provide log-spectrogram visualizations of
V. R ESULTS AND E XPERIMENTS the audio signal generated using our FPGA implementa-
In this section, we evaluate the effect of our optimiza- tion and the golden-output audio signal generated from
tions, namely cyclic queues, pipelining, loop unrolling and the TensorFlow implementation in Figure 6 (c).
customized matrix multiplication engine, by conducting ex- B. Design Space Exploration
tensive design space exploration. Our design experiments are
We implement the following designs to study the effect
synthesized for the Xilinx XCVU13P board using Xilinx
of various optimization techniques in isolation and in com-
Vivado HLS 2017.4. In particular, we discuss the experimental
bination with other techniques. The resource utilization, per-
techniques applied to reduce resource utilization and latency
formance (throughput) and error in the generated audio, for
of our baseline implementation. We further provide a com-
each of the following designs have been reported in Table II.
prehensive comparison of our best designs with CPU and
Throughput measures the number of audio samples generated
GPU implemented baselines in terms of throughput and power
per second by our implementation of an autoregressive model.
efficiency.
Note that one second of audio contains 16000 samples if audio
A. Evaluation Metrics is sampled at 16KHz.
To evaluate the accuracy of our implementation we compare 1) Baseline Floating Point Implementation (FloatingPoint-
the output generated from our FPGA implementation with the Baseline): The baseline design of our network is comprised
golden output generated by the TensorFlow GPU implemen- of modules to implement the basic functionality of each layer,
tation for the same initial seed. We use the following metrics queue, initialization of weights from stored data files and
to compare any two audio signals x1 , x2 of the same length: forward propagation. We use a array-shifting implementation
• Mean Squared Error (MSE): The mean squared error of queue which results in a fairly high latency as shown in
(MSE) between any two given signals x1 , x2 is the Table II because of the very long queues (length = 8192 and
Fig. 6: A: Throughput (Number of Samples generated per second) of different designs. B: Normalized Resource Utilization of
different designs. C: Log-Spectrograms of the 2-second audio generated from the TensorFlow implementation (top) and FPGA
FixedPointMME design implementation (bottom).

4096) in the 13th and 14th layers of our design. For matrix constraint of available resources.
vector multiplication we use simple for loops without any
Time (in seconds)
optimization. Power
Implementation for 1-Second
(W)
2) Floating Point + Cyclic Queue (FloatingPointCQ) : Audio Generation
CPU (Numpy) 732
In this design, we replace our shifting based queue im- GPU - NVIDIA Titan Xp (TensorFlow) 120 70
plementation with a cyclic queue implementation that uses GPU - NVIDIA Tesla V100 (TensorFlow) 85 66
dynamic indexing to produce the same effect as push and FPGA- FloatingPointPipeline 87 10.2
FPGA- FixedPointUnrolling 41 7.6
pop operations. This helps reduce latency substantially since FPGA- FixedPointMME (Best) 11 23
shifting operations in the longer queues was the bottleneck in
TABLE III: Power Consumption and Wall-Clock time required
our baseline design. The resource utilization however, stays
when generating 1-second audio for different implementations.
almost the same as our baseline design.
3) Floating Point + Cyclic Queue + Pipelining (Floating-
PointPipeline) : In this design, we modify the above design C. Performance and Power Analysis
and add pipelining pragma in the dot product computation and Table III illustrates the performance and power consumption
queue update operations. Pipelining the above design helped for our implemented designs and a highly optimized CPU and
increasing the throughput substantially at the cost of higher GPU implementation. We benchmark the optimized Tensor-
resource utilization. flow implementation of Fast-Wavenet on two GPUs: NVIDIA
4) Fixed Point + Cyclic Queue + Unrolling (FixedPointUn- TITAN Xp and Nvidia Tesla V100. The CPU implementation
rolling) : Including both Cyclic Queue and Pipelining opti- is the NumPy inference program written by us and optimized
mization, we switch to fixed point operations from floating fully. We measure the power consumption for the GPU bench-
point operations. Since the order of magnitude of our ker- marks using the NVIDIA power measurement tool (nvidia-
nels, inputs, activations and outputs is nearly the same, we smi) running on Linux operating system which is invoked
keep a common data-type across all of them. After some during program execution. For our FPGA implementations,
experimentation, we found that Loop Unrolling outperforms we synthesize our designs using Xilinx Vivado v2017.4. We
pipelining in terms of both resource utilization and throughput then integrate the synthesized modules accompanied by the
for fixed point data-types. We use loop unrolling factor = corresponding peripherals into a system-level schematic using
8 for the inner loop of our dot product and also the queue Vivado IP Integrator. The frequency is set to 150 MHz and
update operations. We observe a trade-off between precision power consumption is estimated using the synthesis tool.
and resource utilization for different fixed point bit width and As shown, our best FPGA implementation achieves 11×
chose ap_fixed<27,8> (8 bits for the integer and 19 bits speed-up in audio generation while being 3× more power
for the fractional part)since it gives reasonable MSE under the efficient as compared to NVIDIA Titan Xp GPU. As compared
constraints of resources. to a NumPy based CPU implementation, our best design is
5) Fixed Point + Matrix Multiplication Engine (FixedPoint- 66× faster.
MME - Best) : For our best design, we use fixed-point
implementation in a parallelized approach to convert layer VI. P RIOR W ORKS ON ACCELERATING DNN S FOR FPGA S
computations into multiple MAC operations (refer to Section Prior works have made significant efforts in compress-
IV-D for details). For the first dilated convolution layer we set ing Deep Neural Networks (DNNs) to support fast energy-
num parallel out and num parallel in as 1 since the number efficient applications. However, recent research on DNNs is
of input channels is just 1. For all other layers, including still increasing the depth of models and introducing new
the fully connected layer we set num parallel out as 8 and architectures, resulting in higher number of parameters per
num parallel in as 4 to get the best throughput under the network and higher computational complexity. Other than
CPUs and GPUs, FPGAs are becoming a platform candidate [10] H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-
to achieve energy efficient neural network computation [12], to-speech system based on deep convolutional networks with guided
attention,” in 2018 IEEE International Conference on Acoustics, Speech
[13], [22], [24]–[27]. Equipped with the necessary hardware and Signal Processing (ICASSP). IEEE, 2018, pp. 4784–4788.
for basic DNN operations, FPGAs are able to achieve high [11] T. L. Paine, P. Khorrami, S. Chang, Y. Zhang, P. Ramachandran,
parallelism and utilize the properties of neural network com- M. A. Hasegawa-Johnson, and T. S. Huang, “Fast wavenet generation
algorithm,” arXiv preprint arXiv:1611.09482, 2016.
putation to remove unnecessary logic. Algorithm explorations [12] M. Samragh, M. Ghasemzadeh, and F. Koushanfar, “Customizing neural
also show that neural networks can be simplified to become networks for efficient fpga implementation,” in Field-Programmable
more hardware friendly without sacrificing the accuracy of the Custom Computing Machines (FCCM), 2017 IEEE 25th Annual Inter-
national Symposium on. IEEE, 2017, pp. 85–92.
model. Therefore, it has become possible to achieve increased [13] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao,
speedup and higher energy efficiency on FPGAs compared to A. Mishra, and H. Esmaeilzadeh, “From high-level deep neural models
CPU and GPU platforms [?], [28] while maintaining state- to fpgas,” in The 49th Annual IEEE/ACM International Symposium on
Microarchitecture. IEEE Press, 2016, p. 17.
of-the-art accuracy. Prior efforts have been made in FPGA [14] M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and W. Sung, “Fpga-
acceleration of speech recognition, classification and language based low-power speech recognition with recurrent neural networks,” in
modelling using Recurrent Neural Networks [14], [16], [27]; 2016 IEEE International Workshop on Signal Processing Systems (SiPS).
IEEE, 2016, pp. 230–235.
however the challenges in generation of sequences with long- [15] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo,
term dependencies, particularly in audio domain have not been S. Yao, Y. Wang, H. Yang, and W. B. J. Dally, “Ese: Efficient speech
addressed. recognition engine with sparse lstm on fpga,” in Proceedings of the
2017 ACM/SIGDA International Symposium on Field-Programmable
VII. C ONCLUSION Gate Arrays, ser. FPGA ’17. ACM, 2017, pp. 75–84.
[16] S. Li, C. Wu, H. Li, B. Li, Y. Wang, and Q. Qiu, “Fpga acceleration
We present the first accelerator platform for deep autore- of recurrent neural network based language model,” in Proceedings
gressive convolutional neural networks. While prior works of the 2015 IEEE 23rd Annual International Symposium on Field-
have proposed algorithms for making the inference of such Programmable Custom Computing Machines, ser. FCCM ’15. IEEE
Computer Society, 2015, pp. 111–118.
networks faster on GPUs and CPUs, they do not exploit the po- [17] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda,
tential parallelism offered by FPGAs. We develop a systematic “Speaker-dependent wavenet vocoder,” in INTERSPEECH, 2017.
approach to accelerate the inference of WaveNet based neu- [18] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, and L.-R. Dai, “Wavenet
vocoder with limited training data for voice conversion,” in Proc.
ral networks by optimizing their fundamental computational Interspeech, 2018, pp. 1983–1987.
blocks and utilizing only on-chip memory. We demonstrate [19] G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dimitriadis,
the effectiveness of using FPGAs for fast audio generation by X. Cui, B. Ramabhadran, M. Picheny, L.-L. Lim, B. Roomi, and P. Hall,
“English conversational telephone speech recognition by humans and
achieving a significant speed-up over prior efforts on CPU and machines,” in Proc. Interspeech 2017, 2017, pp. 132–136.
GPU based implementations. [20] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated
convolutions,” arXiv preprint arXiv:1511.07122, 2015.
R EFERENCES [21] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“Deeplab: Semantic image segmentation with deep convolutional nets,
[1] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, atrous convolution, and fully connected crfs,” IEEE transactions on
A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848,
“Wavenet: A generative model for raw audio.” in SSW, 2016, p. 125. 2018.
[2] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, [22] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural tts synthesis by fpga-based accelerator design for deep convolutional neural networks,”
conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE in Proceedings of the 2015 ACM/SIGDA International Symposium on
International Conference on Acoustics, Speech and Signal Processing Field-Programmable Gate Arrays. ACM, 2015, pp. 161–170.
(ICASSP). IEEE, 2018, pp. 4779–4783. [23] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition.
[3] A. Gibiansky, S. Arik, G. Diamos, J. Miller, K. Peng, W. Ping, Prentice-Hall, Inc., 1993.
J. Raiman, and Y. Zhou, “Deep voice 2: Multi-speaker neural text-to- [24] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and
speech,” in Advances in neural information processing systems, 2017, E. S. Chung, “Accelerating deep convolutional neural networks using
pp. 2962–2970. specialized hardware,” Microsoft Research Whitepaper, vol. 2, no. 11,
[4] K. Qian, Y. Zhang, S. Chang, X. Yang, D. A. F. Florêncio, and 2015.
M. Hasegawa-Johnson, “Speech enhancement using bayesian wavenet,” [25] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s.
in INTERSPEECH, 2017. Seo, and Y. Cao, “Throughput-optimized opencl-based fpga accelerator
[5] N. Kalchbrenner, L. Espeholt, K. Simonyan, A. van den Oord, A. Graves, for large-scale convolutional neural networks,” in Proceedings of the
and K. Kavukcuoglu, “Neural machine translation in linear time,” CoRR, 2016 ACM/SIGDA International Symposium on Field-Programmable
vol. abs/1610.10099, 2016. Gate Arrays. ACM, 2016, pp. 16–25.
[6] Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick, “Improved [26] C. Shea, A. Page, and T. Mohsenin, “Scalenet: A scalable low power
variational autoencoders for text modeling using dilated convolutions,” in accelerator for real-time embedded deep neural networks,” in Proceed-
Proceedings of the 34th International Conference on Machine Learning- ings of the 2018 on Great Lakes Symposium on VLSI. ACM, 2018, pp.
Volume 70. JMLR. org, 2017, pp. 3881–3890. 129–134.
[7] A. van den Oord, T. Walters, and T. Strohman, “Wavenet launches in [27] Y. Guan, Z. Yuan, G. Sun, and J. Cong, “Fpga-based accelerator for long
the google assistant,” 2017. [Online]. Available: https://deepmind.com/ short-term memory recurrent neural networks,” 01 2017, pp. 629–634.
blog/wavenet-launches-google-assistant/ [28] J. Y. Y. W. Kaiyuan G., Shulin Z and H. Y., “A survery of fpga
[8] R. J. Williams and D. Zipser, “Backpropagation,” Y. Chauvin and D. E. based neural network accelerator,” ACM Transactions on Reconfigurable
Rumelhart, Eds. L. Erlbaum Associates Inc., 1995, ch. Gradient-based Technology and Systems, vol. 9, no. 4, 2017.
Learning Algorithms for Recurrent Networks and Their Computational
Complexity, pp. 433–486.
[9] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic
convolutional and recurrent networks for sequence modeling,” CoRR,
vol. abs/1803.01271, 2018.

You might also like