Cuda 2d
Cuda 2d
Abstract—Autoregressive convolutional neural networks the inference still remains sequential and CNN based models
(CNNs) have been widely exploited for sequence generation are usually very deep. However, to overcome this problem,
arXiv:2002.04971v1 [eess.AS] 9 Feb 2020
tasks such as audio synthesis, language modeling and neural Fast-Wavenet [11] exploits the temporal dependencies by
machine translation. WaveNet is a deep autoregressive CNN
composed of several stacked layers of dilated convolution that caching redundant computations using fixed-length convolu-
is used for sequence generation. While WaveNet produces tional queues and thus makes the generation time linear in
state-of-the art audio generation results, the naive inference length of the sequence. Such efforts have made it feasible
implementation is quite slow; it takes a few minutes to generate to use autoregressive CNNs for practical sequence generation
just one second of audio on a high-end GPU. In this work, we applications, as an alternative to RNN-based models.
develop the first accelerator platform FastWave for autoregressive
convolutional neural networks, and address the associated design While the Fast-Wavenet algorithm provides a speed-up in
challenges. We design the Fast-Wavenet inference model in audio generation over naı̈ve Wavenet implementation, the
Vivado HLS and perform a wide range of optimizations generation time is still high, even on a high-end GPU. It
including fixed-point implementation, array partitioning and takes 120 seconds to generate 1 second of audio using Fast-
pipelining. Our model uses a fully parameterized parallel Wavenet implementation on a NVIDIA TITAN Xp GPU. Prior
architecture for fast matrix-vector multiplication that enables
per-layer customized latency fine-tuning for further throughput works have shown FPGAs to be successful in accelerating the
improvement. Our experiments comparatively assess the trade- inference of pre-trained neural networks by providing custom
off between throughput and resource utilization for various data paths to achieve high parallelism. A vast amount of
optimizations. Our best WaveNet design on the Xilinx XCVU13P such research focuses on accelerating neural networks in the
FPGA that uses only on-chip memory, achieves 66× faster image domain [12], [13], speech recognition [14], [15] and
generation speed compared to CPU implementation and 11×
faster generation speed than GPU implementation. language modelling [16]. To the best of our knowledge, similar
efforts have not been made for accelerating neural networks
I. I NTRODUCTION for speech/audio synthesis.
Autoregressive convolutional models achieve state-of-the- We aim to accelerate audio synthesis using the autoregres-
art results in audio [1]–[4] and language domains [5], [6] with sive CNN model - WaveNet on FPGA. The primary challenges
respect to both estimating the data distribution and generating in deploying auto-regressive CNN inference on FPGA are
high-quality samples. Wavenet [1] is an example of autoregres- designing modules for dilated convolutional layers, buffers for
sive convolutional network, used for modelling audio for appli- storing redundant computations using convolutional queues,
cations such as text-to-speech (TTS) synthesis and music gen- and dealing with the large depth of these networks which
eration. WaveNet has been rated by human listeners to provide is necessary to maintain high audio quality. In this work,
substantially more natural sounding audio when compared to we address these challenges of deploying large autoregressive
the best existing parametric and concatenative systems in TTS convolutional models on FPGAs and perform a wide range
applications for both English and Mandarin [1]. Popular cloud of application and architectural optimizations. Furthermore,
based TTS synthesis systems such as Google Now and Google we comprehensively analyze and compare the performance
Assistant, that produce natural sounding speech, are built on of Fast-Wavenet implementation on FPGA with the CPU and
WaveNet architecture [4], [7]. Alongside achieving state-of- GPU counterparts.
the art results in the audio domain, convolutional models Summary of Contributions:
are prominent for natural language modeling tasks like text • Creation of the first accelerator platform for autoregres-
generation and machine translation [5]. sive convolutional neural networks. We deploy the fast
Generally, both autoregressive convolutional neural net- inference model Fast-Wavenet [11] on Xilinx XCVU13P
works (CNNs) and Recurrent Neural Networks (RNNs) [8] FPGA which achieves 11 times faster generation speed
are widely popular for sequence modelling tasks. The main than a high-end GPU and 66 times faster generation speed
advantage of CNN based models is that they can achieve than a high-end CPU.
higher parallelism during training and can capture longer time- • Development of reconfigurable basic blocks pertinent to
dependencies as compared to RNN based models [9], [10]. autoregressive convolutional networks i.e., dilated causal
However, this comes at a cost of slower inference, since convolutional layers, convolutional queues, and fully con-
nected layer. Our operations are powered by a fully- learnable parameters of Wavenet model. During inference,
customizable matrix-multiplication engine that uses two next-sample audio (xt ) generation is performed by sampling
levels of parallelism controlled by tunable parameters. from the conditional probability distribution given all of the
• Creation of an end-to-end framework that accesses only previous samples, P (xt |xt−1 , ..., x1 , x0 , λ).
on-chip memory thereby ensuring high throughput and One possible method for modeling the probability den-
avoiding any latency caused by communication with off- sity is via a stack of causal convolutional layers as de-
chip memory. picted in Figure 1(a). The input passes through this stack
• Exploration of the design space using different architec- of convolutional layers and gated activation functions and
tural and application optimizations, as well as comparing finally through a softmax layer to get the posterior proba-
the performance and resource usage. We present extensive bility P (xt |xt−1 , ..., x1 , x0 ). The downside of this approach
evaluation of throughput and power efficiency for our is that in order to model long temporal dependencies from
fully optimized and baseline designs. samples far in the past, the causal convolutional network
II. BACKGROUND requires either many layers or large filters to increase the
receptive field. In general, the receptive field is calculated as
In this section, we provide a background on autoregressive
# of layers + f iltersize + 1 which gives a receptive field
convolutional neural networks. We choose WaveNet as an
of 5 in the architecture shown in Figure 1(a). To address this
ideal representation of such models, and describe its overall
problem, WaveNet leverages dilated convolutions [20], [21]
generative architecture. We first elaborate on the 1D convo-
which deliver higher receptive fields without significant in-
lution operation as it is the core computation performed in
crease in computational cost of the model. Dilated convolution
the WaveNet model. Next, we explain WaveNet and its more
is equivalent to performing convolutions with dilated filters
efficient inference algorithm called Fast-Wavenet.
where the size of the filter is expanded by filling the empty
A. 1D Convolution positions with zeros. In practice, this is achieved by performing
The 1D convolution operation is performed by sliding a one a convolution where the filter skips input values with a certain
dimensional kernel over a 1D input signal. Each output value step.
at position i is produced by calculating the dot product of the Fig 1(b) illustrates a network with dilated causal convolu-
kernel and the overlapping values of the input signal, starting tions for dilation values of 1, 2, 4, and 8. Here, the input
from position i. More formally, for an input vector a of length nodes are shown with color blue and the output is shown
n and a kernel k of length m, the 1D convolution is calculated with orange. Each edge in the graph correspond to a 1-
as follows: dimensional convolution (See section II-A), more generally
m a matrix multiplication. Due to the binary tree structure of
(a ∗ k)i = σj=1 kj × ai−j+ m2 (1)
the network, the time complexity of computing output at
where i is an arbitrary index in the output vector, which has each time-step is O(2L ) where L is the number of layers
a total length of n − m + 1. The subscripts denote the indices in the network, which gets highly undesirable as L increases.
of the kernel/input vectors. Similarly, the total memory required to store the inputs, output,
B. WaveNet and Autoregressive CNNs and the intermediate layer features is O(2L ).
Autoregressive Neural Networks are popularly used for
sequence generation tasks which rely on ancestral sampling
i.e. the predictive distribution for each sample in the sequence
is conditioned on all previous ones. While RNNs are popular
autoregressive models, they do not exhibit high receptive field
making them unsuitable for modeling sequences with long-
term dependencies like audio [9]. In contrast, autoregressive
CNN based models use a stack of dilated convolutional (a)
layers to achieve higher receptive field necessary for modeling
sequences with long-term dependencies.
Wavenet [1] is an autoregressive convolutional neural net-
work that produces raw audio waveforms by directly model-
ing the underlying probability distribution of audio samples.
This has led to state-of-the-art performance in text-to-speech
synthesis [2], [7], [17], [18], speech recognition [19], and (b)
other audio generation settings [1], [3], [4]. The Wavenet Fig. 1: a. (Left) Stacked causal convolution layers without any
architecture aims to model the conditional probability among dilations. b. (Right) Stacked causal 1-d convolution layers with
subsequent audio samples. The joint probability distribution increasing dilation. (Figures from WaveNet paper [1]).
of waveform sample
QT points x = x0 , x1 , ..., xT can be written
as: P (x|λ) = t=1 P (xt |xt−1 , .., x0 , λ) where λ denotes the
C. Fast-Wavenet FastWave architecture comprises of 28 convolutional layers
The naı̈ve implementation in Figure 1(b) has many redun- with 128 channels each in order to maintain high audio
dant computations when generating a new sample, that is, quality. When designing an accelerator for such models, it is
it recomputes activations that have been already computed important to be aware of the system restrictions, particularly
for generating previous samples. Fast-Wavenet [11] proposed those of memory access bandwidth [12], [22]. Accessing off-
an efficient algorithm that caches these recurrent activations chip memory is expensive and can limit the throughput of
in queues instead of recomputing them from scratch while our network, making it important to compress DNNs into an
generating a new sample. Fast-Wavenet uses a per-layer first- optimal model for efficient inference.
in-first-out queue to cache the states to be used in future Design Flow: We start with an open source TensorFlow
timestamps. implementation of the Fast-Wavenet algorithm. We save the
The queue size at each layer is determined by its corre- weights of the convolutional and fully connected layers of
sponding dilation value. Figure 2 demonstrates an example our trained model which are used in the inference stage for
4-layer network and their corresponding queues. For the first generating audio. We implement the Fast-Wavenet inference in
layer, the dilation value is 1 and therefore the corresponding NumPy without using any high level deep learning libraries.
queue (Q1) only keeps one value. Similarly, the output layer This implementation serves as a bridge between the high level
has a dilation value of 8, which means that its queue (Q4) TensorFlow and the low level Vivado HLS implementation
will store 8 recurrent values. By removal of redundant compu- in C++. On the FPGA platform, we then accelerate the
tations due to the queue storing mechanism, the computational audio generation process from random seeds, and perform
complexity of Fast-Wavenet is O(L) where L is the number optimized operations using queue buffers and matrix-vector
of layers. The overall memory requirement for queues as multiplications to generate raw audio.
well as the intermediate values remains the same as the nav̈e
implementation, i.e., O(2L ). A. Model Architecture and Training on GPU
The basic queue operations performed in the Fast-Wavenet
are as follows (refer to Figure 2 ): Block Layer Filter Queue Input Output Queue
No. No. Width Length Channels Channels Size
1) Pop phase: The oldest recurrent states are popped off the
1 1 2 1 1 128 1
queues in each layer and fed as input to the generation 1 2 2 2 128 128 256
model. These popped off states and the current input are 1 3 2 4 128 128 512
1 4 2 8 128 128 1024
operated with the convolutional kernel to compute the 1 5 2 16 128 128 2048
current output and the new recurrent states. 1 6 2 32 128 128 4096
1 7 2 64 128 128 8192
2) Push Phase: Newly calculated recurrent states (orange 1 8 2 128 128 128 16384
dots) are pushed to the back of their respective layer 1 9 2 256 128 128 32768
1 10 2 512 128 128 65536
queues to be used in future time stamps. 1 11 2 1024 128 128 131072
1 12 2 2048 128 128 262144
Maintaining the convolutional queues in the above manner al- 1 13 2 4096 128 128 524288
lows us to handle the sparse convolutional operation and avoid 1 14 2 8192 128 128 1048576
2 1 2 1 128 128 128
redundant computations and makes the generation algorithm 2 2 2 2 128 128 256
linear in terms of length of the sequence. 2 3 2 4 128 128 512
2 4 2 8 128 128 1024
2 5 2 16 128 128 2048
Convolutional Queues Generation Model 2 6 2 32 128 128 4096
2 7 2 64 128 128 8192
Queue Pop 2 8 2 128 128 128 16384
2 9 2 256 128 128 32768
Q4 2 10 2 512 128 128 65536
2 11 2 1024 128 128 131072
Q3 2 12 2 2048 128 128 262144
2 13 2 4096 128 128 524288
Q2 2 14 2 8192 128 128 1048576
TABLE II: Resource Utilization, Performance and Measured Error in generation for each design implementation. The error
metrics namely, Mean Squared Error (MSE) and Log-Spectral Distance (LSD) is measured by comparing the generated audio
from FPGA implementations against audio generated from corresponding GPU implementation. The percentages reported
indicate percentage of resources utilized by the design.
4096) in the 13th and 14th layers of our design. For matrix constraint of available resources.
vector multiplication we use simple for loops without any
Time (in seconds)
optimization. Power
Implementation for 1-Second
(W)
2) Floating Point + Cyclic Queue (FloatingPointCQ) : Audio Generation
CPU (Numpy) 732
In this design, we replace our shifting based queue im- GPU - NVIDIA Titan Xp (TensorFlow) 120 70
plementation with a cyclic queue implementation that uses GPU - NVIDIA Tesla V100 (TensorFlow) 85 66
dynamic indexing to produce the same effect as push and FPGA- FloatingPointPipeline 87 10.2
FPGA- FixedPointUnrolling 41 7.6
pop operations. This helps reduce latency substantially since FPGA- FixedPointMME (Best) 11 23
shifting operations in the longer queues was the bottleneck in
TABLE III: Power Consumption and Wall-Clock time required
our baseline design. The resource utilization however, stays
when generating 1-second audio for different implementations.
almost the same as our baseline design.
3) Floating Point + Cyclic Queue + Pipelining (Floating-
PointPipeline) : In this design, we modify the above design C. Performance and Power Analysis
and add pipelining pragma in the dot product computation and Table III illustrates the performance and power consumption
queue update operations. Pipelining the above design helped for our implemented designs and a highly optimized CPU and
increasing the throughput substantially at the cost of higher GPU implementation. We benchmark the optimized Tensor-
resource utilization. flow implementation of Fast-Wavenet on two GPUs: NVIDIA
4) Fixed Point + Cyclic Queue + Unrolling (FixedPointUn- TITAN Xp and Nvidia Tesla V100. The CPU implementation
rolling) : Including both Cyclic Queue and Pipelining opti- is the NumPy inference program written by us and optimized
mization, we switch to fixed point operations from floating fully. We measure the power consumption for the GPU bench-
point operations. Since the order of magnitude of our ker- marks using the NVIDIA power measurement tool (nvidia-
nels, inputs, activations and outputs is nearly the same, we smi) running on Linux operating system which is invoked
keep a common data-type across all of them. After some during program execution. For our FPGA implementations,
experimentation, we found that Loop Unrolling outperforms we synthesize our designs using Xilinx Vivado v2017.4. We
pipelining in terms of both resource utilization and throughput then integrate the synthesized modules accompanied by the
for fixed point data-types. We use loop unrolling factor = corresponding peripherals into a system-level schematic using
8 for the inner loop of our dot product and also the queue Vivado IP Integrator. The frequency is set to 150 MHz and
update operations. We observe a trade-off between precision power consumption is estimated using the synthesis tool.
and resource utilization for different fixed point bit width and As shown, our best FPGA implementation achieves 11×
chose ap_fixed<27,8> (8 bits for the integer and 19 bits speed-up in audio generation while being 3× more power
for the fractional part)since it gives reasonable MSE under the efficient as compared to NVIDIA Titan Xp GPU. As compared
constraints of resources. to a NumPy based CPU implementation, our best design is
5) Fixed Point + Matrix Multiplication Engine (FixedPoint- 66× faster.
MME - Best) : For our best design, we use fixed-point
implementation in a parallelized approach to convert layer VI. P RIOR W ORKS ON ACCELERATING DNN S FOR FPGA S
computations into multiple MAC operations (refer to Section Prior works have made significant efforts in compress-
IV-D for details). For the first dilated convolution layer we set ing Deep Neural Networks (DNNs) to support fast energy-
num parallel out and num parallel in as 1 since the number efficient applications. However, recent research on DNNs is
of input channels is just 1. For all other layers, including still increasing the depth of models and introducing new
the fully connected layer we set num parallel out as 8 and architectures, resulting in higher number of parameters per
num parallel in as 4 to get the best throughput under the network and higher computational complexity. Other than
CPUs and GPUs, FPGAs are becoming a platform candidate [10] H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-
to achieve energy efficient neural network computation [12], to-speech system based on deep convolutional networks with guided
attention,” in 2018 IEEE International Conference on Acoustics, Speech
[13], [22], [24]–[27]. Equipped with the necessary hardware and Signal Processing (ICASSP). IEEE, 2018, pp. 4784–4788.
for basic DNN operations, FPGAs are able to achieve high [11] T. L. Paine, P. Khorrami, S. Chang, Y. Zhang, P. Ramachandran,
parallelism and utilize the properties of neural network com- M. A. Hasegawa-Johnson, and T. S. Huang, “Fast wavenet generation
algorithm,” arXiv preprint arXiv:1611.09482, 2016.
putation to remove unnecessary logic. Algorithm explorations [12] M. Samragh, M. Ghasemzadeh, and F. Koushanfar, “Customizing neural
also show that neural networks can be simplified to become networks for efficient fpga implementation,” in Field-Programmable
more hardware friendly without sacrificing the accuracy of the Custom Computing Machines (FCCM), 2017 IEEE 25th Annual Inter-
national Symposium on. IEEE, 2017, pp. 85–92.
model. Therefore, it has become possible to achieve increased [13] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao,
speedup and higher energy efficiency on FPGAs compared to A. Mishra, and H. Esmaeilzadeh, “From high-level deep neural models
CPU and GPU platforms [?], [28] while maintaining state- to fpgas,” in The 49th Annual IEEE/ACM International Symposium on
Microarchitecture. IEEE Press, 2016, p. 17.
of-the-art accuracy. Prior efforts have been made in FPGA [14] M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and W. Sung, “Fpga-
acceleration of speech recognition, classification and language based low-power speech recognition with recurrent neural networks,” in
modelling using Recurrent Neural Networks [14], [16], [27]; 2016 IEEE International Workshop on Signal Processing Systems (SiPS).
IEEE, 2016, pp. 230–235.
however the challenges in generation of sequences with long- [15] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo,
term dependencies, particularly in audio domain have not been S. Yao, Y. Wang, H. Yang, and W. B. J. Dally, “Ese: Efficient speech
addressed. recognition engine with sparse lstm on fpga,” in Proceedings of the
2017 ACM/SIGDA International Symposium on Field-Programmable
VII. C ONCLUSION Gate Arrays, ser. FPGA ’17. ACM, 2017, pp. 75–84.
[16] S. Li, C. Wu, H. Li, B. Li, Y. Wang, and Q. Qiu, “Fpga acceleration
We present the first accelerator platform for deep autore- of recurrent neural network based language model,” in Proceedings
gressive convolutional neural networks. While prior works of the 2015 IEEE 23rd Annual International Symposium on Field-
have proposed algorithms for making the inference of such Programmable Custom Computing Machines, ser. FCCM ’15. IEEE
Computer Society, 2015, pp. 111–118.
networks faster on GPUs and CPUs, they do not exploit the po- [17] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda,
tential parallelism offered by FPGAs. We develop a systematic “Speaker-dependent wavenet vocoder,” in INTERSPEECH, 2017.
approach to accelerate the inference of WaveNet based neu- [18] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, and L.-R. Dai, “Wavenet
vocoder with limited training data for voice conversion,” in Proc.
ral networks by optimizing their fundamental computational Interspeech, 2018, pp. 1983–1987.
blocks and utilizing only on-chip memory. We demonstrate [19] G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dimitriadis,
the effectiveness of using FPGAs for fast audio generation by X. Cui, B. Ramabhadran, M. Picheny, L.-L. Lim, B. Roomi, and P. Hall,
“English conversational telephone speech recognition by humans and
achieving a significant speed-up over prior efforts on CPU and machines,” in Proc. Interspeech 2017, 2017, pp. 132–136.
GPU based implementations. [20] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated
convolutions,” arXiv preprint arXiv:1511.07122, 2015.
R EFERENCES [21] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“Deeplab: Semantic image segmentation with deep convolutional nets,
[1] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, atrous convolution, and fully connected crfs,” IEEE transactions on
A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848,
“Wavenet: A generative model for raw audio.” in SSW, 2016, p. 125. 2018.
[2] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, [22] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural tts synthesis by fpga-based accelerator design for deep convolutional neural networks,”
conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE in Proceedings of the 2015 ACM/SIGDA International Symposium on
International Conference on Acoustics, Speech and Signal Processing Field-Programmable Gate Arrays. ACM, 2015, pp. 161–170.
(ICASSP). IEEE, 2018, pp. 4779–4783. [23] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition.
[3] A. Gibiansky, S. Arik, G. Diamos, J. Miller, K. Peng, W. Ping, Prentice-Hall, Inc., 1993.
J. Raiman, and Y. Zhou, “Deep voice 2: Multi-speaker neural text-to- [24] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and
speech,” in Advances in neural information processing systems, 2017, E. S. Chung, “Accelerating deep convolutional neural networks using
pp. 2962–2970. specialized hardware,” Microsoft Research Whitepaper, vol. 2, no. 11,
[4] K. Qian, Y. Zhang, S. Chang, X. Yang, D. A. F. Florêncio, and 2015.
M. Hasegawa-Johnson, “Speech enhancement using bayesian wavenet,” [25] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s.
in INTERSPEECH, 2017. Seo, and Y. Cao, “Throughput-optimized opencl-based fpga accelerator
[5] N. Kalchbrenner, L. Espeholt, K. Simonyan, A. van den Oord, A. Graves, for large-scale convolutional neural networks,” in Proceedings of the
and K. Kavukcuoglu, “Neural machine translation in linear time,” CoRR, 2016 ACM/SIGDA International Symposium on Field-Programmable
vol. abs/1610.10099, 2016. Gate Arrays. ACM, 2016, pp. 16–25.
[6] Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick, “Improved [26] C. Shea, A. Page, and T. Mohsenin, “Scalenet: A scalable low power
variational autoencoders for text modeling using dilated convolutions,” in accelerator for real-time embedded deep neural networks,” in Proceed-
Proceedings of the 34th International Conference on Machine Learning- ings of the 2018 on Great Lakes Symposium on VLSI. ACM, 2018, pp.
Volume 70. JMLR. org, 2017, pp. 3881–3890. 129–134.
[7] A. van den Oord, T. Walters, and T. Strohman, “Wavenet launches in [27] Y. Guan, Z. Yuan, G. Sun, and J. Cong, “Fpga-based accelerator for long
the google assistant,” 2017. [Online]. Available: https://deepmind.com/ short-term memory recurrent neural networks,” 01 2017, pp. 629–634.
blog/wavenet-launches-google-assistant/ [28] J. Y. Y. W. Kaiyuan G., Shulin Z and H. Y., “A survery of fpga
[8] R. J. Williams and D. Zipser, “Backpropagation,” Y. Chauvin and D. E. based neural network accelerator,” ACM Transactions on Reconfigurable
Rumelhart, Eds. L. Erlbaum Associates Inc., 1995, ch. Gradient-based Technology and Systems, vol. 9, no. 4, 2017.
Learning Algorithms for Recurrent Networks and Their Computational
Complexity, pp. 433–486.
[9] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic
convolutional and recurrent networks for sequence modeling,” CoRR,
vol. abs/1803.01271, 2018.