Articulo 1
Articulo 1
A R T I C L E I N F O A B S T R A C T
Keywords: This study presented an algorithm for fast hardware execution of complex cube root. In this algorithm, which is
Complex cube root based on the Laurent series of ∛z function, first, the z-plane’s numbers are mapped by using a rapid scaling and
Computational modeling rotation operation to a pre-specified limited region, and then the sequences of the series are computed. The
Computer architecture
parameters of the algorithm are thoroughly analyzed and selected to achieve high precision. The algorithm has
Hardware implementation
Field programmable gate arrays
been implemented on a field programmable gate array-based platform using the Simulink HDL Coder tool and
Digital signal processing Xilinx ISE 14.7. In addition, the resource usage and speed parameters are carefully examined for the imple
mentation of each step of the algorithm. Hardware was implemented in two 56-bit and 32-bit versions (for
comparison). The 32-bit version occupies 140 slice Regs, 421 slice LUTs, and 5 DSP48s. The hardware with the
capability of computing complex cube roots has appropriate specifications comparable with those of previous
implementations of real cube root calculation on FPGA.
1. Introduction number and its cubic root bits. In this study the input number was
changed to 33 bits by the addition of a zero bit in its most significant
Cube root calculation as a fundamental operation in solving cubic position and then divides it into eleven 3-bit sections. Next, these sec
and quartic equations is a complicated operation used in some digital tions were used to calculate each of the cubic root bits from the most
signal processing applications [1–6]. Different algorithms and imple significant bit (MSB) to the least significant bit by solving a conditional
mentations have been proposed for this calculation [7–21]. For example, first-order equation. Implementation of this algorithm needs 1 multi
a field programmable gate array (FPGA)-based hardware was presented plier, 5 adders and several multiplexers and registers. After imple
for the cube root calculation of 32-bit floating point input numbers in mentation on FPGA, the resulting hardware performs the computations
accordance with the IEEE 754–2008 standard format [19]. This hard in 13 clock cycles for 32 bit input numbers.
ware separates the exponential and mantissa sections of the input In [21] the cubic root of a fixed-point 32-bit binary-coded decimal
number, which are 8 bit and 23 bit respectively, and then the results of (BCD) real number was computed on FPGA based on the long division
dividing exponent part by three, which are quotient and reminder, are method. Like previous work, its presentation of mathematical theory
obtained without calculation via a read only memory (ROM) memory. begins by third power of a two-digit number. But, here, unlike the
Further, the cubic root of the mantissa is calculated by applying the previous work, which focused on bit by bit recovery, the authors focused
Newton-Raphson relationship for cube root function, which requires on obtaining mathematical relations for the recovery of decimal digits of
Newton-Raphson approximation for reciprocal function too. After the cube root. It considers the cubic root as a three-digit decimal number
implementation on Virtex5 FPGA, all of the steps takes a latency of 19 (Y1 Y2 Y3 ), and then the digits of this number are calculated from left to
clock cycles. right. Y1 is obtained directly from eight MSBs of the input by using some
In another research [20], according to the mathematical relationship conditional relations. Subsequently, Y2 and Y3 in order are obtained by
related to the third power of a two-digit number ((pq)3), first ’p’ and solving two second-degree algebraic equations through a trial-and-error
then ’q’ is obtained. Then some mathematical expressions were pro method. The coefficients of the equations are obtained by using the
vided, showing the relationships between a 32-bit real fixed-point previously discovered digits of the cubic root and the remaining bits of
✰
Simulink and VHDL files of the designed hardware are available at https://disk.yandex.com/d/GicXvykrtTlpuA.
* Corresponding author.
E-mail address: [email protected] (E. Rajaby).
https://doi.org/10.1016/j.micpro.2023.104847
Received 26 January 2023; Received in revised form 29 March 2023; Accepted 26 April 2023
Available online 16 May 2023
0141-9331/© 2023 Elsevier B.V. All rights reserved.
E. Rajaby et al. Microprocessors and Microsystems 100 (2023) 104847
The algorithms used in previous works for computing the real cubic
root are not generally suitable for complex cube root computing. For
instance, using Newton-Raphson-based methods [7,10] requires the
division or reciprocal operation in the complex domain with high
computational resources. In addition, the Newton-Raphson algorithm
necessitates somewhat strict initializing parameter values before
running, and if it is not performed, the algorithm may not converge at all
or converge after many iterations. Several other proposed algorithms in
other studies [8,9,14] are not compatible with the complex domain or
they need some major modifications to be used for complex values.
The proposed method in this work is a division-free algorithm that
can truly converge and is based on the Laurent series expansion of the
function in the complex domain. The function is holomorphic at point
’a’ and its expansion is expressed as:
̃ 1 1 z − a (z − a)2
z3n = a3 + 2 − 5 +…
3 a3 9 a3 Fig. 2. Mapping operation on the input number.
⎛ ⎞
1
1
+ a3− n n
(− a + z) ⎝ 3 ⎠ domain is equivalent to the evaluation of Eq. (4) in the real domain:
n ( )12
(real(z) − 1)2 + imag(z)2 <1 (4)
n ∈ [0, ∞), (converges when |a − z| < |a|) (1) To make the computation simple, a part of region C, namely, S, is
To avoid redundant computations, the relationship between each considered for the mapping operations which is characterized by the
term in the series and its previous one is used instead of applying explicit following relationships:
formulae in the sequence: real(z) ≥ (imag(z)) (5)
(1 )
− t + 1 ⋅(z − a)
(2) (6)
3 1
rt = rt− 1 ⋅ , r0 = a3 real(z) ≥ − (imag(z))
t⋅a
Further, to avoid long-bit words in hardware implementation, first, h
< ‖ z ‖1 < h (7)
the input number is scaled around a specific constant value of ’a’, and 8
then after computing the series, the scaling is compensated by a reverse
operation. If value ’a’ is far from input value ’z’, the time of scaling where
operations and thus the latency of the system represent an increase.
Furthermore, the value of ’a’ in Eqs. (1) and (2) can affect the and h and
complexity of the computation. By these considerations, ’a’ is set to "1′′ ,
and input value ’z’ is considered in the convergence zone C expressed h/8 are two boundary values of region S that are located on the real axis.
by: Region S, which is similar to a cut corner Square Rhombus, is shown in
Fig. 1. The region has two main properties. First, with the scaling factor
|z − 1| < 1 (3)
of 8, each point outside the region is mapped only once in the region,
A mapping operation moves the input values that are not inside the and second, point zero (which is a divergence point) and all the points
convergence zone into the zone. The evaluation of Eq. (3) in the complex near zero, which slowly converge, are not located in the region.
2
E. Rajaby et al. Microprocessors and Microsystems 100 (2023) 104847
∫∫ Rn (z)
1 dxdy
(14)
|z|3
, (z = x + iy)
...S ds
The reason is that the input values are not uniformly scaled to their
corresponding mapping values in region S; therefore, there is a need for
geometrical gridding in which the distances of the grid lines for the
integration vary accordingly. In other words, the number of points
considered for the integration should not be changed with the value of
’h’. Hence, to compute the average of relative reminders of selected
uniformly distributed points (Here 218 points) on the entire input range,
first, they are mapped to region S, and then the mapped points are
placed on a geometrical grid for the integration, followed by computing
the relative error of each point. This process is expressed as follows:
∫ 215 − 1 ∫ 215 − 1 Rn (map(x+iy))
− 215 − 215 1 dxdy
(15)
|map(x+iy)|3
AR n = ∫ 215 − 1 ∫ 215 −1
− 215 − 215
dxdy
Fig. 3 displays the average error values for five different values of ’n’
in Eq. (1) and for different values of ’h’. Based on these results, the
Fig. 3. Average error obtained by Eq. (15) for different values. values of n = 40 and h = 1.93 are chosen for the implementation.
To fulfill condition (7), one way is to multiply or divide ||z||1 by
The mapping operation for a sample input number ’z’ outside region value eight consecutively until the condition becomes true. However,
S is illustrated in Fig. 2. As shown, scaling and rotation operations are this method is time-consuming, thus the following method is employed
performed for the mapping operation. These operations and their cor for this purpose. From Eq. (7), we have:
responding reverse operations after computing the cube root of the
1 1
mapped data should not be complex for the hardware implementation. ≤ ‖ z ‖1 ⋅ ⋅8m < 1 (16)
8 h
Hence, the two above-mentioned operations and their corresponding
reverse operations are performed by Eqs. (8) and (9), as well as (10) and or
(11), respectively: ( )
1
z1 = z⋅8m , m ∈ Z (8) − 1 ≤ log8 ‖ z ‖1 ⋅ + m < 0 (17)
h
By defining ‖ z ‖1 as:
′
(9)
jkπ
z2 = z1 ⋅e 2 , k ∈ {1, 2, 3}
1
(18)
′
̃1 ̃1 ‖ z ‖1 =‖ z‖1 ⋅
(10)
jkπ
z31 = z32 ⋅e− 6 h
(17) can be re-written as:
̃1
z̃3 = z31 ⋅2− (11)
1 m
( ′ ) ( ′ )
− log8 ‖ z ‖1 − 1 ≤ m < − log8 ‖ z ‖1 (19)
The value of parameter ’m’ in Eqs. (8) and (11) is determined by the
constraint defined in (7). The value of ’h’ in Eq. (7) can be in the interval By using (19), scaling factor ’m’ can be calculated as:
of 0–2. This value affects the precision of the calculated result or the ⌊ ( ’ )⌋
m = − log8 ‖ z‖1 − 1
estimation error, thus finding its optimal value is necessary. The amount
of errors is estimated by the absolute difference value between the exact
⌊ ( ) ⌋
... log2 ‖ z‖’1 ... (20)
=− − 1
root value and the estimated value: 3
⃒ ⃒ ⌊ ′ ⌋
⃒̃ 1 1⃒
Rn (z) = ⃒⃒z3n − z3 ⃒⃒ (12) The value of log2 (‖ z ‖1 ) can be simply calculated by a binary
representation of ‖ z ‖1 . It is equivalent to the Most Significant ’1′ Po
′
To determine the average of relative reminders (errors) on region S sition (MS1P()) in the binary representation of the number, subtracted
for a specified value of ’h’, it is necessary to compute the following by the fractional length value which is 40 here. Hence, ’m’ can be
formula: expressed as:
⌊ ( ′ ) ⌋
...S Rn (z)
1 ds MS1P ‖ z ‖1 − 40
m=− − 1 (21)
(13)
|z|3
AR n = 3
...S ds
As expressed in Eq. (14), the formula cannot be calculated in the For the rotation step, Eqs. (5) and (6) provide the approximate
usual way of using uniform gridding. angular position of input ’z’ in the complex plane and are determined
based on that parameter ’k’ in Eqs. (9) and (10). The implementation of
jkπ
Eq. (10), considering its term e− 6 , is more complicated than Eq. (9).
jπ j2π j3π
Terms e , e and e are equal to j, − 1, and –j, respectively, in Eq. (9). To
2 2 2
Table 1 implement Eq. (10), considering the modular property of the phase,
Values of k and k’ for different evaluations of (5) and (6). parameter k’ is defined as expressed in Eq. (22), and the reverse rotation
Evaluation of (5) False False True True
is applied by using Eq. (23):
( )3 ( )3
Evaluation of (6) False True False True ′
jk π jkπ
k 2 3 1 0 − ′
2 6 j3k π jkπ
k’ 2 3 1 0 e =e => e 2 = e− 2
3
E. Rajaby et al. Microprocessors and Microsystems 100 (2023) 104847
Fig. 4. Block diagram of the proposed hardware; details of step 2 is shown in Fig.5, step 4 in Fig.6, steps 5 and 7 in Fig. 7 and step 6 in Fig. 8.
4
E. Rajaby et al. Microprocessors and Microsystems 100 (2023) 104847
rt− 1 (26)
4
− t
xt = 3 t ∈ {1, 2, …, 40} =>
t
{ }
1 1 116
Fig. 7. The rotation unit. LUT3 = ,− ,…− (27)
3 3 9
connected select bit in the main data bit word. The output range of the z− 1 (28)
MS1P function is [0 55]. To avoid the implementation of "division by 3′′
and floor rounding operations, two 54-point lookup tables, as expressed Using the four above-mentioned values, the following calculations
in Eq. (24), are used to provide 3* are performed for the series calculation. Eq. (26) is multiplied by Eq.
(27) by using two R-multipliers as expressed in Eq. (29). Then, Eq. (28) is
1 (for reverse rotation) multiplied by the result of Eq. (29) through a C_multiplier as expressed
values in the output. This unit is illustrated in Fig. 5. in Eq. (30). To increase the efficiency, as expressed in Eq. (31), the
C_multiplier is made up of 3 R-multipliers and 5 R_adders instead of
conventional 4 R_multipliers and 2 R_adders.
Fig. 8. The unit for step 6 of the algorithm (a Laurant series generator).
5
E. Rajaby et al. Microprocessors and Microsystems 100 (2023) 104847
Table 2
Some sample complex cube root calculation results obtained by the proposed hardware.
Number Exact cube root Cube root by the 56-bit hardware Relative error
Table 3 Algorithm 1
The resource utilization of the 56-bit cube root hardware after place and root. The steps of the proposed complex cube root calculation.
Device utilization summary Input: z
Logic utilization Used Available Utilization 1
̃
Table 4
Comparison of the present work with some similar works.
Ref. Year Type Device Resourceutilization Computation time Max error Input range and
(ns) precision
6
E. Rajaby et al. Microprocessors and Microsystems 100 (2023) 104847
Additionally, the 32-bit version circuit computed eight sequences of the [8] O. Ahmadi, F.R. Henriquez, Low Complexity Cubing and Cube Root Computation
over $\F_ {3^ m} $ in Polynomial Basis, IEEE Trans. Comput. 59 (10) (2010)
Laurent series to obtain a specific relative error of less than 0.29% in this
1297–1308.
study. [9] Y. Li, W. Chu, On the improved implementations and performance evaluation of
The timing and resource usage of the 32-bit version design and of digit-by-digit integer restoring and non-restoring cube root algorithms, in: 2016
some similar works that compute the real cube root are presented in International Conference on Computer, Information and Telecommunication
Systems (CITS), 2016, pp. 1–5.
Table 4. The 32-bit version can work at 65.14 MHz clock frequency and [10] S. Yammen, J. Ieamsaard, Newton’s cube root finding data sequence, in: 2021 9th
needs 8 clock cycles to complete the operation. Our design has International Electrical Engineering Congress (iEECON), 2021, pp. 405–407.
comparatively reasonable resource usage, speed, and precision while [11] V. Pieterse, P. Black, cube root. Dictionary of Algorithms and Data Structures,
2009.
having the capability of computing complex roots. The hardware pre [12] L. Moroz, V. Samotyy, C.J. Walczyk, J.L. Cieśliński, Fast calculation of cube and
sented in [20] is superior to our work only in terms of resource utili inverse cube roots using a magic constant and its implementation on
zation. The hardware reported in [19] is a suitable choice when the microcontrollers, Energies 14 (4) (2021) 1058.
[13] M.S.B. Mohamad, An algorithms for finding the cube roots in finite fields, Procedia
numbers are real but with high resource usage. For fair comparison, our Comput. Sci. 179 (2021) 838–844.
32-bit version hardware like other works was implemented on Virtex 5 [14] G.H. Cho, S. Kwon, H.-.S. Lee, A refinement of Müller’s cube root algorithm, in:
FPGA by ISE 14.7. Finite Fields and Their Applications, 67, 2020, 101708.
[15] C. Zhou, H. Geng, P. Wang, C. Guo, Ten-input cube root logic computation with
rational designed DNA nanoswitches coupled with DNA strand displacement
5. Conclusion process, ACS Appl. Mater. Interfaces 12 (2) (2019) 2601–2606.
[16] J. Jo, I.-.C. Park, Low-latency low-cost architecture for square and cube roots,
IEICE Trans. Fundam. Electr. Commun. Comput. Sci. 100 (9) (2017) 1951–1955.
The proposed hardware in this work computes complex and real
[17] A. Pineiro, J.D. Bruguera, F. Lamberti, P. Montuschi, A radix-2 digit-by-digit
cube roots by detecting approximate absolute value and angular position architecture for cube root, IEEE Trans. Comput. 57 (4) (2008) 562–566, https://
and then using a mapping (shift and rotation) process and computing the doi.org/10.1109/TC.2007.70848.
Laurent series. It is fast, with efficient resource usage due to utilizing [18] G.H. Cho, N. Koo, E. Ha, S. Kwon, New cube root algorithm based on the third
order linear recurrence relations in finite fields, Designs Codes Cryptogr. 75 (3)
techniques such as computational reuse, converting multiplications to (2015) 483–495.
add-shift operations, and using pre-computing data. The design can be [19] C.M. Guardia, E. Boemo, FPGA implementation of a binary32 floating point cube
utilized in different applications. For example, based on the application, root, in: 2014 IX Southern Conference on Programmable Logic (SPL), Nov. 2014,
pp. 1–6, https://doi.org/10.1109/SPL.2014.7002202.
the bit width of the signals and the number of sequences of the Laurent [20] R.V.W. Putra, T. Adiono, Optimized hardware algorithm for integer cube root
series can be modified to achieve desired run time and precision. In this calculation and its efficient architecture, in: 2015 International Symposium on
work we implemented two cases of 56-bit and 32-bit. With the proposed Intelligent Signal Processing and Communication Systems (ISPACS), 2015,
pp. 263–267.
design, higher-order roots can be computed by changing the pre [21] S.K. Padhan, S. Gadtia, B. Bhoi, FPGA based implementation for extracting the
computed coefficient of the Laurant series. roots of real number, Alexandria Eng. J. 55 (3) (Sep. 2016) 2849–2854, https://
doi.org/10.1016/j.aej.2016.07.003.
Declaration of Competing Interest
Supplementary materials