0% found this document useful (0 votes)
8 views

Articulo 1

Uploaded by

hyperloop624
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Articulo 1

Uploaded by

hyperloop624
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Microprocessors and Microsystems 100 (2023) 104847

Contents lists available at ScienceDirect

Microprocessors and Microsystems


journal homepage: www.elsevier.com/locate/micpro

Hardware design and implementation of high-efficiency cube-root of


complex numbers✰
Elias Rajaby *, Sayed Masoud Sayedi, Ehsan Yazdian
Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111, Iran

A R T I C L E I N F O A B S T R A C T

Keywords: This study presented an algorithm for fast hardware execution of complex cube root. In this algorithm, which is
Complex cube root based on the Laurent series of ∛z function, first, the z-plane’s numbers are mapped by using a rapid scaling and
Computational modeling rotation operation to a pre-specified limited region, and then the sequences of the series are computed. The
Computer architecture
parameters of the algorithm are thoroughly analyzed and selected to achieve high precision. The algorithm has
Hardware implementation
Field programmable gate arrays
been implemented on a field programmable gate array-based platform using the Simulink HDL Coder tool and
Digital signal processing Xilinx ISE 14.7. In addition, the resource usage and speed parameters are carefully examined for the imple­
mentation of each step of the algorithm. Hardware was implemented in two 56-bit and 32-bit versions (for
comparison). The 32-bit version occupies 140 slice Regs, 421 slice LUTs, and 5 DSP48s. The hardware with the
capability of computing complex cube roots has appropriate specifications comparable with those of previous
implementations of real cube root calculation on FPGA.

1. Introduction number and its cubic root bits. In this study the input number was
changed to 33 bits by the addition of a zero bit in its most significant
Cube root calculation as a fundamental operation in solving cubic position and then divides it into eleven 3-bit sections. Next, these sec­
and quartic equations is a complicated operation used in some digital tions were used to calculate each of the cubic root bits from the most
signal processing applications [1–6]. Different algorithms and imple­ significant bit (MSB) to the least significant bit by solving a conditional
mentations have been proposed for this calculation [7–21]. For example, first-order equation. Implementation of this algorithm needs 1 multi­
a field programmable gate array (FPGA)-based hardware was presented plier, 5 adders and several multiplexers and registers. After imple­
for the cube root calculation of 32-bit floating point input numbers in mentation on FPGA, the resulting hardware performs the computations
accordance with the IEEE 754–2008 standard format [19]. This hard­ in 13 clock cycles for 32 bit input numbers.
ware separates the exponential and mantissa sections of the input In [21] the cubic root of a fixed-point 32-bit binary-coded decimal
number, which are 8 bit and 23 bit respectively, and then the results of (BCD) real number was computed on FPGA based on the long division
dividing exponent part by three, which are quotient and reminder, are method. Like previous work, its presentation of mathematical theory
obtained without calculation via a read only memory (ROM) memory. begins by third power of a two-digit number. But, here, unlike the
Further, the cubic root of the mantissa is calculated by applying the previous work, which focused on bit by bit recovery, the authors focused
Newton-Raphson relationship for cube root function, which requires on obtaining mathematical relations for the recovery of decimal digits of
Newton-Raphson approximation for reciprocal function too. After the cube root. It considers the cubic root as a three-digit decimal number
implementation on Virtex5 FPGA, all of the steps takes a latency of 19 (Y1 Y2 Y3 ), and then the digits of this number are calculated from left to
clock cycles. right. Y1 is obtained directly from eight MSBs of the input by using some
In another research [20], according to the mathematical relationship conditional relations. Subsequently, Y2 and Y3 in order are obtained by
related to the third power of a two-digit number ((pq)3), first ’p’ and solving two second-degree algebraic equations through a trial-and-error
then ’q’ is obtained. Then some mathematical expressions were pro­ method. The coefficients of the equations are obtained by using the
vided, showing the relationships between a 32-bit real fixed-point previously discovered digits of the cubic root and the remaining bits of


Simulink and VHDL files of the designed hardware are available at https://disk.yandex.com/d/GicXvykrtTlpuA.
* Corresponding author.
E-mail address: [email protected] (E. Rajaby).

https://doi.org/10.1016/j.micpro.2023.104847
Received 26 January 2023; Received in revised form 29 March 2023; Accepted 26 April 2023
Available online 16 May 2023
0141-9331/© 2023 Elsevier B.V. All rights reserved.
E. Rajaby et al. Microprocessors and Microsystems 100 (2023) 104847

the input number. The design was implemented on a Virtex 5 device.


The hardware needs at least 10 clock cycles to complete the
computations.
In some applications, there is a need for computing the cube root of
complex numbers; for instance, to execute the syndrome decoding al­
gorithm in performing Sparse Fast Fourier Transform (SFFT), solving
cubic equations and calculating the complex cube root within it are
necessary [5,6].
In this paper, a fast algorithm with its hardware implementation are
presented for the cube root calculation of complex numbers. In the
design to improve the speed and area usage of the FPGA, each mathe­
matical relationship involved in the algorithm has been implemented
through a low-level design approach to carefully manage its resources
and time consumption.
The remaining sections of the paper are organized as follows:
Section 2 describes the proposed algorithm and the related mathe­
matical relationships. A detailed implementation of the relationships
and architecture of the designed hardware is discussed in Section 3.
Section 4 presents the results of FPGA implementation and then com­ Fig. 1. Convergence zone C, and subregion S.
pares them with those of some previous works. Finally, the main find­
ings are provided in Section 5.

2. The proposed algorithm

The algorithms used in previous works for computing the real cubic
root are not generally suitable for complex cube root computing. For
instance, using Newton-Raphson-based methods [7,10] requires the
division or reciprocal operation in the complex domain with high
computational resources. In addition, the Newton-Raphson algorithm
necessitates somewhat strict initializing parameter values before
running, and if it is not performed, the algorithm may not converge at all
or converge after many iterations. Several other proposed algorithms in
other studies [8,9,14] are not compatible with the complex domain or
they need some major modifications to be used for complex values.
The proposed method in this work is a division-free algorithm that
can truly converge and is based on the Laurent series expansion of the
function in the complex domain. The function is holomorphic at point
’a’ and its expansion is expressed as:

̃ 1 1 z − a (z − a)2
z3n = a3 + 2 − 5 +…
3 a3 9 a3 Fig. 2. Mapping operation on the input number.
⎛ ⎞
1
1
+ a3− n n
(− a + z) ⎝ 3 ⎠ domain is equivalent to the evaluation of Eq. (4) in the real domain:
n ( )12
(real(z) − 1)2 + imag(z)2 <1 (4)
n ∈ [0, ∞), (converges when |a − z| < |a|) (1) To make the computation simple, a part of region C, namely, S, is
To avoid redundant computations, the relationship between each considered for the mapping operations which is characterized by the
term in the series and its previous one is used instead of applying explicit following relationships:
formulae in the sequence: real(z) ≥ (imag(z)) (5)
(1 )
− t + 1 ⋅(z − a)
(2) (6)
3 1
rt = rt− 1 ⋅ , r0 = a3 real(z) ≥ − (imag(z))
t⋅a
Further, to avoid long-bit words in hardware implementation, first, h
< ‖ z ‖1 < h (7)
the input number is scaled around a specific constant value of ’a’, and 8
then after computing the series, the scaling is compensated by a reverse
operation. If value ’a’ is far from input value ’z’, the time of scaling where
operations and thus the latency of the system represent an increase.
Furthermore, the value of ’a’ in Eqs. (1) and (2) can affect the and h and
complexity of the computation. By these considerations, ’a’ is set to "1′′ ,
and input value ’z’ is considered in the convergence zone C expressed h/8 are two boundary values of region S that are located on the real axis.
by: Region S, which is similar to a cut corner Square Rhombus, is shown in
Fig. 1. The region has two main properties. First, with the scaling factor
|z − 1| < 1 (3)
of 8, each point outside the region is mapped only once in the region,
A mapping operation moves the input values that are not inside the and second, point zero (which is a divergence point) and all the points
convergence zone into the zone. The evaluation of Eq. (3) in the complex near zero, which slowly converge, are not located in the region.

2
E. Rajaby et al. Microprocessors and Microsystems 100 (2023) 104847

∫∫ Rn (z)
1 dxdy
(14)
|z|3
, (z = x + iy)
...S ds
The reason is that the input values are not uniformly scaled to their
corresponding mapping values in region S; therefore, there is a need for
geometrical gridding in which the distances of the grid lines for the
integration vary accordingly. In other words, the number of points
considered for the integration should not be changed with the value of
’h’. Hence, to compute the average of relative reminders of selected
uniformly distributed points (Here 218 points) on the entire input range,
first, they are mapped to region S, and then the mapped points are
placed on a geometrical grid for the integration, followed by computing
the relative error of each point. This process is expressed as follows:
∫ 215 − 1 ∫ 215 − 1 Rn (map(x+iy))
− 215 − 215 1 dxdy
(15)
|map(x+iy)|3
AR n = ∫ 215 − 1 ∫ 215 −1
− 215 − 215
dxdy

Fig. 3 displays the average error values for five different values of ’n’
in Eq. (1) and for different values of ’h’. Based on these results, the
Fig. 3. Average error obtained by Eq. (15) for different values. values of n = 40 and h = 1.93 are chosen for the implementation.
To fulfill condition (7), one way is to multiply or divide ||z||1 by
The mapping operation for a sample input number ’z’ outside region value eight consecutively until the condition becomes true. However,
S is illustrated in Fig. 2. As shown, scaling and rotation operations are this method is time-consuming, thus the following method is employed
performed for the mapping operation. These operations and their cor­ for this purpose. From Eq. (7), we have:
responding reverse operations after computing the cube root of the
1 1
mapped data should not be complex for the hardware implementation. ≤ ‖ z ‖1 ⋅ ⋅8m < 1 (16)
8 h
Hence, the two above-mentioned operations and their corresponding
reverse operations are performed by Eqs. (8) and (9), as well as (10) and or
(11), respectively: ( )
1
z1 = z⋅8m , m ∈ Z (8) − 1 ≤ log8 ‖ z ‖1 ⋅ + m < 0 (17)
h

By defining ‖ z ‖1 as:

(9)
jkπ
z2 = z1 ⋅e 2 , k ∈ {1, 2, 3}
1
(18)

̃1 ̃1 ‖ z ‖1 =‖ z‖1 ⋅
(10)
jkπ
z31 = z32 ⋅e− 6 h
(17) can be re-written as:
̃1
z̃3 = z31 ⋅2− (11)
1 m
( ′ ) ( ′ )
− log8 ‖ z ‖1 − 1 ≤ m < − log8 ‖ z ‖1 (19)
The value of parameter ’m’ in Eqs. (8) and (11) is determined by the
constraint defined in (7). The value of ’h’ in Eq. (7) can be in the interval By using (19), scaling factor ’m’ can be calculated as:
of 0–2. This value affects the precision of the calculated result or the ⌊ ( ’ )⌋
m = − log8 ‖ z‖1 − 1
estimation error, thus finding its optimal value is necessary. The amount
of errors is estimated by the absolute difference value between the exact
⌊ ( ) ⌋
... log2 ‖ z‖’1 ... (20)
=− − 1
root value and the estimated value: 3
⃒ ⃒ ⌊ ′ ⌋
⃒̃ 1 1⃒
Rn (z) = ⃒⃒z3n − z3 ⃒⃒ (12) The value of log2 (‖ z ‖1 ) can be simply calculated by a binary
representation of ‖ z ‖1 . It is equivalent to the Most Significant ’1′ Po­

To determine the average of relative reminders (errors) on region S sition (MS1P()) in the binary representation of the number, subtracted
for a specified value of ’h’, it is necessary to compute the following by the fractional length value which is 40 here. Hence, ’m’ can be
formula: expressed as:
⌊ ( ′ ) ⌋
...S Rn (z)
1 ds MS1P ‖ z ‖1 − 40
m=− − 1 (21)
(13)
|z|3
AR n = 3
...S ds
As expressed in Eq. (14), the formula cannot be calculated in the For the rotation step, Eqs. (5) and (6) provide the approximate
usual way of using uniform gridding. angular position of input ’z’ in the complex plane and are determined
based on that parameter ’k’ in Eqs. (9) and (10). The implementation of
jkπ
Eq. (10), considering its term e− 6 , is more complicated than Eq. (9).
jπ j2π j3π
Terms e , e and e are equal to j, − 1, and –j, respectively, in Eq. (9). To
2 2 2

Table 1 implement Eq. (10), considering the modular property of the phase,
Values of k and k’ for different evaluations of (5) and (6). parameter k’ is defined as expressed in Eq. (22), and the reverse rotation
Evaluation of (5) False False True True
is applied by using Eq. (23):
( )3 ( )3
Evaluation of (6) False True False True ′
jk π jkπ
k 2 3 1 0 − ′
2 6 j3k π jkπ

k’ 2 3 1 0 e =e => e 2 = e− 2

3
E. Rajaby et al. Microprocessors and Microsystems 100 (2023) 104847

Fig. 4. Block diagram of the proposed hardware; details of step 2 is shown in Fig.5, step 4 in Fig.6, steps 5 and 7 in Fig. 7 and step 6 in Fig. 8.

Fig. 5. Circuit diagram of step 2 of the algorithm .

′ was optimized through a low-level design approach. Each step of the


3k π kπ
=> ≡− (mod 2π) (22) algorithm/block of the architecture and their corresponding circuits are
2 2 explained in the following sections.
̃1 ̃1 jk′ π Step 1: In the first step of the algorithm, an R_adder1 circuit com­
z31 = z32 ⋅e 2 (23) putes the sum of the absolute values of the imaginary and real parts.
Step 2: To compute ||z|| in Eq. (18) without the need for multipli­

The values of k and k’ that are determined by Eqs. (5) and (6) are
expressed in TABLE 1. cation, the value of 1h is chosen in such a way that only a few shift and
After performing the scaling and rotation process, the third root is addition operations are needed for the calculation. In other words, the
estimated by Eqs. (1) and (2), and then the reverse mapping is per­ binary representation of 1h contains only a few nonzero bits. As
formed on the result of the series by Eqs. (10) and (11). mentioned earlier, the optimal value of h in our design is around 1.93,
The steps of the proposed approach are summarized in Algorithm 1. and 1h is nearly 0.5181. Thus, the value of 0.5313 (12 + 32
1
) is selected for
1
h, requiring two R_shifts and one R_addition for the implementation of
3. Implementation and architecture Eq. (18). In step 2, to calculate parameter ’m’, the function of the most
significant ’1′ position (MS1P) in Eq. (21) is implemented by using a
The proposed design is a fixed point-based architecture design. After priority circuit. It consists of 56 6-bit length 2 to 1 multiplexers that are
a range analysis in the floating point mode, for the input value of the connected serially (Fig. 5). Each bit of the 56-bit input data is connected
architecture (input ’z’ in Fig. 4), a 56-bit word-length 40-bit fraction- to one of the select signals of the multiplexers, and the input data of the
length format was taken into consideration. The majority of the sig­ multiplexers, which are 6-bit numbers, demonstrate the place of the
nals during calculations, considering their real and imaginary parts, are
numbers with 112 bits.
The block diagram of the proposed hardware and the algorithm’s
steps performed by each block is depicted in Fig. 4. For efficient resource 1
Prefixes R and C in the text denote Real and Complex calculation,
usage of the hardware, the implementation of each step of the algorithm respectively.

4
E. Rajaby et al. Microprocessors and Microsystems 100 (2023) 104847

LUT1 = { − 39, − 39, − 36, − 36, − 36, …, 15},


(24)
LUT2 = { − 13, − 13, − 12, − 12, − 12, …, 5}
Step 3: A dynamic shifter with a rounding procedure is used to
implement Eq. (8). It receives the output of the previous step (the
number of shifts) and performs the shifting operation accordingly.
Fig. 6. The angular position detector unit. Step 4: The angular position of the number on the z-plane is detected
by two comparators (Fig. 6) and the evaluation of the real and imaginary
components of ’z’.
Step 5: To implement Eq. (9) and rotate numbers on the z-plane by
the integer multiplicand of 2π , instead of using a complex multiplier, it is
implemented by applying two multiplexers that are controlled by the
output of comparators in Fig. 6. The circuit is shown in Fig. 7.
Step 6: The recursive formula (2) is employed to compute the first 40
(8 for the 32-bit version) terms of the series in Eq. (1). To optimize
implementation, the relation is decomposed into four terms as expressed
in Eqs. (25) to (28). Eq. (25) is a constant value. Further, Eq. (26) is the
previous calculated term in the series, which is saved in a register. The
values in Eq. (27) are not dependent on ’z’, thus they are precomputed
and stored as a real vector in a LUT. Furthermore, Eq. (28) is computed
only once at the beginning of the calculation after the mapping step.
r0 = a = 1 (25)

rt− 1 (26)

4
− t
xt = 3 t ∈ {1, 2, …, 40} =>
t
{ }
1 1 116
Fig. 7. The rotation unit. LUT3 = ,− ,…− (27)
3 3 9

connected select bit in the main data bit word. The output range of the z− 1 (28)
MS1P function is [0 55]. To avoid the implementation of "division by 3′′
and floor rounding operations, two 54-point lookup tables, as expressed Using the four above-mentioned values, the following calculations
in Eq. (24), are used to provide 3* are performed for the series calculation. Eq. (26) is multiplied by Eq.
(27) by using two R-multipliers as expressed in Eq. (29). Then, Eq. (28) is
1 (for reverse rotation) multiplied by the result of Eq. (29) through a C_multiplier as expressed
values in the output. This unit is illustrated in Fig. 5. in Eq. (30). To increase the efficiency, as expressed in Eq. (31), the
C_multiplier is made up of 3 R-multipliers and 5 R_adders instead of
conventional 4 R_multipliers and 2 R_adders.

Fig. 8. The unit for step 6 of the algorithm (a Laurant series generator).

5
E. Rajaby et al. Microprocessors and Microsystems 100 (2023) 104847

Table 2
Some sample complex cube root calculation results obtained by the proposed hardware.
Number Exact cube root Cube root by the 56-bit hardware Relative error

− 3255.365024682178 + 7128.25509509333i 2.826762700162849 − 19.660566429183490i 2.82677118011569 - 19.6605760468843i 6.45e-07


1.04995984450020 + 0.674768219654003i 1.05721150890496 + 0.203760817022829i 1.057211508972549 + 0.203760817164240i 1.45e-10
− 0.004689200353823 + 0.00000000000000i − 0.167378471036465 + 0.000000000000i − 0.167378471639658+0.0000000000000i 3.60e-09
− 1401.327899706519 - 3068.48008295311i 2.13436837450103 + 14.8448581158965i 2.13436837450342 + 14.8448581158943i 2.16e-13
− 0.329488528351652 - 0.21174941676904i − 0.718429929994387 - 0.138466019596150i − 0.718429932013664 - 0.138466018477094i 3.15e-09
821.614695155512 + 9026.624548114502i − 5.389113344235370 − 22.214227764551840i − 5.38911130328913 - 22.2142286939108i 9.81e-08
1021.56803888605 - 2236.92198047282i − 1.92093153705092 + 13.3603723043068i − 1.92093153704819 + 13.3603723043047i 2.55e-13
0.0159305391625821 - 0.00467762834432i 0.253956056031554 - 0.0242498631143139i 0.253956056037006 - 0.0242498631180101i 2.58e-11
− 5481.34114634651 + 1609.46697740049i − 17.7955204147214 + 1.69926616794021i − 17.7955204147214 + 1.69926616794021i 2.04e-16

Table 3 Algorithm 1
The resource utilization of the 56-bit cube root hardware after place and root. The steps of the proposed complex cube root calculation.
Device utilization summary Input: z
Logic utilization Used Available Utilization 1
̃

Number of slice registers 722 126,800 0.6% Output: z3


n
Number of slice LUTs 4346 63,400 6.9%
Number of DSP48E1s 60 240 25% 1 Computing ‖ z‖1 .
2 Obtaining ‖ z ‖1 and ’m’ by (18) and (21).

3 ’z’ is rescaled by (8).


y = xt ⋅Re(rt− 1 ) + xt ⋅Im(rt− 1 )i (29) 4 Detecting the position of input ’z’ number in the complex plane by (5) and (6).
5 Scaled ’z’ is rotated by simple arithmetic operation to region S by (8) according to
Table 1.
(a + bi)⋅(c + di) =
6 Forty sequences of Laurent series (1) are computed in a loop using (2).
7 Rescaled answer is rotated using (23). Thus step 5 is compensated.
(a⋅c − b⋅d) + (a⋅d + b⋅c)i (30) √̅̅̅
8 Output of series is rescaled by (11) which compensates for step 3. This is 3 ̃z.

(a + bi)⋅(c + di) = (q1 + q2) + (q1 + q3)i


The design was implemented by using the Matlab HDL Coder
(q1 = d⋅(a − b), q2 = a⋅(c − d), q3 = b⋅(c + d)) (31) toolbox.
A counter unit controls the iteration steps until 40 sequences of the
series are generated and summed up as expressed in Eq. (32). The circuit 4. Results
diagram of step 6 is displayed in Fig. 8. The critical path is shown by a
dashed line. This path includes two multiplexers, an adder, and two The functional verification of the design was performed by applying
multipliers. various complex numbers with different amplitudes and angles to the
circuit and measuring the relative error. Some sample results are pre­
rt = y⋅(z − 1)
sented in Table 2. Based on the results, relative errors are extremely
small. The hardware was synthesized and implemented on Artix 7 FPGA
st = st− 1 + rt (32)
XC7A100T using Xilinx ISE 14.7. The resource usage is provided in
Step 7: The reverse rotation of the output of the series is imple­ Table 3. The maximum throughput of the hardware includes 925,925
mented similarly to that of the second step by employing some multi­ samples per second at 37.037 MHz clock frequency. To compare our
plexers to exchange real and imaginary components and negate them. design in terms of precision and hardware usage with previous ones, a
Step 8: At the final step of the algorithm, the output of the reverse 32-bit version of the hardware was also implemented, with only the
rotation step is scaled by a dynamic shifter. Considering Eqs. (8) and cubic root core section without the scaling units in the front- and back-
(11), the number of shifts is one-third of the shifts applied in the second end of the system. The previous works, which were 32-bit designs, did
step, and it is in the reverse direction. The shift values are obtained from not scale numbers, and only accepted input numbers in a specified
a look-up table. range. They also computed the cube root with specific precision.

Table 4
Comparison of the present work with some similar works.
Ref. Year Type Device Resourceutilization Computation time Max error Input range and
(ns) precision

[11] 2009 Real N/A 12 Reg N/A N/A N/A


1966 LUT
15 DSP48
[19] 2014 Real Virtex 439 Reg 127.3 0.0001% IEEE 754–2008 standard (single
5 576 LUT *floating point precision)
12 DSP48 format
2015 Real Virtex 119 Reg 174 2% [0 232 )
[20]
5 380 LUT 1
0 DSP48
2016 Real Virtex 7 Reg N/A N/A [0 108 ]
[21]
5 2320 LUTs 1
4 DSP48
Present work: 32-bit complex cube 2022 Complex Virtex 140 Reg 122.8 0.29% Real and imaginary [− 2 2)
root core 5 421 LUT 2 − 14
5 DSP48

6
E. Rajaby et al. Microprocessors and Microsystems 100 (2023) 104847

Additionally, the 32-bit version circuit computed eight sequences of the [8] O. Ahmadi, F.R. Henriquez, Low Complexity Cubing and Cube Root Computation
over $\F_ {3^ m} $ in Polynomial Basis, IEEE Trans. Comput. 59 (10) (2010)
Laurent series to obtain a specific relative error of less than 0.29% in this
1297–1308.
study. [9] Y. Li, W. Chu, On the improved implementations and performance evaluation of
The timing and resource usage of the 32-bit version design and of digit-by-digit integer restoring and non-restoring cube root algorithms, in: 2016
some similar works that compute the real cube root are presented in International Conference on Computer, Information and Telecommunication
Systems (CITS), 2016, pp. 1–5.
Table 4. The 32-bit version can work at 65.14 MHz clock frequency and [10] S. Yammen, J. Ieamsaard, Newton’s cube root finding data sequence, in: 2021 9th
needs 8 clock cycles to complete the operation. Our design has International Electrical Engineering Congress (iEECON), 2021, pp. 405–407.
comparatively reasonable resource usage, speed, and precision while [11] V. Pieterse, P. Black, cube root. Dictionary of Algorithms and Data Structures,
2009.
having the capability of computing complex roots. The hardware pre­ [12] L. Moroz, V. Samotyy, C.J. Walczyk, J.L. Cieśliński, Fast calculation of cube and
sented in [20] is superior to our work only in terms of resource utili­ inverse cube roots using a magic constant and its implementation on
zation. The hardware reported in [19] is a suitable choice when the microcontrollers, Energies 14 (4) (2021) 1058.
[13] M.S.B. Mohamad, An algorithms for finding the cube roots in finite fields, Procedia
numbers are real but with high resource usage. For fair comparison, our Comput. Sci. 179 (2021) 838–844.
32-bit version hardware like other works was implemented on Virtex 5 [14] G.H. Cho, S. Kwon, H.-.S. Lee, A refinement of Müller’s cube root algorithm, in:
FPGA by ISE 14.7. Finite Fields and Their Applications, 67, 2020, 101708.
[15] C. Zhou, H. Geng, P. Wang, C. Guo, Ten-input cube root logic computation with
rational designed DNA nanoswitches coupled with DNA strand displacement
5. Conclusion process, ACS Appl. Mater. Interfaces 12 (2) (2019) 2601–2606.
[16] J. Jo, I.-.C. Park, Low-latency low-cost architecture for square and cube roots,
IEICE Trans. Fundam. Electr. Commun. Comput. Sci. 100 (9) (2017) 1951–1955.
The proposed hardware in this work computes complex and real
[17] A. Pineiro, J.D. Bruguera, F. Lamberti, P. Montuschi, A radix-2 digit-by-digit
cube roots by detecting approximate absolute value and angular position architecture for cube root, IEEE Trans. Comput. 57 (4) (2008) 562–566, https://
and then using a mapping (shift and rotation) process and computing the doi.org/10.1109/TC.2007.70848.
Laurent series. It is fast, with efficient resource usage due to utilizing [18] G.H. Cho, N. Koo, E. Ha, S. Kwon, New cube root algorithm based on the third
order linear recurrence relations in finite fields, Designs Codes Cryptogr. 75 (3)
techniques such as computational reuse, converting multiplications to (2015) 483–495.
add-shift operations, and using pre-computing data. The design can be [19] C.M. Guardia, E. Boemo, FPGA implementation of a binary32 floating point cube
utilized in different applications. For example, based on the application, root, in: 2014 IX Southern Conference on Programmable Logic (SPL), Nov. 2014,
pp. 1–6, https://doi.org/10.1109/SPL.2014.7002202.
the bit width of the signals and the number of sequences of the Laurent [20] R.V.W. Putra, T. Adiono, Optimized hardware algorithm for integer cube root
series can be modified to achieve desired run time and precision. In this calculation and its efficient architecture, in: 2015 International Symposium on
work we implemented two cases of 56-bit and 32-bit. With the proposed Intelligent Signal Processing and Communication Systems (ISPACS), 2015,
pp. 263–267.
design, higher-order roots can be computed by changing the pre­ [21] S.K. Padhan, S. Gadtia, B. Bhoi, FPGA based implementation for extracting the
computed coefficient of the Laurant series. roots of real number, Alexandria Eng. J. 55 (3) (Sep. 2016) 2849–2854, https://
doi.org/10.1016/j.aej.2016.07.003.
Declaration of Competing Interest

Elias Rajaby received his B.Sc. from Electrical Engineering


The authors declare that they have no known competing financial Department, Yazd University, Yazd, Iran, in 2014, and the M.
interests or personal relationships that could have appeared to influence Sc. degree in Digital Electronic Engineering from Amirkabir
the work reported in this paper. University of Technology, Tehran, Iran, in 2016. He is
currently pursuing his Ph.D. degree in Isfahan University of
Technology, Isfahan, Iran. His research interests include digital
Data availability system implementation, image processing, computer vision,
and evolutionary algorithms.
Data will be made available on request.

Supplementary materials

Supplementary material associated with this article can be found, in


Sayed Masoud Sayedi received B.Sc. and M.Sc. degrees in
the online version, at doi:10.1016/j.micpro.2023.104847.
electrical engineering from Isfahan University of Technology
(IUT), Isfahan, Iran, in 1986 and 1988, respectively. He also
References received his Ph.D. degree in electronics from Concordia Uni­
versity, Montreal, QC, Canada, in 1996. From 1988 to 1992,
and then since 1997, he has been with IUT, where he is
[1] D. Trofimowicz, T.P. Stefański, Multimodal particle swarm optimization with
currently a full professor at the Department of Electrical and
phase analysis to solve complex equations of electromagnetic analysis, in: 2020
Computer Engineering. His current research interests include
23rd International Microwave and Radar Conference (MIKON), 2020, pp. 44–48.
VLSI fabrication processes, image sensors, low-power VLSI
[2] V. Maslennikov, Method for approximate determination of the roots of a cubic
circuits, and data converters.
equation with positive coefficients and complex conjugate roots, Vestnik
Natsional’nogo Issledovatel’skogo Yadernogo Universiteta MIFI 4 (2) (2015)
179–183.
[3] X. Liu, Amount of log-square-Hoyt fading in satellite optical communications, IEEE
Commun. Lett. 16 (5) (2012) 666–669.
[4] D.M. Kipping, Investigations of approximate expressions for the transit duration,
Mon. Not. R. Astron. Soc. 407 (1) (2010) 301–313. Ehsan Yazdian received his B.Sc. degree in electrical engi­
[5] B. Ghazi, H. Hassanieh, P. Indyk, D. Katabi, E. Price, L. Shi, Sample-optimal neering from Isfahan University of Technology (IUT), Isfahan,
average-case sparse fourier transform in two dimensions, in: 2013 51st Annual Iran, in 2004. Furthermore, he received his M.Sc. and Ph.D.
Allerton Conference on Communication, Control, and Computing (Allerton), 2013, degrees in electrical engineering from Sharif University of
pp. 1258–1265. Technology, Tehran, Iran, in 2006 and 2012, respectively.
[6] S.-.H. Hsieh, C.-.S. Lu, and S.-.C. Pei, “Sparse fast fourier transform for exactly and Since 2013, he has been with the Electrical Engineering
generally k-sparse signals by downsampling and sparse recovery,” arXiv preprint Department, IUT. His current research interests include statis­
arXiv:1407.8315, 2014. tical array signal processing, wireless communications, digital
[7] Y. Zhang, Z. Ke, D. Guo, F. Li, Solving for time-varying and static cube roots in real communication systems, software defined radio, and design
and complex domains via discrete-time ZD models, Neur. Comput. Appl. 23 (2) and implementation of signal processing algorithms on FPGA.
(2013) 255–268.

You might also like