Parallel Implementation of Cryptographic Algorithm Aes Using Opencl On Gpu
Parallel Implementation of Cryptographic Algorithm Aes Using Opencl On Gpu
IEEE Xplore Compliant - Part Number:CFP18J06-ART, ISBN:978-1-5386-0807-4; DVD Part Number:CFP18J06DVD, ISBN:978-1-5386-0806-7
Abstract—The importance of protecting the information has GPU implementation of the AES algorithm. Experimental
increased rapidly during the last decades. This motivates the results followed by conclusions are presented in section IV
need for cryptographic algorithms. The Acceleration of the and V respectively.
symmetric key cryptography algorithm is enhanced using
parallel implementation on GPGPUs with Open Computing II. LITERATURE SURVEY
Language (OpenCL). The General Purpose Graphics
Processing Units (GPGPU) enables high level of parallelism Details of AES algorithm [5], hardware and software
with Compute Unified Device Architecture (CUDA)/Open employed are briefly discussed in this section. The software
Computing Language (OpenCL) programming environments constructs and terms used in the implementation are also
using (Single Instruction Multiple Data) SIMD architecture. In elucidated. The program flow of OpenCL [2] is explained in
this paper, the parallel implementation of Advanced detail.
Encryption Standard (AES) Algorithm using OpenCL is
A. Advanced Encryption Standard Algorithm
presented. The experimental results show that, the parallel
implementation of encryption algorithm tested on GPUs
accelerates the speed when compared to sequential The Advanced Encryption Standard (AES) is a specification
implementation of encryption algorithm. The experimental for the encryption of electronic data established by the
result shows that, the percentage 99.8% is improved compared National Institute of Standards (NIST) [3] in 2001 based on
to sequential implementation of Encryption algorithm. Rijndael cipher, where input is a plain text and encryption
produces cipher text.
Keywords: Advanced Encryption Standard (AES), Graphics
Processing Unit (GPU), Image Restoration, OpenCL, SIMD The algorithm has four rounds based on Rijndael cipher [6] as
shown in Fig 2:
I. INTRODUCTION
The Graphics Processing Unit (GPU) [1] plays a vital role in 1. Key Expansion: Round keys are derived from the
various types of image and video processing applications. The cipher key using Rijndael’s key schedule. 128-bit
invention of these many core GPUs provides the scope to round key block for each round is required for AES.
accelerate speed in case of massive parallel applications. The
GPU ported applications improves the performance by 2. Initial Round: Each byte of the state is combined
offloading the compute intensive part onto GPU and with a block of the round key using bitwise XOR.
remaining code onto Central Processing Unit (CPU). The
multi core GPUs provide high performance for data parallel 3. Rounds:
tasks using SIMD architectures [2] [3]. In [4], the author i) Sub Bytes: A non-linear substitution step where
shows that, the usage of GPGPUs accelerates the each byte is replaced with another based on lookup
cryptographic solution to crack the UNIX password cipher in table.
100 MHz. ii) Shift Rows: In this step, the last three rows of the
states are shifted cyclically.
The focus of this paper is to accelerate the implementation of iii) Mix Columns: In this step, the four bytes in each
AES algorithm. The proposed work is implemented using columns are combined.
OpenCL and tested on Nvidia GPUs. The experimental results iv) AddRoundKey: In this step, the sub key is added
are compared with sequential implementation on different set by combining each byte of the state with the
of inputs. corresponding byte of the sub key using bit wise
The rest of the paper is organized as follows: section II XOR.
discusses the existing parallel formulation of AES algorithm 4. Last Round: In this round, Sub Bytes, Shift Rows
and introduction to GPU computing. Section III describes the and an AddRoundKey operation takes place.
3. OpenCL constructs, such as BARRIER construct Code Snippet for Parallel Formulation of shift row
allows results to be copied back to the host program function:
only after complete execution of a work-group // Rotate first row 1 columns to right
4. The AES algorithm employed is of 14 rounds (i.e,
256-AES). if ( k==4)
5. The size of each work item is 24 bits, which {
represent the RGB pixels in hexadecimal. emp=P [ i + 3 ] ;
P [ i +3]=P [ i + 2 ] ;
P [ i +2]=P [ i + 1 ] ;
P [ i +1]=P [ i ] ;
P [ i ]= temp ;
}
// Rotate second row 2 columns to right
i f ( k==8)
{
emp=P [ i ] ;
P [ i ]=P [ i + 2 ] ;
P [ i +2]= temp ;
temp=P [ i + 1 ] ;
P [ i +1]=P [ i + 3 ] ;
P [ i +3]= temp ;
}
// Rotate third row 3 columns to right
Fig 3: Flow diagram of Parallel implementation of AES i f ( k==12)
algorithm {
emp=P [ i ] ;
As shown in Fig 3, the Encrypt() function executed P [ i ]=P [ i + 1 ] ;
concurrently, where as key expansion is executed P [ i +1]=P [ i + 2 ] ;
sequentially. Based on the prefix computation technique, the P [ i +2]=P [ i + 3 ] ;
encrypt function is executed concurrently by utilizing the P [ i +3]= temp ; }
supported GPUs. The parallel formulation of the shiftrow()
function is illustrated with the example is as follows:
Code Snippet for Sequential Execution of shift row IV. EXPERIMENTAL RESULTS
function:
/ / Rotate first row 1 columns t o l e f t The parallel implementation of the proposed work is
Temp= s t a t e [ 1 ] [ 0 ] ; implemented using OpenCL and tested on GPUs. The GPU
state[1][0]=state[1][1]; implementation of AES algorithm is tested on AMD Radeon
state[1][1]=state[1][2]; 8550M GPU and 8570G GPU. AMD APP Profiler is used for
state[1][2]=state[1][3]; performance analysis, to evaluate the proposed work from the
s t a t e [ 1 ] [ 3 ] = temp ; OpenCL run-time and AMD Radeon GPUs during the
/ / Rotate second row 2 columns t o l e f t execution of an Open CL application. The figure 4 shows the
Temp= s t a t e [ 2 ] [ 0 ] ; screen-shots of GPU implementation of the proposed work.
state[2][0]=state[2][2]; Various parameters like the GPU platform being executed
s t a t e [ 2 ] [ 2 ] = temp ; currently, the global and local item sizes processed by the
Temp= s t a t e [ 2 ] [ 1 ] ; GPU, kernel occupancy of OpenCL application are considered
state[2][1]=state[2][3]; to evaluate the proposed work.
s t a t e [ 2 ] [ 3 ] = temp ;
/ / Rotate third row 3 columns t o l e f t Table I shows the execution time in milliseconds for the
Temp= s t a t e [ 3 ] [ 0 ] ; encryption and decryption functions. As shown in the Table I,
state[3][0]=state[3][3]; First column represents the number of work items, column 2
state[3][3]=state[3][2]; and 3 lists the time taken for parallel implementation of the
state[3][2]=state[3][1]; proposed work and column 4 and 5 represents the time taken
s t a t e [ 3 ] [ 1 ] = temp ; for sequential implementation of AES algorithm. As the
VI. REFERENCES
[8] Jielin Wang, Weizhen Wang, Jianw ei Yang, Zhiyi Yu, Jun
Han, Xiaoyang Zeng “Parallel Implementation of AES on
2.5D Multicore Platform with Hardware and Software Co-
Design”, IEEE 11th International Conference on ASIC
(ASICON), 2015.