Lopez Software
Lopez Software
Julio López
[email protected]
ASCrypto 2017
Agenda
2 Symmetric-Key Cryptography
Data Encryption
Hash Functions
SHA2 Implementation
SHA3 Implementation
Software Efficiency
Efficient Software Implementations Software Efficiency
Software Efficiency
• Ensure security.
• Running time.
• Code size.
• Memory consumption.
• Computer platform
characteristics
• Energy consumption.
Sometimes these goals are in conflict with each other. For example:
accelerating an operation using look-up tables, it will increase code size,
and it could result vulnerable against memory cache-attacks (if not
implemented adequately).
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 3 / 83
Efficient Software Implementations Software Efficiency
Vector instructions
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
Integer Arithmetic
MMX
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX
(64)
Integer Arithmetic
Floating-point Arithmetic
SSE
MMX
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX XMM
(64) (128)
Integer Arithmetic
Floating-point Arithmetic SSE2
SSE
MMX
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX XMM
(64) (128)
Integer Arithmetic
Floating-point Arithmetic SSE2
SSE
MMX
SSE3
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX XMM
(64) (128)
Integer Arithmetic
Floating-point Arithmetic SSE2
String Manipulation
SSE
SSE4
MMX
SSE3
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX XMM
(64) (128)
Integer Arithmetic
Floating-point Arithmetic SSE2
String Manipulation
Cryptography
AES-NI + CLMUL
SSE
SSE4
MMX
SSE3
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX XMM
(64) (128)
Integer Arithmetic
Floating-point Arithmetic SSE2
String Manipulation
Cryptography
AVX
AES-NI + CLMUL
SSE
SSE4
MMX
SSE3
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX XMM YMM
(64) (128) (256)
AES-NI + CLMUL
SSE
SSE4
MMX
SSE3
BMI
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX XMM YMM
(64) (128) (256)
AES-NI + CLMUL
SSE
SHA1-SHA2
SSE4
MMX
SSE3
BMI
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX XMM YMM
(64) (128) (256)
AES-NI + CLMUL
SSE
SHA1-SHA2
SSE4
MMX
SSE3
BMI
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX XMM YMM ZMM
(64) (128) (256) (512)
c3 c2 c1 c0
c3 c2 c1 c0
Symmetric-Key Cryptography
2.1
Data Encryption
Symmetric-Key Cryptography Data Encryption
Secure Communication
0111100001100010101011111010
Using a secret key k, Alice and Bob can interchange encrypted messages.
Charles can not read the messages without the knowledge of the key k.
k k
Key Generation
(M, k) M
encryption C C decryption
0111100001100010101011111010
C = Ek (M ) M = Dk (C)
M AES C
• AES supports three key sizes, |k| = {128, 192, 256}, leading to three
algorithms:
• AES-128.
• AES-192.
• AES-256.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 13 / 83
Symmetric-Key Cryptography Data Encryption
M ... C
k0 kNr
10
if |k| = 128
Nr = 12 if |k| = 192
14 if |k| = 256
• SubBytes
• ShiftRows
• MixColumns
• AddRoundKey
02 03 01 01
c0 c0
c1 01 02 03 01 c1
=
c2 01 01 02 03 c2
c3 03 01 01 02 c3
0e 0b 0d 09
c0 c0
c1 09 0e 0b 0d c1
=
c2 0d 09 02 0b c2
c3 0b 0d 09 0e c3
Plaintext Plaintext
AddRoundKey AddRoundKey
ShiftRows InvShiftRows
Nr − 1
AESENC
MixColumns AddRoundKey
AddRoundKey InvMixColumns
Nr − 1
AESDEC
SubBytes InvSubBytes
InvShiftRows
AESENCLAST ShiftRows
AddRoundKey AddRoundKey
Ciphertext Ciphertext
AES-128 Encryption
Analogously, for decryption use AESDEC, AESDECLAST and invert the key
schedule using AESIMC.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 19 / 83
Symmetric-Key Cryptography Data Encryption
Modes of Operation
P1 P2 P3 P4 C1 C2 C3 C4
IV Dk Dk Dk Dk
Ek Ek Ek Ek IV
C1 C2 C3 C4 P1 P2 P3 P4
Encryption Decryption
(sequential execution) (parallel execution)
Ek Ek Ek Ek Ek Ek Ek Ek
P1 P2 P3 P4 C1 C2 C3 C4
C1 C2 C3 C4 P1 P2 P3 P4
Encryption Decryption
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Clock
Latency
Throughput
1.0
0.8
0.6
0.4
0.2
0.0
Haswell Skylake Zen
Can we do better?
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 25 / 83
Symmetric-Key Cryptography Data Encryption
1.0
0.8
0.6
0.4
0.2
0.0
Haswell Skylake Zen
1.0
0.8
0.6
0.4
0.2
0.0
Haswell Skylake Zen
Hash Functions
Symmetric-Key Cryptography Hash Functions
Hash Function
Cryptographic Properties
SHA2 Implementation
Symmetric-Key Cryptography SHA2 Implementation
SHA2 Algorithm
Sj = Update(Sj−1 , Mj ) for 1 ≤ j ≤ n
where
σ0 (x) = Rot(x, 7) ⊕ Rot(x, 18) ⊕ Shr(x, 3)
σ1 (x) = Rot(x, 17) ⊕ Rot(x, 19) ⊕ Shr(x, 10)
(a0 , b0 , c0 , d0 , e0 , f0 , g0 , h0 ) ← S
for i ← 0 to 63 do
T2 i
k i wi bi bi+1
T2 ← Σ0 (ai ) Maj(ai , bi , ci ) ci ci+1
hi+1 ← gi , gi+1 ← fi
di di+1
fi+1 ← ei , ei+1 ← di T1
ei ei+1
di+1 ← ci , ci+1 ← bi
fi fi+1
bi+1 ← ai , ai+1 ← T1 T2
gi gi+1
end for
S 0 ← (a0 a63 , . . . , h0 h63 ) hi T1 i hi+1
ki wi
32
is addition modulo 2 .
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 33 / 83
Symmetric-Key Cryptography SHA2 Implementation
xmm0 xmm1
w7 w6 w5 w4 w3 w2 w1 w0
σ0 σ0 σ0 σ0
+ + + +
x3 x2 x1 x0
xmm2
xmm0 xmm1
σ1 σ1
+ + + +
σ1 σ1
The remaining values Ai+2 = [ai+2 , bi+2 , ei+2 , fi+2 ] are calculated by the
SHA256RNDS2 instruction:
Ai+2 = SHA256RNDS2(Ai , Ci , X)
T2i T2 i+1
ai ai+1 ai+2
bi bi+1 bi+2
ci ci+1 ci+2 = ai
di di+1 di+2
ei ei+1 ei+2
fi fi+1 fi+2
gi gi+1 gi+2
ki wi ki+1 wi+1
T2i T2 i+1
ai ai+1 ai+2
bi bi+1 bi+2
ci ci+1 ci+2 = ai
di di+1 di+2 = bi
ei ei+1 ei+2
fi fi+1 fi+2
gi gi+1 gi+2 = ei
ki wi ki+1 wi+1
Ci+2 = Ai
Ai+2 = SHA256RNDS2 (Ci , Ai , X)
Ci+4 = Ai+2
Ai+4 = SHA256RNDS2 (Ci+2 , Ai+2 , Y )
210
29 5×
28
Running Time
(cycles-per-byte)
27 4×
Speedup
26
3×
25
24 2×
23
22 1×
21
Can we do better?
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Clock
Latency
Throughput
Example:
Calculating four hashes (pipelined) is 20% faster than a sequential
implementation.
2.0
1.5
1.0
256 4K 64K 1M
Message size (bytes)
SHA3 Implementation
Symmetric-Key Cryptography SHA3 Implementation
SHA-3 is composed of four hash functions and two XOF called as SHAKE.
The input of a SHA-3 is split into blocks of r bits. The larger bit-rate the
faster execution.
Extendable-Output Function
Sponge Construction
Initializing: The state has 1,600 bits that are initialized to 0; then, the
input is split into blocks of r bits.
Sponge Construction
Absorbing: Each block is added to the first r bits of the state; then, the
state is processed by a permutation function P .
Sponge Construction
Permutation Function P
The state has 1, 600 bits and is represented by 5 × 5 matrix S, each entry
of the matrix is 64-bit word.
s0 s1 s2 s3 s4
s5 s6 s7 s8 s9
S = s10 s11 s12 s13 s14 ; S[x, y] = s5x+y for 0 ≤ x, y < 5.
s s16 s17 s18 s19
15
s20 s21 s22 s23 s24
θ
ι ρ
24
χ π
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 47 / 83
Symmetric-Key Cryptography SHA3 Implementation
Y0 s0 s1 s2 s3
Y1 s5 s6 s7 s8
Pros:
Y2 s10 s11 s12 s13
• It uses just few 256-bit
Y3 s15 s16 s17 s18
vector registers.
Y4 s20 s21 s22 s23 Cons:
Y5 s24 s24 s24 s24 • The permutation
instructions of AVX-2 are
Y6 s4 s9 s14 s19
expensive.
• Yi : 256-bit vector registers.
• si : 64-bit words.
X0 s0 s1 X7 s15 s16
• The state uses 12
X1 s2 s3 X8 s17 s18 variables of 256 bits.
X2 s5 s6 X9 s14 s19 • Pros:
• The permutation
s7 s8 s20 s21
X3 X10
instructions of SSE4
X4 s4 s9 X11 s22 s23 are cheaper than
AVX-2.
X5 s10 s11 X12 s24 s24 • Cons:
X6 s12 s13 • It uses more variables.
4-way implementation
• State representation.
18
Running Time
15
(cycles-per-byte)
12
9
6
3
0
Haswell Skylake Zen
Haswell
Skylake
(2M) 128-bit vector instructions 3 Zen
[SSE2/AVX].
Speedup
2
(4M) 256-bit vector instructions
[AVX2].
1
1 2 3 4
Number of messages
Elliptic Curves
Elliptic Curve Cryptography Elliptic Curves
• Introduction
• Point Multiplication kP
• Elliptic Curve Diffie-Hellman (X25519, X448)
• Digital Signature (EdDSA)
• Performance (vector instructions on Intel Haswell/Skylake)
• In 1985, Koblitz [8] and Miller [9] independently suggested the use of
elliptic curves for cryptographic purposes.
• ECC achieves the same security as RSA-based protocols using shorter
keys sizes. For example: at the 128-bit security level:
• RSA uses keys of 3,072 bits
• ECC uses keys of 256 bits.
• Applications of ECC:
• Key-agreement protocols.
• Digital signatures.
• Bitcoin.
• End-to-end encryption.
• Smart cards security.
E/Fp : y 2 + a1 xy + a3 y = x3 + a2 x2 + a4 x + a6
Point Addition
Let P and Q two points in the curve, then we can compute P + Q using a
geometric construction:
Point Addition
Let P and Q two points in the curve, then we can compute P + Q using a
geometric construction:
Point Addition
Let P and Q two points in the curve, then we can compute P + Q using a
geometric construction:
Point Addition
Let P and Q two points in the curve, then we can compute P + Q using a
geometric construction:
Point Addition
Let P and Q two points in the curve, then we can compute P + Q using a
geometric construction:
Point Doubling
Point Doubling
Point Doubling
Point Doubling
Point Doubling
Point Multiplication kP
Point Multiplication kP
Point Multiplication kP
kP = kn−1 2n−1 + · · · + k1 2P + k0 P
Input: P ∈ E and k ∈ Z+ .
Output: kP
(kn−1 , . . . , k1 , k0 )2 ← k
Q←O
for i ← n − 1 to 0 do
Q ← 2Q
Q ← Q + ki P
end for
return Q
Techniques for kP
Given two points, P and Q, the problem of finding an integer k such that
Q = kP is known as the elliptic curve discrete logarithm problem.
• The Pollard’s algorithm is the best known algorithm that solves
ECDLP. The complexity of this algorithm is:
q
O #E(Fp ) ,
E/Fp : y 2 = x3 − 3x + b
P-256 P-384
Security 128-bit 192-bit
p 2256 − 2224 + 2192 + 296 − 1 2384 − 2128 − 296 + 232 − 1
b 0x5ac635d...27d2604b 0xb3312fa...d3ec2aef
#E 2256 − 2224 + 2192 − 2128 +t 2384 − t
t 0xbce6faa...fc632551 0x389cb27...333ad68d
The RFC 7748 recommends the use of two functions to compute a shared
secret.
X25519 Keys of 32 bytes.
X448 Keys of 56 bytes.
$ $
− {0, 1}256
a← − {0, 1}256
b←
KA ← X25519(9, a) KB ← X25519(9, b)
K = X25519(KB , a) K = X25519(KA , b)
The X Function
Example: 22P .
Montgomery ladder algorithm.
ki Q0 ← O Q1 ← P
Input: P ∈ E and k ∈ Z+ .
Output: kP
1: (kn−1 = 1, . . . , k0 )2 ← k 1 P 2P
2: Q0 ← P
3: Q1 ← 2P 0 2P 3P
4: for i ← n − 2 to 0 do
5: b ← ki ⊕ ki+1
1 5P 6P
6: Q0 , Q1 ← cswap(b, Q0 , Q1 )
7: Q0 , Q1 ← 2Q0 , Q0 + Q1
8: end for 1 11P 12P
9: Q0 , Q1 ← cswap(k0 , Q0 , Q1 )
10: return Q0 0 22P 23P
Let W be the machine’s word size, then there are two cases:
w=W Full-radix or saturated arithmetic.
w<W Reduced-radix, redundant representation, unsaturated arith...
125
(103 cycles)
100
75
50
25
Haswell Skylake
500
103 clock cycles
400
300
200
100
Digital Signatures
Elliptic Curve Cryptography Digital Signatures
Digital Signatures
Signature Generation
Private
Key
Hash Signing
Signature Verification
Public
Key
Valid
Verification
Reject
Digital Signatures
EdDSA Scheme
Elliptic Curve Cryptography EdDSA Scheme
y2 − 1
s
y = s mod 2b−1 , x=
dy 2 − a
such that x ≡ sb−1 mod 2.
• H, a hash function producing 2b bits.
• Ex: use of the SHAKE128 function which is part of the SHA3 standard.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 74 / 83
Elliptic Curve Cryptography EdDSA Scheme
1: sk ∈R [0, `)
2: h = (h2b−1 , . . . , h0 )2 ← H(sk)
a ← 2n + 2 hi , for c ≤ i < n; a : n + 1 bits, bottom c bits cleared.
P i
3:
4: pk ← aB
5: return (sk, pk)
EdDSA: Signing
3: r ← H(hH k M ) (mod `)
4: R0 ← rB
5: R ← Encode(R0 )
6: S ← r + H(R k pk k M ) · a (mod `)
7: return (R, S)
EdDSA: Verification
P ← Decode(pk)
h ← H(R k pk k M ) (mod `)
100
Haswell Skylake
80
Running time
(103 cycles)
60
40
20
160
Running time
(103 cycles)
120
80
40