Llama - A Low Latency Math Library For Secure Inference
Llama - A Low Latency Math Library For Secure Inference
):1–21
ference algorithms that contain mathematical functions be computed securely using FSS techniques.
such as sigmoid and tanh are more expensive to com-
pute securely; e.g. running a recurrent neural network Lack of support for mixed bitwidth arithmetic.
(RNN) [26] on the standard Google-30 dataset [40] to Second, most existing MPC frameworks support only
identify commands, directions, and digits from speech uniform bitwidth arithmetic – i.e., a single bitwidth and
costs 415 MB, ≈ 60,000 rounds, and ≈ 37 seconds [30]. scale are used uniformly for all fixed-point values and
all secure computation must execute within these lim-
its. On the other hand, state-of-the-art converters such
1.1 Function Secret Sharing based 2PC as Tensorflow Lite [16] and Shiftry [24], that quantize
floating point models to corresponding fixed point ones,
A recently emerging trend to build secure computation for efficiency, use low bitwidth representations (e.g. 8 or
protocols with low online cost is through the technique 16). To obtain precise outputs of operators such as sim-
of function secret sharing (FSS) [5, 7]. In this paradigm, ple matrix multiplication with low bitwidth inputs and
a trusted dealer provides input independent correlated outputs, these libraries rely on using higher bitwidths
randomness to the two parties executing the secure com- to compute intermediate results such as element-wise
putation protocol. Here, round and communication op- multiplications and accummulations in a dot product
timal 2PC protocols are known for many functions such to avoid overflows. Later, these intermediate results
as addition, multiplication, comparison, and so on; thus, are quantized to the right output format. However,
leading to 2PC protocols with low online cost for vari- without the support for such precise computation on
ous algorithms [3, 7, 34]. This paradigm, which leads to low bitwidths (which implicitly require mixed bitwidth
dramatic improvements in online cost, is practically well arithmetic), systems such as [9] were forced to per-
motivated too [2, 4, 10, 21, 22] – typically the vendor form secure computation over large bitwidths, resulting
providing or building the 2PC solution for the client and in high performance overhead. The work of [30] was
server in our above example is trusted to provide correct the first to provide 2PC protocols for such precise low
2PC code and hence providing correlated randomness is bitwidth arithmetic, but these protocols have high on-
not an additional burden. line interaction and communication.
While FSS-based 2PC protocols for secure com-
putation [7], fixed-point arithmetic [3], and secure in- Erroneous ReLU and truncations. Finally, the only
ference [34] are known, these works suffer from the prior FSS-based secure inference library AriaNN [34],
following drawbacks. uses protocols for ReLU activations2 and truncations of
fixed-point values (required to maintain scale) that are
Lack of math functions support. First, they lack only probabilistically correct. This has two drawbacks
support for math functions (e.g. sigmoid, tanh, recip- – first, since this error probability is inversely propor-
rocal square root) and thus are incapable of handling tional to the bitwidth used, the protocols are forced to
benchmarks such as the RNN described above1 . Recent use larger bitwidths to ensure that this error probabil-
work [30] provides a secure computation library for pre- ity does not affect the accuracy of the resulting infer-
cisely computing various math functions by designing ence algorithm; second, as the number of instances of
approximate functionalities that simultaneously have ReLU/truncation in the algorithm increases, the prob-
low error and are also efficient to compute securely ability of the overall computation being incorrect in-
in 2PC. However, unfortunately, the protocols used to creases and hence, as noted in multiple works [9, 19, 25],
compute these functionalities require high online com- on large benchmarks, such probabilistic computations
munication and rounds (due to the highly sequential would almost certainly provide incorrect outputs.
nature of the functionalities themselves) and hence, as
we note later, the online cost of computing them is still
prohibitive. Furthermore, given this, even the function- 1.2 Our Contributions
alities themselves are not the optimal design choice to
In this work, we address the above drawbacks and
present LLAMA – an end-to-end semi-honest secure
1 Recent work of [37], which focuses on application of FSS
to recommendation systems, considered only reciprocal square
root. We provide a comparison with this work in Section 1.3. 2 The ReLU activation is defined as ReLU(x) = max(x, 0).
LLAMA: A Low Latency Math Library 3
2PC inference system (in the trusted dealer model) splines keeping FSS costs in mind. Next, once we have
that is based on FSS techniques. LLAMA has signifi- such a spline, as discussed in Section 5, we need to eval-
cantly lower online complexity compared to previous uate it efficiently using FSS techniques. Here, we build
works on 2-party secure inference [28, 30, 31] and is upon the work of [3] that provided an FSS protocol
much more expressive than AriaNN [34], the only prior for uniform bitwidth spline evaluation and extend it
secure inference system based on FSS. We discuss our to protocols for low bitwidth splines. Further, LLAMA
technical contributions and features of LLAMA below. uses two novel optimizations over [3] that significantly
reduce keysize as well as PRG invocations during the
Precise Low Bitwidth Computation. First, LLAMA online phase of that protocol. As an example, for the
supports precise secure computation over low bitwidth spline approximating sigmoid, our techniques reduce
values. As discussed, this requires computing intermedi- keysize by 4× and PRG invocations by 40×.
ary results over higher bitwidths and then appropriately
reducing these results to lower bitwidths. Supporting ReLU, Truncation, Pooling. Third, unlike [34],
this requires providing FSS-based protocols for two cru- LLAMA uses correct protocols for comparison and
cial functionalities – Zero/Signed-Extension (to increase ReLU from [3]. We build on these to provide cor-
bitwidth) as well as Truncate-Reduce (to decrease both rect protocols for average pool, maxpool, and argmax
bitwidth and scale). We design a single-round FSS pro- (Appendix C). Next, unlike [34], in LLAMA all types
tocol, also known as an FSS gate, for these operations of truncations (bitwidth preserving or bitwidth reduc-
that act as building blocks in our other protocols (Sec- ing) and comparisons are faithful, resulting in correct
tion 3). Next, we build on these and design protocols computation. With this, our FSS implementations are
for element-wise multiplications and matrix multiplica- bitwise equivalent to the corresponding cleartext fixed-
tions (and convolutions) that implement the arithmetic point implementations. This enables us to execute large
logic used in fixed-point implementations of convert- benchmarks with no fear of incorrect outputs (even
ers [16, 24] (Section 4). when using very small bitwidths in some cases) .
Low Bitwidth Splines and Math Functions. Sec- E2E Inference System. Fourth, we integrate LLAMA
ond, we provide “FSS-friendly” low bitwidth approxi- as a cryptographic backend to the EzPC framework [8].
mations to math functions (such as sigmoid, tanh and Together, all of the above, enable us to execute various
reciprocal square root) that are provably precise. We benchmarks securely – large CNNs (such as ResNet-50
use ULPs3 as the measure of preciseness of math imple- on ImageNet dataset) using the CrypTFlow toolchain
mentations that is also used by standard math libraries [25] as well as RNNs (e.g. [26] on the Google-30 dataset)
and ensure that our implementations have at most 4 using the SIRNN toolchain [30]. For almost all our
ULPs error (similar to math libraries and 2PC work benchmarks, we obtain at least an order of magni-
SIRNN [30]). As already mentioned, approximate func- tude reduction in online communication, rounds and
tionalities provided in SIRNN are highly sequential and runtimes (Section 6), thus obtaining, a low latency
would lead to large number of rounds in the online framework for secure inference.
phase even when implemented with FSS-based tech- Let us revisit the two examples presented earlier
niques. Hence, we deviate significantly from SIRNN in in the introduction. Running LLAMA on the RNN [26]
our design choice and instead use low bitwidth piecewise over the Google-30 dataset costs only roughly 8.6 MB of
polynomials or splines to approximate our math func- online communication, 2600 online rounds, and 1.9 sec-
tions4 . However, standard tools for finding splines result onds, resulting in roughly 48×, 22×, and 19× improve-
in floating-point splines. We convert these to fixed-point ment in communication, rounds and performance over
SIRNN [30]. Similarly, executing the CNN model [27]
on the CIFAR-10 dataset costs 8.25 MB of online com-
3 Informally, ULP (unit of least precision) is the number of munication, and 0.5 seconds resulting in approximately
representable values between our result and output over reals. 24× and 5× improvements in online communication and
It is widely accepted as the standard for measuring accuracy in times over DELPHI [28]. We now proceed to introduce
numeric calculations [14]. See Section 5 for more details.
all background technical information in the next section.
4 We note that although prior works such as SecureML [29] also
used a spline to approximate sigmoid, the spline used had only
3 pieces and leads to very high ULP error and hence, a high
degradation in classification accuracy as shown in [30].
LLAMA: A Low Latency Math Library 4
1.3 Other Related Works x ∈ SN , we use the notation of x[i] to represent the i-th
bit from the LSB in the 2’s complement representation
The work on FSS-based 2PC protocols for fixed-point of x such that LSB is x[0] and MSB is x[n − 1]. We also
arithmetic by Boyle et al. [3] provides FSS gates for var- use x[i,j) ∈ Z2j−i (where j > i) to denote the number
ious building blocks such as ReLU, arithmetic/logical formed by the bitstring x[j − 1], x[j − 2] . . . x[i].
right shift, and splines. However, [3] lacks support Fixed-point representation. Real numbers are encoded
for FSS-gates of signed-extension, truncate-reduce and into ZN using fixed-point notation. The fixed-point
hence, does not support mixed-bitwidth operations and numbers are parameterized by two values, a bitwidth
precise math functions over small bitwidths. n and a scale s. The first n − s bits and last s bits
A recent work by Vadapalli et al. [37] also use spline correspond to the integer part and the fractional part
to compute reciprocal square root and realize it using respectively. We have y = fixn,s (x) = bx · 2s c mod N .
DPF (Distributed Point Function) [6]. Our work and To convert a fixed-point integer y to its real counter-
[37] present a trade-off between online computation and part x, we have x = urtn,s (y) = uint(y)/2s if y ∈ UN
key size. The online compute in LLAMA grows propor- and x = srtn,s (y) = sint(y)/2s if y ∈ SN .
tional to the number of intervals of the spline. On the
other hand, the online compute of [37] grows exponen-
tially with the input bitwidth. (For reciprocal square 2.1 Function Secret Sharing (FSS)
root with 16-bit inputs, in the online phase, LLAMA
makes 1448 AES evaluations, and [37] makes 131072 An FSS scheme [5, 6] is a pair of algorithms, namely
AES evaluations.) However, the key size in LLAMA is Gen and Eval. Gen splits a secret function f : Gin →
higher than that in [37]. (For reciprocal square root Gout into a pair of functions f0 and f1 . For the party
with 16-bit inputs, key size in LLAMA is nearly 5KB, identifier σ ∈ {0, 1}, Eval evaluates the function fσ on a
compared to around 0.3KB in [37].) Since the primary given input x ∈ Gin . While correctness of an FSS scheme
benefit of FSS based techniques is to reduce online com- requires that f0 (x) + f1 (x) = f (x) for all x ∈ Gin , the
plexity, LLAMA would perform better than [37]. security requires that each fσ hides f .
quence (fˆλ )λ∈N of polynomial-size function descrip- construction of 2PC protocols using FSS, we refer the
tions from F and polynomial-size input sequence xλ reader to [7]. Now we formally define FSS gates:
for fλ , the outputs of the following experiments Real
and Ideal are computationally indistinguishable: Definition 3 (FSS Gates [3]). Let G = {g : Gin →
– Realλ : (k0 , k1 ) ← Gen(1λ , fˆλ ); Output kσ . Gout } be a computation gate (parameterized by input and
– Idealλ : Output Simσ (1λ , Leak(fˆλ )). output groups Gin , Gout ). The family of offset functions
Ĝ of G is given by
g : Gin → Gout ∈ G,
[rin ,rout ] in out
2.2 2PC with preprocessing via FSS Ĝ := g :G →G
rin ∈ Gin , rout ∈ Gout
in
Threat Model. We consider 2PC with a trusted dealer ,rout ]
where g [r (x) := g(x − rin ) + rout ,
secure against a static PPT adversary that corrupts one in out
and g [r ,r ] contains an explicit description of rin , rout .
of the parties. That is, the adversary is computationally
Finally, we use the term FSS gate for G to denote an
bounded, corrupts one of the parties at the beginning of
FSS scheme for the corresponding offset family Ĝ.
the protocol, and follows the protocol specification. The
dealer gives out the input independent correlated ran-
domness to both the parties in the offline phase. Given
the correlated randomness, the parties engage in a 2PC
2.3 Prior FSS schemes and FSS gates
protocol in the online phase. We consider standard sim-
In this section, we discuss the FSS schemes and FSS
ulation paradigm for semi-honest security.
gates constucted in prior works [3, 6] that serve as build-
Boyle et al. [7] construct 2PC protocols (in a trusted
ing blocks for us. A distributed comparison function is
dealer model, where the dealer provides correlated ran-
an FSS scheme for special intervals as defined below.
domness to the 2 parties5 ) via FSS. The high level idea
is as follows: For each wire wi in the circuit to be com-
Definition 4 (DCF [3, 6]). A special interval function
puted, the dealer picks a mask value ri uniformly at <
fα,β , also referred to as a comparison function, out-
random. Denote the cleartext value at wi by xi . The
puts β if x < α and 0 otherwise. We refer to an FSS
protocol maintains the invariant that the 2 parties hold
schemes for comparison functions as distributed com-
masked values to input wire of a gate, and run FSS 6
parison function (DCF). Analogously, function fα,β out-
evaluation protocol to learn masked value of the output
puts β if x 6 α and 0 otherwise. In all of these cases,
wire with one round of simultaneous message exchange.
we allow the default leakage Leak(fˆ) = (Gin , Gout ).
In more detail, to compute a gate gij with input and out-
put wires wi and wj , parties start with xi +ri and end up For α ∈ {0, 1}n and β ∈ G, we use Gen< λ
n (1 , α, β, G)
with xj + rj after one round of interaction. Moreoever, <
and Evaln (b, kb , x) to denote the keygen and evaluation
the size of the message exchanged is the same as the algorithms for DCF.
bitwidth of xj . To enable this, the dealer gives out FSS
[rin ,rout ]
keys for the offset function gij (x) = gij (x−ri )+rj in Theorem 1 (Concrete cost of DCF [3]). Let λ be the
the pre-processing phase. In the online phase, the par- security parameter, G be an Abelian group, and ` =
ties compute their share of the function on the masked dlog |G|e. Given a PRG G : {0, 1}λ → {0, 1}4λ+2 , there
<
input xi + ri to obtain the secret shares of the value exists a DCF for fα,β : {0, 1}n → G with key size
xj + rj , which they reconstruct to obtain the masked n(λ + ` + 2) + λ + ` bits. For `0 = d 4λ+2
`
e, the key genera-
<
output value. For the input wires, dealer simply sends tion algorithm Genn invokes G at most 2n(1 + 2`0 ) + 2`0
the masks of the wire to its respective owner, who on times and the evaluation algorithm Eval< n invokes G at
receiving the mask, adds it to the input and sends it to most n(1+`0 )+`0 times. In the special case that |G| = 2c
the other party. For the output wire, the dealer sends for c 6 λ the number of PRG invocations in Gen< n is 2n
<
the mask to both of the parties. For more details on the and the number of PRG invocations in Evaln is n.
using standard 2PC, but in this work we focus on 2PC with a {0, 1}n → G which returns β1 if the input value is less
trusted dealer setting. than α and β2 otherwise. DDCF can be described in
LLAMA: A Low Latency Math Library 6
Zero and Signed-Extension functions are used to ex- 1: Parse kb as kb0 ||rb .
tend the bitwidths of unsigned and signed numbers, 2: Set x0 ← x + 2m−1 mod M
<
respectively. More precisely, for an m-bit number x ∈ 3: Set t ← M · Evalm (b, kb0 , x0 ).
UM (resp. x ∈ SM ), Zero-Extension (resp. Signed- 4: return ub = b · x0 + rb + t.
Extension) to n-bits (n > m) is defined by y = Fig. 1. FSS Gate for Signed-Extension GSExt , b refers to party id.
ZExt(x, m, n) ∈ UN (resp. y = SExt(x, m, n) ∈ SN ),
such that uintn (y) = uintm (x) (resp. sintn (y) = sintm (x))
holds. In the discussion that follows, we only consider
Theorem 2. There is an FSS gate (GenSExt SExt
m,n , Evalm,n )
the case of Signed-Extension and note that the protocol
for GSExt with keysize of n bits plus the keysize of
for Zero-Extension can be derived similarly.
DCFm,SN and 1 invocation of DCFm,SN in EvalSExt
m,n .
The Signed-Extension gate GSExt is the family of
functions gSExt,m,n : SM → SN parameterized by input
Proof. For b ∈ {0, 1}, the simulator Simb for signed-
group Gin = SM and output group Gout = SN , and de-
extension gate is given x and ub (the input and output
fined as gSExt,m,n (z) := sintm (z) mod N . For this gate
of the ideal functionality). It first invokes Sim0 b , the sim-
(and further gates described in the coming sections), we
ulator for DCF over m bits, which outputs a DCF key
use the following equation for sintm from [30]:
kb0 . It then computes x0 and t in the same manner as
sintm (z) = z 0 − 2m−1 , for z 0 = z + 2m−1 mod M (1) Steps 2 and 3 in EvalSExt
m,n in Figure 1. Finally, it com-
putes rb = ub − b · x0 − t and outputs kb0 ||rb . One can
We denote the corresponding offset gate class by easily see that the output of Simb is computationally
ĜSExt and the offset functions by: indistinguishable from kb , the output of GenSExtm,n .
[rin ,rout ]
ĝSExt,m,n (x) = gSExt,m,n (x − rin ) + rout
= sintm ((x − rin ) mod M ) + rout
LLAMA: A Low Latency Math Library 7
with p0 = pm = N − 1 and
pi < pi+1 for 0 < i < m.
Table 1. Description and costs for FSS gates from [3, 7].
sintm (a) ∗` sintn (b). Intuitively, this says that finite-bit for Gsmult which has a total keysize 3(m + n) bits plus
signed values a and b are lifted to Z and multiplied (note the key size of DCFm,S2 and DCFn,S2 and requires 1
N M
that taking a modulo with L for ` = m+n does not loose invocation each of DCFm,S2 and DCFn,S2 in Evalsmult
m,n .
N M
any bits).
Remark. We use the above protocol for signed multipli-
For a0 = a + 2m−1 mod M and b0 = b + 2n−1
cation to realize element-wise multiplications occurring
mod N , using Equation (1) we get:
in Hadamard Product layers in our benchmarks.
gsmult,m,n (a, b) = (a0 − 2m−1 ) · (b0 − 2n−1 ) mod L
0 0 0 n−1 0 m−1 m+n−2
=a ·b −a ·2 −b ·2 +2 mod L
4.2 Matrix Multiplication and Convolution
We denote the corresponding offset gate class by Ĝsmult
and the offset functions by: d1 ×d2
Consider signed matrix multiplication of A ∈ SM
d2 ×d3
[rin ,rin ,rout ] and B ∈ SN . In the resulting matrix C = A · B each
1 2
ĝsmult,m,n (x, y) = gsmult,m,n (x − r1in , y − r2in ) + rout
element is a result of d2 multiplications and d2 − 1 addi-
= ((x0 − r1in ) mod M − 2m−1 ) · ((y 0 − r2in ) mod N tions. Even if we store the result of multiplication in a
n−1 out larger ring S2m+n , similar to signed multiplication, and
−2 )+r mod L
do d1 additions over these, the result can still overflow
where x0 = x+2m−1 mod M and y 0 = y+2n−1 mod N . due to additions. To avoid such an overflow, underlying
We used Equation (1) in the above expression. Now, libraries assume that computation is happening over a
using Equation (2) we have: sufficiently large domain. Note that ` = m + n + dlog d2 e
[r1in ,r2in ,rout ] suffices to avoid any overflows during whole dot product
(x, y) = x0 − r1in − 2m−1 + 2m · 1{x0 < r1in }
ĝ smult,m,n
computation. Even though the compute looks similar to
· y 0 − r2in − 2n−1 + 2n · 1{y 0 < r2in } + rout mod L
what we discussed in last section, matrix multiplication
LLAMA: A Low Latency Math Library 9
output in n-bits and scale 2s followed by truncation by these allow for one-round low communication protocols
s using arithmetic right shift gate from [3] to adjust the using FSS techniques [3, 7]. However, we notice that
putput scale to s (see Table 1 to obtain the costs). for obvious reasons, uniform bitwidth splines cannot
be used to obtain low ULP errors. In particular, for
the above mentioned case, we cannot find a reasonable
spline that uses coefficients on 16-bits, does all arith-
5 Math Functions metic over 16-bits, and provides at most 4 ULP error for
scales 12. Similar to [30], the polynomial computation
In this section, we first discuss our novel FSS-based pro-
in splines needs to happen with intermediate results in
tocols for precise math functions for the same input and
higher bitwidths and final result needs to be truncated
output domains considered in [30]. To quantify how pre-
(to reduce bitwidth and adjust scale). Here, intuitvely,
cise our math function implementations are, we use the
while evaluating polynomials one keeps accumulating
standard notion of ULP error (defined formally in [14])
bitwidth and only reduces the final result.
that we discuss below. Then, we provide a high-level
We discuss details of our mixed-bitwidth splines fol-
overview for the design of our math functions, followed
lowed by our protocols for precise math functions.
by mixed-bitwidth splines that are crucial to obtain low-
bitwidth splines that are good approximations to math
functions. Finally, we provide details on FSS-friendly
5.1 Mixed-bitwidth Splines
math function design for popular activations – sigmoid
and tanh – used in neural networks.
We first discuss the cleartext functionality for the
ULP error. It is impossible to represent an irrational
mixed-bitwidth splines followed by their FSS implemen-
value exactly using finite number of bits. Therefore, it
tation. For ease of exposition, we discuss the splines over
is important to quantify the deviations between exact
integers (no scale) and later show to handle fixed-point
real result and the output of a math library in finite-bit
arithmetic that has an associated scale with each value.
representation. There are various notions of errors one
Suppose the spline under consideration is composed
can use – absolute, relative and ULP error. Standard
of m polynomials f1 , f2 , . . . , fm (with degree d and coef-
math libraries use ULP error as a metric to determine
ficients of bitwidth nc ), and a set of m + 1 knots P . Let
whether the real output of a math function is approx-
the input to the spline be x ∈ SNI , where the bitwidth
imately close to the finite-bit output that the library
of x is nI and NI = 2nI . Let n be a sufficiently large
produces [1, 38]. The lower the ULP value, the higher
bitwidth that prevents overflow of values during poly-
is the precision and accuracy of the implementation of
nomial evaluation. Note that n = nc + d · nI suffices for
that math function. At a high level, ULP error between mixed
this purpose. The functionality gspline,(n I ,nc ),m,d,P,{fi }i
:
the exact real result r1 and library output r2 is equal to
SNI → SN for mixed-bitwidth splines is defined as fol-
the number of representable values between r1 and r2
lows. First, sign extend the input x from nI bits to n,
[14]. We use the same notion to quantify the precision of
resulting in sint(x) mod N . Then, sign extend the co-
our math functions that use fixed-point as the finite-bit
efficients of all polynomials from nc bits to n, and sign
representation.
extend all knots from nI bits to n. Finally output the
Design of math functions. Although our techniques
result of uniform bitwidth spline functionality on input
are general, for a high level discussion, let us assume
sint(x) using the new (sign extended) coefficients and
that we want to approximate sigmoid within 4 ULPs of
knots. Note that this evaluation procedure (mod N ) is
error over fixed-point inputs and outputs with bitwidth
the same as doing all computation over Z.
16 and scale 12. There are multiple design choices pos-
Now we describe a simple (yet non-optimized) 2-
sible in coming up with such an implementation. For
round FSS-based protocol for this spline evaluation. For
instance, SIRNN [30] used the recipe of first obtaining
the first round, parties call the FSS gate for Signed-
a good initial approximation followed by Goldsmidth’s
Extension gSExt,nI ,n with input x ∈ SNI masked by rin ,
iterations where the number of iterations depend on
and reconstruct x̄ = sint(x − rin ) + rtemp mod N ∈ SN .
final output scale precision desired. However, this ap-
Here, rtemp ∈ SN is chosen randomly by the dealer dur-
proach leads to large number of online rounds and
ing key generation. For the second round, let f¯i be the
communication due to the iterative nature of the algo-
polynomial corresponding to fi with (publicly known)
rithm. Our first design choice is to use piecewise poly-
coefficients sign extended to n-bits from nc bits. Also, let
nomials or splines to approximate math functions as
P̄ be the sign extended (publicly known) knots to n-bits.
LLAMA: A Low Latency Math Library 11
Next, parties call the FSS gate for uniform-bitwidth addition along the respective path to get the desired
splines gspline,n,m,d,P̄ ,{f¯i }i from [3]. The input to this output. In the case of splines, where G is a vector of
gate is x̄ ∈ SN from the first round, masked by rtemp . group elements, these additions are performed element-
The output would be the spline evaluated on (x̄ − rtemp ) wise. Since we need only 2 out of the m elements in the
masked by rout ∈ SN . output of Eval< <
n , we can tweak Evaln to only do the re-
quired additions and PRG calls. This change reduces the
number of PRG calls in the spline evaluation by roughly
5.1.1 Optimizations a factor of m/2. For the above example of sigmoid, this
factor is roughly 10×.
We propose two optimizations to the protocol described Overall performance improvement. The two op-
above that significantly reduce the FSS key size and timizations discussed above are compatible with each
offline and online computational cost. other and together lead to roughly a factor n/nI reduc-
Optimization 1. Here, we reduce the FSS key size tion in FSS key size, n/nI reduction in PRG calls by
for the spline gate used in the second round. In the the dealer in key generation and n/nI · m/2 reduction
above construction, the DCF key used in the spline in PRG calls by the servers in the online phase. For the
operates over inputs from SN (obtained after sign ex- case of sigmoid, this amounts to 4× reduction in key
tending the original input) and uses the sign extended size, 4× fewer PRG calls by the dealer and 40× fewer
knots during the key generation by dealer and evalu- PRG calls by the servers.
ation by servers. Our observation is as follows: Even We now discuss how we can easily extend our pro-
though the parties need to learn the shares of coeffi- tocol to work over fixed-point arithmetic as required.
cients of the correct polynomial to be evaluated in the
larger domain, i.e., SN , the DCF input (and hence, its
depth, etc) and knots themselves can come from the 5.1.2 Fixed-point arithmetic
original domain SNI where NI = 2nI . In particular, we
replace DCFn,Sm(d+1) in the above unoptimized scheme In the context of fixed-point arithmetic, let the scale of
N
by DCFn m(d+1) and evaluate it on the original input the input x be sI , and that of the spline coefficients
I ,SN Pd
x instead of x̄. This reduces the key size and the num- be sc . In symbolic notation, let fi (x) = j=0 ai,j · xj .
ber of PRG calls made in key generation and evaluation The first round of our FSS protocol remains the same.
of the uniform bitwidth spline gate (used in the second Note that during polynomial evaluation in the spline, we
round of our protocol) by a factor of n/nI . For instance, require all the summands to have the same scale. This
for the case of sigmoid, this reduction factor is 4×. requires a small change in the dealer as follows: The
Optimization 2. This optimization significantly re- dealer sign extends the coefficients from nc to n bits
duces the number of PRG calls made by the servers and also left shifts ai,j by (d − j) · sI bits. So, during
during the online phase. Recall that the FSS gate for evaluation the scale of each summand of the polynomial
gspline,nI ,m,d,P,{f¯i }i used in the second round uses a DCF is the same, viz. s = sc + d · sI . Now, after running
with a payload of m(d + 1)n bits. This DCF key gets the above described protocol, we have the output with
evaluated m times and the output of each invocation is bitwidth n and scale s.
m(d + 1)n bits. Let the output of the ith invocation be Next, suppose the desired output is required to have
(i) (i) bitwidth nO and scale sO . We do this adjustment in
s1 , . . . , sm (using the same notation as Figure 5 in [3]).
(i) (i)
We observe that only si−1 , si , i.e., 2(d + 1)n bits, are the final round as follows: We can safely assume that
used during evaluation and other values are discarded. sO 6 s as to obtain precise output with scale sO , scales
The Gen< of the coefficients will have to be appropriately large as
n algorithm of the DCF construction of [3]
generates a key kb for each party b ∈ {0, 1} such that well. Now, in the third round, we reduce the scale of
kb consists of a random seed sb and n + 1 correction output by tr = s − sO and adjust the bitwidth to nO
words CWi ∈ G for i ∈ {1 . . . n + 1} (where G denotes by using appropriate truncation operations as discussed
the output group). The seed generates a binary tree in “fixed-point mixed-bitwidth matrix multiplication”
with 2n leaves and each node is assosiated with a tuple paragraph in Section 4.2. The complete 3-round FSS
(sb , tb , Vb ) with an invariant that the sum of Vb along protocol is described in Appendix E.
the evaluation path for an input x, form secret shares Below, we summarize the key size and evaluation
<
of fα,β (x). Hence, in Eval< cost of the FSS protocol for mixed-bitwidth splines
n , it suffices to perform this
LLAMA: A Low Latency Math Library 12
(mixed,fixed)
over fixed-point gspline,(nI ,sI ,nO ,sO ,nc ,sc ),m,d,P,{fi }i . (Let In the third step, we quantize this spline, i.e., repre-
G(mixed,fixed)-spline denote the corresponding function fam- sent it over fixed-point as follows: we quantize the knots
ily, parameterized accordingly). After optimization 2, to have the same bitwidth and scale as our inputs. We
our protocol for spline evaluation uses the underlying linearly search over bitwidths and scales for the coeffi-
DCF key in a non-black box manner, hence we report cients. For a choice of nc , sc , we exhaustively run the
the cost of this step in number of PRG calls made. Also, mixed-bitwidth spline cleartext algorithm for all inputs
we report the cost when the third round of the protocol and check for their ULP error w.r.t. the output of a
uses a Truncate-Reduce gate. The other case is similar. high-precision math library [13]. We crucially note that
since sigmoid (and also tanh) are well-behaved functions
Theorem 5. Let params = (nI , sI , nO , sO , nc , sc ), n = with bounded outputs, and the output scale sO 6 14,
nc +d·nI , tr = sc +d·sI −sO . There is a 3-round FSS pro- this exhaustive testing is feasible. If the maximum ULP
(mixed,fixed)-spline (mixed,fixed)-spline
tocol (Genparams,m,d,P , Evalparams,m,d,P ) for mixed- error 6 4 we stop. Otherwise, we increase the value of
bitwidth splines over fixed-point that has a total key size either nc or sc until nc = 32. If we do not find a good ap-
of 2mn(d+1)+n bits, plus the key size of DCFn ,Sm(d+1) , proximation, we increase the number of knots, m, until
I N
plus the key sizes of FSS gates for gSExt,nI ,n and gTR,n,tr . 100; even if this is unsuccessful, we increment the degree
Let ` = d 2n(d+1)
4λ+2 e, where λ is the security [Link] d and go back to the second step of spline finding.
online phase makes single evaluations of Sign-Extension Following the above procedure, we successfully find
and Truncate-Reduce gates and at most m(nI (1 + `) + `) splines with d = 2, m 6 52 for the sigmoid function for
calls to PRG G (used in DCF) during spline evaluation input and output scales such that 0 6 sI , sO 6 14 and
(in the second round). this suffices for our benchmarks as well as benchmarks
considered in prior works. As is expected from a 2D
graph of sigmoid, we were unable to find linear splines
5.2 Math Functions for sigmoid (and also tanh) with even 100 knots.
In this section, we discuss our approach for computing Tanh. Over the reals, tanh(x) = (ex − e−x )/(ex + e−x )
math functions using FSS techniques – in particular, and tends to −1 for small values of x and 1 for large
sigmoid and tanh. We use mixed-bitwidth splines over values. Our procedure for tanh is identical to sigmoid
fixed-point as approximations to math functions that except for a straightword change to clipping in terms of
can be realized directly using the 3-round protocol de- outputs on inputs with large magnitude.
scribed in the previous section. Below, we discuss how
we obtain the required splines for each of the math Reciprocal square root. Over reals, rsqrt(x) =
√
functions. We use sigmoid to illustrate this. 1/ x, x > 0. To avoid division-by-zero error when x
is very small, we assume that all inputs to rsqrt satisfy
Sigmoid. Over the reals, sigmoid(x) = 1/(1 + e−x ) and x > , where is a small public constant. As in the
tends to 0 for small values of x and tends to 1 for large case of SIRNN [30], we set = 0.1. The procedure to
values of x. Our task is as follows: Given the bitwidths find splines is similar to sigmoid and tanh with one
and scales for the inputs and outputs, find a spline that difference. We observe that since the precision of input
approximates the real result with at most 4ULPs of er- x is sI , it suffices to compute the output with preci-
ror (see beginning of this section for the definition of sion sI /2. Hence, the ULP error of spline obtained is
ULP error). The first step is to clip the input domain to computed over bitwidth nO and scale dsI /2e, instead
an interesting interval as follows: we find the largest xL of sO . Later, we adjust the scale of spline output to sO
and the smallest xR such that if we set sigmoid(x) = 0 by left-shifting the output by (sO − dsI /2e) bits.
(with appropriate fixed-point representation of outputs)
for all x 6 xL and set sigmoid(x) = 1 for all x > xR , the Sample choice of parameters. Table 2 lists the
resulting ULP error 6 4. In the second step, we start choice of our spline parameters that give at most 4 ULP
with a choice of degree of the polynomials, d, and num- error for various configurations of math functions re-
ber of knots, m, and run an off-the-shelf tool Octave quired by our benchmarks in Section 6.2. In Appendix
[12] to find a best fit spline for sigmoid for the reduced B, we provide fixed-point values of the coefficients and
domain. Note that this step, returns a floating-point intervals of a mixed-bitwidth spline for tanh.
spline, i.e., both polynomial coefficients as well as knots
are floating-point values.
LLAMA: A Low Latency Math Library 13
Communication
Online LAN (in milliseconds)
Layer Batch Size Technique (in KB)
Rounds
Offline Online Offline Online
LLAMA 35 0.8 1 0.38 ± 0.14 0.37 ± 0.26
100
Signed-Extension SIRNN - 30 7 - 4.5 ± 0.6
(m = 8, n = 21) LLAMA 352 7.8 1 0.96 ± 0.68 0.81 ± 0.22
1000
SIRNN - 114 7 - 5.73 ± 1.54
LLAMA 47 0.2 1 0.47 ± 0.49 0.48 ± 0.43
100
Truncate-Reduce SIRNN - 41 13 - 9.34 ± 6.34
(n = 21, s = 13) LLAMA 466 2 1 1.72 ± 1.91 0.89 ± 0.44
1000
SIRNN - 211 13 - 13.65 ± 2.17
LLAMA 3297 3.5 3 12.47 ± 5.5 4.09 ± 1.47
100 SIRNN - 768 139 - 91.96 ± 8.50
Sigmoid (nI =
MP-SPDZ 3696 134 145 ** 32.32 ± 8.12
nO = 16, sI =
LLAMA 33044 35 3 128.45 ± 46.91 27.05 ± 4.37
9, sO = 14)
1000 SIRNN - 5007 139 - 102.46 ± 8.06
MP-SPDZ 5246 1308 145 ** 52.10 ± 8.90
LLAMA 1320 3.5 3 5.35 ± 3.88 2.81 ± 0.84
100 SIRNN - 604 131 - 83.7 ± 8.26
Tanh (nI = nO = MP-SPDZ 3696 137 155 ** 35.74 ± 12.46
16, sI = sO = 9) LLAMA 13219 35 3 51.06 ± 19.99 10.16 ± 3.44
1000 SIRNN - 3614 131 - 88.07 ± 8.96
MP-SPDZ 5246 1341 155 ** 57.60 ± 8.80
LLAMA 1138 3.5 3 4.80 ± 4.35 2.84 ± 1.10
Reciprocal square 100 SIRNN - 881 185 - 124.05 ± 10.95
root (nI = nO = MP-SPDZ 2457 44.4 87 ** 22.11 ± 5.00
16, sI = 12, sO = LLAMA 11375 35 3 41.20 ± 15.60 8.99 ± 1.79
11) 1000 SIRNN - 5488 185 - 126.03 ± 11.41
MP-SPDZ 2467 413 87 ** 28.92 ± 5.78
Table 3. Performance comparison for bitwidth changing and math functions. For Signed-Extension, m,n are input, output bitwidths.
For Truncate-Reduce, n is input bitwidth and s is shift amount. For Sigmoid, Tanh and Reciprocal square root, nI , nO , sI , sO denote
input/output bitwidths and scales. ** denotes that the value was not reported by the code.
ble 3, the choice of parameters for bitwidths and scales LLAMA does have a larger total communication (by up
are made using examples from our benchmarks such as to 4.4×) in other cases.
Google-30 [26] and Heads [30] (described in Section 6.2) Table 4 summarizes our microbenchmarks for
and we evaluate for batch sizes of 100 and 1000. For mixed-bitwidth matrix multiplication (multiplying a
the math functions, for these choice of bitwidths and d1 ×d2 matrix with a d2 ×d3 matrix). The input/output
scales, Table 2 provides details on the spline chosen by bitwidths for all experiments are 8, while the scale is 6;
LLAMA, i.e., degree and number of knots, as well as however, due to d2 being different in each case, the in-
coefficient bitwidths and scales. From Table 3, LLAMA termediate bitwidth (16 + dlog d2 e) in each computation
is up to 40× better than MP-SPDZ in online commu- is different. As can be seen from the table, LLAMA com-
nication, up to 51× better in terms of online rounds, municates between 59 − 208× less than SIRNN in the
and up to 12× better in online execution time. As seen online phase and has 10 − 13× fewer rounds. LLAMA
in Table 3, LLAMA communicates between 105 − 251× also performs between 2.2 − 7.3× better in the LAN set-
less than SIRNN in the online phase and has between ting. Further, in these microbenchmarks, LLAMA also
13−61× fewer rounds of online communication. In terms has 1.3 − 5.4× lower total communication.
of performance, LLAMA is between 3.7−43× faster than
SIRNN in the LAN setting. Finally, as expected, while
the total communication of LLAMA (i.e. communication
6.2 Benchmarks
including the offline key size as well) can be compara-
In this section, we evaluate and compare the perfor-
ble to SIRNN in a few cases (e.g. Truncate-Reduce),
mance of LLAMA on several machine learning inference
LLAMA: A Low Latency Math Library 15
algorithms. We provide details on the benchmarks con- for online phase. To illustrate that LLAMA can scale
sidered in Appendix D and summarize the findings in to large benchmarks, we run it on the ResNet-50 CNN
Table 5. We split the discussion below into two kinds of on the ImageNet dataset [18], and compare with both
benchmarks. Our main focus is the networks with math CrypTFlow (a 3PC system) as well as CrypTFlow2
functions or networks that use low bitwidths for activa- (a 2PC system)6 . Finally, we consider AriaNN [34],
tions and weights for efficiency. We also consider simple which like LLAMA is an FSS based secure inference
convolutional neural networks (CNNs) from prior works system (in the trusted dealer model), but does not sup-
to demonstrate our generality and scalability. port mixed-bitwidth arithmetic or math functions. Here
alone, since AriaNN code [33] does not support execu-
Neural networks with math functions/mixed- tion on different VMs, we ran all parties in both AriaNN
bitwidth arithmetic. First, to illustrate the perfor- and LLAMA on the same VM and appropriately set the
mance of LLAMA on algorithms that use mixed-bitwidth latency and bandwidth on the VM using the tc com-
arithmetic and/or math functions (tanh, sigmoid, or mand7 . On the ResNet-18 benchmark on Hymenoptera
reciprocal square root), we run it on the following end- dataset, we show that LLAMA outperforms AriaNN by
to-end inference benchmarks: DeepSecure B4 [32], that about 3× in online communication and 1.7× in online
enables embedded sensors to classify various physical runtime (despite AriaNN using a probabilistically cor-
activities, as well as on an RNN algorithm [26] that en- rect, cheaper, local truncation protocol compared to
ables keyword spotting on the Google-30 dataset [40]. the correct truncation in LLAMA). This improvement
We compare the performance of LLAMA with SIRNN can be attributed to ReLU being a 2-round protocol
and observe that LLAMA has up to 4 orders of mag- in AriaNN compared to an FSS gate in LLAMA. For
nitude lower online communication, up to 22× fewer fairness, we also provide numbers for LLAMA with the
online rounds, and up to 57× faster runtime. We also same probabilistically correct local truncation.
evaluate LLAMA on the sigmoid/tanh layers of the
MiniONN LSTM [27] (a language model for word pre-
dictions) and the Industrial-72 benchmarks [24, 30] (a
model that provides feedback for quality of shots in
7 Conclusion
a sports game), as well as the reciprocal square root
This paper proposes LLAMA, an FSS-based 2PC secure
layers from the Heads model [35] (a model for counting
inference system in the semi-honest, trusted dealer set-
the number of people in an image). Here, we show that,
ting. The main design goal of LLAMA is to minimize
in comparison with SIRNN, the online communication
of LLAMA is at least 200× less, the number of rounds
is at least 43× better and the performance is at least
15× and 43× better in the LAN and WAN settings. 6 In very recent work, Cheetah [19] show an improvement of ≈
12× in communication and 4-5× in runtime over CrypTFlow2.
We do not directly compare with this work, as it is orthogonal
Other neural networks. While not the primary focus
to the focus of this work; however, even in comparison to Chee-
of this work, for the sake of completeness, we also com- tah, we note that LLAMA has much lower communication and
pare LLAMA with prior systems on neural networks not is expected to outperform it.
requiring mixed-bitwidth arithmetic or math functions. 7 The end-to-end code execution time in AriaNN took around
We compare with DELPHI [28] – a 2PC system designed 40 minutes. In Table 5, we report the offline and online times
(around 350 seconds and 13 seconds respectively) that is output
speifically with online cost in mind – and show ≈ 24×
by their code. Due to longer execution times, the mean and
better communication and ≈ 3 − 5× better runtime
standard deviation of runtimes are calculated over 25 iterations.
LLAMA: A Low Latency Math Library 16
Communication
Online LAN Time (in seconds) WAN Time (in seconds)
Network Technique (in MB)
Rounds
Offline Online Offline Online Offline Online
LLAMA 183 0.15 21 0.78 ± 0.20 0.11 ± 0.01 3.15 ± 0.16 0.83 ± 0.05
DeepSecure B4
SIRNN - 1844 379 - 6.45 ± 0.31 - 47.63 ± 1.67
LLAMA 882 8.6 2687 5.31 ± 0.94 1.89 ± 0.12 8.02 ± 0.32 87.71 ± 3.65
Google-30
SIRNN - 415 59899 - 37.18 ± 1.95 - 1997.8 ± 93.5
MiniONN LSTM
LLAMA 49.7 0.04 6 0.21 ± 0.07 0.02 ± 0.003 0.49 ± 0.23 0.23 ± 0.01
(only Sigmoid,
Tanh) SIRNN - 9.7 403 - 0.34 ± 0.02 - 14.47 ± 0.90
Industrial-72 (only LLAMA 19.4 0.03 42 0.09 ± 0.03 0.04 ± 0.006 0.30 ± 0.04 1.41 ± 0.07
Sigmoid, Tanh) SIRNN - 7.9 1847 - 1.23 ± 0.08 - 61.36 ± 3.85
[9] Dalskov, A.P.K., Escudero, D., Keller, M.: Secure evalua- [28] Mishra, P., Lehmkuhl, R., Srinivasan, A., Zheng, W., Popa,
tion of quantized neural networks. Proc. Priv. Enhancing R.A.: DELPHI: A cryptographic inference service for neural
Technol. (2020) networks. In: USENIX Security (2020)
[10] Damgård, I., Pastro, V., Smart, N.P., Zakarias, S.: Multi- [29] Mohassel, P., Zhang, Y.: SecureML: A system for scalable
party computation from somewhat homomorphic encryption. privacy-preserving machine learning. In: IEEE S&P (2017)
In: CRYPTO (2012) [30] Rathee, D., Rathee, M., Goli, R.K.K., Gupta, D., Sharma,
[11] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, R., Chandran, N., Rastogi, A.: SIRNN: A math library for
L.: Imagenet: A large-scale hierarchical image database. In: secure RNN inference. In: IEEE S&P (2021)
2009 IEEE Conference on Computer Vision and Pattern [31] Rathee, D., Rathee, M., Kumar, N., Chandran, N., Gupta,
Recognition (2009) D., Rastogi, A., Sharma, R.: CrypTFlow2: Practical 2-party
[12] Eaton, J.W., Bateman, D., Hauberg, S., Wehbring, R.: secure inference. In: CCS (2020)
GNU Octave version 6.1.0 manual: a high-level interac- [32] Rouhani, B.D., Riazi, M.S., Koushanfar, F.: Deepsecure:
tive language for numerical computations (2020), https: Scalable provably-secure deep learning. In: DAC (2018)
//[Link]/software/octave/doc/v6.1.0/ [33] Ryffel, T., Pointcheval, D., Bach, F.: ARIANN: Low-
[13] Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., Zimmer- interaction privacy-preserving deep learning via function
mann, P.: Mpfr: A multiple-precision binary floating-point secret sharing. [Link] (2020)
library with correct rounding. ACM Trans. Math. Softw. [34] Ryffel, T., Pointcheval, D., Bach, F.: ARIANN: Low-
(2007) interaction privacy-preserving deep learning via function
[14] Goldberg, D.: What every computer scientist should know secret sharing. PoPETS (2022)
about floating-point arithmetic. ACM Comput. Surv. (1991) [35] Saha, O., Kusupati, A., Simhadri, H.V., Varma, M., Jain, P.:
[15] Goldreich, O., Micali, S., Wigderson, A.: How to Play any RNNPool: Efficient non-linear pooling for RAM constrained
Mental Game or A Completeness Theorem for Protocols inference. In: NeurIPS (2020)
with Honest Majority. In: STOC (1987) [36] Soin, A., Bhatu, P., Takhar, R., Chandran, N., Gupta, D.,
[16] Google: Tensorflow Lite (2019), [Link] Alvarez-Valle, J., Sharma, R., Mahajan, V., Lungren, M.P.:
org/lite/ Production-level open source privacy preserving inference in
[17] Gopinath, S., Ghanathe, N., Seshadri, V., Sharma, R.: Com- medical imaging. CoRR (2021)
piling kb-sized machine learning models to tiny iot devices. [37] Vadapalli, A., Bayatbabolghani, F., Henry, R.: You may
In: PLDI (2019) also like... privacy: Recommendation systems meet pir. In:
[18] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning PoPETs (2021)
for image recognition. In: 2016 IEEE Conference on Com- [38] Wang, E., Zhang, Q., Bo, S., Zhang, G., Lu, X., Wu, Q.,
puter Vision and Pattern Recognition (CVPR) (2016) Wang, Y.: Intel math kernel library (2014)
[19] Huang, Z., jie Lu, W., Hong, C., Ding, J.: Cheetah: Lean [39] Wang, F., Yun, C., Goldwasser, S., Vaikuntanathan, V.,
and fast secure two-party deep neural network inference. In: Zaharia, M.: Splinter: Practical private queries on public
USENIX Security (2022) data. In: Proceedings of the 14th USENIX Conference on
[20] Juvekar, C., Vaikuntanathan, V., Chandrakasan, A.: Networked Systems Design and Implementation (2017)
GAZELLE: A Low Latency Framework for Secure Neural [40] Warden, P.: Speech Commands: A Dataset for Limited-
Network Inference. In: USENIX Security (2018) Vocabulary Speech Recognition. arXiv (2018)
[21] Kaissis, G., Ziller, A., Passerat-Palmbach, J., Ryffel, T., [41] Yao, A.C.: How to generate and exchange secrets. In: FOCS
Usynin, D., Trask, A., Lima, I., Mancuso, J., Jungmann, (1986)
F., Steinborn, M., Saleh, A., Makowski, M.R., Rueckert,
D., Braren, R.: End-to-end privacy preserving deep learning
on multi-institutional medical imaging. Nat. Mach. Intell.
(2021)
[22] Kamara, S., Mohassel, P., Raykova, M., Sadeghian, S.S.: A Unsigned multiplication
Scaling private set intersection to billion-element sets. In:
Christin, N., Safavi-Naini, R. (eds.) FC (2014)
Unsigned Multiplication of two values x ∈ UM , y ∈ UN
[23] Keller, M.: MP-SPDZ: A versatile framework for multi-party
computation. In: ACM CCS (2020) refers to the multiplication of the two values uintm (x)
[24] Kumar, A., Seshadri, V., Sharma, R.: Shiftry: RNN Inference and uintn (y) carried out in the group UL , where L =
in 2KB of RAM. In: OOPSLA (2020) M · N , which is equivalent to uintm (x) ∗` uintn (y).
[25] Kumar, N., Rathee, M., Chandran, N., Gupta, D., Rastogi, The unsigned multiplication gate Gumult is the family
A., Sharma, R.: CrypTFlow: Secure TensorFlow Inference.
of functions gumult,m,n : UM × UN → UL parameterized
In: IEEE S&P (2020)
[26] Kusupati, A., Singh, M., Bhatia, K., Kumar, A., Jain, P.,
by input group Gin = UM ×UN and output group Gout =
Varma, M.: FastGRNN: A Fast, Accurate, Stable and Tiny UL , and given by gumult,m,n (x, y) := uintm (x) ∗` uintn (y).
Kilobyte Sized Gated Recurrent Neural Network. In: NeurIPS
(2018)
[27] Liu, J., Juuti, M., Lu, Y., Asokan, N.: Oblivious neural
network predictions via minionn transformations. In: CCS
(2017)
LLAMA: A Low Latency Math Library 18
We denote the corresponding offset gate class by Spline Endpoints Spline coefficients (a2 x2 + a1 x + a0 )
Ĝumult and the offset functions by Left Right a2 a1 a0
where 0 < d < N . Let hai0 , hai1 ∈ SN denote additive of 6n bits, plus the key size of DCFn,SN , plus the key
secret shares of a over SN , i.e., hai0 and hai1 are ran- sizes of FSS gates for gIC,n,dN/2e,N −1 and gsCMP,n . This
dom elements of SN subject only to the constraint that protocol requires 1 invocation of DCFn,SN , 1 invocation
(hai0 + hai1 ) mod N = a. The following theorem [31] of EvalIC
n,dN/2e,N −1 and 3 invocations of Evaln
sCMP
.
allows expressing rdiv(a, d) in terms of hai0 and hai1 .
Remark. In the special case when d is a power of 2, we
Theorem 6 (Division of ring element [31]). Let shares have rdiv(a, d) = (aA log2 d), and it is more efficient to
of a ∈ SN be hai0 , hai1 ∈ SN , for some N = N1 ·d+N0 ∈ use the (single round) arithmetic right-shift (ARS) gate
Z, where N1 , N0 , d ∈ Z and 0 6 N0 < d < N . from [3] to perform signed division.
Let the unsigned representation of a, hai0 , hai1 in SN
lifted to integers be au , a0 , a1 ∈ {0, 1, . . . , N − 1}, respec- Average pool. The family of functions Gavgpool
tively, such that a0 = a10 ·d+a00 and a1 = a11 ·d+a01 , where to compute the average of d elements is de-
Pd
a10 , a00 , a11 , a01 ∈ Z and 0 6 a00 , a01 < d. Let N 0 = dN/2e. fined as gavgpool,n,d (x1 , x2 , . . . , xd ) = ( i=1 xi )/d =
Pd
Let corr, A, B, C ∈ Z be defined as rdiv( i=1 xi , d) ∈ SN , where x1 , x2 , . . . , xd ∈ SN . It is
straightforward to derive a 2-round FSS protocol for
Gavgpool from the protocol for signed division.
−1 (au > N 0 ) ∧ (a0 < N 0 ) ∧ (a1 < N 0 )
corr = 1 (au < N 0 ) ∧ (a0 > N 0 ) ∧ (a1 > N 0 ) Theorem 8. There is a 2-round FSS protocol
0 otherwise (Genavgpool
n,d , Evalavgpool
n,d ) for Gavgpool which has the same
A = a00 + a01 − (1{a0 > N 0 } + 1{a1 > N 0 } − corr) · N0 , key size and evaluation cost as (Gendiv div
n,d , Evaln,d ).
B = idiv(a00 − 1{a0 > N 0 } · N0 , d)
+ idiv(a01 − 1{a1 > N 0 } · N0 , d),
C.2 ReLU, Maxpool and Argmax
C = 1{A < d} + 1{A < 0} + 1{A < −d}.
T hen, rdiv(a, d) = rdiv(hai0 , d) + rdiv(ha1 i, d) For the ReLU function, we use the FSS gate for gReLU,n
+ (corr · N1 + 1 − C − B) mod N from the work of [3]. With this gate, one can easily con-
struct an FSS gate to compute the maximum of two
In the FSS setting, the dealer holds rin , rout ∈ SN , while elements by defining the function in terms of ReLU –
the two parties P0 and P1 hold x ∈ SN , with the goal i.e., gmax,n (x1 , x2 ) = ReLU(x1 − x2 ) + x2 . We then build
of computing rdiv(x − rin , d) + rout . We will set hai0 = x upon this to construct an FSS protocol for Maxpool (i.e.
and hai1 = −rin in Theorem 6 (i.e. a = x − rin mod N ). the function that computes the maximum out of d ele-
We will first compute A in the above theorem. To ments) by computing the maximum of 2 elements at a
do this, we use the following fact (from [31]). Let w = time in a tree-like manner, resulting in (d − 1) compar-
1{a0 + a1 > N }, then corr = 1{a0 > N 0 } + 1{a1 > isons done over dlog de rounds. Finally, Argmax (that
N 0 } − w − 1{a > N 0 }. Now, using DCFn,SN , P0 and P1 computes the index with the maximum value out of d
can compute shares of w = 1{a0 + a1 > N } = 1{N − elements) is computed in a similar manner to Maxpool,
1 − a0 < a1 }. Similarly, shares of 1{a > N 0 } can be in 2dlog de rounds.
computed using the FSS gate for gIC,N 0 ,N −1 . These two
computations can be done in parallel in the first round, Theorem 9. There is a dlog de-round FSS protocol
and from this, the parties can compute shares of Ā = (Genmaxpool
n,d , Evalmaxpool
n,d ) for maxpool on d elements,
A + rtemp ∈ SN , where rtemp ∈ SN is a random mask which has a total key size of n(d − 1) bits plus (d − 1)
chosen by the dealer. times the key size of FSS gate for gReLU,n , and requires
In the second round, parties first locally compute (d − 1) invocations of EvalReLU
n .
shares of B. Now, they reconstruct Ā, and then, along
with an FSS gate for GsCMP from [3], compute C. Shares Theorem 10. There is a 2dlog de-round FSS protocol
of rdiv(x−rin , d)+rout can then be computed locally from (Genargmax
n,d , Evalargmax
n,d ) for Gargmax which has a total key
shares of B, C and corr. The full FSS protocol for signed size of n(d − 1) bits, plus the key size of FSS protocol
division is given in Figure 5. for gmaxpool,n,d , plus (d − 1) times the key sizes of FSS
protocols for gsCMP,n and g×,n . The protocol requires (d−
Theorem 7. There is a 2-round FSS protocol 1) invocations of EvalReLU
n , EvalsCMP
n and Eval×
n each.
(Gendiv div
n,d , Evaln,d ) for Gdiv which has a total key size
LLAMA: A Low Latency Math Library 20
(mixed,fixed)-spline (mixed,fixed)-spline
Fixed-point mixed-bitwidth spline protocol (Gen(nI ,sI ,nO ,sO ,nc ,sc ),m,d,{pi }i , Eval(nI ,sI ,nO ,sO ,nc ,sc ),m,d,{pi }i )
Gen(nI ,sI ,nO ,sO ,nc ,sc ),m,d,{pi }i (1λ , {fi }i , rin , rout ):
(mixed,fixed)-spline