0% found this document useful (0 votes)

54 views21 pages

Llama - A Low Latency Math Library For Secure Inference

Uploaded by

ganzh1011

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views21 pages

Llama - A Low Latency Math Library For Secure Inference

Uploaded by

ganzh1011

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Proceedings on Privacy Enhancing Technologies ..; .. (..

):1–21

Kanav Gupta*, Deepak Kumaraswamy, Nishanth Chandran, and Divya Gupta

LLAMA: A Low Latency Math Library for Secure Inference
Abstract: Secure machine learning (ML) inference can DOI Editor to enter DOI
provide meaningful privacy guarantees to both the client Received ..; revised ..; accepted ...
(holding sensitive input) and the server (holding sensi-
tive weights of the ML model) while realizing inference-
as-a-service. Although many specialized protocols ex-
1 Introduction
ist for this task, including those in the preprocessing
model (where a majority of the overheads are moved to
One of the most important security problems in the area
an input independent offline phase), they all still suffer
of machine learning is that of secure inference, wherein,
from large online complexity. Specifically, the protocol
a model owner S, holding a public model M and its as-
phase that executes once the parties know their inputs,
sociated private weights w, offers inference-as-a-service
has high communication, round complexity, and latency.
to a client C, with private input data point x, with the
Function Secret Sharing (FSS) based techniques offer an
security guarantee that S learns nothing about x, while
attractive solution to this in the trusted dealer model
C only learns M (w, x), i.e. the output of the model on
(where a dealer provides input independent correlated
the input data, and nothing else. Being a special case of
randomness to both parties), and 2PC protocols ob-
the generic problem of secure multi-party computation
tained based on these techniques have a very lightweight
(MPC or 2PC in the 2-party case) [15, 41], this prob-
online phase.
lem has received widespread attention – both within the
Unfortunately, current FSS-based 2PC works (AriaNN,
cryptography community [19, 20, 28, 29, 31] as well as in
PoPETS 2022; Boyle et al. Eurocrypt 2021; Boyle et
several application domains such as healthcare [21, 36].
al. TCC 2019) fall short of providing a complete so-
On the technical side, perhaps the biggest bottle-
lution to secure inference. First, they lack support for
neck towards the widespread adoption of cryptograph-
math functions (e.g., sigmoid, and reciprocal square
ically secure inference protocols, is their large commu-
root) and hence, are insufficient for a large class of in-
nication and interaction cost. As an example, even if
ference algorithms (e.g. recurrent neural networks). Sec-
one were to execute secure inference on a simple 7-
ond, they restrict all values in the computation to be of
layer convolutional neural network (CNN) [27] on the
the same bitwidth and this prevents them from ben-
CIFAR-10 dataset using a state-of-the-art system [31],
efitting from efficient float-to-fixed converters such as
one inference would cost approximately 340 MB of com-
Tensorflow Lite that crucially use low bitwidth repre-
munication, 345 rounds of interaction, and would take
sentations and mixed bitwidth arithmetic.
about 10 seconds to execute on commodity hardware
In this work, we present LLAMA – an end-to-end, FSS
with good bandwidth. In light of this, works such as
based, secure inference library supporting precise low
DELPHI [28] have crucially focused on reducing the on-
bitwidth computations (required by converters) as well
line cost of such secure protocols – that is, where a bulk
as provably precise math functions; thus, overcoming
of the overheads can be moved to an input indepen-
all the drawbacks listed above. We perform an exten-
dent pre-processing phase. However, even here, the sit-
sive evaluation of LLAMA and show that when com-
uation is far from satisfactory and the above benchmark
pared with non-FSS based libraries supporting mixed
would still require 196 MB of online communication
bitwidth arithmetic and math functions (SIRNN, IEEE
and 2.7 seconds of online execution time. Further, in-
S&P 2021), it has at least an order of magnitude lower
communication, rounds, and runtimes. We integrate
LLAMA with the EzPC framework (IEEE EuroS&P
2019) and demonstrate its robustness by evaluating it *Corresponding Author: Kanav Gupta: Microsoft Re-
search. E-mail: t-kanavgupta@[Link].
on large benchmarks (such as ResNet-50 on the Ima-
Deepak Kumaraswamy: Microsoft Research. E-mail: t-
geNet dataset) as well as on benchmarks considered in deepakk@[Link].
AriaNN – here too LLAMA outperforms prior work. Nishanth Chandran: Microsoft Research. E-mail:
nichandr@[Link].
Divya Gupta: Microsoft Research. E-mail: di-
Keywords: Function Secret Sharing, Secure Inference, [Link]@[Link].
Secure Two-Party Computation
LLAMA: A Low Latency Math Library 2

ference algorithms that contain mathematical functions be computed securely using FSS techniques.
such as sigmoid and tanh are more expensive to com-
pute securely; e.g. running a recurrent neural network Lack of support for mixed bitwidth arithmetic.
(RNN) [26] on the standard Google-30 dataset [40] to Second, most existing MPC frameworks support only
identify commands, directions, and digits from speech uniform bitwidth arithmetic – i.e., a single bitwidth and
costs 415 MB, ≈ 60,000 rounds, and ≈ 37 seconds [30]. scale are used uniformly for all fixed-point values and
all secure computation must execute within these lim-
its. On the other hand, state-of-the-art converters such
1.1 Function Secret Sharing based 2PC as Tensorflow Lite [16] and Shiftry [24], that quantize
floating point models to corresponding fixed point ones,
A recently emerging trend to build secure computation for efficiency, use low bitwidth representations (e.g. 8 or
protocols with low online cost is through the technique 16). To obtain precise outputs of operators such as sim-
of function secret sharing (FSS) [5, 7]. In this paradigm, ple matrix multiplication with low bitwidth inputs and
a trusted dealer provides input independent correlated outputs, these libraries rely on using higher bitwidths
randomness to the two parties executing the secure com- to compute intermediate results such as element-wise
putation protocol. Here, round and communication op- multiplications and accummulations in a dot product
timal 2PC protocols are known for many functions such to avoid overflows. Later, these intermediate results
as addition, multiplication, comparison, and so on; thus, are quantized to the right output format. However,
leading to 2PC protocols with low online cost for vari- without the support for such precise computation on
ous algorithms [3, 7, 34]. This paradigm, which leads to low bitwidths (which implicitly require mixed bitwidth
dramatic improvements in online cost, is practically well arithmetic), systems such as [9] were forced to per-
motivated too [2, 4, 10, 21, 22] – typically the vendor form secure computation over large bitwidths, resulting
providing or building the 2PC solution for the client and in high performance overhead. The work of [30] was
server in our above example is trusted to provide correct the first to provide 2PC protocols for such precise low
2PC code and hence providing correlated randomness is bitwidth arithmetic, but these protocols have high on-
not an additional burden. line interaction and communication.
While FSS-based 2PC protocols for secure com-
putation [7], fixed-point arithmetic [3], and secure in- Erroneous ReLU and truncations. Finally, the only
ference [34] are known, these works suffer from the prior FSS-based secure inference library AriaNN [34],
following drawbacks. uses protocols for ReLU activations2 and truncations of
fixed-point values (required to maintain scale) that are
Lack of math functions support. First, they lack only probabilistically correct. This has two drawbacks
support for math functions (e.g. sigmoid, tanh, recip- – first, since this error probability is inversely propor-
rocal square root) and thus are incapable of handling tional to the bitwidth used, the protocols are forced to
benchmarks such as the RNN described above1 . Recent use larger bitwidths to ensure that this error probabil-
work [30] provides a secure computation library for pre- ity does not affect the accuracy of the resulting infer-
cisely computing various math functions by designing ence algorithm; second, as the number of instances of
approximate functionalities that simultaneously have ReLU/truncation in the algorithm increases, the prob-
low error and are also efficient to compute securely ability of the overall computation being incorrect in-
in 2PC. However, unfortunately, the protocols used to creases and hence, as noted in multiple works [9, 19, 25],
compute these functionalities require high online com- on large benchmarks, such probabilistic computations
munication and rounds (due to the highly sequential would almost certainly provide incorrect outputs.
nature of the functionalities themselves) and hence, as
we note later, the online cost of computing them is still
prohibitive. Furthermore, given this, even the function- 1.2 Our Contributions
alities themselves are not the optimal design choice to
In this work, we address the above drawbacks and
present LLAMA – an end-to-end semi-honest secure
1 Recent work of [37], which focuses on application of FSS
to recommendation systems, considered only reciprocal square
root. We provide a comparison with this work in Section 1.3. 2 The ReLU activation is defined as ReLU(x) = max(x, 0).
LLAMA: A Low Latency Math Library 3

2PC inference system (in the trusted dealer model) splines keeping FSS costs in mind. Next, once we have
that is based on FSS techniques. LLAMA has signifi- such a spline, as discussed in Section 5, we need to eval-
cantly lower online complexity compared to previous uate it efficiently using FSS techniques. Here, we build
works on 2-party secure inference [28, 30, 31] and is upon the work of [3] that provided an FSS protocol
much more expressive than AriaNN [34], the only prior for uniform bitwidth spline evaluation and extend it
secure inference system based on FSS. We discuss our to protocols for low bitwidth splines. Further, LLAMA
technical contributions and features of LLAMA below. uses two novel optimizations over [3] that significantly
reduce keysize as well as PRG invocations during the
Precise Low Bitwidth Computation. First, LLAMA online phase of that protocol. As an example, for the
supports precise secure computation over low bitwidth spline approximating sigmoid, our techniques reduce
values. As discussed, this requires computing intermedi- keysize by 4× and PRG invocations by 40×.
ary results over higher bitwidths and then appropriately
reducing these results to lower bitwidths. Supporting ReLU, Truncation, Pooling. Third, unlike [34],
this requires providing FSS-based protocols for two cru- LLAMA uses correct protocols for comparison and
cial functionalities – Zero/Signed-Extension (to increase ReLU from [3]. We build on these to provide cor-
bitwidth) as well as Truncate-Reduce (to decrease both rect protocols for average pool, maxpool, and argmax
bitwidth and scale). We design a single-round FSS pro- (Appendix C). Next, unlike [34], in LLAMA all types
tocol, also known as an FSS gate, for these operations of truncations (bitwidth preserving or bitwidth reduc-
that act as building blocks in our other protocols (Sec- ing) and comparisons are faithful, resulting in correct
tion 3). Next, we build on these and design protocols computation. With this, our FSS implementations are
for element-wise multiplications and matrix multiplica- bitwise equivalent to the corresponding cleartext fixed-
tions (and convolutions) that implement the arithmetic point implementations. This enables us to execute large
logic used in fixed-point implementations of convert- benchmarks with no fear of incorrect outputs (even
ers [16, 24] (Section 4). when using very small bitwidths in some cases) .

Low Bitwidth Splines and Math Functions. Sec- E2E Inference System. Fourth, we integrate LLAMA
ond, we provide “FSS-friendly” low bitwidth approxi- as a cryptographic backend to the EzPC framework [8].
mations to math functions (such as sigmoid, tanh and Together, all of the above, enable us to execute various
reciprocal square root) that are provably precise. We benchmarks securely – large CNNs (such as ResNet-50
use ULPs3 as the measure of preciseness of math imple- on ImageNet dataset) using the CrypTFlow toolchain
mentations that is also used by standard math libraries [25] as well as RNNs (e.g. [26] on the Google-30 dataset)
and ensure that our implementations have at most 4 using the SIRNN toolchain [30]. For almost all our
ULPs error (similar to math libraries and 2PC work benchmarks, we obtain at least an order of magni-
SIRNN [30]). As already mentioned, approximate func- tude reduction in online communication, rounds and
tionalities provided in SIRNN are highly sequential and runtimes (Section 6), thus obtaining, a low latency
would lead to large number of rounds in the online framework for secure inference.
phase even when implemented with FSS-based tech- Let us revisit the two examples presented earlier
niques. Hence, we deviate significantly from SIRNN in in the introduction. Running LLAMA on the RNN [26]
our design choice and instead use low bitwidth piecewise over the Google-30 dataset costs only roughly 8.6 MB of
polynomials or splines to approximate our math func- online communication, 2600 online rounds, and 1.9 sec-
tions4 . However, standard tools for finding splines result onds, resulting in roughly 48×, 22×, and 19× improve-
in floating-point splines. We convert these to fixed-point ment in communication, rounds and performance over
SIRNN [30]. Similarly, executing the CNN model [27]
on the CIFAR-10 dataset costs 8.25 MB of online com-
3 Informally, ULP (unit of least precision) is the number of munication, and 0.5 seconds resulting in approximately
representable values between our result and output over reals. 24× and 5× improvements in online communication and
It is widely accepted as the standard for measuring accuracy in times over DELPHI [28]. We now proceed to introduce
numeric calculations [14]. See Section 5 for more details.
all background technical information in the next section.
4 We note that although prior works such as SecureML [29] also
used a spline to approximate sigmoid, the spline used had only
3 pieces and leads to very high ULP error and hence, a high
degradation in classification accuracy as shown in [30].
LLAMA: A Low Latency Math Library 4

1.3 Other Related Works x ∈ SN , we use the notation of x[i] to represent the i-th
bit from the LSB in the 2’s complement representation
The work on FSS-based 2PC protocols for fixed-point of x such that LSB is x[0] and MSB is x[n − 1]. We also
arithmetic by Boyle et al. [3] provides FSS gates for var- use x[i,j) ∈ Z2j−i (where j > i) to denote the number
ious building blocks such as ReLU, arithmetic/logical formed by the bitstring x[j − 1], x[j − 2] . . . x[i].
right shift, and splines. However, [3] lacks support Fixed-point representation. Real numbers are encoded
for FSS-gates of signed-extension, truncate-reduce and into ZN using fixed-point notation. The fixed-point
hence, does not support mixed-bitwidth operations and numbers are parameterized by two values, a bitwidth
precise math functions over small bitwidths. n and a scale s. The first n − s bits and last s bits
A recent work by Vadapalli et al. [37] also use spline correspond to the integer part and the fractional part
to compute reciprocal square root and realize it using respectively. We have y = fixn,s (x) = bx · 2s c mod N .
DPF (Distributed Point Function) [6]. Our work and To convert a fixed-point integer y to its real counter-
[37] present a trade-off between online computation and part x, we have x = urtn,s (y) = uint(y)/2s if y ∈ UN
key size. The online compute in LLAMA grows propor- and x = srtn,s (y) = sint(y)/2s if y ∈ SN .
tional to the number of intervals of the spline. On the
other hand, the online compute of [37] grows exponen-
tially with the input bitwidth. (For reciprocal square 2.1 Function Secret Sharing (FSS)
root with 16-bit inputs, in the online phase, LLAMA
makes 1448 AES evaluations, and [37] makes 131072 An FSS scheme [5, 6] is a pair of algorithms, namely
AES evaluations.) However, the key size in LLAMA is Gen and Eval. Gen splits a secret function f : Gin →
higher than that in [37]. (For reciprocal square root Gout into a pair of functions f0 and f1 . For the party
with 16-bit inputs, key size in LLAMA is nearly 5KB, identifier σ ∈ {0, 1}, Eval evaluates the function fσ on a
compared to around 0.3KB in [37].) Since the primary given input x ∈ Gin . While correctness of an FSS scheme
benefit of FSS based techniques is to reduce online com- requires that f0 (x) + f1 (x) = f (x) for all x ∈ Gin , the
plexity, LLAMA would perform better than [37]. security requires that each fσ hides f .

Definition 1 (FSS: Syntax [5, 6]). A (2-party) func-

tion secret sharing scheme is a pair of algorithms
2 Preliminaries (Gen, Eval) such that:
– Gen(1λ , fˆ) is a PPT key generation algorithm that
Notation. Let λ denote the computational security pa- given 1λ and fˆ ∈ {0, 1}∗ (description of a function
rameter. We use uppercase L, M, N to denote the values f ) outputs a pair of keys (k0 , k1 ). We assume that fˆ
2` , 2m and 2n respectively. 1{b} denotes the indicator explicitly contains descriptions of input and output
function that is 1 when b is true and 0 otherwise. We groups Gin , Gout .
use the natural one to one map between {0, 1}` and ZL . – Eval(σ, kσ , x) is a polynomial-time evaluation algo-
We consider computations over finite bit unsigned rithm that given σ ∈ {0, 1} (party index), kσ (key
and signed integers, denoted by UN and SN , respec- defining fσ : Gin → Gout ) and x ∈ Gin (input for fσ )
tively, for a given bitwidth of n-bits. We note that outputs yσ ∈ Gout (the value of fσ (x)).
UN = {0, . . . , N − 1} is isomorphic to ZN . Signed in-
tegers SN range from −N/2 till N/2 − 1 and can be en- Definition 2 (FSS: Correctness and Security [5, 6]).
coded into ZN or UN using 2’s complement representa- Let F = {f } be a function family and Leak be a func-
tion. In this encoding, the MSB of the bit-representation tion specifying the allowable leakage about fˆ. When Leak
of x ∈ SN is 0 if x > 0 and 1 otherwise. We use the is omitted, it is understood to output only Gin and Gout .
functions uintn : UN → Z and sintn : SN → Z to We say that (Gen, Eval) as in Definition 1 is an FSS
represent the conversion of a number to its unsigned scheme for F (with respect to leakage Leak) if it satisfies
and signed number in Z respectively. In 2’s comple- the following requirements.
ment notation, sintn (x) = uintn (x) − MSB(x) ∗ N where – Correctness: For all fˆ ∈ PF describing f : Gin →
MSB(x) = 1{x > 2n−1 }. We drop the subscript when- Gout , and every x ∈ Gin , if (k0 , k1 ) ← Gen(1λ , fˆ)
ever the bitwidth can be inferred from the context. We then Pr [Eval(0, k0 , x) + Eval(1, k1 , x) = f (x)] = 1.
also use the symbol ∗` : Z × Z → ZL to denote the bi- – Security: For each σ ∈ {0, 1} there is a PPT al-
nary operation x ∗` y = x · y mod L where x, y ∈ Z. For gorithm Simσ (simulator), such that for every se-
LLAMA: A Low Latency Math Library 5

quence (fˆλ )λ∈N of polynomial-size function descrip- construction of 2PC protocols using FSS, we refer the
tions from F and polynomial-size input sequence xλ reader to [7]. Now we formally define FSS gates:
for fλ , the outputs of the following experiments Real
and Ideal are computationally indistinguishable: Definition 3 (FSS Gates [3]). Let G = {g : Gin →
– Realλ : (k0 , k1 ) ← Gen(1λ , fˆλ ); Output kσ . Gout } be a computation gate (parameterized by input and
– Idealλ : Output Simσ (1λ , Leak(fˆλ )). output groups Gin , Gout ). The family of offset functions
Ĝ of G is given by

g : Gin → Gout ∈ G,

[rin ,rout ] in out
2.2 2PC with preprocessing via FSS Ĝ := g :G →G
rin ∈ Gin , rout ∈ Gout
in
Threat Model. We consider 2PC with a trusted dealer ,rout ]
where g [r (x) := g(x − rin ) + rout ,
secure against a static PPT adversary that corrupts one in out
and g [r ,r ] contains an explicit description of rin , rout .
of the parties. That is, the adversary is computationally
Finally, we use the term FSS gate for G to denote an
bounded, corrupts one of the parties at the beginning of
FSS scheme for the corresponding offset family Ĝ.
the protocol, and follows the protocol specification. The
dealer gives out the input independent correlated ran-
domness to both the parties in the offline phase. Given
the correlated randomness, the parties engage in a 2PC
2.3 Prior FSS schemes and FSS gates
protocol in the online phase. We consider standard sim-
In this section, we discuss the FSS schemes and FSS
ulation paradigm for semi-honest security.
gates constucted in prior works [3, 6] that serve as build-
Boyle et al. [7] construct 2PC protocols (in a trusted
ing blocks for us. A distributed comparison function is
dealer model, where the dealer provides correlated ran-
an FSS scheme for special intervals as defined below.
domness to the 2 parties5 ) via FSS. The high level idea
is as follows: For each wire wi in the circuit to be com-
Definition 4 (DCF [3, 6]). A special interval function
puted, the dealer picks a mask value ri uniformly at <
fα,β , also referred to as a comparison function, out-
random. Denote the cleartext value at wi by xi . The
puts β if x < α and 0 otherwise. We refer to an FSS
protocol maintains the invariant that the 2 parties hold
schemes for comparison functions as distributed com-
masked values to input wire of a gate, and run FSS 6
parison function (DCF). Analogously, function fα,β out-
evaluation protocol to learn masked value of the output
puts β if x 6 α and 0 otherwise. In all of these cases,
wire with one round of simultaneous message exchange.
we allow the default leakage Leak(fˆ) = (Gin , Gout ).
In more detail, to compute a gate gij with input and out-
put wires wi and wj , parties start with xi +ri and end up For α ∈ {0, 1}n and β ∈ G, we use Gen< λ
n (1 , α, β, G)
with xj + rj after one round of interaction. Moreoever, <
and Evaln (b, kb , x) to denote the keygen and evaluation
the size of the message exchanged is the same as the algorithms for DCF.
bitwidth of xj . To enable this, the dealer gives out FSS
[rin ,rout ]
keys for the offset function gij (x) = gij (x−ri )+rj in Theorem 1 (Concrete cost of DCF [3]). Let λ be the
the pre-processing phase. In the online phase, the par- security parameter, G be an Abelian group, and ` =
ties compute their share of the function on the masked dlog |G|e. Given a PRG G : {0, 1}λ → {0, 1}4λ+2 , there
<
input xi + ri to obtain the secret shares of the value exists a DCF for fα,β : {0, 1}n → G with key size
xj + rj , which they reconstruct to obtain the masked n(λ + ` + 2) + λ + ` bits. For `0 = d 4λ+2
`
e, the key genera-
<
output value. For the input wires, dealer simply sends tion algorithm Genn invokes G at most 2n(1 + 2`0 ) + 2`0
the masks of the wire to its respective owner, who on times and the evaluation algorithm Eval< n invokes G at
receiving the mask, adds it to the input and sends it to most n(1+`0 )+`0 times. In the special case that |G| = 2c
the other party. For the output wire, the dealer sends for c 6 λ the number of PRG invocations in Gen< n is 2n
<
the mask to both of the parties. For more details on the and the number of PRG invocations in Evaln is n.

Dual Distributed Comparison Function (DDCF) is a

generalization of DCF also introduced in [3]. It is
DDCF
defined as a class of comparison functions fα,β :
5 [3] discuss how the pre-processing FSS keys can be generated 1 ,β2

using standard 2PC, but in this work we focus on 2PC with a {0, 1}n → G which returns β1 if the input value is less
trusted dealer setting. than α and β2 otherwise. DDCF can be described in
LLAMA: A Low Latency Math Library 6

DDCF (x) = β + f <

terms of DCF by fα,β1 ,β2
2 α,β1 −β2 (x). We = (x0 − rin ) mod M − 2m−1 + rout
use GenDDCF
n (1λ , α, β1 , β2 , G) and EvalDDCF
n (b, kb , x) to
where x0 = x + 2m−1 mod M . Observe that for
denote FSS algorithms for DDCF.
a, b ∈ UM , the following equations holds for arithmetic
in Z:
We abuse notation and use DCFn,G and DDCFn,G to
denote cost (either keysize or evaluation cost) of FSS (a − b) mod M = a − b + M · 1{a < b} (2)
schemes for DCF and DDCF respectively.
So, on using the above equation, we get:
FSS gates. [3, 7] construct FSS gates for uniform
[rin ,rout ]
bitwidth matrix multiplication, interval containment, ĝSExt,m,n (x) = x0 − rin + M · 1{x0 < rin } − 2m−1 + rout
signed comparison and splines. We summarize their (3)
costs in Table 1.
We present our construction of the FSS gate for
Signed-Extension in Figure 1 and provide the summary
of cost along with proof of security in Theorem 2. The
3 Bitwidth-Changing Gates proofs for subsequent FSS protocols in our paper can
be derived in a similar manner, and we omit them.
Mixed-bitwidth arithmetic (multiplication, spline eval-
uations and so on) fundamentally relies on two building
blocks – Extension, that increases the bitwidth of an in- Signed-Extension Gate (GenSExt SExt
m,n , Evalm,n )
SExt λ in out
put from m to n; and Truncate-Reduce, that truncates Genm,n (1 , r , r ):
and reduces the bitwidth of the input from n to n − s. 1: Sample random r0 , r1 ← SN s.t.
In this section, we present FSS gates for Extension (in r0 + r1 = sintn (rout ) − sintm (rin ) − 2m−1 mod N .
2: (k00 , k10 ) ← Genm (1λ , rin , 1, SN ).
<
Section 3.1) and Truncate-Reduce (in Section 3.2).
3: For b ∈ {0, 1}, let kb = kb0 ||rb .
4: return (k0 , k1 ).
3.1 Extension
EvalSExt
m,n (b, kb , x):

Zero and Signed-Extension functions are used to ex- 1: Parse kb as kb0 ||rb .
tend the bitwidths of unsigned and signed numbers, 2: Set x0 ← x + 2m−1 mod M
<
respectively. More precisely, for an m-bit number x ∈ 3: Set t ← M · Evalm (b, kb0 , x0 ).
UM (resp. x ∈ SM ), Zero-Extension (resp. Signed- 4: return ub = b · x0 + rb + t.
Extension) to n-bits (n > m) is defined by y = Fig. 1. FSS Gate for Signed-Extension GSExt , b refers to party id.
ZExt(x, m, n) ∈ UN (resp. y = SExt(x, m, n) ∈ SN ),
such that uintn (y) = uintm (x) (resp. sintn (y) = sintm (x))
holds. In the discussion that follows, we only consider
Theorem 2. There is an FSS gate (GenSExt SExt
m,n , Evalm,n )
the case of Signed-Extension and note that the protocol
for GSExt with keysize of n bits plus the keysize of
for Zero-Extension can be derived similarly.
DCFm,SN and 1 invocation of DCFm,SN in EvalSExt
m,n .
The Signed-Extension gate GSExt is the family of
functions gSExt,m,n : SM → SN parameterized by input
Proof. For b ∈ {0, 1}, the simulator Simb for signed-
group Gin = SM and output group Gout = SN , and de-
extension gate is given x and ub (the input and output
fined as gSExt,m,n (z) := sintm (z) mod N . For this gate
of the ideal functionality). It first invokes Sim0 b , the sim-
(and further gates described in the coming sections), we
ulator for DCF over m bits, which outputs a DCF key
use the following equation for sintm from [30]:
kb0 . It then computes x0 and t in the same manner as
sintm (z) = z 0 − 2m−1 , for z 0 = z + 2m−1 mod M (1) Steps 2 and 3 in EvalSExt
m,n in Figure 1. Finally, it com-
putes rb = ub − b · x0 − t and outputs kb0 ||rb . One can
We denote the corresponding offset gate class by easily see that the output of Simb is computationally
ĜSExt and the offset functions by: indistinguishable from kb , the output of GenSExtm,n .

[rin ,rout ]
ĝSExt,m,n (x) = gSExt,m,n (x − rin ) + rout
= sintm ((x − rin ) mod M ) + rout
LLAMA: A Low Latency Math Library 7

Key size per Online evaluation

Functionality Function Notation Description FSS Gate
party cost
Signed
n , Evaln )
(Gen× 3n bits no FSS Eval calls
×
g×,n (x, y); x, y ∈ SN x · y ∈ SN
multiplication [7]
g×,n,d1 ,d2 ,d3 (X, Y ); ma-
Matrix trices X, Y of dimension (Gen× , (d1 d2 + d2 d3 +
X · Y ∈ SdN1 ×d3 n,d 1 ,d2 ,d3
no FSS Eval calls
multiplication [7] d1 × d2 and d2 × d3 , with Eval×
n,d
) d1 d3 )n bits
1 ,d2 ,d3
entries in SN
Interval gIC,n,p,q (x); x, p, q ∈ SN , 1{p 6 x 6 q} ∈ n bits + 2 calls to
(GenIC IC
n,p,q , Evaln,p,q )
containment [3] p6q SN DCFn,SN DCFn,SN
Signed n bits + 1 call to
gsCMP,n (x, y); x, y ∈ SN 1{x 6 y} ∈ SN (GensCMP
n , EvalsCMP
n )
comparison [3] DDCFn−1,SN DDCFn−1,SN
(xA s) = n bits + 1 call to DCFs,SN
Arithmetic Right x−(x mod 2s )
gA ,s,n ; x ∈ SN , s ∈ Z ∈ (Genn,sA , Evaln,sA ) DCFs,SN + and 1 call to
Shift (ARS) [3] 2s
SN DDCFn−1,SN ×SN DDCFn−1,SN ×SN
x · 1{x > 0} ∈ 5n bits + 2 calls to
ReLU [3] gReLU,n (x); x ∈ SN (GenReLU , EvalReLU )
SN n n DCFn,S2 DCFn,S2
N N
gspline,n,m,d,P,F (x); x ∈
SN , F = {fi }i is a set of
m univariate polynomials of (2mn(d + 1) +
fi (x) ∈ SN
degree d with coefficients in (Genspline , m calls to
Splines [3] when pi−1 +1 6 n,m,d,P n) bits +
SN , P = {p0 , p1 . . . pm } is Evalspline ) DCF m(d+1)
x 6 pi n,m,d,P DCF m(d+1) n,SN
the list of m+1 knots in SN n,SN

with p0 = pm = N − 1 and
pi < pi+1 for 0 < i < m.
Table 1. Description and costs for FSS gates from [3, 7].

3.2 Truncate-Reduce where y = 2n − rin . Using the relation from [30], we

can re-write Truncate-Reduce as follows.
The functionality of Truncate-Reduce (TR) for an n-bit [rin ,rout ]
number x ∈ SN by s-bits is defined as dropping the last ĝTR,n,s (x) = x[s,n) + y[s,n) + 1{x[0,s) + y[0,s) > 2s − 1}
s bits and returning the result as a (n − s)-bit number + rout mod 2n−s
y ∈ S2n−s .
The GTR gate is the family of functions gTR,n,s : Based on this, we present our construction of the
SN → S2n−s parameterized by input group Gin = FSS gate for Truncate-Reduce in Figure 2 and summa-
SN and output group Gout = S2n−s , and defined as rize its cost below:
gTR,n,s (x) := x[s,n) . One straightforward way to realize
the FSS gate for Truncate-Reduce is to use an FSS gate Truncate-Reduce Gate (GenTR TR
n,s , Evaln,s )
for arithmetic right shift operation that produces the TR λ in out
Genn,s (1 , r , r ):
result in n bits (see Table 1) followed by a local modulo
1: Let y = (2n − rin ) ∈ SN and α(s) = y[0,s) .
operation to get rid of higher order s bits. In particular, (s) (s) <

2: (k0 , k1 ) ← Gens 1λ , α(s) , 1, S2n−s .
gTR,n,s (x) = x A s mod 2n−s , where A represents
3: Sample random r0 , r1 ← S2n−s s.t.
the Arithmetic Right Shift (ARS) of x by s bits. This
r0 + r1 = rout + y[s,n) mod 2n−s .
construction would have a total key size of n-bits plus (s)
the key size of DCFs,SN and DDCFn−1,SN ×SN . In the 4: return (k0 , k1 ), where kb = kb ||rb for b ∈ {0, 1}.

following text, we provide a new construction that only

EvalTR
n,s (b, kb , x):
uses a single DCF key, DCFs,2n−s . (s)
We denote the corresponding offset gate class by 1: Parse kb as kb ||rb .
< (s)
ĜTR and the offset functions by: 2: For x(s) = 2s − x[0,s) − 1, tb ← Evals (b, kb , x(s) ).
3: return b · x[n,s) + rb + tb .
[rin ,rout ]
ĝTR,n,s (x) = (x − rin )[s,n) + rout mod 2n−s
Fig. 2. FSS Gate for Truncate-Reduce GTR , b is party id.
= (x + y)[s,n) + rout mod 2n−s
LLAMA: A Low Latency Math Library 8

Theorem 3. There is an FSS gate (GenTR TR

n,s , Evaln,s ) for We observe that the above relation requires two
GTR with key size of (n − s) bits plus the key size of comparisons, one for x0 with r1in and second for y 0 with
DCFs,S2n−s and 1 invocation of DCFs,S2n−s in EvalTR n,s . r2in . However, if we implement above relation naively
with FSS it will require a 2-round protocol. The first
round will compute shares of comparison outputs and
the second round will do the multiplications (with
4 Linear Layers shares of r2in and r1in , resp.). Below, we provide a 1-round
protocol for this gate using the observation that the val-
Machine learning models that use low bitwidth fixed-
ues that need to be multiplied with 1{x0 < r1in } (resp.,
point numbers to represent the model parameters, i.e.,
1{y 0 < r2in }) are either known to the servers or the dealer.
weights, as well as activations, inherently rely on very
With this observation, we can increase the DCF pay-
precise computation of intermediate operations [16, 17,
loads to 2 group elements each and compute the whole
24]. For linear layers, this corresponds to values being
expression in a single round. First, we re-arrange the
multiplied and accumulated over high bitwidths, before
above expression to separate out the terms known to
being truncated to required output bitwidth. Below, we
the servers and the dealer.
discuss our FSS protocols for such linear layers, start-
ing with the basic operation of element-wise multipli-
[rin ,rin ,rout ]
cation followed by matrix multiplications and convolu- 1 2
ĝsmult,m,n (x, y) = 2m · 1{x0 < r1in } · (y 0 − 2n−1 )
tions. Since all our benchmarks use signed arithmetic, + 2m · 1{x0 < r1in } · (−r2in ) + 2n · 1{y 0 < r2in } · (x0 − 2m−1 )
we focus on signed multiplications below. Unsigned op-
+ 2n · 1{y 0 < r2in } · (−r1in ) + (−r1in ) · (y 0 − 2n−1 )
erations are analogous and disusssed in Appendix A.
+ (−r2in ) · (x0 − 2m−1 ) + r1in · r2in + rout
+ (x0 − 2m−1 ) · (y 0 − 2n−1 ) mod L
4.1 Signed multiplication
Based on above re-arrangement, we present our con-
We define the signed multiplication gate Gsmult as the struction of the FSS gate for Signed Multiplication in
family of functions gsmult,m,n : SM × SN → SL with L = Figure 3 and summarize its cost below:
M ·N parameterized by input group Gin = SM ×SN and
output group Gout = SL , and given by gsmult,m,n (a, b) := Theorem 4. There is an FSS gate (Gensmult smult
m,n , Evalm,n )

sintm (a) ∗` sintn (b). Intuitively, this says that finite-bit for Gsmult which has a total keysize 3(m + n) bits plus
signed values a and b are lifted to Z and multiplied (note the key size of DCFm,S2 and DCFn,S2 and requires 1
N M

that taking a modulo with L for ` = m+n does not loose invocation each of DCFm,S2 and DCFn,S2 in Evalsmult
m,n .
N M

any bits).
Remark. We use the above protocol for signed multipli-
For a0 = a + 2m−1 mod M and b0 = b + 2n−1
cation to realize element-wise multiplications occurring
mod N , using Equation (1) we get:
in Hadamard Product layers in our benchmarks.
gsmult,m,n (a, b) = (a0 − 2m−1 ) · (b0 − 2n−1 ) mod L
0 0 0 n−1 0 m−1 m+n−2
=a ·b −a ·2 −b ·2 +2 mod L
4.2 Matrix Multiplication and Convolution
We denote the corresponding offset gate class by Ĝsmult
and the offset functions by: d1 ×d2
Consider signed matrix multiplication of A ∈ SM
d2 ×d3
[rin ,rin ,rout ] and B ∈ SN . In the resulting matrix C = A · B each
1 2
ĝsmult,m,n (x, y) = gsmult,m,n (x − r1in , y − r2in ) + rout
element is a result of d2 multiplications and d2 − 1 addi-
= ((x0 − r1in ) mod M − 2m−1 ) · ((y 0 − r2in ) mod N tions. Even if we store the result of multiplication in a
n−1 out larger ring S2m+n , similar to signed multiplication, and
−2 )+r mod L
do d1 additions over these, the result can still overflow
where x0 = x+2m−1 mod M and y 0 = y+2n−1 mod N . due to additions. To avoid such an overflow, underlying
We used Equation (1) in the above expression. Now, libraries assume that computation is happening over a
using Equation (2) we have: sufficiently large domain. Note that ` = m + n + dlog d2 e
[r1in ,r2in ,rout ] suffices to avoid any overflows during whole dot product
(x, y) = x0 − r1in − 2m−1 + 2m · 1{x0 < r1in }

ĝ smult,m,n
computation. Even though the compute looks similar to
· y 0 − r2in − 2n−1 + 2n · 1{y 0 < r2in } + rout mod L

what we discussed in last section, matrix multiplication
LLAMA: A Low Latency Math Library 9

Signed Multiplication Gate (Gensmult smult

m,n , Evalm,n )
plus d1 d2 DCFm,Ud3 +1 for comparisons on entries in A
0
2n
smult λ in in out
Genm,n (1 , r1 , r2 , r ): plus d2 d3 DCFn0 ,Ud1 +1 for comparisons on entries in sign-
M
1: Sample random r10 , r11 ← SL s.t. extended B plus `(d1 d2 + d2 d3 + d3 d1 ) per party.
r10 + r11 = sintm (−r1in ) mod L. Now we describe the second approach that is much
2: Sample random r20 , r21 ← SL s.t. more efficient for FSS. In the first round, we sign extend
r20 + r21 = sintn (−r2in ) mod L. entries of both A and B to ` = m+n+dlog d2 e-bits using
3: Sample random r0 , r1 ← SL s.t. r0 + r1 = Signed-Extension gate from Section 3.1. In the second
sintm (r1in ) · sintn (r2in ) + rout mod L. round, we do uniform bitwidth matrix multiplication
4: Let β1 = (1, −r2in ) ∈ S2N and β2 = (1, −r1in ) ∈ S2M . as in [7] (see Table 1). While the round complexity of
(k10 , k11 ) ← Gen< λ in 2
5: m (1 , r1 , β1 , SN ). this FSS protocol is still 2, the total keysize is only
< λ in 2
6: (k20 , k21 ) ← Genn (1 , r2 , β2 , SM ). d1 d2 (` + DCFm,UL ) + d2 d3 (` + DCFn,UL ) for sign exten-
7: For b ∈ {0, 1}, let kb = k1b ||k2b ||r1b ||r2b ||rb . sions plus `(d1 d2 +d2 d3 +d3 d1 ) for matrix multiplication.
8: return (k0 , k1 ).
Convolutions. These can be directly reduced to ma-
Evalsmult
m,n (b, kb , x, y): trix multiplications, but a trivial translation expands
1: Parse kb = k1b ||k2b ||r1b ||r2b ||rb . the input matrices and amounts to larger than neces-
2: Set x0 = x + 2m−1 mod M , y 0 = y + 2n−1 mod N . sary key size. To reduce keysize, following the ideas from
<
3: Set (t1 , t2 ) ← Evalm (b, k1b , x0 ). [25], we sign extend and secret-share the input masks
<
4: Set (t3 , t4 ) ← Evaln (b, k2b , y 0 ). (as we do in uniform bitwidth matrix multiplication)
5: Set s1 = 2m · (t1 · (y 0 − 2n−1 ) + t2 ) for the input matrices instead of expanding and then
6: Set s2 = 2n · (t3 · (x0 − 2m−1 ) + t4 ) calling signed extension and mask sharing.
7: return s1 + s2 + r1b · (y 0 − 2n−1 ) + r2b · (x0 − 2m−1 )
+ rb + b · (x0 − 2m−1 ) · (y 0 − 2n−1 ) mod L. Fixed-point mixed-bitwidth matrix multiplica-
Fig. 3. FSS Gate for Signed Multiplication Gsmult , b is party id. tion. When operating over fixed-point arithmetic, the
input matrices apart from the specified bitwidths also
have respective scales, say, sm and sn . After following
over mixed bitwidths cannot be computed in a single the above procedure of extend-then-multiply, the result
round, due to the need for a ring larger than S2m+n . In is in bitwidth ` = m + n + dlog d2 e and scale s = sm + sn .
particular, the element-wise multiplications of x and y The fixed-point model would also specify the required
performed during dot-product would also have a term bitwidth and scale for the output, say, nO and sO . Now
2m {x0 < r1in } · 2n {y 0 < r2in } that does not become 0 when we adjust to this using appropriate truncation opera-
taking a modulo over L > 2m+n . Hence, we need to tions as follows: We can safely assume that sO 6 s and
perform an explicit multiplication between 2 compari- nO 6 `. In the third round, we reduce the scale of out-
son outputs that are secret shared between the servers, put by tr = s − sO and adjust the bitwidth to nO . We
and this leads to an additional round of interaction. have the following two cases:
Since we are going to take 2 rounds, we use the – nO 6 (` − tr): In this case, parties first compute
approach of extend-then-multiply. For this, there are a Truncate-Reduce gate to truncate and reduce by
two ways. The first one follows the approach of [30] tr bits to obtain the result in ` − tr bits and scale
that extends the entries of one of the matrix by mini- sO . Next, parties locally compute a modulo 2nO to
mal amount, i.e., dlog d2 e then performs mixed bitwidth obtain output with correct bitwidth and scale.
multiplication as discussed in previous section. How- – nO > (` − tr): Here, parties compute an arithmetic
ever, unlike [30] this approach is quite expensive for right shift by tr bits to obtain the result in ` bits
FSS, as the payload of the DCF key for comparison and scale sO . This is followed by a local modulo
used for a value, say a in matrix A, grows with number operation by 2nO to obtain the output with correct
of elements in B it is multiplied with, i.e., d3 . Simi- bitwidth and scale.
larly, for elements in B, the payload of DCF key grows
with d1 . So, say, we extend the entries of B from n to Uniform bitwidth linear layers. For our benchmarks
n0 = n + dlog d2 e bits and then carry out a protocol sim- that have linear layers working over uniform bitwidth
ilar to that of signed multiplication. Then, key would and scale, say n and s, (signed) arithmetic, we use the
have size, roughly, d2 d3 (n0 + DCFn,U n0 ) for extensions known matrix multiplication FSS gate from [7] to obtain
2
LLAMA: A Low Latency Math Library 10

output in n-bits and scale 2s followed by truncation by these allow for one-round low communication protocols
s using arithmetic right shift gate from [3] to adjust the using FSS techniques [3, 7]. However, we notice that
putput scale to s (see Table 1 to obtain the costs). for obvious reasons, uniform bitwidth splines cannot
be used to obtain low ULP errors. In particular, for
the above mentioned case, we cannot find a reasonable
spline that uses coefficients on 16-bits, does all arith-
5 Math Functions metic over 16-bits, and provides at most 4 ULP error for
scales 12. Similar to [30], the polynomial computation
In this section, we first discuss our novel FSS-based pro-
in splines needs to happen with intermediate results in
tocols for precise math functions for the same input and
higher bitwidths and final result needs to be truncated
output domains considered in [30]. To quantify how pre-
(to reduce bitwidth and adjust scale). Here, intuitvely,
cise our math function implementations are, we use the
while evaluating polynomials one keeps accumulating
standard notion of ULP error (defined formally in [14])
bitwidth and only reduces the final result.
that we discuss below. Then, we provide a high-level
We discuss details of our mixed-bitwidth splines fol-
overview for the design of our math functions, followed
lowed by our protocols for precise math functions.
by mixed-bitwidth splines that are crucial to obtain low-
bitwidth splines that are good approximations to math
functions. Finally, we provide details on FSS-friendly
5.1 Mixed-bitwidth Splines
math function design for popular activations – sigmoid
and tanh – used in neural networks.
We first discuss the cleartext functionality for the
ULP error. It is impossible to represent an irrational
mixed-bitwidth splines followed by their FSS implemen-
value exactly using finite number of bits. Therefore, it
tation. For ease of exposition, we discuss the splines over
is important to quantify the deviations between exact
integers (no scale) and later show to handle fixed-point
real result and the output of a math library in finite-bit
arithmetic that has an associated scale with each value.
representation. There are various notions of errors one
Suppose the spline under consideration is composed
can use – absolute, relative and ULP error. Standard
of m polynomials f1 , f2 , . . . , fm (with degree d and coef-
math libraries use ULP error as a metric to determine
ficients of bitwidth nc ), and a set of m + 1 knots P . Let
whether the real output of a math function is approx-
the input to the spline be x ∈ SNI , where the bitwidth
imately close to the finite-bit output that the library
of x is nI and NI = 2nI . Let n be a sufficiently large
produces [1, 38]. The lower the ULP value, the higher
bitwidth that prevents overflow of values during poly-
is the precision and accuracy of the implementation of
nomial evaluation. Note that n = nc + d · nI suffices for
that math function. At a high level, ULP error between mixed
this purpose. The functionality gspline,(n I ,nc ),m,d,P,{fi }i
:
the exact real result r1 and library output r2 is equal to
SNI → SN for mixed-bitwidth splines is defined as fol-
the number of representable values between r1 and r2
lows. First, sign extend the input x from nI bits to n,
[14]. We use the same notion to quantify the precision of
resulting in sint(x) mod N . Then, sign extend the co-
our math functions that use fixed-point as the finite-bit
efficients of all polynomials from nc bits to n, and sign
representation.
extend all knots from nI bits to n. Finally output the
Design of math functions. Although our techniques
result of uniform bitwidth spline functionality on input
are general, for a high level discussion, let us assume
sint(x) using the new (sign extended) coefficients and
that we want to approximate sigmoid within 4 ULPs of
knots. Note that this evaluation procedure (mod N ) is
error over fixed-point inputs and outputs with bitwidth
the same as doing all computation over Z.
16 and scale 12. There are multiple design choices pos-
Now we describe a simple (yet non-optimized) 2-
sible in coming up with such an implementation. For
round FSS-based protocol for this spline evaluation. For
instance, SIRNN [30] used the recipe of first obtaining
the first round, parties call the FSS gate for Signed-
a good initial approximation followed by Goldsmidth’s
Extension gSExt,nI ,n with input x ∈ SNI masked by rin ,
iterations where the number of iterations depend on
and reconstruct x̄ = sint(x − rin ) + rtemp mod N ∈ SN .
final output scale precision desired. However, this ap-
Here, rtemp ∈ SN is chosen randomly by the dealer dur-
proach leads to large number of online rounds and
ing key generation. For the second round, let f¯i be the
communication due to the iterative nature of the algo-
polynomial corresponding to fi with (publicly known)
rithm. Our first design choice is to use piecewise poly-
coefficients sign extended to n-bits from nc bits. Also, let
nomials or splines to approximate math functions as
P̄ be the sign extended (publicly known) knots to n-bits.
LLAMA: A Low Latency Math Library 11

Next, parties call the FSS gate for uniform-bitwidth addition along the respective path to get the desired
splines gspline,n,m,d,P̄ ,{f¯i }i from [3]. The input to this output. In the case of splines, where G is a vector of
gate is x̄ ∈ SN from the first round, masked by rtemp . group elements, these additions are performed element-
The output would be the spline evaluated on (x̄ − rtemp ) wise. Since we need only 2 out of the m elements in the
masked by rout ∈ SN . output of Eval< <
n , we can tweak Evaln to only do the re-
quired additions and PRG calls. This change reduces the
number of PRG calls in the spline evaluation by roughly
5.1.1 Optimizations a factor of m/2. For the above example of sigmoid, this
factor is roughly 10×.
We propose two optimizations to the protocol described Overall performance improvement. The two op-
above that significantly reduce the FSS key size and timizations discussed above are compatible with each
offline and online computational cost. other and together lead to roughly a factor n/nI reduc-
Optimization 1. Here, we reduce the FSS key size tion in FSS key size, n/nI reduction in PRG calls by
for the spline gate used in the second round. In the the dealer in key generation and n/nI · m/2 reduction
above construction, the DCF key used in the spline in PRG calls by the servers in the online phase. For the
operates over inputs from SN (obtained after sign ex- case of sigmoid, this amounts to 4× reduction in key
tending the original input) and uses the sign extended size, 4× fewer PRG calls by the dealer and 40× fewer
knots during the key generation by dealer and evalu- PRG calls by the servers.
ation by servers. Our observation is as follows: Even We now discuss how we can easily extend our pro-
though the parties need to learn the shares of coeffi- tocol to work over fixed-point arithmetic as required.
cients of the correct polynomial to be evaluated in the
larger domain, i.e., SN , the DCF input (and hence, its
depth, etc) and knots themselves can come from the 5.1.2 Fixed-point arithmetic
original domain SNI where NI = 2nI . In particular, we
replace DCFn,Sm(d+1) in the above unoptimized scheme In the context of fixed-point arithmetic, let the scale of
N
by DCFn m(d+1) and evaluate it on the original input the input x be sI , and that of the spline coefficients
I ,SN Pd
x instead of x̄. This reduces the key size and the num- be sc . In symbolic notation, let fi (x) = j=0 ai,j · xj .
ber of PRG calls made in key generation and evaluation The first round of our FSS protocol remains the same.
of the uniform bitwidth spline gate (used in the second Note that during polynomial evaluation in the spline, we
round of our protocol) by a factor of n/nI . For instance, require all the summands to have the same scale. This
for the case of sigmoid, this reduction factor is 4×. requires a small change in the dealer as follows: The
Optimization 2. This optimization significantly re- dealer sign extends the coefficients from nc to n bits
duces the number of PRG calls made by the servers and also left shifts ai,j by (d − j) · sI bits. So, during
during the online phase. Recall that the FSS gate for evaluation the scale of each summand of the polynomial
gspline,nI ,m,d,P,{f¯i }i used in the second round uses a DCF is the same, viz. s = sc + d · sI . Now, after running
with a payload of m(d + 1)n bits. This DCF key gets the above described protocol, we have the output with
evaluated m times and the output of each invocation is bitwidth n and scale s.
m(d + 1)n bits. Let the output of the ith invocation be Next, suppose the desired output is required to have
(i) (i) bitwidth nO and scale sO . We do this adjustment in
s1 , . . . , sm (using the same notation as Figure 5 in [3]).
(i) (i)
We observe that only si−1 , si , i.e., 2(d + 1)n bits, are the final round as follows: We can safely assume that
used during evaluation and other values are discarded. sO 6 s as to obtain precise output with scale sO , scales
The Gen< of the coefficients will have to be appropriately large as
n algorithm of the DCF construction of [3]
generates a key kb for each party b ∈ {0, 1} such that well. Now, in the third round, we reduce the scale of
kb consists of a random seed sb and n + 1 correction output by tr = s − sO and adjust the bitwidth to nO
words CWi ∈ G for i ∈ {1 . . . n + 1} (where G denotes by using appropriate truncation operations as discussed
the output group). The seed generates a binary tree in “fixed-point mixed-bitwidth matrix multiplication”
with 2n leaves and each node is assosiated with a tuple paragraph in Section 4.2. The complete 3-round FSS
(sb , tb , Vb ) with an invariant that the sum of Vb along protocol is described in Appendix E.
the evaluation path for an input x, form secret shares Below, we summarize the key size and evaluation
<
of fα,β (x). Hence, in Eval< cost of the FSS protocol for mixed-bitwidth splines
n , it suffices to perform this
LLAMA: A Low Latency Math Library 12

(mixed,fixed)
over fixed-point gspline,(nI ,sI ,nO ,sO ,nc ,sc ),m,d,P,{fi }i . (Let In the third step, we quantize this spline, i.e., repre-
G(mixed,fixed)-spline denote the corresponding function fam- sent it over fixed-point as follows: we quantize the knots
ily, parameterized accordingly). After optimization 2, to have the same bitwidth and scale as our inputs. We
our protocol for spline evaluation uses the underlying linearly search over bitwidths and scales for the coeffi-
DCF key in a non-black box manner, hence we report cients. For a choice of nc , sc , we exhaustively run the
the cost of this step in number of PRG calls made. Also, mixed-bitwidth spline cleartext algorithm for all inputs
we report the cost when the third round of the protocol and check for their ULP error w.r.t. the output of a
uses a Truncate-Reduce gate. The other case is similar. high-precision math library [13]. We crucially note that
since sigmoid (and also tanh) are well-behaved functions
Theorem 5. Let params = (nI , sI , nO , sO , nc , sc ), n = with bounded outputs, and the output scale sO 6 14,
nc +d·nI , tr = sc +d·sI −sO . There is a 3-round FSS pro- this exhaustive testing is feasible. If the maximum ULP
(mixed,fixed)-spline (mixed,fixed)-spline
tocol (Genparams,m,d,P , Evalparams,m,d,P ) for mixed- error 6 4 we stop. Otherwise, we increase the value of
bitwidth splines over fixed-point that has a total key size either nc or sc until nc = 32. If we do not find a good ap-
of 2mn(d+1)+n bits, plus the key size of DCFn ,Sm(d+1) , proximation, we increase the number of knots, m, until
I N
plus the key sizes of FSS gates for gSExt,nI ,n and gTR,n,tr . 100; even if this is unsuccessful, we increment the degree
Let ` = d 2n(d+1)
4λ+2 e, where λ is the security [Link] d and go back to the second step of spline finding.
online phase makes single evaluations of Sign-Extension Following the above procedure, we successfully find
and Truncate-Reduce gates and at most m(nI (1 + `) + `) splines with d = 2, m 6 52 for the sigmoid function for
calls to PRG G (used in DCF) during spline evaluation input and output scales such that 0 6 sI , sO 6 14 and
(in the second round). this suffices for our benchmarks as well as benchmarks
considered in prior works. As is expected from a 2D
graph of sigmoid, we were unable to find linear splines
5.2 Math Functions for sigmoid (and also tanh) with even 100 knots.

In this section, we discuss our approach for computing Tanh. Over the reals, tanh(x) = (ex − e−x )/(ex + e−x )
math functions using FSS techniques – in particular, and tends to −1 for small values of x and 1 for large
sigmoid and tanh. We use mixed-bitwidth splines over values. Our procedure for tanh is identical to sigmoid
fixed-point as approximations to math functions that except for a straightword change to clipping in terms of
can be realized directly using the 3-round protocol de- outputs on inputs with large magnitude.
scribed in the previous section. Below, we discuss how
we obtain the required splines for each of the math Reciprocal square root. Over reals, rsqrt(x) =
√
functions. We use sigmoid to illustrate this. 1/ x, x > 0. To avoid division-by-zero error when x
is very small, we assume that all inputs to rsqrt satisfy
Sigmoid. Over the reals, sigmoid(x) = 1/(1 + e−x ) and x > , where is a small public constant. As in the
tends to 0 for small values of x and tends to 1 for large case of SIRNN [30], we set = 0.1. The procedure to
values of x. Our task is as follows: Given the bitwidths find splines is similar to sigmoid and tanh with one
and scales for the inputs and outputs, find a spline that difference. We observe that since the precision of input
approximates the real result with at most 4ULPs of er- x is sI , it suffices to compute the output with preci-
ror (see beginning of this section for the definition of sion sI /2. Hence, the ULP error of spline obtained is
ULP error). The first step is to clip the input domain to computed over bitwidth nO and scale dsI /2e, instead
an interesting interval as follows: we find the largest xL of sO . Later, we adjust the scale of spline output to sO
and the smallest xR such that if we set sigmoid(x) = 0 by left-shifting the output by (sO − dsI /2e) bits.
(with appropriate fixed-point representation of outputs)
for all x 6 xL and set sigmoid(x) = 1 for all x > xR , the Sample choice of parameters. Table 2 lists the
resulting ULP error 6 4. In the second step, we start choice of our spline parameters that give at most 4 ULP
with a choice of degree of the polynomials, d, and num- error for various configurations of math functions re-
ber of knots, m, and run an off-the-shelf tool Octave quired by our benchmarks in Section 6.2. In Appendix
[12] to find a best fit spline for sigmoid for the reduced B, we provide fixed-point values of the coefficients and
domain. Note that this step, returns a floating-point intervals of a mixed-bitwidth spline for tanh.
spline, i.e., both polynomial coefficients as well as knots
are floating-point values.
LLAMA: A Low Latency Math Library 13

Function nI = nO sI sO d m and no support for math functions or mixed-bitwidth

16 8 14 2 34 operations), on a benchmark considered in that work.
16 9 14 2 34 Implementation Details. LLAMA is implemented as
Sigmoid 16 11 14 2 34 a C++ library with ~6700 lines of code. The code is
(nc = 32, sc = 20) 16 13 14 2 29
publically available at [Link]
16 12 12 2 19
37 12 12 2 20
EzPC/tree/master/FSS. In addition to the FSS proto-
16 8 8 2 10 cols for mixed-bitwidth and math functions, we also im-
16 9 9 2 12 plement the relevant FSS schemes and gates proposed
Tanh 16 11 11 2 20 in [3], as their work does not have an implementation.
(nc = 32, sc = 18) 16 12 12 2 26 All APIs for key generation and evaluation are param-
16 13 13 2 12
eterized by input and output bitwidths and scales, to
37 12 12 2 26
easily support both uniform and mixed-bitwidth opera-
Reciprocal square root 16 10 9 2 10
(nc = 32, sc = 13) 16 12 11 2 10 tions. As suggested in [39, Section 6], we use the Matyas-
Table 2. Spline parameters (degree d, number of intervals m, Meyer-Oseas one-way compression function (which uses
coefficient bitwidth nc , coefficient scale sc ) with at most 4 ULPs AES in fixed-key mode) to generate pseudorandomness.
error, for varying input bitwidth nI , output bitwidth nO = nI , This is done to avoid multiple expensive AES initializa-
input scale sI , output scale sO . tion operations in the DCF computation, which is the
main cost of FSS protocols.
6 Evaluation We have integrated LLAMA as a cryptographic
backend to EzPC [8]. This allows us to compile fixed-
In this section, we perform a empirical evalutation of point inference code written in EzPC, into FSS-friendly
LLAMA and compare its performance with relevant C++ code. Various frontends such as CrypTFlow [25]
prior works. In Section 6.1, we provide microbench- and SeeDot [17] can be easily used to obtain fixed-point
marks that compare our protocols for mixed-bitwidth EzPC code (with carefully chosen bitwidths and scales
arithmetic and math functions with SIRNN [30] and that preserve accuracy) for arbitrary machine learning
MP-SPDZ [23], which are the prior state-of-the-art sys- network architectures.
tems for precise implementations of these functions. Experimental Setup. We run our benchmarks on 3
SIRNN is a 2PC system in the semi-honest setting virtual machines (one dealer and 2 servers), each with
and MP-SPDZ is run in the semi-honest 2PC set- a 4-core 3.7 GHz Xeon processor and 16 GBs of RAM.
ting with trusted dealer. (SIRNN is optimized for end- In the LAN setting, all VMs are connected in a net-
to-end latency, while MP-SPDZ, like LLAMA, consid- work with average bandwidth of 340 MBps and RTT of
ers an offline-online split.) For these microbenchmarks, 0.96 ms, while in the WAN setting the corresponding
LLAMA reduces the online communication by two orders numbers are 120 MBps and 72.3 ms respectively. The
of magnitude and latency by 1.9 − 10×. mean and standard deviation of both offline and online
Next, in Section 6.2, we use LLAMA to run end-to- execution times are calculated over 100 runs.
end secure inference on various neural network bench-
marks and compare its performance with appropriate 6.1 Microbenchmarks
prior works that considered same benchmarks. We eval-
uate on the benchmarks considered in SIRNN that use In this section we microbenchmark LLAMA on individ-
math functions and/or mixed-bitwidth computations. ual functions used in mixed-bitwidth arithmetic as well
We observe that online latency and communication us- as math functions. For precise math functions, we com-
ing LLAMA is up to 57× and 12000× lower than SIRNN. pare with SIRNN [30] and MP-SPDZ [23]. Although
To demonstrate scalability of LLAMA, we evaluate it on SIRNN is a standard 2PC system and LLAMA is a 2PC
large convolutional networks such as ResNet-50 for Ima- system in the dealer model, SIRNN is the only work that
geNet and contrast its performance with recent systems considers secure computation of mixed-bitwidth opera-
such as CrypTFlow [25] for 3PC and CrypTFlow2 [31] tions, and hence we compare with it for these blocks.
for 2PC. We also compare with DELPHI [28], a 2PC Microbenchmarks for bitwidth changing functions, i.e.,
work that explicitly considers the question of online la- signed-extension and truncate-reduce, and math func-
tency of 2-party secure inference. Finally, we compare tions, i.e., sigmoid, tanh and reciprocal square root are
LLAMA with AriaNN [34], an FSS-based secure infer- provided in Table 3, while those for mixed-bitwidth ma-
ence framework (with erroneous ReLU and truncations trix multiplication are presented in Table 4. For Ta-
LLAMA: A Low Latency Math Library 14

Communication
Online LAN (in milliseconds)
Layer Batch Size Technique (in KB)
Rounds
Offline Online Offline Online
LLAMA 35 0.8 1 0.38 ± 0.14 0.37 ± 0.26
100
Signed-Extension SIRNN - 30 7 - 4.5 ± 0.6
(m = 8, n = 21) LLAMA 352 7.8 1 0.96 ± 0.68 0.81 ± 0.22
1000
SIRNN - 114 7 - 5.73 ± 1.54
LLAMA 47 0.2 1 0.47 ± 0.49 0.48 ± 0.43
100
Truncate-Reduce SIRNN - 41 13 - 9.34 ± 6.34
(n = 21, s = 13) LLAMA 466 2 1 1.72 ± 1.91 0.89 ± 0.44
1000
SIRNN - 211 13 - 13.65 ± 2.17
LLAMA 3297 3.5 3 12.47 ± 5.5 4.09 ± 1.47
100 SIRNN - 768 139 - 91.96 ± 8.50
Sigmoid (nI =
MP-SPDZ 3696 134 145 ** 32.32 ± 8.12
nO = 16, sI =
LLAMA 33044 35 3 128.45 ± 46.91 27.05 ± 4.37
9, sO = 14)
1000 SIRNN - 5007 139 - 102.46 ± 8.06
MP-SPDZ 5246 1308 145 ** 52.10 ± 8.90
LLAMA 1320 3.5 3 5.35 ± 3.88 2.81 ± 0.84
100 SIRNN - 604 131 - 83.7 ± 8.26
Tanh (nI = nO = MP-SPDZ 3696 137 155 ** 35.74 ± 12.46
16, sI = sO = 9) LLAMA 13219 35 3 51.06 ± 19.99 10.16 ± 3.44
1000 SIRNN - 3614 131 - 88.07 ± 8.96
MP-SPDZ 5246 1341 155 ** 57.60 ± 8.80
LLAMA 1138 3.5 3 4.80 ± 4.35 2.84 ± 1.10
Reciprocal square 100 SIRNN - 881 185 - 124.05 ± 10.95
root (nI = nO = MP-SPDZ 2457 44.4 87 ** 22.11 ± 5.00
16, sI = 12, sO = LLAMA 11375 35 3 41.20 ± 15.60 8.99 ± 1.79
11) 1000 SIRNN - 5488 185 - 126.03 ± 11.41
MP-SPDZ 2467 413 87 ** 28.92 ± 5.78
Table 3. Performance comparison for bitwidth changing and math functions. For Signed-Extension, m,n are input, output bitwidths.
For Truncate-Reduce, n is input bitwidth and s is shift amount. For Sigmoid, Tanh and Reciprocal square root, nI , nO , sI , sO denote
input/output bitwidths and scales. ** denotes that the value was not reported by the code.

ble 3, the choice of parameters for bitwidths and scales LLAMA does have a larger total communication (by up
are made using examples from our benchmarks such as to 4.4×) in other cases.
Google-30 [26] and Heads [30] (described in Section 6.2) Table 4 summarizes our microbenchmarks for
and we evaluate for batch sizes of 100 and 1000. For mixed-bitwidth matrix multiplication (multiplying a
the math functions, for these choice of bitwidths and d1 ×d2 matrix with a d2 ×d3 matrix). The input/output
scales, Table 2 provides details on the spline chosen by bitwidths for all experiments are 8, while the scale is 6;
LLAMA, i.e., degree and number of knots, as well as however, due to d2 being different in each case, the in-
coefficient bitwidths and scales. From Table 3, LLAMA termediate bitwidth (16 + dlog d2 e) in each computation
is up to 40× better than MP-SPDZ in online commu- is different. As can be seen from the table, LLAMA com-
nication, up to 51× better in terms of online rounds, municates between 59 − 208× less than SIRNN in the
and up to 12× better in online execution time. As seen online phase and has 10 − 13× fewer rounds. LLAMA
in Table 3, LLAMA communicates between 105 − 251× also performs between 2.2 − 7.3× better in the LAN set-
less than SIRNN in the online phase and has between ting. Further, in these microbenchmarks, LLAMA also
13−61× fewer rounds of online communication. In terms has 1.3 − 5.4× lower total communication.
of performance, LLAMA is between 3.7−43× faster than
SIRNN in the LAN setting. Finally, as expected, while
the total communication of LLAMA (i.e. communication
6.2 Benchmarks
including the offline key size as well) can be compara-
In this section, we evaluate and compare the perfor-
ble to SIRNN in a few cases (e.g. Truncate-Reduce),
mance of LLAMA on several machine learning inference
LLAMA: A Low Latency Math Library 15

Communication LAN (in

Online
d1 d2 d3 Technique (in MB) milliseconds)
Rounds
Offline Online Offline Online
LLAMA 72.44 1.63 3 316.81 ± 82.11 106.08 ± 4.71
10 200 1000
SIRNN - 97.51 31 - 239.96 ± 15.02
LLAMA 76.60 1.68 3 333.78 ± 86.22 110.91 ± 5.25
10 2000 100
SIRNN - 108.36 39 - 266.19 ± 15.37
LLAMA 37.08 0.99 3 158.94 ± 46.24 56.63 ± 3.97
200 200 200
SIRNN - 206.87 41 - 415.63 ± 25.86
Table 4. Comparison for mixed-bitwidth matrix multiplication for dimensions d1 × d2 and d2 × d3 , using bitwidths of 8 and scale 6.

algorithms. We provide details on the benchmarks con- for online phase. To illustrate that LLAMA can scale
sidered in Appendix D and summarize the findings in to large benchmarks, we run it on the ResNet-50 CNN
Table 5. We split the discussion below into two kinds of on the ImageNet dataset [18], and compare with both
benchmarks. Our main focus is the networks with math CrypTFlow (a 3PC system) as well as CrypTFlow2
functions or networks that use low bitwidths for activa- (a 2PC system)6 . Finally, we consider AriaNN [34],
tions and weights for efficiency. We also consider simple which like LLAMA is an FSS based secure inference
convolutional neural networks (CNNs) from prior works system (in the trusted dealer model), but does not sup-
to demonstrate our generality and scalability. port mixed-bitwidth arithmetic or math functions. Here
alone, since AriaNN code [33] does not support execu-
Neural networks with math functions/mixed- tion on different VMs, we ran all parties in both AriaNN
bitwidth arithmetic. First, to illustrate the perfor- and LLAMA on the same VM and appropriately set the
mance of LLAMA on algorithms that use mixed-bitwidth latency and bandwidth on the VM using the tc com-
arithmetic and/or math functions (tanh, sigmoid, or mand7 . On the ResNet-18 benchmark on Hymenoptera
reciprocal square root), we run it on the following end- dataset, we show that LLAMA outperforms AriaNN by
to-end inference benchmarks: DeepSecure B4 [32], that about 3× in online communication and 1.7× in online
enables embedded sensors to classify various physical runtime (despite AriaNN using a probabilistically cor-
activities, as well as on an RNN algorithm [26] that en- rect, cheaper, local truncation protocol compared to
ables keyword spotting on the Google-30 dataset [40]. the correct truncation in LLAMA). This improvement
We compare the performance of LLAMA with SIRNN can be attributed to ReLU being a 2-round protocol
and observe that LLAMA has up to 4 orders of mag- in AriaNN compared to an FSS gate in LLAMA. For
nitude lower online communication, up to 22× fewer fairness, we also provide numbers for LLAMA with the
online rounds, and up to 57× faster runtime. We also same probabilistically correct local truncation.
evaluate LLAMA on the sigmoid/tanh layers of the
MiniONN LSTM [27] (a language model for word pre-
dictions) and the Industrial-72 benchmarks [24, 30] (a
model that provides feedback for quality of shots in
7 Conclusion
a sports game), as well as the reciprocal square root
This paper proposes LLAMA, an FSS-based 2PC secure
layers from the Heads model [35] (a model for counting
inference system in the semi-honest, trusted dealer set-
the number of people in an image). Here, we show that,
ting. The main design goal of LLAMA is to minimize
in comparison with SIRNN, the online communication
of LLAMA is at least 200× less, the number of rounds
is at least 43× better and the performance is at least
15× and 43× better in the LAN and WAN settings. 6 In very recent work, Cheetah [19] show an improvement of ≈
12× in communication and 4-5× in runtime over CrypTFlow2.
We do not directly compare with this work, as it is orthogonal
Other neural networks. While not the primary focus
to the focus of this work; however, even in comparison to Chee-
of this work, for the sake of completeness, we also com- tah, we note that LLAMA has much lower communication and
pare LLAMA with prior systems on neural networks not is expected to outperform it.
requiring mixed-bitwidth arithmetic or math functions. 7 The end-to-end code execution time in AriaNN took around
We compare with DELPHI [28] – a 2PC system designed 40 minutes. In Table 5, we report the offline and online times
(around 350 seconds and 13 seconds respectively) that is output
speifically with online cost in mind – and show ≈ 24×
by their code. Due to longer execution times, the mean and
better communication and ≈ 3 − 5× better runtime
standard deviation of runtimes are calculated over 25 iterations.
LLAMA: A Low Latency Math Library 16

Communication
Online LAN Time (in seconds) WAN Time (in seconds)
Network Technique (in MB)
Rounds
Offline Online Offline Online Offline Online
LLAMA 183 0.15 21 0.78 ± 0.20 0.11 ± 0.01 3.15 ± 0.16 0.83 ± 0.05
DeepSecure B4
SIRNN - 1844 379 - 6.45 ± 0.31 - 47.63 ± 1.67
LLAMA 882 8.6 2687 5.31 ± 0.94 1.89 ± 0.12 8.02 ± 0.32 87.71 ± 3.65
Google-30
SIRNN - 415 59899 - 37.18 ± 1.95 - 1997.8 ± 93.5
MiniONN LSTM
LLAMA 49.7 0.04 6 0.21 ± 0.07 0.02 ± 0.003 0.49 ± 0.23 0.23 ± 0.01
(only Sigmoid,
Tanh) SIRNN - 9.7 403 - 0.34 ± 0.02 - 14.47 ± 0.90

Industrial-72 (only LLAMA 19.4 0.03 42 0.09 ± 0.03 0.04 ± 0.006 0.30 ± 0.04 1.41 ± 0.07
Sigmoid, Tanh) SIRNN - 7.9 1847 - 1.23 ± 0.08 - 61.36 ± 3.85

Heads (only 0.026 ±

LLAMA 30 0.09 9 0.11 ± 0.03 0.50 ± 0.02 0.23 ± 0.01
Reciprocal square 0.003
root) SIRNN - 18 545 - 0.39 ± 0.02 - 19.17 ± 0.77
LLAMA 1084 8.2 25 4.61 ± 1.12 0.52 ± 0.02 10.05 ± 2.77 1.77 ± 0.10
MiniONN CNN
DELPHI 3258 196 ** 36.73 ± 0.70 2.73 ± 0.07 55.63 ± 2.33 5.29 ± 0.25
CrypTFlow2 - 340 345 - 10.24 ± 0.17 - 32.50 ± 2.50
LLAMA 78848 745 280 427.94 ± 45.32 36.99 ± 0.46 631.91 ± 31.60 108.68 ± 7.17
ResNet-50
(ImageNet) CrypTFlow2 - 31502 4053 - 476.6 ± 2.51 - 1026.9 ± 70.2
CrypTFlow - 6549 >7400 - 58.55 ± 3.46 - 364.7 ± 10.77
LLAMA 8459 57 66 36.28 ± 10.29 7.29 ± 0.36 84.51 ± 3.23 12.79 ± 0.28
ResNet-18
LLAMA
(Hymenoptera)
(local 5243 45 48 22.55 ± 6.00 6.81 ± 0.78 51.09 ± 1.29 9.28 ± 0.3
truncation)
AriaNN 6702 148 ** 339.02 ± 18.51 12.27 ± 1.02 455.12 ± 18.76 55.62 ± 2.79
Table 5. Secure inference benchmarks using LLAMA and prior works. ** denotes that the value was not reported by the code.

online complexity. LLAMA proposes novel FSS-based

constructions for signed-extension and truncate-reduce,
References
which facilitate mixed-bitwidth operations and precise [1] Intel SVML. [Link]
math functions based on low bitwidth splines. Due to us/en/develop/documentation/mkl-vmperfdata/[Link]
the emphasis on lower online complexity, the offline (2022)
phase in LLAMA can incur significant memory and band- [2] Beaver, D.: Efficient multiparty protocols using circuit ran-
width requirement (primarily attributed to large size of domization. In: CRYPTO (1991)
[3] Boyle, E., Chandran, N., Gilboa, N., Gupta, D., Ishai, Y.,
DCF keys). It would be interesting to optimize memory
Kumar, N., Rathee, M.: Function secret sharing for mixed-
usage in the offline phase in LLAMA. mode and fixed-point secure computation. In: EUROCRYPT
(2020)
[4] Boyle, E., Couteau, G., Gilboa, N., Ishai, Y., Kohl, L.,
Scholl, P.: Efficient pseudorandom correlation generators:
8 Acknowledgement Silent OT extension and more. In: CRYPTO (2019)
[5] Boyle, E., Gilboa, N., Ishai, Y.: Function secret sharing. In:
This research received no specific grant from any fund- EUROCRYPT (2015)
[6] Boyle, E., Gilboa, N., Ishai, Y.: Function secret sharing:
ing agency in the public, commercial, or not-for-profit
Improvements and extensions. In: CCS (2016)
sectors. We thank Rahul Sharma for his help in spline er-
[7] Boyle, E., Gilboa, N., Ishai, Y.: Secure computation with
ror calculation and Deevashwer Rathee, Mayank Rathee preprocessing via function secret sharing. In: TCC (2019)
and Rahul Kranti Kiran Goli for their help in under- [8] Chandran, N., Gupta, D., Rastogi, A., Sharma, R., Tripathi,
standing existing code bases. S.: EzPC: Programmable and Efficient Secure Two-Party
Computation for Machine Learning. In: IEEE EuroS&P
(2019)
LLAMA: A Low Latency Math Library 17

[9] Dalskov, A.P.K., Escudero, D., Keller, M.: Secure evalua- [28] Mishra, P., Lehmkuhl, R., Srinivasan, A., Zheng, W., Popa,
tion of quantized neural networks. Proc. Priv. Enhancing R.A.: DELPHI: A cryptographic inference service for neural
Technol. (2020) networks. In: USENIX Security (2020)
[10] Damgård, I., Pastro, V., Smart, N.P., Zakarias, S.: Multi- [29] Mohassel, P., Zhang, Y.: SecureML: A system for scalable
party computation from somewhat homomorphic encryption. privacy-preserving machine learning. In: IEEE S&P (2017)
In: CRYPTO (2012) [30] Rathee, D., Rathee, M., Goli, R.K.K., Gupta, D., Sharma,
[11] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, R., Chandran, N., Rastogi, A.: SIRNN: A math library for
L.: Imagenet: A large-scale hierarchical image database. In: secure RNN inference. In: IEEE S&P (2021)
2009 IEEE Conference on Computer Vision and Pattern [31] Rathee, D., Rathee, M., Kumar, N., Chandran, N., Gupta,
Recognition (2009) D., Rastogi, A., Sharma, R.: CrypTFlow2: Practical 2-party
[12] Eaton, J.W., Bateman, D., Hauberg, S., Wehbring, R.: secure inference. In: CCS (2020)
GNU Octave version 6.1.0 manual: a high-level interac- [32] Rouhani, B.D., Riazi, M.S., Koushanfar, F.: Deepsecure:
tive language for numerical computations (2020), https: Scalable provably-secure deep learning. In: DAC (2018)
//[Link]/software/octave/doc/v6.1.0/ [33] Ryffel, T., Pointcheval, D., Bach, F.: ARIANN: Low-
[13] Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., Zimmer- interaction privacy-preserving deep learning via function
mann, P.: Mpfr: A multiple-precision binary floating-point secret sharing. [Link] (2020)
library with correct rounding. ACM Trans. Math. Softw. [34] Ryffel, T., Pointcheval, D., Bach, F.: ARIANN: Low-
(2007) interaction privacy-preserving deep learning via function
[14] Goldberg, D.: What every computer scientist should know secret sharing. PoPETS (2022)
about floating-point arithmetic. ACM Comput. Surv. (1991) [35] Saha, O., Kusupati, A., Simhadri, H.V., Varma, M., Jain, P.:
[15] Goldreich, O., Micali, S., Wigderson, A.: How to Play any RNNPool: Efficient non-linear pooling for RAM constrained
Mental Game or A Completeness Theorem for Protocols inference. In: NeurIPS (2020)
with Honest Majority. In: STOC (1987) [36] Soin, A., Bhatu, P., Takhar, R., Chandran, N., Gupta, D.,
[16] Google: Tensorflow Lite (2019), [Link] Alvarez-Valle, J., Sharma, R., Mahajan, V., Lungren, M.P.:
org/lite/ Production-level open source privacy preserving inference in
[17] Gopinath, S., Ghanathe, N., Seshadri, V., Sharma, R.: Com- medical imaging. CoRR (2021)
piling kb-sized machine learning models to tiny iot devices. [37] Vadapalli, A., Bayatbabolghani, F., Henry, R.: You may
In: PLDI (2019) also like... privacy: Recommendation systems meet pir. In:
[18] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning PoPETs (2021)
for image recognition. In: 2016 IEEE Conference on Com- [38] Wang, E., Zhang, Q., Bo, S., Zhang, G., Lu, X., Wu, Q.,
puter Vision and Pattern Recognition (CVPR) (2016) Wang, Y.: Intel math kernel library (2014)
[19] Huang, Z., jie Lu, W., Hong, C., Ding, J.: Cheetah: Lean [39] Wang, F., Yun, C., Goldwasser, S., Vaikuntanathan, V.,
and fast secure two-party deep neural network inference. In: Zaharia, M.: Splinter: Practical private queries on public
USENIX Security (2022) data. In: Proceedings of the 14th USENIX Conference on
[20] Juvekar, C., Vaikuntanathan, V., Chandrakasan, A.: Networked Systems Design and Implementation (2017)
GAZELLE: A Low Latency Framework for Secure Neural [40] Warden, P.: Speech Commands: A Dataset for Limited-
Network Inference. In: USENIX Security (2018) Vocabulary Speech Recognition. arXiv (2018)
[21] Kaissis, G., Ziller, A., Passerat-Palmbach, J., Ryffel, T., [41] Yao, A.C.: How to generate and exchange secrets. In: FOCS
Usynin, D., Trask, A., Lima, I., Mancuso, J., Jungmann, (1986)
F., Steinborn, M., Saleh, A., Makowski, M.R., Rueckert,
D., Braren, R.: End-to-end privacy preserving deep learning
on multi-institutional medical imaging. Nat. Mach. Intell.
(2021)
[22] Kamara, S., Mohassel, P., Raykova, M., Sadeghian, S.S.: A Unsigned multiplication
Scaling private set intersection to billion-element sets. In:
Christin, N., Safavi-Naini, R. (eds.) FC (2014)
Unsigned Multiplication of two values x ∈ UM , y ∈ UN
[23] Keller, M.: MP-SPDZ: A versatile framework for multi-party
computation. In: ACM CCS (2020) refers to the multiplication of the two values uintm (x)
[24] Kumar, A., Seshadri, V., Sharma, R.: Shiftry: RNN Inference and uintn (y) carried out in the group UL , where L =
in 2KB of RAM. In: OOPSLA (2020) M · N , which is equivalent to uintm (x) ∗` uintn (y).
[25] Kumar, N., Rathee, M., Chandran, N., Gupta, D., Rastogi, The unsigned multiplication gate Gumult is the family
A., Sharma, R.: CrypTFlow: Secure TensorFlow Inference.
of functions gumult,m,n : UM × UN → UL parameterized
In: IEEE S&P (2020)
[26] Kusupati, A., Singh, M., Bhatia, K., Kumar, A., Jain, P.,
by input group Gin = UM ×UN and output group Gout =
Varma, M.: FastGRNN: A Fast, Accurate, Stable and Tiny UL , and given by gumult,m,n (x, y) := uintm (x) ∗` uintn (y).
Kilobyte Sized Gated Recurrent Neural Network. In: NeurIPS
(2018)
[27] Liu, J., Juuti, M., Lu, Y., Asokan, N.: Oblivious neural
network predictions via minionn transformations. In: CCS
(2017)
LLAMA: A Low Latency Math Library 18

We denote the corresponding offset gate class by Spline Endpoints Spline coefficients (a2 x2 + a1 x + a0 )
Ĝumult and the offset functions by Left Right a2 a1 a0

[r1in ,r2in ,rout ]

0 198 -87883 286070 -1
ĝumult,m,n (x, y) = gumult,m,n (x − r1in , y − r2in ) + rout 199 398 -74280 148925 169708
= ((x − r1in ) mod M ) · ((y − r2in ) mod N ) 399 598 -15013 33009 240687
599 798 -6420 9582 257304
+ rout mod L
799 32766 0 0 262144
Now, using Equation (2) we have: 32767 -800 0 0 -262144
-799 -601 6419 -435 -260874
[r1in ,r2in ,rout ]
ĝumult,m,n (x, y) = (x − r1in +2 m
· 1{x < r1in }) · (y − r2in -600 -401 15012 9582 -257305
-400 -201 74279 33009 -240688
+ 2n · 1{y < r2in }) + rout mod L -200 -1 87882 148925 -169709
Table 6. Spline intervals (modulo 216 ) and coefficients (modulo
On further expanding we get:
232 ) for tanh with
[rin ,rin ,rout ] nI = nO = 16, sI = sO = 8, nc = 32, sc = 18, d = 2, m = 10.
1 2
ĝumult,m,n (x, y) = 2n · 1{y < r2in } · x
+ 2m · 1{x < r1in } · y
+ 2n · 1{y < r2in } · (−r1in )
+ 2m · 1{x < r1in } · (−r2in ) B Example of a Spline
+ (−r1in ) · y + (−r2in ) · x
+ r1in · r2in + rout + x · y mod L In Table 6, we describe the mixed-bitwidth spline for
tanh for the following parameters: nI = nO = 16, sI =
We omit the last term (i.e. 2m {x < r1in }·2n {y < r2in }) sO = 8, nc = 32, sc = 18, d = 2, m = 10 (see Table 2,
as it gets cancelled out by the modulus. Based on above Section 5.2). The coefficients a2 , a1 , a0 (corresponding
rearragment, we present our construction of the FSS to decreasing powers of x) for each spline polynomial
gate for unsigned multiplication in Figure 4. are fixed-point numbers with bitwidth 32 and scale 18.
The endpoints of spline intervals are 16-bit fixed point
numbers with scale 8.
Unsigned Multiplication Gate (Genumult umult
m,n , Evalm,n )
umult λ in in out
Genm,n (1 , r1 , r2 , r ):
1: Sample random r10 , r11 ← UL s.t. r10 + r11 =
uintm (−r1in ) mod L. C Simple Activations
2: Sample random r20 , r21 ← UL s.t. r20 + r21 =
uintn (−r2in ) mod L. We now present the construction of FSS protocols for
3: Sample random r0 , r1 ← UL s.t. r0 + r1 = Average Pool (Appendix C.1), as well as ReLU, Max-
uintm (r1in ) · uintn (r2in ) + rout . pool, and Argmax (Appendix C.2).
4: Let β1 = (1, −r2in ) ∈ U2N and β2 = (1, −r1in ) ∈ U2M .
(k10 , k11 ) ← Gen< λ in 2
5: m (1 , r1 , β1 , UN ).
< λ in
6: (k20 , k21 ) ← Genn (1 , r2 , β2 , UM ).2 C.1 Average Pool
7: For b ∈ {0, 1}, let kb = k1b ||k2b ||r1b ||r2b ||rb .
8: return (k0 , k1 ). The average pool functionality requires signed division
of a ring element by a public positive integer. Hence, we
Evalumult
m,n (b, kb , x, y): start by defining the division functionality, show how
1: Parse kb = k1b ||k2b ||r1b ||r2b ||rb . to realize it using FSS techniques, and then present our
<
2: Set (t1 , t2 ) ← Evalm (b, k1b , x). protocol for average pool.
<
3: Set (t3 , t4 ) ← Evaln (b, k2b , y). We follow the notation and definitions from [31]. Let
4: return x · t3 · 2n + y · t1 · 2m + t4 · 2n + t2 · 2m idiv : Z × Z → Z denote signed integer division, that is,
+ r1b · y + r2b · x + rb + b · x · y mod L. idiv(a, d) = b where a = b · d + r, r ∈ Z, 0 6 r < d. Let
Fig. 4. FSS Gate for Unsigned Multiplication Gumult , b refers to rdiv : SN × Z → SN denote (signed) division of a ring
party id. element by a positive integer, defined as

rdiv(a, d) := idiv(sint(a), d) mod N

LLAMA: A Low Latency Math Library 19

where 0 < d < N . Let hai0 , hai1 ∈ SN denote additive of 6n bits, plus the key size of DCFn,SN , plus the key
secret shares of a over SN , i.e., hai0 and hai1 are ran- sizes of FSS gates for gIC,n,dN/2e,N −1 and gsCMP,n . This
dom elements of SN subject only to the constraint that protocol requires 1 invocation of DCFn,SN , 1 invocation
(hai0 + hai1 ) mod N = a. The following theorem [31] of EvalIC
n,dN/2e,N −1 and 3 invocations of Evaln
sCMP
.
allows expressing rdiv(a, d) in terms of hai0 and hai1 .
Remark. In the special case when d is a power of 2, we
Theorem 6 (Division of ring element [31]). Let shares have rdiv(a, d) = (aA log2 d), and it is more efficient to
of a ∈ SN be hai0 , hai1 ∈ SN , for some N = N1 ·d+N0 ∈ use the (single round) arithmetic right-shift (ARS) gate
Z, where N1 , N0 , d ∈ Z and 0 6 N0 < d < N . from [3] to perform signed division.
Let the unsigned representation of a, hai0 , hai1 in SN
lifted to integers be au , a0 , a1 ∈ {0, 1, . . . , N − 1}, respec- Average pool. The family of functions Gavgpool
tively, such that a0 = a10 ·d+a00 and a1 = a11 ·d+a01 , where to compute the average of d elements is de-
Pd
a10 , a00 , a11 , a01 ∈ Z and 0 6 a00 , a01 < d. Let N 0 = dN/2e. fined as gavgpool,n,d (x1 , x2 , . . . , xd ) = ( i=1 xi )/d =
Pd
Let corr, A, B, C ∈ Z be defined as rdiv( i=1 xi , d) ∈ SN , where x1 , x2 , . . . , xd ∈ SN . It is
straightforward to derive a 2-round FSS protocol for
Gavgpool from the protocol for signed division.
 −1 (au > N 0 ) ∧ (a0 < N 0 ) ∧ (a1 < N 0 )


corr = 1 (au < N 0 ) ∧ (a0 > N 0 ) ∧ (a1 > N 0 ) Theorem 8. There is a 2-round FSS protocol

0 otherwise (Genavgpool
n,d , Evalavgpool
n,d ) for Gavgpool which has the same
A = a00 + a01 − (1{a0 > N 0 } + 1{a1 > N 0 } − corr) · N0 , key size and evaluation cost as (Gendiv div
n,d , Evaln,d ).
B = idiv(a00 − 1{a0 > N 0 } · N0 , d)
+ idiv(a01 − 1{a1 > N 0 } · N0 , d),
C.2 ReLU, Maxpool and Argmax
C = 1{A < d} + 1{A < 0} + 1{A < −d}.
T hen, rdiv(a, d) = rdiv(hai0 , d) + rdiv(ha1 i, d) For the ReLU function, we use the FSS gate for gReLU,n
+ (corr · N1 + 1 − C − B) mod N from the work of [3]. With this gate, one can easily con-
struct an FSS gate to compute the maximum of two
In the FSS setting, the dealer holds rin , rout ∈ SN , while elements by defining the function in terms of ReLU –
the two parties P0 and P1 hold x ∈ SN , with the goal i.e., gmax,n (x1 , x2 ) = ReLU(x1 − x2 ) + x2 . We then build
of computing rdiv(x − rin , d) + rout . We will set hai0 = x upon this to construct an FSS protocol for Maxpool (i.e.
and hai1 = −rin in Theorem 6 (i.e. a = x − rin mod N ). the function that computes the maximum out of d ele-
We will first compute A in the above theorem. To ments) by computing the maximum of 2 elements at a
do this, we use the following fact (from [31]). Let w = time in a tree-like manner, resulting in (d − 1) compar-
1{a0 + a1 > N }, then corr = 1{a0 > N 0 } + 1{a1 > isons done over dlog de rounds. Finally, Argmax (that
N 0 } − w − 1{a > N 0 }. Now, using DCFn,SN , P0 and P1 computes the index with the maximum value out of d
can compute shares of w = 1{a0 + a1 > N } = 1{N − elements) is computed in a similar manner to Maxpool,
1 − a0 < a1 }. Similarly, shares of 1{a > N 0 } can be in 2dlog de rounds.
computed using the FSS gate for gIC,N 0 ,N −1 . These two
computations can be done in parallel in the first round, Theorem 9. There is a dlog de-round FSS protocol
and from this, the parties can compute shares of Ā = (Genmaxpool
n,d , Evalmaxpool
n,d ) for maxpool on d elements,
A + rtemp ∈ SN , where rtemp ∈ SN is a random mask which has a total key size of n(d − 1) bits plus (d − 1)
chosen by the dealer. times the key size of FSS gate for gReLU,n , and requires
In the second round, parties first locally compute (d − 1) invocations of EvalReLU
n .
shares of B. Now, they reconstruct Ā, and then, along
with an FSS gate for GsCMP from [3], compute C. Shares Theorem 10. There is a 2dlog de-round FSS protocol
of rdiv(x−rin , d)+rout can then be computed locally from (Genargmax
n,d , Evalargmax
n,d ) for Gargmax which has a total key
shares of B, C and corr. The full FSS protocol for signed size of n(d − 1) bits, plus the key size of FSS protocol
division is given in Figure 5. for gmaxpool,n,d , plus (d − 1) times the key sizes of FSS
protocols for gsCMP,n and g×,n . The protocol requires (d−
Theorem 7. There is a 2-round FSS protocol 1) invocations of EvalReLU
n , EvalsCMP
n and Eval×
n each.
(Gendiv div
n,d , Evaln,d ) for Gdiv which has a total key size
LLAMA: A Low Latency Math Library 20

Signed Division (Gendiv div

n,d , Evaln,d )
Google-30. This benchmark is an RNN taken from [26]
div λ in out with 99 sigmoid and 99 tanh layers with 100 instances
Genn,d (1 , r , r ):
over the Google-30 [40] dataset. The input and output
1: Set hai1 = (−rin ) ∈ SN .
scales for sigmoid are 9 and 14 respectively. The input
2: Compute N, N 0 , N0 , N1 , a1 , a11 , a01 as described in
and output scales for tanh are 9. These two layers op-
Theorem 6.
erate on bitwidth 16. The network also contains other
3: (k10 , k11 ) ← Gen< λ
n (1 , a1 , 1, SN ).
IC layers like hadamard product and agrmax.
4: (k20 , k21 ) ← Genn,N 0 ,N −1 (1λ , −a1 , 0).
MiniONN LSTM. This benchmark contains the sig-
5: Sample random r10 , r11 ∈ SN s.t. r10 + r11 = a01 .
moid and tanh layers of LSTM from [27] (see Figure
6: Sample random r20 , r21 ∈ SN s.t.
14). It contains a sigmoid layer with 600 instances and
r20 + r21 = 1{a1 > N 0 }.
a tanh layer with 400 instances. The bitwidth used is 37
7: Sample random r30 , r31 ∈ SN s.t.
and input/output scales are 12 for both layers.
r30 + r31 = idiv(a01 − 1{a1 > N 0 } · N0 , d).
Industrial-72. Since the benchmark is not public, we
8: Sample random r40 , r41 ∈ SN s.t.
evaluate the math functions alone for this benchmark
r40 + r41 = rdiv(hai1 , d).
as described in [30]. As stated, it contains 7 layers of
9: Sample random rtemp ← SN and random
sigmoid and tanh each with 64 instances in each layer.
r50 , r51 ∈ SN s.t. r50 + r51 = rtemp .
The bitwidth is used 16 and input/output scales for sig-
10: (k30 , k31 ) ← GensCMP
n (1λ , rtemp , 0, 0)
moid are 8 and 14 respectively. The input/output scales
11: Sample random r0 , r1 ∈ SN s.t. r0 + r1 = rout .
for tanh is set to 8.
12: For b ∈ {0, 1}, let
Heads. Similar to above, description for this bench-
kb = k1b ||k2b ||k3b ||r1b ||r2b ||r3b ||r4b ||r5b ||rb .
mark is not available publicly. This is the only bench-
13: return (k0 , k1 ).
mark which contains L2 normalization layers that use
Evaldiv
n,d (b, kb , x): reciprocal square root compuation. We use this bench-
1: Parse kb = k1b ||k2b ||k3b ||r1b ||r2b ||r3b ||r4b ||r5b ||rb . mark to evaluate our protocol for this function. The
2: Set hai0 = x ∈ SN . benchmark contains 3 layers of reciprocal square root
3: Compute N, N 0 , N0 , N1 , a0 , a10 , a00 as described in each with 1200, 1200 and 300 instances. The first and
Theorem 6. third layers have input/output scales of 12 and 11. The
Set wb ← Eval< second layer has input/output scales of 10 and 9. The
4: n (b, k1b , N − 1 − a0 ).
5: Set pb ← EvalIC n,N 0 ,N −1 (b, k2b , a0 ).
input/output bitwidths are 16 for all layers.
6: Set corrb = b · 1{a0 > N 0 } + r2b − wb − pb . MiniONN CNN. This is a 7 layer CNN benchmark
7: Set Ab = b · a00 + r1b − (b · 1{a0 > N 0 } + r2b − from [27] (see Figure 13) over CIFAR-10 dataset. This
corrb ) · N0 + r5b . CNN was used as one of the benchmarks in [28]. It
8: Reconstruct Ā = A0 + A1 ∈ SN . contains convolutions, ReLU, and Maxpool layers. The
9: Set Bb = b · idiv(a00 − 1{a0 > N 0 } · N0 , d) + r3b . fixed bitwidth and scale of 41 and 15 is used.
10: Set C1b ← b − EvalsCMP n (b, k3b , Ā, d). ResNet-50. This is a 50 layer CNN from [18] over Ima-
sCMP geNet [11] dataset. The code is generated from the pub-
11: Set C2b ← b − Evaln (b, k3b , Ā, 0).
12: Set C3b ← b − EvalsCMP n (b, k3b , Ā, −d). licly available ONNX files. Bitwidth of 37 is used, as
13: Set Cb = C1b + C2b + C3b . done in [31] with a scale of 12.
14: return b · rdiv(hai0 , d) + r4b + cb · N1 + b − Cb − Bb ResNet-18. This is a 18 layer CNN from [18] over the
+ rb ∈ SN . Hymenoptera dataset. The code is generated from pub-
licly available ONNX files. Bitwidth of 32 is used, as
Fig. 5. 2-round FSS Protocol for signed division of ring element
done in [34] with a scale of 10.
by a public positive integer.

D Description of Benchmarks E Mixed-bitwidth splines

DeepSecure B4. This benchmark from [32] contains 3 Figure 6 describes the 3-round FSS protocol for fixed-
fully connected layers, 2 tanh layers with 2000 and 500 point mixed-bitwidth spline from Section 5.1 with the
instances each and argmax. The uniform bitwidth is set text in magenta denoting modifications over the FSS
to 16. The input/output scales for tanh are set to 12. gate for uniform bitwidth splines [3, Figure 6].
LLAMA: A Low Latency Math Library 21

(mixed,fixed)-spline (mixed,fixed)-spline
Fixed-point mixed-bitwidth spline protocol (Gen(nI ,sI ,nO ,sO ,nc ,sc ),m,d,{pi }i , Eval(nI ,sI ,nO ,sO ,nc ,sc ),m,d,{pi }i )
Gen(nI ,sI ,nO ,sO ,nc ,sc ),m,d,{pi }i (1λ , {fi }i , rin , rout ):
(mixed,fixed)-spline

1: Set NI = 2nI , NO = 2nO , n = nc + d · nI , tr = sc + d · sI − sO .

2: Sample random rtemp ← SN .
SExt
3: (k10 , k11 ) ← GennI ,n (1λ , rin , rtemp ).
4: For i ∈ {1, . . . , m}, let f¯i be the polynomial corresponding to fi with coefficients sign extended to n-bits from
nc bits.
, be the coefficient vector of fi0 s.t. fi0 (x) = f¯i (x − rtemp ).
0 , . . . , f0 ) ∈ S (d+1)
5: For i ∈ {1, . . . , m}, let βi = (fi,d i,0 N
and γ = (NI − 1) + rin ∈ SNI .
m(d+1)
6: Set β = (β1 , . . . , βm ) ∈ SN
(N −1) (N −1) m(d+1)
7: (k0 , k1 ) ← Gen< λ
nI (1 , γ, β, SN ).
8: for i = {1, . . . , m} do
(R0 )
= pi−1 + 1 + rin ∈ SNI , αi = pi + rin ∈ SNI and αi = pi + 1 + rin ∈ SNI .
(L) (R)
9: Set αi
(L) (R) (L) (R0 )
10: Set cr,i = > 1{αi − αi } > (pi−1 + 1 mod NI )} + 1{αi
1{αi > (pi + 1 mod NI )}
(R)
+ 1{αi = NI − 1} ∈ SN .
(d+1)
11: Sample random ei,0 , ei,1 ← UN s.t. ei,0 + ei,1 = cr,i · βi .
(d+1)
12: Sample random βi,0 , βi,1 ← UN s.t. βi,0 + βi,1 = βi .
13: end for
14: Sample random rtemp2 ∈ SN , and random r10 , r11 ← SN s.t. r10 + r11 = rtemp2 .
15: if nO 6 n − tr then
16: (k20 , k21 ) = GenTR λ temp2 , rout ).
n,tr (1 , r
17: else if nO > n − tr then
A λ temp2 out
18: (k20 , k21 ) = Genn,tr (1 , r , r ).
19: end if
20: Sample random r0 , r1 ← SNO s.t. r0 + r1 = rout .
(N −1)
21: For b ∈ {0, 1}, let kb = kb ||{ei,b }i ||{βi,b }i ||rb ||k1b ||k2b ||r1b .
22: return (k0 , k1 ).
(mixed,fixed)-spline
Eval(nI ,sI ,nO ,sO ,nc ,sc ),m,d,{pi }i (b, kb , x):
1: Set NI = 2nI , NO = 2nO , n = nc + d · nI , tr = sc + d · sI − sO .
(N −1) SExt
2: Parse kb = kb ||{ei,b }i ||{βi,b }i ||rb ||k1b ||k2b ||r1b and set x̄b = EvalnI ,n (b, k1b , x).
3: Reconstruct x̄ = x̄0 + x̄1 .
4: for i = {1, . . . , m} do
5: Set xi = x + (NI − 1 − (pi−1 + 1)) ∈ SNI .
(i) (i) (N −1)
6: Set (si−1,b , si,b ) ← Eval< nI (b, kb , xi ).
7: end for
(m+1) (1)
8: Set sm,b = sm,b .
9: for i = {1, . . . , m} do
10: Set cx,i = (1{x > (pi−1 + 1modNI )} − 1{x > (pi + 1modNI )}) ∈ SN .
(i) (i) (i) (i) (i+1)
11: wb = (wd,b , . . . , w0,b ) = cx,i · βi,b − si,b + si,b + ei,b .
12: end for
Pm (i) (d+1) Pd
13: Set tb = (td,b , . . . , t0,b ) = i=1 wb ∈ SN . Set yb = r1b + i=0 (ti,b · x̄i ) ∈ SN .
14: Reconstruct y = y0 + y1 ∈ SN .
15: if nO 6 n − tr then
16: zb = EvalTRn,tr (b, k2b , y).
17: else if nO > n − tr then
18: zb = Gen A
n,tr (b, k2b , y).
19: end if
20: return zb mod NO .
(mixed,fixed)
Fig. 6. FSS protocol for fixed-point mixed-bitwidth spline gspline,(n , b refers to party id.
I ,sI ,nO ,sO ,nc ,sc ),m,d,{pi }i ,{fi }i

Beacon
No ratings yet
Beacon
18 pages
SecFloat Camera Ready
No ratings yet
SecFloat Camera Ready
21 pages
BLAZE: Fast Privacy-Preserving ML Framework
No ratings yet
BLAZE: Fast Privacy-Preserving ML Framework
28 pages
More Efficient Comparison Protocols For MPC: Abstract
No ratings yet
More Efficient Comparison Protocols For MPC: Abstract
23 pages
Ditto: Quantization-Aware Secure Inference of Transformers Upon MPC
No ratings yet
Ditto: Quantization-Aware Secure Inference of Transformers Upon MPC
19 pages
22 TBD
No ratings yet
22 TBD
14 pages
Deepsecure: Scalable Provably-Secure Deep Learning
No ratings yet
Deepsecure: Scalable Provably-Secure Deep Learning
6 pages
Slalom: Efficient DNN Execution in TEEs
No ratings yet
Slalom: Efficient DNN Execution in TEEs
15 pages
SAFELearn Secure Aggregation For Private FEderated Learning
No ratings yet
SAFELearn Secure Aggregation For Private FEderated Learning
7 pages
BOLT Privacy-Preserving Accurate and Efficient Inference For Transformers
No ratings yet
BOLT Privacy-Preserving Accurate and Efficient Inference For Transformers
19 pages
Client-Aided PPML Protocols Explained
No ratings yet
Client-Aided PPML Protocols Explained
42 pages
Optimization Methods and Software For Federated Learning
No ratings yet
Optimization Methods and Software For Federated Learning
442 pages
Medisc
No ratings yet
Medisc
20 pages
ML Confidential: Machine Learning On Encrypted Data: Abstract
No ratings yet
ML Confidential: Machine Learning On Encrypted Data: Abstract
15 pages
Lossy Cryptography From Code-Based Assumptions
No ratings yet
Lossy Cryptography From Code-Based Assumptions
43 pages
SecureBoost A Lossless Federated Learning Framework
No ratings yet
SecureBoost A Lossless Federated Learning Framework
9 pages
Mathematics 11 01091 With Cover
No ratings yet
Mathematics 11 01091 With Cover
17 pages
BSR-FL An Efficient Byzantine-Robust Privacy-Preserving Federated Learning Framework
No ratings yet
BSR-FL An Efficient Byzantine-Robust Privacy-Preserving Federated Learning Framework
15 pages
MLFormer A High Performance MPC Linear Inference F
No ratings yet
MLFormer A High Performance MPC Linear Inference F
21 pages
Efficient Secure Multi-Party Computation For Multi-Dimensional Arithmetics and Its Applications
No ratings yet
Efficient Secure Multi-Party Computation For Multi-Dimensional Arithmetics and Its Applications
21 pages
Privacy ML with TF Encrypted
No ratings yet
Privacy ML with TF Encrypted
49 pages
1differentially Private Federated Learning With An Adaptive Noise Mechanism
No ratings yet
1differentially Private Federated Learning With An Adaptive Noise Mechanism
14 pages
GROUPCOVER A Secure, Efficient and Scalable Inference Framework For On-Device Model Protection Based On TEEs
No ratings yet
GROUPCOVER A Secure, Efficient and Scalable Inference Framework For On-Device Model Protection Based On TEEs
12 pages
COM3030 Week 10 Slides
No ratings yet
COM3030 Week 10 Slides
63 pages
Jsan 12 00013
No ratings yet
Jsan 12 00013
18 pages
Lossy Cryptography From Code-Based Assumptions
No ratings yet
Lossy Cryptography From Code-Based Assumptions
39 pages
Efficient Permutation Correlations and Batched Random Access For Two-Party Computation
No ratings yet
Efficient Permutation Correlations and Batched Random Access For Two-Party Computation
45 pages
PPFL Privacy Preserving FL With TEE
No ratings yet
PPFL Privacy Preserving FL With TEE
15 pages
2020 Delphi
No ratings yet
2020 Delphi
19 pages
Sec22 Watson
No ratings yet
Sec22 Watson
19 pages
Practical Secure Aggregation For Federated Learning On User Held Data
No ratings yet
Practical Secure Aggregation For Federated Learning On User Held Data
5 pages
Privacy Preserving Federated Learning From Multi-Input Functional Proxy Re-Encryption
No ratings yet
Privacy Preserving Federated Learning From Multi-Input Functional Proxy Re-Encryption
5 pages
2020 - Enhancing IoT Security Using Multi-Layer Feedforward Neural Networkwith Tree Parity Machine Elements
No ratings yet
2020 - Enhancing IoT Security Using Multi-Layer Feedforward Neural Networkwith Tree Parity Machine Elements
5 pages
Pseudo-Random Generation From One-Way Functions (Extended Abstract)
No ratings yet
Pseudo-Random Generation From One-Way Functions (Extended Abstract)
13 pages
Federated Learning With Differential Privacy Algorithms and Performance Analysis
No ratings yet
Federated Learning With Differential Privacy Algorithms and Performance Analysis
16 pages
Scalable Privacy-Preserving ML System
No ratings yet
Scalable Privacy-Preserving ML System
20 pages
Quality Inference in Federated Learning
No ratings yet
Quality Inference in Federated Learning
8 pages
Privacy-Preserving Feature Selection with MPC
No ratings yet
Privacy-Preserving Feature Selection with MPC
8 pages
Manasi Report Tech (1) Min Pdf1
No ratings yet
Manasi Report Tech (1) Min Pdf1
29 pages
Fpga Implementation of Present Algorithm With Improved Security
No ratings yet
Fpga Implementation of Present Algorithm With Improved Security
6 pages
Certifying Floating-Point Implementations Using Gappa
No ratings yet
Certifying Floating-Point Implementations Using Gappa
20 pages
Automatic Protocol Selection in Secure Two-Party Computations
No ratings yet
Automatic Protocol Selection in Secure Two-Party Computations
18 pages
Data Science
No ratings yet
Data Science
30 pages
Secure Interval Membership Testing
No ratings yet
Secure Interval Membership Testing
16 pages
Data Guardian A Data Protection Scheme For Industrial Monitoring Systems
No ratings yet
Data Guardian A Data Protection Scheme For Industrial Monitoring Systems
10 pages
A Robust Privacy-Preserving Federated Learning Model Against Model Poisoning Attacks
No ratings yet
A Robust Privacy-Preserving Federated Learning Model Against Model Poisoning Attacks
16 pages
Lattices
No ratings yet
Lattices
37 pages
Trapdoors For Lattices: Simpler, Tighter, Faster, Smaller: Daniele Micciancio Chris Peikert September 14, 2011
No ratings yet
Trapdoors For Lattices: Simpler, Tighter, Faster, Smaller: Daniele Micciancio Chris Peikert September 14, 2011
41 pages
3-PPFL Enhancing Privacy in FL With Confidential Computing
No ratings yet
3-PPFL Enhancing Privacy in FL With Confidential Computing
4 pages
Squirrels Specification Document
No ratings yet
Squirrels Specification Document
59 pages
Attacking LPN Using Large Covering Codes: Extended Abstract of Ongoing Work
No ratings yet
Attacking LPN Using Large Covering Codes: Extended Abstract of Ongoing Work
4 pages
New RSA Vulnerabilities Using Lattice Reduction Methods /: Article
No ratings yet
New RSA Vulnerabilities Using Lattice Reduction Methods /: Article
160 pages
Verifiable Computing Overview
No ratings yet
Verifiable Computing Overview
27 pages
2022 PRNN
No ratings yet
2022 PRNN
18 pages
THOR: Secure Transformer Inference With Homomorphic Encryption
No ratings yet
THOR: Secure Transformer Inference With Homomorphic Encryption
16 pages
A New Reconfigurable True Random Number Generator and Physical Unclonable Functi
No ratings yet
A New Reconfigurable True Random Number Generator and Physical Unclonable Functi
14 pages
Floating PT Arithmetic
No ratings yet
Floating PT Arithmetic
11 pages
NeuroCrypto C Implementation of Neural C
No ratings yet
NeuroCrypto C Implementation of Neural C
8 pages
Secure Aggregation For Federated Learning in Flower: Kwing Hei Li Pedro Porto Buarque de Gusmão
No ratings yet
Secure Aggregation For Federated Learning in Flower: Kwing Hei Li Pedro Porto Buarque de Gusmão
7 pages
Data Science PHD Research
No ratings yet
Data Science PHD Research
43 pages
Graphiti:: Secure Graph Computation Made More Scalable
No ratings yet
Graphiti:: Secure Graph Computation Made More Scalable
15 pages
(TAHSEEN) StudentID - StudentName - ITC571 - A2
No ratings yet
(TAHSEEN) StudentID - StudentName - ITC571 - A2
19 pages
WhitepaperExplanation Oasis
No ratings yet
WhitepaperExplanation Oasis
8 pages
PGP in Cyber Security Overview
No ratings yet
PGP in Cyber Security Overview
24 pages
Llama - A Low Latency Math Library For Secure Inference
No ratings yet
Llama - A Low Latency Math Library For Secure Inference
21 pages
Secure Computation Protocols Explained
No ratings yet
Secure Computation Protocols Explained
4 pages
Hermez Network: Scalable zk-Rollup Payments
No ratings yet
Hermez Network: Scalable zk-Rollup Payments
23 pages
Computing Blindfolded On Data Homomorphically Encrypted Under Multiple Keys An Extended Survey
No ratings yet
Computing Blindfolded On Data Homomorphically Encrypted Under Multiple Keys An Extended Survey
55 pages
Secure Multi-Party Computation Guide
100% (1)
Secure Multi-Party Computation Guide
181 pages
Business Analytics For Technology Management 2nd Edition Sumit Chakraborty Latest PDF 2025
No ratings yet
Business Analytics For Technology Management 2nd Edition Sumit Chakraborty Latest PDF 2025
107 pages
Complex Engineering Problem Solving: Computer Science and Engineering
No ratings yet
Complex Engineering Problem Solving: Computer Science and Engineering
40 pages
Secure Multiparty Computation Using Secret Sharing
No ratings yet
Secure Multiparty Computation Using Secret Sharing
4 pages
SURI 2025 Project Catalog 1
No ratings yet
SURI 2025 Project Catalog 1
12 pages
Secure Multi-Party Computation Problems and Their Applications - A
No ratings yet
Secure Multi-Party Computation Problems and Their Applications - A
11 pages
Boaz Barak's Intensive Intro To Crypto PDF
No ratings yet
Boaz Barak's Intensive Intro To Crypto PDF
363 pages
Science and Technology Art Installation
No ratings yet
Science and Technology Art Installation
12 pages
E-V Oting System U Sing Visual Cryptography & Secure Multi-Party Computation
No ratings yet
E-V Oting System U Sing Visual Cryptography & Secure Multi-Party Computation
4 pages
Real-World Image Datasets For Federated Learning
No ratings yet
Real-World Image Datasets For Federated Learning
8 pages
An Efficient Privacy-Enhancing Cross-Silo Federated Learning and Applications For False Data Injection Attack Detection in Smart Grids
No ratings yet
An Efficient Privacy-Enhancing Cross-Silo Federated Learning and Applications For False Data Injection Attack Detection in Smart Grids
15 pages
White Paper
No ratings yet
White Paper
34 pages
Curriculum Vitae
No ratings yet
Curriculum Vitae
2 pages
Notes For The Applied Cryptography
No ratings yet
Notes For The Applied Cryptography
6 pages
PlatON A High Efficiency Trustless Computing Network Whitepaper
No ratings yet
PlatON A High Efficiency Trustless Computing Network Whitepaper
40 pages
2014 Practical 2
No ratings yet
2014 Practical 2
16 pages
Laconic Cryptography With Preprocessing: Carmit - Hazay@biu - Ac.il
No ratings yet
Laconic Cryptography With Preprocessing: Carmit - Hazay@biu - Ac.il
39 pages
AI & Blockchain Privacy Integration
No ratings yet
AI & Blockchain Privacy Integration
23 pages
LOKA Protocol: A Decentralized Framework For Trustworthy and Ethical AI Agent Ecosystems
No ratings yet
LOKA Protocol: A Decentralized Framework For Trustworthy and Ethical AI Agent Ecosystems
14 pages
A Decentralized Approach To Threat
No ratings yet
A Decentralized Approach To Threat
20 pages
K-Anonymity for Privacy Protection
No ratings yet
K-Anonymity for Privacy Protection
4 pages

Llama - A Low Latency Math Library For Secure Inference

Uploaded by

Llama - A Low Latency Math Library For Secure Inference

Uploaded by

Proceedings on Privacy Enhancing Technologies ..; .. (..

Kanav Gupta*, Deepak Kumaraswamy, Nishanth Chandran, and Divya Gupta

Definition 1 (FSS: Syntax [5, 6]). A (2-party) func-

Dual Distributed Comparison Function (DDCF) is a

DDCF (x) = β + f <

Key size per Online evaluation

3.2 Truncate-Reduce where y = 2n − rin . Using the relation from [30], we

following text, we provide a new construction that only

Theorem 3. There is an FSS gate (GenTR TR

Signed Multiplication Gate (Gensmult smult

Function nI = nO sI sO d m and no support for math functions or mixed-bitwidth

Communication LAN (in

Heads (only 0.026 ±

online complexity. LLAMA proposes novel FSS-based

[r1in ,r2in ,rout ]

rdiv(a, d) := idiv(sint(a), d) mod N

Signed Division (Gendiv div

D Description of Benchmarks E Mixed-bitwidth splines

1: Set NI = 2nI , NO = 2nO , n = nc + d · nI , tr = sc + d · sI − sO .

You might also like