0% found this document useful (0 votes)
53 views13 pages

1 s2.0 S266729522100009X Main

Uploaded by

Manan Kalavadia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views13 pages

1 s2.0 S266729522100009X Main

Uploaded by

Manan Kalavadia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

High-Confidence Computing 1 (2021) 100008

Contents lists available at ScienceDirect

High-Confidence Computing
journal homepage: www.elsevier.com/locate/hcc

A survey of federated learning for edge computing: Research problems and


solutions
Qi Xia∗, Winson Ye, Zeyi Tao, Jindi Wu, Qun Li
Department of Computer Science, College of William and Mary 251 Jamestown Rd., Williamsburg, VA 23185, USA

a r t i c l e i n f o a b s t r a c t

Keywords: Federated Learning is a machine learning scheme in which a shared prediction model can be collaboratively
Federated learning learned by a number of distributed nodes using their locally stored data. It can provide better data privacy be-
Edge computing cause training data are not transmitted to a central server. Federated learning is well suited for edge computing
applications and can leverage the the computation power of edge servers and the data collected on widely dis-
persed edge devices. To build such an edge federated learning system, we need to tackle a number of technical
challenges. In this survey, we provide a new perspective on the applications, development tools, communication
efficiency, security & privacy, migration and scheduling in edge federated learning.

1. Introduction laborate the computational resources from different devices, but also
preserve the privacy at the same time.
The proliferation of real time technologies such as VR, AR, and self- Given the common features of both edge computing and federated
driving cars has led researchers and industry executives to come up with learning, edge computing is a naturally suitable environment to ap-
new architecture for data processing. The traditional model of cloud ply federated learning framework. Therefore, edge federated learning
computing is unsuitable for applications that demand low latency, so is more and more appealing in both academic research and industry in
as a result a new model of computation termed edge computing has recent days. Here, we first have a brief introduction to edge computing
sprung forth. Edge computing is primarily concerned with transmitting and federated learning respectively and discuss about their key advan-
data among the devices at the edge, closer to where user applications are tages.
located, rather than to a centralized server (see Fig. 1). Edge node (or
edge client, edge device) is usually the resource-constraint device that 1.1. Edge computing
end user uses and it is geographically close to the nearest edge server,
who has abundant computing resources and high bandwidth communi- There are a set of key reasons why industry executives are transition-
cating with end nodes. When the edge server requires more computing ing from a traditional cloud-based model to edge computing platforms.
power, it will connect to the cloud server. The most important conse- The two major factors that were already discussed beforehand are low
quences of this architecture are twofold: latency is dramatically reduced latency and high bandwidth [1]. However, the edge also provides for
as data does not need to travel as far, and bandwidth availability im- greater security. For example, sending data to an edge device will give
proves significantly, as the user is no longer relying on sharing a single any potential attackers less time to launch an attack as compared to
traffic lane in order to transfer their data. Indeed, this new computing the cloud simply because the latency is lower. Moreover, attacks like
paradigm offers great cost savings for companies who do not have the DDoS that would normally be debilitating in a cloud-based environ-
resources to build dedicated data centers for their operations. Instead, ment are rendered almost harmless in an edge computing environment
engineers can build a reliable network of smaller and cheaper edge de- because the affected edge devices can be removed from the network
vices. without hampering the overall functionality of the network as a whole.
In addition, federated learning has been discussed a lot recently. It Of course, this also means that edge networks are much more reliable
is a collaborative machine learning framework allowing devices from as they do not have a single point of failure. As discussed briefly before-
different resources with different private datasets working together to hand, edge networks are much more easily scalable because the devices
study and train a global model. Federated learning can not only col- have much smaller footprints. Indeed, a scale-out strategy of scalabil-
ity rather than a scale-up one offers companies a very attractive way of


Corresponding author.
E-mail addresses: [email protected] (Q. Xia), [email protected] (W. Ye), [email protected] (Z. Tao), [email protected] (J. Wu), [email protected] (Q. Li).

https://doi.org/10.1016/j.hcc.2021.100008
Received 24 December 2020; Received in revised form 3 March 2021; Accepted 3 March 2021
2667-2952/© 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
Q. Xia, W. Ye, Z. Tao et al. High-Confidence Computing 1 (2021) 100008

from the same memoryless stochastic process. However, no such as-


sumption is made in the federated learning setting [3]. Instead, datasets
can be heterogeneous. For example, an ML model designed to recognize
criminals within a neighborhood may rely on camera footage collected
by a diverse group of users. Clearly, one cannot reasonably expect the
footage collected between two users to be i.i.d.
The promise of federated learning is appealing to many users. There
are a number of key advantages to take note of:

• Training time is reduced. Multiple devices are used to calculate gra-


Fig. 1. Devices by distance to user. dients in parallel, which offers significant speedups.
• Inference time is reduced. At the end of the day, each device has their
own local copy of the model, so predictions can be made extremely
getting good performance with low cost. Moreover, some of these edge quickly and without having to rely on slow queries to the cloud.
devices or edge data centers may not even need to be built from scratch • Privacy is preserved. Uploading sensitive information to the cloud
by any one company. Different stakeholders can partner up to share the presents a major privacy risk for applications like healthcare devices.
resources from the already existing IoT devices in the edge network. Privacy breaches in these settings may literally be a matter of life and
In order to deliver these benefits to end-users, engineers have re- death. As such, keeping data local helps preserve the privacy of end
lied on a common set of key operating principles when building edge users.
computing systems [2]: • Collaborative learning is easier. Instead of having to collect one mas-
sive dataset to train a machine learning model, federated learning
• Mobility: For applications like self-driving cars, the edge devices
allows for a ”crowdsourcing” of sorts that can make the data col-
have to accommodate a constantly moving end-user without sacri-
lection and labeling process much easier in terms of time and effort
ficing latency or bandwidth. Some approaches solve this problem by
spent.
positioning edge devices on the roadside.
• Proximity: In order to deliver low latency guarantees, the edge de- Because of the natural advantages of both edge computing and fed-
vices must be positioned as close as possible to the end users. This erated learning, and the fact that edge computing is a very suitable en-
could mean performing computation directly at the edge device or vironment to deploy federated learning, the promising future of edge
investing in a local edge computing data center that is close to the federated learning prompts us to present a survey designed to explore
end-user. the research problems that arise in this new area. The rest of this paper
• Coverage: For edge computing to become ubiquitous, network cov- will detail these challenges, summarize how the state of the art solves
erage must be far-reaching. Thus, the exact distribution of nodes in them, as well as provide our own insights into the future of this field. In
an edge computing framework is imperative to achieving an optimal this article, first, we will go into details regarding how different applica-
user experience. Of course, a dense distribution is preferred, but this tions make use of edge federated learning frameworks. Second, we will
must be balanced with cost constraints. discuss existing programming models for edge federated learning. Third,
we will discuss communication and computation efficiency. Fourth, we
1.2. Federated learning will discuss security and privacy. In the end, we will discuss resource
allocation and migration.
Federated learning is a method for training neural networks across
many devices. In this model of computation, a single global neural net- 2. Applications
work is stored in a central server. The data used to train the neural
network is stored locally across multiple nodes and are usually hetero- Edge federated learning solves the data island problem by fully ex-
geneous. Here we assume we have several nodes 𝑛𝑜𝑑𝑒1 , 𝑛𝑜𝑑𝑒2 , ⋯ , 𝑛𝑜𝑑𝑒𝑛 . ploring the huge potential of the data on terminal devices without in-
On the node side, 𝑛𝑜𝑑𝑒𝑖 keeps the private dataset 𝜉𝑖 . If we assume the fringing on user’s privacy, and it greatly improves the efficiency of
loss function on the neural network is 𝑓 (⋅), in one synchronization, 𝑛𝑜𝑑𝑒𝑖 model learning in edge computing systems. Therefore, it can be widely
computes the updated weight based on the current weight at time 𝑡 𝑤𝑖𝑡 , used in many scenarios where privacy protection and resource utiliza-
step size at time 𝑡 as 𝛾𝑡 , and its private dataset 𝜉𝑖 : tion are critical. In this section, we will discuss a few scenarios for edge
federated learning, and some recent work applied in these scenarios.
𝜕𝑓 (𝑤𝑡 , 𝜉𝑖 )
𝑤𝑖𝑡+1 = 𝑤𝑡 − 𝛾𝑡 ⋅ (𝑖 = 1, 2, ⋯ , 𝑛) (1)
𝜕𝑤 2.1. Healthcare system
Note that this local update can run one or several iterations. On the
server side, it receives the weights that are uploaded by all the nodes. The excellent performance of deep learning in complex pattern
Central server uses an aggregation function 𝐴(⋅) to aggregate all the up- recognition tasks makes it more widely used in the medical industry.
loaded weights and update the weights for the next round. The updated For each medical institution, its data is separately stored and processed
weights at time 𝑡 + 1 are: in the edge node, but the model trained with a small dataset that is
collected from an individual medical institution does not have a satis-
𝑤𝑡+1 = 𝐴(𝑤1𝑡 , 𝑤2𝑡 , ⋯ , 𝑤𝑛𝑡 ) (2)
factory accuracy when it is applied to the unseen data that is somehow
In practice, we usually simply use an average function to aggregate the uncorrelated with the training data. Therefore, a large amount of real
uploaded weights and update the global model. The model is replicated electronic health records (EHR) is needed to train a powerful medical
across all the end devices as needed so predictions can be made locally. model. However, the demand for real dataset is hard to satisfy because
Because of the heterogeneity of federated learning, we do not require all of the sensitivity and privacy of medical data. Edge federated learning
𝑛 nodes to participate in one synchronization. Only some of the nodes can help overcome this problem, allowing medical institutions to collab-
will be randomly selected to perform the computation. orate on training models without sharing patient data so that they can
Note that federated learning is distinct from the traditional dis- meet the requirements of data privacy protection and the Health Insur-
tributed computing scenario. The most profound difference lies in the ance Portability and Accountability Act (HIPAA). For example, Liu et al.
assumptions made on the datasets. In distributed learning, the partitions trained a chest X-ray image classification model using federated learn-
of the dataset are assumed to be i.i.d., meaning that they are generated ing for COVID-19 [4]. Sheller et al. used edge federated learning to train

2
Q. Xia, W. Ye, Z. Tao et al. High-Confidence Computing 1 (2021) 100008

an image semantic segmentation model for brain scans and the simula- sonalized models by exploring user similarities without violating user
tion result shows that the performance of the proposed model is similar privacy.
to the model trained with shared data [5]. Their extended work applied
the final model selection mechanism in which each medical institution 3. Development tools
selects the best locally validated model for global model aggregation to
achieve better performance for the medical image learning model [6]. There are many concerns that the programmer needs to take into
In addition to medical institutions, the edge federated transfer learn- account when designing an edge federated learning system. Issues such
ing method is applied to personal health measurement devices. Some as different APIs, dataflow models, network configurations, and device
personal healthcare devices, such as blood pressure meter and activity properties have to be considered. In light of the complexity involved
recognition device, are utilized to observe health conditions and push in edge federated learning, it is important that the research community
health alarms in time, which plays an important role in smart health spend time developing tools that can help programmers build edge fed-
systems [7]. For users, it is necessary to have a ready-made model at erated learning systems more easily. In this section, we will be discussing
the beginning and train a personalized model updated by their physi- the following areas that could benefit from development tools:
cal conditions in real time. Chen et al. proposed an accurate and per- • Application-level support. This is concerned with providing easy to
sonalized healthcare model FedHealth [8]. Moreover, FedPer [9] and use APIs for the developer.
pFedMe [6] can be used to collaboratively learn a model at the network • Systems design Support. This is concerned with providing helpful
edge while capturing personalization. abstractions for systems level technicalities such as network config-
uration.
2.2. Vehicular network
3.1. Application-level support
The data generated by the device on vehicles such as location and ori-
entation detected by the GPS, images captured by the on-board camera, First, let us discuss the available application-level support for edge
and the pressure data from the oil pressure sensor, are valuable resources federated learning systems. Ideally, any application-level support would
for vehicle manufacturers to provide intelligent navigation services and come in the form of easy-to-use integrated development environments
early warnings. The on-board computer collects locally generated sens- or APIs that can help the average developer perform common edge fed-
ing data, then uploads it to the Vehicle Edge Computing (VEC) system erated learning tasks easily. The reader can think back to the many clas-
to train the local learning model. Edge federated learning in VEC can sical software engineering tools such as numpy for numerical processing
meet the needs of users for smart vehicle decision-making. For instance, or IDLE for Python programming as examples of good application-level
a clustering-based federated energy demand learning approach is imple- support tools.
mented by Saputra et al. in [10] for electric vehicle networks to make One work of note is the programming model proposed by Hong et al
energy demand prediction in the considered areas. [17]. The authors provide a set of event handlers that the program-
In addition, image classification is a typical task in vehicular net- mer must implement and functions that individual applications can call
works. The performance and efficiency of edge federated learning are upon. This way, whenever a significant event occurs, such as when a
highly impacted by the training data quality and the computational message arrives from another device, the programmer can rest assured
power of edge nodes, respectively. Ye et al. proposed a selective model that the event handlers will do most of the work. For federated learning
aggregation approach [11], in which a model is selected if its training in particular, some adjustments may need to be made. For example, fed-
images are in high quality and the edge node has sufficient computa- erated learning systems usually require aggregation functions in order
tion capability. In order to further improve the learning accuracy and to assemble all the local gradients. This would have to be provided in
encourage devices with high-quality data to join the model training pro- the API. Event handlers would also have to be implemented to facilitate
cess, Kang et al. design an incentive mechanism [12] using the contract different stages of the learning process, such as when a round of learning
theory. has finished. However, given the easily extensible nature of their frame-
Moreover, autonomous vehicles are equipped with more sensors work, we believe that it would be fairly straightforward to implement
than regular vehicles such as LiDAR and ultrasonic sensors to perceive these changes for federated learning systems.
the surrounding environment without human interaction. Edge feder- Another significant paper by Giang et al. [18] focuses on developing
ated learning is a desirable solution in the VEC system to learn a privacy- a good abstraction that allows developers to reason about the complex-
preserving machine learning model from non-IID vehicular data [13]. ities of edge federated learning more easily. In particular, they propose
a methodology for federated learning systems using dataflow graphs.
2.3. Intelligent recommendation Even though this idea was proposed for edge computing, it is possible
to generalize this kind of framework to edge federated learning systems.
Intelligent recommendation is a useful function in smartphone or Their dataflow program can handle three key issues: 1) heterogeneity,
desktop applications to predict user choices so that users can easily ac- 2) mobility, and 3) scalability. For example, to avoid vertical and hori-
cess and use it. Compared with standard machine learning approaches, zontal heterogeneity, the program contains specialized nodes developed
edge federated learning is capable of effectively training flexible models by domain experts that can only be wired together with specific nodes.
for recommendation tasks. Because edge nodes are located in a certain Mobility requirements can be fulfilled through code duplication. Scal-
area and have similar tasks for efficiency and cost reasons, this kind of ability requirements are met by eliminating the need for an internal
similarity among edge nodes can be used to train adaptive models by management system to coordinate communication between nodes.
edge federated learning. For instance, the researchers from Google Key-
board (Gboard) team train models using edge federated learning on a 3.2. Systems design support
global scale for virtual keyboard search suggestion [14] and emoji pre-
diction [15], and the evaluation results show that the models on each Next, let us discuss the available development tools for system level
edge node have a good performance because the models are adjusted to design. Ideally, we are looking for tools that can help developers accom-
different language and culture styles in a specific area. In addition, Hart- plish systems-level tasks such as load balancing, resource management,
mann et al. show that the browser option suggestion model trained with or migrations easily. Some of these features may be integrated into a
federated learning can help users quickly find the website they need by larger IDE designed for edge federated learning. For example, a tool
entering fewer characters [16]. The work can be improved in edge fed- akin to MapReduce would be very helpful in the edge federated learn-
erated learning systems to provide different users with relatively per- ing setting.

3
Q. Xia, W. Ye, Z. Tao et al. High-Confidence Computing 1 (2021) 100008

The primary work here is Bonawitz et al.’s work based on TensorFlow 4. Communication and computation efficient edge federated
[19]. The major contribution of their work is offering a mature systems learning
level framework that the developer can use to deploy their federated
learning applications. They deal with a variety of key issues: 1) device Edge federated learning is a privacy-preserving machine learning
availability, 2) resource management, and 3) reliability. In a federated framework where the data is distributed across many resource con-
learning setting, devices cannot be expected to always be available for a strained edge devices. It shares the same training procedure as baseline
given round. The authors implement a ”pace steering” mechanism that federated learning [21] that is an edge server that distributes an initial
allows the server to suggest reconnection times that will ensure a suffi- model to each edge node who independently updates the model (local
cient number of devices are connected to ensure progress in the learning model) via local data, and the global model is updated by aggregating a
task. In order to handle the limited resources issues in end devices, the subset of the local models. The server broadcasts this new global model
authors created their framework such that a learning job is only run on to all nodes to start a new round of local training. This training proce-
the cellphone when it is idle, charging, and connected to WiFi. Besides, dure repeats until some criterion meets.
to deal with the limited storage capabilities of cellphones, the authors
provide programming utilities to help minimize the storage footprint
4.1. Scale of federation
of the local data these devices must store in order to participate in the
federated learning task. As for the reliability, the authors maintain a
Similar to the conventional federated learning [22], the edge feder-
coordinator to to deal with issues such as data crash, learning fail, etc.
ated learning can also be categorized into two types by the scale of fed-
eration: cross-silo and cross-device edge federated learning. Cross-silo
3.3. Future directions
edge federated learning trains data from different organizations (e.g.
medical center or geo-distributed datacenter). On the other hand, cross-
In the future, we believe that the research community should focus
device federated learning trains data on many IoT devices. The major
their attention on three key areas: 1) containerization, 2) security frame-
difference between them is the number of participating training nodes
works, and 3) extending current edge computing programming models
and the amount of training data stored on each node. In this section, we
to the edge federated learning setting.
discuss the impact of the scale of federation and how it affects commu-
By containerization, we are referring to the management of all the
nication and computation cost on edge federated learning.
various execution environments that devices in the edge federated learn-
ing setting will utilize. Since many of the devices in edge federated learn-
ing may be IoT devices, it is important that operating systems remain as 4.1.1. Cross-device edge federated learning
lightweight as possible. Several key challenges are present in this area. In cross-device edge federated learning, the number of active train-
For example, how will programmers manage all the containers running ing nodes is in the order of millions and each node has relatively small
on heterogeneous devices? How will these containers communicate with amounts of data as well as computational power [23]. The nodes are
the outside world without exposing themselves to any security vulner- usually portable devices or sensors. A remarkable example here is im-
abilities? Encouraging progress has already been made in the industry. proving the query suggestion of Google Keyboard by [14]. The major
For example, WeBank’s KubeFate1 2 allows developers to run federated challenges that cross-device edge federated learning faces are:
learning tasks across multiple containers, with features like security and
privacy already built into the framework. Other notable examples in- • There are extremely high communication costs when edge servers
clude TensorFlow Federated,3 PySyft,4 and PaddleFL.5 While all these synchronize the training models and broadcast a new global model
applications do address the containerization issue, they are not yet ma- to each node for next step training.
ture technologies. For example, many of these applications reply on Ku- • It is hard to efficiently manage a large number of nodes and deal
bernetes for container orchestration, which some users may take issue with possible issues such as the unexpected network connectivity
with because of the high overhead. between node and server.
Second, researchers should consider building security APIs that pro-
Given the number of total nodes  , selection rate 𝜂, the total com-
grammers can utilize in order to secure their own edge federated learn-
munication cost for one round training can be formulated as
ing systems. Current theory on the subject covers ideas such as differ-
ential privacy, homomorphic encryption, multi-party computation, and 2⋅𝜏 ⋅ ⋅𝜂⋅ (3)
secure enclaves. Nevertheless, major challenges exist when it comes to
implementing these security measures. For example, Kairouz et al. note where 𝜏 indicates the number of global synchronizations that model
that there is not yet a methodology for distributing federated learning can converge and  is the raw size of the training model including all
functions across trusted execution environments [20]. weights and training metadata. For the sake of simplicity, we denote
Finally, as the reader may have noticed in the previous sections,  as the number of total trainable parameters 𝑃𝑛 multiplies its pre-
many of the previous work referenced do not pertain directly to edge cision such as  = 𝑃𝑛 ⋅ bit (4, 8, 16, 32). For example, as the winner
federated learning. Instead, they refer to edge computing systems in gen- of ILSVRC-2012 competition, AlexNet [24] comes along with nearly 61
eral. As such, it is imperative that researchers focus on extending these million 32-bit real value parameters with an actual model size of 233MB.
tools to the edge federated learning setting. Edge federated learning is In the original federated learning, model aggregation happens on every
unique in that machine learning tasks require massive computations as global synchronization and it requires selected nodes passing their lo-
well as storage capabilities. For the model to perform well, gradients cal models 𝑊𝑡𝑘 where 𝑘 ∈ [𝑁] to the central server. This setting can be
must be coordinated carefully as the sudden failure of a few devices or further relaxed to be only passing the local updates Δ𝑊𝑡𝑘 = 𝑊𝑡𝑘+𝑒 − 𝑊𝑡𝑘
the presence of malicious actors may cause the model to act unexpect- where 𝑒 is the number of local epochs. The keys to reduce the commu-
edly. nication cost in edge federated learning as shown in Eq. (3) are total
communication rounds 𝜏 and model size .
1 FATE: Federated AI Technology Enabler, https://github.com/FederatedAI/ To reduce the communication cost in edge federated learning, one
FATE .
can reduce the size of local update Δ𝑊𝑡𝑘 by either vector quantization or
2
KubeFATE: https://github.com/FederatedAI/KubeFATE . specification. On the other hand, we can also find the optimal choice of
3
TensorFlow Federated: https://github.com/tensorflow/federated . 𝜏 for minimizing the overall communication cost of the process. The for-
4
PySyft: https://github.com/OpenMined/PySyft . mer has been widely studied in the past decade. However, determining
5
PaddleFL: https://github.com/PaddlePaddle/PaddleFL . the optimal communication rounds 𝜏 seems tricky. This is due to:

4
Q. Xia, W. Ye, Z. Tao et al. High-Confidence Computing 1 (2021) 100008

• Increasing the number of training nodes can significantly put the cess data in a non-i.i.d manner. This against a commonly used assump-
negative influence on performance of the model therefore it requires tion that all training data is drawn from an independent and identically
more global synchronization rounds to meet certain criterion. distributed data source. The presence of non-i.i.d data used in edge fed-
• Training on highly decentralized and heterogeneous data makes the erated learning leads to local models divergence. The further network
contribution of each local model shard to the global mode to be quantization and pruning makes the divergence problem even worse. In
rather limited which prolong the training epochs. the following sections, we will introduce communication efficient and
computation efficient techniques.
Although increasing the local training epochs may mitigate this is-
sue, it introduces extra computational workload and power consump-
4.2. Communication efficient methods
tion on each node. Finding optimal communication iterations remains
an open problem.
The optimization methods for federated learning are largely inher-
iting from the conventional distributed machine learning optimization.
4.1.2. Cross-silo edge federated learning The distributed first-order stochastic gradient descent (SGD) optimiza-
In cross-silo edge federated learning, on the contrary, the number of tion methods have been largely studied in literature [29–33]. Local-SGD,
nodes is relatively small, but it requires the nodes to have sufficient com- as another approach to train the neural network in a distributed manner
putational resources for processing a huge amount of data on each edge with less communication has been studied in [34]. They considered a
server. For example, big online retailers would recommend items for master-worker topology and provided theoretical analysis for the con-
users by training tens of million shopping data stored in geo-distributed vergence of local-SGD. The fundamental difference between distributed
data centers. In the above settings, the challenge is how edge federated SGD and local-SGD is the use of training data. More specifically, if every
learning efficiently distributes computation to edge servers under the local node uses training data that comes from the same data distribution,
constraint of computation budgets and privacy models. In recent years, the local-SGD is equivalent to its distributed version. However, if local
many researchers tend to deploy large and powerful deep networks to re- nodes use arbitrarily heterogeneous training data, local-SGD and dis-
source constrained devices because deeper and wider networks usually tributed SGD are entirely different. We cannot expect the model updates
achieve better performance than shallow networks [25]. It also relies on (or gradients) to be drawn from the same unknown distribution even
powerful end devices. when local epoch 𝑒 = 1. Although we cannot directly use distributed
Given a certain level computation budget, network quantization and version techniques to address the communication bottleneck in edge
pruning can further expand the reach of edge federated learning. Quan- federated learning, the vector quantization and sparsification are still
tizing network weights with a small number of bits can significantly main-stream optimization strategies for edge federated learning. In this
accelerate network training and inference as well as reduce model size. section, we summarize some methods of communication efficient train-
In [26], researchers reduce float version VGG-19 [27] from ∼ 500𝑀 to ing in conventional distributed learning and discuss how they related to
∼ 32𝑀 with ternary precision and no accuracy loss. Many network quan- edge federated learning.
tization mainly focuses on optimization at a single model. The quantized
weights are only used during the forward and backward propagations
4.2.1. Gradient qantization
but not during the parameter updates. It is very obvious that passing
When using SGD or other first order gradient method as model op-
the non-quantized updates in edge federated learning is undesired. A
timizer, quantizing the gradients to its low precision value has been
good starting point is from [28], authors introduce an algorithm allow-
widely adopted. Gradient quantization has been explored in the [35–
ing model updates to be quantized before being transmitted but they
39]. In particular, [40] summarized the general gradient quantization
do not use quantized training models. One open question here is that
scheme 𝑄(𝑔, s, 𝑙) as
can we train quantized models in edge federated learning while using
quantized model updates at the same time such that we can achieve both 𝑔̂𝑡 = s ⋅ sgn(𝑔𝑡 ) ⋅ 𝜅(𝑔𝑡 , 𝑙) (4)
communication and computation efficient edge federated learning train-
where s is a shared scaling factor (possible choices include ‖𝑔𝑡 ‖2 or
ing. Network pruning can be very beneficial, and it can also efficiently
‖𝑔𝑡 ‖∞ ), and sgn(⋅) returns the sign of gradient coordinate 𝑔𝑡 . 𝜅(⋅, ⋅) is
reduce the complexity of neural network models. A common approach
an independent random variable defined as follows. Let 0 ≤ 𝑘 ≤ 𝑙 be an
in network pruning is dropping the parameter with small enough mag-
integer such that |𝑔𝑡 |∕‖𝑔𝑡 ‖ ∈ [𝑝∕𝑙, (𝑝 + 1)∕𝑙], then
nitude. Similar to network quantization, the existing network pruning
{
algorithms are also limited to the single non-distributed settings, and |𝑔 |
𝑝∕𝑙, w.p 𝑝 − s𝑡 ⋅ 𝑙 + 1
neural networks are usually pruned step by step, that is, we train model 𝜅(𝑔𝑡 , 𝑙) ≜ (5)
(𝑝 + 1)∕𝑙, otherwise
until convergence before the next step pruning. In this way, local net-
work pruning increases computation consumption on local devices and A concrete explanation of the above formula is from [36]. TernGrad
delays the training. Recently, federated pruning has drawn much re- compresses gradients into ternary values {−1, 0, 1} with a stochastic
search attention. The models are kept being pruned together with the quantization function to ensure the unbiasedness. Terngrad sets the
standard FedAvg learning process. Federated pruning allows us to train quantization level 𝑙 = s, and chooses shared scaling factor among all
networks in a computation efficient as well as communication efficient workers such that s = max(‖𝑔𝑡 ‖∞ ) for all 𝑚 ∈ [𝑁]. Other work such
manner because we only need to upload the non-zero parameters for as [35], the authors adventurously applied 1-bit SGD on speech DNNs to
synchronization. One potential drawback is that it is hard to find opti- reduce data-exchange bandwidth and they empirically showed its feasi-
mal pruning ratio. bility on distributed environments. An error feedback scheme is intro-
It is worth mentioning that all the efficient approaches in both cross- duced during quantization, to compensate for the quantization error.
silo and cross-device edge federated learning could be corrupted due to Zhou et al. [38] proposed the DoReFa-Net to train convolutional net-
the system level heterogeneity and statistical level heterogeneity. Sys- works with weights, and gradients all quantized into fixed-point num-
tem heterogeneity refers to the different hardwares (CPU, GPU, mem- bers.
ory), network configurations, and power supplies of nodes in edge feder- FedPAQ [41], to our best knowledge, maybe the first study that
ated learning. Different computation capabilities may cause the unfair- bridges the gap between distributed gradient quantization and federated
ness results among local models and downgrade the fusion model. Dif- learning. The idea of FedPAQ is quite straightforward: using quantized
ferent network configurations may cause important local model shard model updates during the FedAvg process. Applying gradient quantiza-
missing and increase training time. The statistical heterogeneity refers tion methods to federated model updation should be very careful. The
to the highly non-i.i.d training data. Nodes frequently collect and pro- model divergence is enlarged by training on highly distributed non-i.i.d

5
Q. Xia, W. Ye, Z. Tao et al. High-Confidence Computing 1 (2021) 100008

data. Especially, when the training scale increases, quantized model up- The former uses different weight quantization resolution such as bi-
dates introduce lots of quantization variances. One possible solution is nary weights [62], which uses only one bit for each weight while still
to use error compensation quantization for model updation mentioned achieving state-of-the-art classification results. Also [26,63] added scal-
in [42]. The variance reduced SGD [43] and variance reduced quantized ing to the ternarized weights, and DoReFa-Net [38] further extended
SGD [44] are also helpful. quantization to any quantization levels. In a weight quantized network,
𝑚 bits where 𝑚 ≥ 2 are used to represent each weight. Let  be a set
4.2.2. Gradient sparsification of 2𝑘 + 1 quantized values, where 𝑘 = 2𝑚−1 + 1. The linear quantiza-
Gradient sparsification is also a popular and efficient method. The tion scheme has  = {−1, − 𝑘−1 𝑘
, ⋯ , 𝑘−1
𝑘
, 1} and logarithmic quantiza-
intuition of sparse approaches is straightforward, that is, dropping the tion scheme has 𝑄 = {−1, − 12 , ⋯ , − 2𝑘1−1 , 0, 2𝑘1−1 , ⋯ , 1}. When 𝑚 = 2, both
less beneficial coordinates of gradient vectors and then synchronizing on schemes reduce to  = {−1, 0, 1}. Both quantization schemes can be ap-
PS to ensure the unbiasedness. Therefore, reduction of communication plied to any hidden layer in the model. Particularly, in order to constrain
cost is adopted. The insight of gradient sparsification can attribute to: a CNN to have binary weights, a series of binary filters 𝐵1 , 𝐵2 , ⋯ , 𝐵𝑛 ∈
• DNNs usually over-parameterized [45] with a considerable number {−1, 1}𝑐𝑖𝑛 ×𝑤×ℎ×𝑐𝑜𝑢𝑡 is used to estimate the real-value weight filter 𝑊 ∈
of parameters that have numerical values close to zero, and therefore 𝑅𝑐𝑖𝑛 ×𝑤×ℎ×𝑐𝑜𝑢𝑡 such that 𝑊 ≈ 𝛼1 𝐵1 + 𝛼2 𝐵2 + ⋯ 𝛼𝑛 𝐵𝑛 which is a linear com-
resulting in spare sub-gradients. bination of 𝑛 binary or tenary filters. Here 𝑐𝑖𝑛 × 𝑤 × ℎ × 𝑐𝑜𝑢𝑡 is the dimen-
• Sparse SGD can be formulated as variants of delaying weight updates sion of weights. The optimal estimation can be solved by minimizing the
such as asynchronous SGD [46]. following optimization problem.

To avoid the information loss when coordinate dropping occurs,


min 𝐽 (𝛼, 𝐵 ) = ‖𝑊 − 𝛼𝐵 ‖2 , (6)
[47] applied gradients accumulation for gradient value less than the pre-
defined updating threshold. Unfortunately, updating those out-of-data
(stale) gradients cause convergence, slow down and model performance In addition, another approach known as loss-aware network quantiza-
degeneration. Early studies such as [48–51] used similar constant-like tion minimizes the loss directly w.r.t the quantized weights and often
or fixed ratio thresholds to perform gradient coordinates selection. It is achieves better performance than approximation-based methods. The
impractical to use the above methods, because the threshold is hard to existing weight quantization methods above simply find the closest ap-
choose for particular DNNs and our experiments show the threshold- proximation of weight and ignore its effects to the model loss. However,
based methods even fail to converge in some cases. Recently, a series of it uses full-precision weights during the training process and extra gra-
hybrid strategies via combining gradient sparsity and vector quantiza- dient information, which is expensive [59].
tion have been proposed. A heuristic algorithm proposed by [52] aimed In edge federated learning, if we could perform training with a quan-
to automatically tune the compression rate and then quantize them for tized model on each device, it can significantly reduce the computa-
updates. With limited coding length of stochastic gradients, and con- tional burden and accelerate the training and inference. To our best
strained gradient variance budget, [53] achieved a high compression knowledge, there is no such work that can fully adapt to the edge feder-
ratio on 𝑙2 -regularized logistic regression by using gradient sparsifica- ated learning environment. One close approach [28] tried to update the
tion technique. global model by using quantized local modes. It reduces the communica-
The gradient level sparsification training possibly leads the final tion cost, however the local computational cost increases because it uses
model to be sparse. It will accelerate the model training on nodes later. a full-precision model for local training and extra work on computing
One tightly related work to gradient sparsification is [54]. The method ready-update quantized models. Most of the aforementioned network
so called Federated Drop aims to train the randomly selected sub-model. quantization methods cannot be directly used in edge federated learn-
These sub-models are subsets of the global model and, as such, the com- ing without modification. The challenge is we cannot ignore the model
puted local updates have a natural interpretation as updates to the larger divergence in edge federated learning and inappropriate quantization
global model. Sparsification is an easily applied method. It does not re- introduces much mode noises to the global model, which makes con-
quire any network topology changes or extra computation bandwidth. vergence speed slow.
However, the gradient selection is challenging and non-trivial and yet
this is still an open question in edge federated learning.

4.3. Computational efficient methods 4.3.2. Network pruning


Neural network pruning is an alternative way to reduce the com-
Deep neural networks have made significant improvements in lots plexity of neural network models and accelerate the deep neural net-
of computer vision tasks such as image recognition and objective detec- work on resource-limited edge nodes. Continuously dropping the small
tion. This motivates interests to deploy the state-of-the-art deep models magnitudes weights and finding an optimal substructure of the origi-
to real world applications like mobile devices. For those applications, nal network is the key mechanism of pruning methods. It can be well
it is typically assumed that training is performed on the server and test explained by lottery ticket hypothesis [64]. Magnitude-based pruning
is executed on mobile devices. However, in the cross-device edge fed- methods including [65–69] that train until convergence before the next
erated learning scheme, both training and inference phases are located pruning step is prohibited on edge nodes. The iterative pruning meth-
on mobile devices. These models often need considerable storage and ods [70,71] is more attractive. The dynamic pruning allows the net-
computational power, and can easily overburden the limited storage, work to grow and shrink during the training. These existing pruning
battery power, and computer capabilities of the model devices. techniques consider the centralized setting with full access to the train-
ing data, which is fundamentally different from edge federated learning
4.3.1. Network quantization settings. The pruning method for decentralized data training is under
To address the computational and storage issues, methods using discussion. PrunEdge FL [72] proposed a two-stage distributed prun-
quantized weights or activations in models have been proposed. The ing algorithm for federated learning. At beginning, a shared pruning
network has been accelerated by quantizing each full-precision weight model is sent to each node to train. Then PrunEdge FL performs dynamic
to a small number of bits. This can be further divided to two sub- pruning together with the standard FedAvg procedure. One drawback
categories, depending on whether approximating full-precision weights of magnitude-based and adaptive pruning methods is that it is difficult
with the linear combination of multiple binary weight bases at each to control the model size for the update. The structure of the submodel
iteration [26,55–58] or the model loss information is used [59–61]. constantly changes over the training.

6
Q. Xia, W. Ye, Z. Tao et al. High-Confidence Computing 1 (2021) 100008

4.4. Other efficient methods

Straggler problem, where the nodes lag because of computational


resource heterogeneity recently has drawn much attention [41,73,74].
Synchronously or asynchronously updating models over heterogeneous
network configurations is quite challenging. A tier-based federated
learning framework that updates local model parameters synchronously
within tiers and updates the global model asynchronously across tiers
are proposed by Chai et al. [73]. Another approach is called Het-
eroFL [75]. By coordinatively training local models which are smaller
than a global model to produce a single global inference model, Het-
eroFL is robust against the non-i.i.d statistical heterogeneity.
Optimal sampling problem is another interesting problem. The num-
ber of nodes in edge federated learning is in the order of millions and
each node in the system has very limited contribution to the global
model for each round. By sampling the important nodes, we can tremen-
dously save communication costs and accelerate the training process. Fig. 2. Federated learning with Byzantine attackers.
The Ribero and Vikalo [76] uses Ornstein-Uhlenbeck (OU) process, a
continuous stochastic process to adaptively decide the node-side model
updation. Rizk et al. [77] uses a non-uniform sampling scheme, where how to attack and how to defend. In this section, we will introduce the
the nodes are sampled according to some predefined distribution. current attack and defense algorithms in edge federated learning.

4.5. Future directions 5.1. Security in edge federated learning

Communication and computation are the key bottlenecks to con- Security issues arise in edge federated learning because of the het-
sider when developing methods for edge federated networks. Using tra- erogeneity of both edge computing and federated learning. On the one
ditional methods such as gradient quantization and sparsification is less hand, edge nodes and edge servers are usually from different sources
beneficial as we discuss early. The current efficient method studies are while at the same time, not all of them are trusted. On the other hand,
based on the standard FedAvg process and its variants. It is necessary to in federated learning, we usually assume that all nodes keep their own
discover more efficient algorithms other than FedAvg which are more private data and do not share them with other nodes or servers. Those
suitable for Federated learning schemes. There are some conditions that features improve the application generality of edge computing and keep
new algorithms should satisfy: more privacy for users’ private data, but they also increase the risk of
suffering malicious attacks. Much work has been done to address secu-
• It can achieve the same convergence speed as FedAvg and at least
rity issues in edge federated learning. We will first introduce two major
have the same performance as FedAvg.
ways to inject attacks followed by some existing algorithms to defend
• It can deal with both systems heterogeneity and statistical hetero-
attacks.
geneity challenges in edge federated learning.
• It can be easily applied to any edge federated learning applications
5.1.1. Attacks
(image recognition, NLP, etc.).
In this section, we introduce two major security attacks in edge fed-
Alternatively, the communication efficiency can be reduced through erated learning: Byzantine attack and poisoning attack.
fast model training (using less communication rounds) or important Byzantine attack
node sampling (using less nodes) based on Eq. (3). Byzantine problems have been explored since the beginning of
Computation efficiency, on the other hand, is another bottleneck for the distributed system. This problem was first introduced by Lamport
edge federated learning development. The core idea is to reduce the et al. [79] in 1982 and talks about the fails of the whole distributed com-
workload of local nodes by using lightweight models to reduce the us- puting systems if some nodes are attacked, compromised or failed. This
age of computation resources and memory. One possible solution is we problem was first introduced into the distributed machine learning area
could use neural architecture search (NAS) to find the optimal model for by Blanchard et al. [80] in 2017. We briefly define this problem together
nodes. Besides NAS, we can also use split learning [78]. Many machine with Figure 2, which is the structure of the federated learning system.
learning tasks have a heavy computation on the fully connected layers. As this figure shows, in the synchronous federated learning, Byzantine
We can transfer these computation burdens to a powerful edge server. problems exist when some nodes are attacked or compromised and do
The training data is processed on the local node therefore privacy is still not compute or upload weights correctly. In this scenario, the uploaded
preserved. weights 𝑤𝑖𝑡 in (2) may not be the real 𝑤𝑖𝑡 computed by (1). Theoretically,
the generalized Byzantine model that is defined in [80,81] is:
5. Security and privacy
Definition 1 (Generalized Byzantine Model).
Security and privacy problems are two major problems of imple- { 𝑖
𝑤𝑡 if 𝑖th node is honest
𝑤𝑖,𝑡 (𝑟) = (7)
menting federated learning in edge computing. Because of the natural 𝑎𝑖 ≠ 𝑤𝑖𝑡 otherwise
heterogeneous environment of federated learning and edge computing,
it is always hard to predict activities of other nodes in the system. For Here we denote 𝑤𝑖,𝑡 (𝑟) as the actual gradient received by central server
example, in the training process of edge federated learning, due to the from 𝑛𝑜𝑑𝑒𝑖 . In each iteration of the training phase, some nodes may be-
uncertainty of other nodes, some nodes may be malicious and attack the come Byzantine nodes and upload an attack gradient 𝑎𝑖 to the server.
training process. In addition, the curious nodes and servers may find it As we can see in Fig. 2, 𝑛𝑜𝑑𝑒3 here is a Byzantine attacker. It uploads
interesting to learn our personal data and want to retrieve the private an alternative 𝑤3𝑡 ,(𝑟) rather than the actual 𝑤3𝑡 to the server. The cen-
data from the update information we upload in each synchronization. tral server, at the same time, does not know 𝑛𝑜𝑑𝑒3 is compromised. It
In this scenario, the security and privacy of our node may be harmed. In aggregates all the uploaded gradients and sends the incorrect updated
summary, the security and privacy problems are divided into two parts: weights back to all nodes. According to Theorem 1 in [82], when the

7
Q. Xia, W. Ye, Z. Tao et al. High-Confidence Computing 1 (2021) 100008

aggregation function is an average function, one Byzantine attacker can 5.1.2. Defenses
take over the aggregation result and lead the whole training process to Although the methods to attack edge federated learning are different
an incorrect phase. for Byzantine attacks and poisoning attacks, the defense problems are
As we said before, federated learning is a special kind of distributed actually from the same structure, that is, the central server must distin-
machine learning, so the Byzantine problems remain the same or get guish the information uploaded by honest nodes from the information
even worse in edge federated learning. The main differences are listed uploaded by attack nodes. In the related literature, all those defense
below: algorithms are denoted as Byzantine-resilient algorithms.
To defend Byzantine attacks in distributed machine learning, there
• Edge computing environment is complex, where edge nodes and are basically three different directions.
servers are usually from different sources. This increases the pos-
sibility of suffering Byzantine attacks in edge federated learning. • Score-based method. This direction usually defines a metric to score
• In the federated learning, the data on each node is private and does each uploaded weight and finally we can choose the one with the
not share with each other, so the distribution of subdataset on each highest score as the aggregation results.
node can be either non-i.i.d. or i.i.d., while the heterogeneity makes • Median-based method. Geometric median and its modifications are
it easier to attack and harder to detect. used in this direction.
• In each iteration of the federated learning, only some of nodes are • Distance-based method. This kind of method uses the distance infor-
selected to perform the computation, making the honest majority mation in euclidean space to remove outliers.
assumption unreasonable. It is possible that more than half of nodes
are under attacks in each iteration. Blanchard et al. first proposed a score-based algorithm called
Krum [80] to measure the scores for each uploaded gradient in the cen-
tral server by the L2 norm sum of its closest 𝑛 − 𝑓 − 2 gradients and
Mainly there are three different ways to inject Byzantine at-
chose the gradient with the highest score as the aggregated gradient.
tacks [81].
After this, they also proposed another algorithm to resist asynchronous
Byzantine attacks [92]. However, Krum has a natural shortcoming that
• Gaussian attack. Simply generate Gaussian noise as attack gradients they can only choose one gradient among all uploaded gradients, which
and weights. reduces the convergence speed a lot.
• Omniscient Attack. Let the direction of gradients uploaded by the As for median-based methods, in fact, most of the following work
Byzantine nodes equal to the direction of the sum of all honest nodes. usually focused on using median-based aggregation methods rather than
• Flip bit attack. Flip some bit of the uploaded gradients and weights. predefined score-based methods. For example, Xie et al. proposed geo-
metric median, marginal median and median-around-median [81], Yin
Poisoning attack et al. proposed coordinate-wised median [90], Su et al. proposed a
Poisoning attack is a kind of attack to injection attacks in the feder- batch normalized median [93], Alistarh et al. proposed a more com-
ated learning training process. In general, there are two types of poison- plicated modification of median-based methods that is called Byzanti-
ing attacks: data poisoning and model poisoning. neSGD [94]. However, because the geometric median is a point that
5.1.1.0.1. Data Poisoning. Data poisoning is a naive way to at- minimizes the sum of distances to all points, in order to find the geo-
tack the federated learning system. It uses a simple idea: affecting the metric median, a recursive method is adopted and therefore the time
training process by changing the input data. One method to inject data complexity is really large.
poisoning attacks is proposed by Tolpegin et al. [83]. Their method is The last direction is distance-based method. Yin et al proposed
based on a simple intuition by flipping the labels [84] of the input data. coordinate-wise trimmed mean [90], Xia et al. proposed an alternative
The labels of the input data are randomly shuffled such that the in- method called FABA [82]. Instead of using geometric median to ag-
put data mismatch their corresponding labels, which has a significantly gregate the uploaded gradients, those methods used euclidean distance
negative impact on the classes that are under attack. Besides, Shafahi to remove outlier gradients. They adaptively remove outliers based on
et al. [85] propose a clean-labels attack. They introduce an optimization- the center current remaining gradients. They later provided another
based method for crafting poisons without requiring the attackers to Byzantine-resilient algorithm for large scale distributed machine learn-
make any modifications on the input data label. By conducting this at- ing [95].
tack, they can make the model fail at some specific task. Gu et al. choose As for the federated learning area, several Byzantine robust algo-
to mix the clean data with adversarial data [86] to attack the model rithms are proposed. The biggest difference between federated learn-
training [87]. ing and classic distributed machine learning is that the subdataset on
5.1.1.0.2. Model Poisoning. Unlike data poisoning, model poison- each node is non-i.i.d. distributed. Although all those algorithms in
ing does not manipulate the input data. It aims at manipulating the classic distributed machine learning methods still work in some scenar-
local model to inject backdoors or malicious attacks into the global ios, they really depend on how non-i.i.d. the datasets are. Ghosh et al.
model. Bagdasaryan et al. first introduced backdoor attack into feder- first talked about this problem in 2019 [96], in which they combine K-
ated learning area [88]. They use a model replacement method to inject means and trimmed mean together to achieve Byzantine-resilient. They
this attack from one or multiple compromised nodes. Therefore, after use K-means to gather the uploaded weights into several clusters and
aggregation in the central server, the updated global model will be in- use trimmed-means to remove the outliers. However, there is one prob-
jected and misclassify some predefined inference tasks. Bhagoji et al. lem here that maybe all uploaded weights in the same cluster are from
proposed another method to inject the targeted model poisoning and Byzantine nodes, and thus may affect the performance. Muoz-Gonzlez
stealthy model poisoning for standard federated learning [89]. They et al. proposed an adaptive model averaging algorithm to resist Byzan-
choose the attack gradients and weights by estimating the benign nodes tine attacks [97]. They divide all nodes into two sets: good clients set
updates and optimizing for both the training loss and adversarial objec- and bad clients’ set. Then in each iteration, they compare the uploaded
tive. Their experiments show a high successful attack rate against some weights with the aggregation results from nodes in good clients’ set and
Byzantine-resilient algorithms such as Krum [80] and coordinate-wise update both sets. Then they use the updated good clients’ sets to per-
median [90]. Fang et al. focus on attacking the model training with form the aggregation in the current iteration. Prakash et al. proposed
four different Byzantine robust algorithms [91]. In order to attack those a method based on the direction similarity and length similarity [98].
algorithms, they solve an optimization problem for each objective algo- Kang et al. proposed a decentralized method to achieve reliable feder-
rithm. ated learning for mobile networks [99].

8
Q. Xia, W. Ye, Z. Tao et al. High-Confidence Computing 1 (2021) 100008

5.1.3. Future directions 5.2.2. Defenses


Right now, all the proposed defense methods assume that either the The privacy-preserving techniques in edge federated learning are
data distribution is i.i.d. on all nodes or the majority of the nodes are roughly in two directions.
honest. However, these assumptions are not practical in edge federated
• Algorithm-based solutions. The most widely used technique in
learning. First, edge nodes usually collect their data from their own
the machine learning area to protect privacy is differential pri-
sources, so the data distribution is not necessarily i.i.d. Second, at syn-
vacy [108]. The common technique is to add noise while maintain-
chronization time, the central server randomly selects some of the edge
ing an acceptable performance.
nodes to perform the computation, among which the honest nodes may
• Encryption-based solutions. These kinds of solutions are in low-level
not be the majority even if more than half nodes are honest. In this set-
architecture to encrypt the communication information to protect
ting, honest majority is not a reasonable assumption through the whole
privacy such as secure multi-party computation [109].
training process.
Therefore, a more practical algorithm without assuming i.i.d. data Most of the existing algorithms are in these two directions, while
distribution and honest majority is expected in the future to defend the some of them are the combinations of several techniques to protect
security attacks to edge federated learning. At present, there are very privacy. There is several work talking about implementing differen-
few explorations in this direction. How to perfectly defend Byzantine tial privacy techniques in federated learning area. Wei et al. proposed
attacks remains an open problem. an algorithm called NbAFL, which adds artificial noise before aggrega-
tion [110]. Geyer et al. considered this problem in the client-side per-
5.2. Privacy in edge federated learning spective and they use random sub-sampling and Gaussian mechanism to
distort the sum of all updates [111]. Bhowmick et al. designed a new op-
Although federated learning is designed to protect each node’s pri- timal locally differentially private algorithm for all privacy levels [112].
vate training data without relying on training data transmission between Ghazi et al. proposed a simple differential privacy-based algorithm for a
servers, privacy breach can still be incurred when information (e.g., shuffled model in federated learning to make user’s data indistinguish-
model weights) is shared between servers. It is possible that some curi- able with random noise [113].
ous nodes or servers can extract private information through the training Additionally, in encryption level, Phong et al. used an additively ho-
process. In this subsection, we will introduce various privacy attacks and momorphic encryption in cryptography area in collaborative deep learn-
privacy-preserving algorithms. ing and built a privacy-preserving enhanced system [114]. When com-
municating with the central server, the information is well encrypted so
that no information is leaked to the curious-but-honest server and ac-
5.2.1. Attacks
curacy is kept intact. Elgabli et al. proposed A-FADMM [115], which is
In edge federated learning, the information exchange between nodes
based on analog transmissions and the alternating direction method of
and server basically contains weights that are computed by nodes using
multipliers. This method can hide each local model’s update trajectory
local private data and the updated weights that are aggregated by the
from any eavesdropper, which protects privacy of each node.
central server. Therefore, in order to attack the privacy of some nodes,
Some hybrid methods are also proposed in this edge federated learn-
the process of the adversary is to retrieve the private dataset information
ing. Truex et al. proposed a hybrid method to implement differential
from uploaded weights (curious server) or aggregated weights (curious
privacy and secure multiparty computation at the same time [116].
node) [100]. Thus, this problem is equivalent to retrieving the data from
This method reduces the noise injection into communication with the
weight update. In general, there are two types of privacy attacks in fed-
central server to increase the performance and still maintain a high
erated learning area.
privacy level. Hao et al. integrated additively homomorphic encryp-
• Membership inference attack [101]. This kind of attack is to deter- tion [114] with differential privacy [117]. Their method provides a
mine whether a data record is contained in a node’s training dataset. stronger protection to prevent privacy leakage in the scenario that mul-
When the dataset is sensitive, this attack may leak a lot of useful in- tiple nodes or central servers are colluded. They also proposed another
formation. method by adding encryption-level and differential privacy protection
• Data inference attack. This attack aims to retrieve the training data in federated learning [118]. This method supports large-scale federated
or a class of training data from the information that node provides. learning applications.

There is some existing work for both attack types, and we list some 5.2.3. Future directions
below. Much work has been done in addressing the privacy issues in ma-
Nasr et al. proposed a white-box inference attack in federated learn- chine learning or federated learning, in particular, how to solve infor-
ing [102] and provided a comprehensive privacy analysis of deep learn- mation leakage when transmitting gradients and weights between nodes
ing models. Truex et al. proposed a feasible black-box membership and the central server. Little work has focused on privacy issues in edge
inference attack in federated learning [103]. Zhu et al. proposed a federated learning. The privacy issue, however, may become more se-
method called deep leakage to retrieve training data from public shared vere in edge computing because edge nodes/servers may leak infor-
gradients on both computer vision and natural language processing mation regarding data, usage, location to malicious users [119,120].
tasks [104]. Their method is based on minimizing the loss between the How to preserve privacy in edge federated learning by considering the
dummy gradients computed by the attack training data and real gra- specifics of edge computing is a new direction worth exploration.
dients computed by the true training data. Experiments show a very
large leakage rate for four different datasets. This shows that a curious 6. Migration and scheduling
server can easily retrieve the training data by the gradients that node
uploads. Hitaj et al. proposed an information leakage method [105] us- Migration and scheduling are two major low-level supports in edge
ing generative adversarial networks [106] in collaborative deep learn- computing [121,122]. In this section, we will compare their differences
ing. Although this needs a separate neural network to retrieve the train- between edge computing and edge federated learning.
ing data, it has an excellent performance in information leakage even
with privacy-preserving algorithms. Wang et al. use a similar GAN-based 6.1. Migration
method called Multi-task GAN in federated learning [107] to precisely
recover the private data from a specific client which causes user-level In edge federated learning, the migration problem arises when the
privacy leakage. edge node moves between edge servers. Because we know that in edge

9
Q. Xia, W. Ye, Z. Tao et al. High-Confidence Computing 1 (2021) 100008

computing, edge nodes are always connected to the geographically near- incentive mechanism can help invite more participants while maxi-
est edge server to get low latency and high bandwidth, it is possible mizing the usage of the resources.
for the edge node to travel from one edge server to another while in
the process of federated learning training. In this scenario, the new con- We list some solutions based on these four directions below.
nected edge server does not have a copy of the federated learning model Participant Selection
and thus a migration between edge servers must be implemented. From The idea of participant selection is based on the mechanism of feder-
Section 4, we know that a mid-size model can be around hundreds of ated learning. In federated learning, some nodes are randomly selected
megabyte level, so it takes time to migrate the model between edge as participants to perform the computation using their private local data
servers, especially when we adopt split learning [123,124] to offload for one iteration, among which some may have enough computational
most of the computation on edge servers. Therefore, a collaborative and resources and high network bandwidth, and some may have limited re-
efficient migration policy is necessary in edge federated learning. sources. The one with the slowest computational speed and upload speed
There is a lot of research about migration in edge comput- decides the total training time for this iteration. Therefore, it is very use-
ing [121,125–127]. However, in edge federated learning area, research ful to choose the participating nodes in a smarter way. Nishio et al. pro-
about migration is still at an early stage. Although it is possible to let the posed a novel federated learning protocol FedCS to mitigate this prob-
under-layer infrastructure handle the migration process between differ- lem [128]. The main part of FedCS is a client selection protocol. They
ent edge server, it is too heavy to migrate the model in system level. initialize the whole framework by requesting the resource information
Because the federated learning training has some extra features such as of all nodes. Then in order to select the clients, they solve an optimiza-
large model size, flexible node selection, we should specifically optimize tion problem by the computational resource and previous training time
the migration process for edge federated learning itself. information. This method is efficient in client selection but may affect
We believe the following strategies may help to further improve the the performance because of the non-i.i.d. distributed data. Yoshida et al.
migration-based edge federated learning performance. later proposed an enhanced framework called Hybrid-FL [129]. Similar
to their previous work, they also request the resource information of all
• A naive way is simply to keep a copy in all the adjacent edge servers. nodes at first. However, they choose to perform a client and data selec-
This method is fairly effective and achieves very low latency. How- tion at the same time instead of just selecting the clients to decrease the
ever, this process has a huge communication and storage cost. influence of the data distribution. Then they upload the selected data
• We can also split the network into several parts and broadcast part of and update the model. Yang et al. developed a formal framework to
the model to other edge servers, this can reduce the communication analyze different scheduling policies’ convergence performance [130].
cost in the migration process and stay a low storage cost. Through their analysis, using proportional fair scheduling policy to se-
• It is efficient to predict the movement path of the edge nodes using lect clients performs better than randomly scheduling and round robin.
a separate model and cache the model to the predicted edge servers Resource Optimization
for a better migration experience. Most of the proposed methods for resource allocation transform this
• If the migration cost is too high, we can update the scheduling policy problem to a resource optimization problem, that is, given constraints
by abandoning the computation results from this node and reuse it about the edge node computation resources and network limit, the goal
after finishing the migration process. is to find the most efficient way to implement the edge federated learn-
ing. For example, Dinh et al. proposed FEDL and solved an optimiza-
6.2. Scheduling tion problem to minimize energy and time consumption [131]. Li et al.
proposed q-FedAvg optimization objective for fair resource allocation
Scheduling, or resource allocation is an important problem in edge in edge federated learning [132]. Zeng et al. proposed energy-efficient
computing. Due to the node and server heterogeneity, the data, com- strategies for bandwidth allocation and scheduling [133]. Neely et al.
putation, memory, and network resources vary a lot among different proposed a scheduling control of heterogeneous network for resource
devices. In federated learning, a synchronous iteration requires all par- allocation [134]. Similar algorithms are also proposed in [135–137].
ticipating nodes finish their computation and upload computational re- Apart from optimization, Zou et al. considered about using game the-
sults to the central server before the server continues to perform the ory for resource allocation [138]. They proposed an evolutionary game
aggregation and model update. Therefore, the training speed is limited approach to dynamically schedule the computing resources and reach
by the node with slowest computational resources and network band- an evolutionary equilibrium. Recently there are some work about using
width. In order to get a better training speed, it is better to schedule the reinforcement learning for scheduling the resources. Nguyen et al. pro-
computational resource in an efficient way. posed a deep reinforcement learning based method [139]. They use a
neural network to decide the scheduling policy and update the network
6.2.1. Current solutions by the corresponding rewards. Zhan et al. also proposed a deep rein-
There is much previous work about scheduling and resource alloca- forcement learning based method [140] to get a near-optimal solution
tion in edge federated learning area. In summary, most of the solutions for the optimization problem without knowledge about the networks.
follow the following four directions. Asynchronous training
Asynchronous training fits well for resource allocation problems. It
• Participant selection. In federated learning, the central server ran- is because in asynchronous training, the central server does not have to
domly selects some nodes to perform the computation. Therefore, it wait for all the participating nodes to finish their computation before
helps efficiency to select the participated nodes in a smarter way. updating the global model. In this scenario, nodes can take their time
• Resource optimization. In edge computing, because of the hetero- to train on their private local data even if they lack computational re-
geneity of nodes, the computational and network resources in each sources or suffer the network delay. Chen et al. proposed ASO-fed, an
device are different. We can optimize the resource allocation by let- asynchronous online edge federated learning framework [141]. Central
ting nodes with more computational power compute more. server will take streaming of model updates from different edge nodes
• Asynchronous training. Most of the current edge federated research because of the node heterogeneity and update the model accordingly in
focus on synchronous training, but asynchronous training can signif- an exponential moving average way with non-i.i.d. and imbalanced set-
icantly improve the efficiency in a heterogeneous environment. ting. Lu et al. combined differential privacy and asynchronous federated
• Incentive Mechanism. Some of the researchers focus on the incentive learning together to reach both privacy guarantee and better resource
compensation in federated learning because the node must consume allocation [142]. Chen et al. used a temporally weighted aggregation
their computational resources for collaborative work. An efficient method in the setting of asynchronous federated learning in order to

10
Q. Xia, W. Ye, Z. Tao et al. High-Confidence Computing 1 (2021) 100008

make use of the previously trained local models [143]. This helps the References
convergence speed and accuracy performance. Chen et al. proposed a
vertical asynchronous federated learning method VAFL by using a per- [1] M. Satyanarayanan, The emergence of edge computing, Computer 50 (1) (2017)
30–39, doi:10.1109/MC.2017.9.
turbed local embedding [144], which improves the data privacy and [2] W.Z. Khan, E. Ahmed, S. Hakak, I. Yaqoob, A. Ahmed, Edge computing: a survey,
communication efficiency. Future Gener. Comput. Syst. 97 (2019) 219–235.
Incentive Mechanism [3] T. Li, A.K. Sahu, A. Talwalkar, V. Smith, Federated learning: challenges, methods,
and future directions, IEEE Signal Process. Mag. 37 (3) (2020) 50–60.
Incentive mechanism addresses how to reward the participating edge [4] B. Liu, B. Yan, Y. Zhou, Y. Yang, Y. Zhang, Experiments of federated learning for
nodes according to their computational resources and personal data so Covid-19 chest x-ray images, arXiv:2007.05592 (2020).
that they are willing to contribute their computational power for a col- [5] M.J. Sheller, G.A. Reina, B. Edwards, J. Martin, S. Bakas, Multi-institutional deep
learning modeling without sharing patient data: a feasibility study on brain tu-
laborative federated training. A practical incentive mechanism must be
mor segmentation, in: International MICCAI Brainlesion Workshop, Springer, 2018,
fair for both participated nodes and edge servers. Kang et al. introduced pp. 92–104.
a contract theory based incentive design [12]. They use a contract model [6] M.J. Sheller, B. Edwards, G.A. Reina, J. Martin, S. Pati, A. Kotrotsou, M. Milchenko,
W. Xu, D. Marcus, R.R. Colen, et al., Federated learning in medicine: facilitating
to define the data quality of edge nodes and give more reward to data
multi-institutional collaborations without sharing patient data, Sci. Rep. 10 (1)
owners who have high quality data. However, they only consider the (2020) 1–12.
price of data, but no price of computational resources. Feng et al. com- [7] J. Xu, B.S. Glicksberg, C. Su, P. Walker, J. Bian, F. Wang, Federated learning for
bined rewards for providing data and computation resources in one pric- healthcare informatics, J. Healthc. Inform. Res. (2020) 1–19.
[8] Y. Chen, X. Qin, J. Wang, C. Yu, W. Gao, Fedhealth: a federated transfer learning
ing model [145]. They use a Stackelberg game model to evaluate the framework for wearable healthcare, IEEE Intell. Syst. (2020).
value of edge nodes. Khan et al. also use a similar game model [146]. [9] M.G. Arivazhagan, V. Aggarwal, A.K. Singh, S. Choudhary, Federated learning with
personalization layers, arXiv:1912.00818 (2019).
6.2.2. Future directions [10] Y.M. Saputra, D.T. Hoang, D.N. Nguyen, E. Dutkiewicz, M.D. Mueck, S. Srikan-
teswara, Energy demand prediction with federated learning for electric vehicle
We summarized three future directions for scheduling in edge fed- networks, in: 2019 IEEE Global Communications Conference (GLOBECOM), IEEE,
erated learning. First, current algorithms for scheduling and resource 2019, pp. 1–6.
allocation always try to minimize the training time. However, in this [11] D. Ye, R. Yu, M. Pan, Z. Han, Federated learning in vehicular edge com-
puting: aselective model aggregation approach, IEEE Access 8 (2020) 23920–
setting, the central server may not select nodes with limited computa- 23935.
tional resource or unstable network because of the long waiting time. [12] J. Kang, Z. Xiong, D. Niyato, H. Yu, Y.-C. Liang, D.I. Kim, Incentive design for
The data on those nodes will not be used in the model training, which efficient federated learning in mobile networks: a contract theory approach, in:
2019 IEEE VTS Asia Pacific Wireless Communications Symposium (APWCS), IEEE,
results in a biased model training. To mitigate this problem, we may
2019, pp. 1–5.
group the nodes with similar training time together and integrate their [13] A. Imteaj, M.H. Amini, Distributed sensing using smart end-user devices: path-
weights before sending them to the server in batches. This way, all data way to federated learning for autonomous IoT, in: 2019 International Confer-
ence on Computational Science and Computational Intelligence (CSCI), IEEE, 2019,
can be used for training and the training time can be reduced as well.
pp. 1156–1161.
Second, in asynchronous training, most of the previous work focuses [14] T. Yang, G. Andrew, H. Eichner, H. Sun, W. Li, N. Kong, D. Ramage, F. Beau-
on using numerical experiments to show the performance. However, fays, Applied federated learning: improving google keyboard query suggestions,
there is still a lack of theoretical study for asynchronous training. Math- arXiv:1812.02903 (2018).
[15] S. Ramaswamy, R. Mathews, K. Rao, F. Beaufays, Federated learning for emoji
ematical analysis and comparisons between asynchronous training and prediction in a mobile keyboard, arXiv:1906.04329 (2019).
synchronous training are needed in edge federated learning. Third, the [16] F. Hartmann, S. Suh, A. Komarzewski, T.D. Smith, I. Segall, Federated learning for
incentive mechanisms in edge federated learning seem to be a less stud- ranking browser history suggestions, arXiv:1911.11807 (2019).
[17] K. Hong, D. Lillethun, U. Ramachandran, B. Ottenwälder, B. Koldehofe, Mobile
ied topic. It is promising to explore how to incentivize the participation fog: a programming model for large-scale applications on the internet of things, in:
of nodes with high-quality data and abundant computational resources. Proceedings of the Second ACM SIGCOMM Workshop on Mobile Cloud Computing,
2013, pp. 15–20.
7. Conclusion [18] N.K. Giang, M. Blackstock, R. Lea, V.C. Leung, Developing IoT applications in the
fog: a distributed dataflow approach, in: 2015 5th International Conference on the
Internet of Things (IOT), IEEE, 2015, pp. 155–162.
In this article, we carefully investigate edge federated learning, [19] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kid-
which is a paradigm to implement federated learning on edge comput- don, J. Konečnỳ, S. Mazzocchi, H.B. McMahan, et al., Towards federated learning
at scale: System design, arXiv:1902.01046 (2019).
ing environments. The development of edge federated learning is still
[20] P. Kairouz, H.B. McMahan, B. Avent, A. Bellet, M. Bennis, A.N. Bhagoji,
at an early stage, and there is not much research in this area. We sum- K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R.G.L. D’Oliveira, S.E. Rouay-
marize the research problems and methods respectively in applications, heb, D. Evans, J. Gardner, Z. Garrett, A. Gascón, B. Ghazi, P.B. Gibbons,
M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi,
development tools, communication efficiency, security, privacy, migra-
T. Javidi, G. Joshi, M. Khodak, J. Konecný, A. Korolova, F. Koushanfar, S. Koyejo,
tion and scheduling as well as providing some insights of the future T. Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. Özgür, R. Pagh, M. Raykova,
directions and open problems in edge federated learning. With the fast H. Qi, D. Ramage, R. Raskar, D. Song, W. Song, S.U. Stich, Z. Sun, A.T. Suresh,
advancement of both edge computing and federated learning, more and F. Tramèr, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F.X. Yu, H. Yu,
S. Zhao, Advances and open problems in federated learning, CoRR (2019).
more collaborative training methods for edge federated learning are de- abs/1912.04977
veloped for better user experience and privacy protection. We will need [21] B. McMahan, E. Moore, D. Ramage, S. Hampson, B.A. y Arcas, Communication-ef-
more efforts on solving those open problems in edge federated learning. ficient learning of deep networks from decentralized data, in: Artificial Intelligence
and Statistics, PMLR, 2017, pp. 1273–1282.
[22] P. Kairouz, H.B. McMahan, B. Avent, A. Bellet, M. Bennis, A.N. Bhagoji, K.
Declaration of Competing Interest Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al., Advances and open prob-
lems in federated learning, arXiv:1912.04977 (2019).
The authors declare that they have no known competing financial [23] S. Wang, T. Tuor, T. Salonidis, K.K. Leung, C. Makaya, T. He, K. Chan, When
edge meets learning: adaptive control for resource-constrained distributed machine
interests or personal relationships that could have appeared to influence learning, in: IEEE INFOCOM 2018-IEEE Conference on Computer Communications,
the work reported in this paper. IEEE, 2018, pp. 63–71.
[24] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convo-
Acknowledgments lutional neural networks, Commun. ACM 60 (6) (2017) 84–90.
[25] J. Ba, R. Caruana, Do deep nets really need to be deep? Adv. Neural Inf. Process.
Syst. 27 (2014) 2654–2662.
This project was supported in part by US National Science Founda- [26] F. Li, B. Zhang, B. Liu, Ternary weight networks, arXiv:1605.04711 (2016).
tion grant CNS-1816399. This work was also supported in part by the [27] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image
recognition, arXiv:1409.1556 (2014).
Commonwealth Cyber Initiative, an investment in the advancement of
[28] M.M. Amiri, D. Gunduz, S.R. Kulkarni, H.V. Poor, Federated learning with quan-
cyber R&D, innovation and workforce development. For more informa- tized global model updates, arXiv:2006.10672 (2020).
tion about CCI, visit cyberinitiative.org. [29] B. Recht, C. Re, S. Wright, F. Niu, Hogwild: a lock-free approach to parallelizing

11
Q. Xia, W. Ye, Z. Tao et al. High-Confidence Computing 1 (2021) 100008

stochastic gradient descent, in: J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, [61] C. Leng, H. Li, S. Zhu, R. Jin, Extremely low bit neural network: squeeze the last
K.Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 24, bit out with ADMM, arXiv:1707.09870 (2017).
Curran Associates, Inc., 2011, pp. 693–701. [62] M. Courbariaux, Y. Bengio, J.-P. David, Binaryconnect: training deep neural net-
[30] J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, Q.V. Le, M.Z. Mao, M. Ranzato, works with binary weights during propagations, in: Advances in neural information
A. Senior, P. Tucker, K. Yang, A.Y. Ng, Large scale distributed deep networks, in: processing systems, 2015, pp. 3123–3131.
NIPS, 2012, pp. 1–2. [63] C. Zhu, S. Han, H. Mao, W.J. Dally, Trained ternary quantization, arXiv:1612.01064
[31] T.M. Chilimbi, Y. Suzue, J. Apacible, K. Kalyanaraman, Project adam: building an (2016).
efficient and scalable deep learning training system, in: OSDI, 2014, pp. 1–2. [64] J. Frankle, M. Carbin, The lottery ticket hypothesis: finding sparse, trainable neural
[32] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, Z. Zhang, networks, arXiv:1803.03635 (2018).
Mxnet: a flexible and efficient machine learning library for heterogeneous dis- [65] S. Han, J. Pool, J. Tran, W. Dally, Learning both weights and connections for effi-
tributed systems, CoRR (2015). abs/1512.01274 cient neural network, Adv. Neural Inf. Process. Syst. 28 (2015) 1135–1143.
[33] Y. Zhang, J. Duchi, M.I. Jordan, M.J. Wainwright, Information-theoretic lower [66] S. Han, H. Mao, W.J. Dally, Deep compression: compressing deep neural net-
bounds for distributed statistical estimation with communication constraints, Adv. works with pruning, trained quantization and Huffman coding, arXiv:1510.00149
Neural Inf. Process. Syst. 26 (2013) 2328–2336. (2015b).
[34] S.U. Stich, Local SGD converges fast and communicates little, in: International Con- [67] W. Wen, C. Wu, Y. Wang, Y. Chen, H. Li, Learning structured sparsity in deep neural
ference on Learning Representations, 2018. networks, Adv. Neural Inf. Process. Syst. 29 (2016) 2074–2082.
[35] F. Seide, H. Fu, J. Droppo, G. Li, D. Yu, 1-bit stochastic gradient descent and appli- [68] H. Li, A. Kadav, I. Durdanovic, H. Samet, H.P. Graf, Pruning filters for efficient
cation to data-parallel distributed training of speech DNNS, in: Interspeech 2014, convnets, arXiv:1608.08710 (2016).
2014, pp. 1–2. [69] P. Molchanov, S. Tyree, T. Karras, T. Aila, J. Kautz, Pruning convolutional neural
[36] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, H. Li, Terngrad: ternary networks for resource efficient inference, arXiv:1611.06440 (2016).
gradients to reduce communication in distributed deep learning, CoRR (2017). [70] N. Lee, T. Ajanthan, P.H. Torr, Snip: Single-shot network pruning based on connec-
abs/1705.07878 tion sensitivity, arXiv:1810.02340 (2018).
[37] H. Tang, C. Yu, X. Lian, T. Zhang, J. Liu, Doublesqueeze: parallel stochastic gra- [71] T. Lin, S.U. Stich, L. Barba, D. Dmitriev, M. Jaggi, Dynamic model pruning with
dient descent with double-pass error-compensated compression, in: International feedback, arXiv:2006.07253 (2020).
Conference on Machine Learning, PMLR, 2019, pp. 6155–6165. [72] Y. Jiang, S. Wang, B.J. Ko, W.-H. Lee, L. Tassiulas, Model pruning enables efficient
[38] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, Y. Zou, Dorefa-net: Training low bitwidth federated learning on edge devices, arXiv:1909.12326 (2019).
convolutional neural networks with low bitwidth gradients, arXiv:1606.06160 [73] Z. Chai, Y. Chen, L. Zhao, Y. Cheng, H. Rangwala, Fedat: a communication-
(2016). efficient federated learning method with asynchronous tiers under non-IID data,
[39] J. Sun, T. Chen, G. Giannakis, Z. Yang, Communication-efficient distributed learn- arXiv:2010.05958 (2020).
ing via lazily aggregated quantized gradients, Adv. Neural Inf. Process. Syst. 32 [74] F. Sattler, S. Wiedemann, K.-R. Müller, W. Samek, Robust and communication-ef-
(2019) 3370–3380. ficient federated learning from non-IID data, IEEE Trans. Neural Netw. Learn.Syst.
[40] D. Alistarh, D. Grubic, J. Li, R. Tomioka, M. Vojnovic, Qsgd: communication-effi- (2019).
cient SGD via gradient quantization and encoding, Adv. Neural Inf. Process. Syst. [75] E. Diao, J. Ding, V. Tarokh, Heterofl: Computation and communication efficient
30 (2017) 1709–1720. federated learning for heterogeneous clients, arXiv:2010.01264 (2020).
[41] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, R. Pedarsani, Fedpaq: a com- [76] M. Ribero, H. Vikalo, Communication-efficient federated learning via optimal client
munication-efficient federated learning method with periodic averaging and quan- sampling, arXiv:2007.15197 (2020).
tization, in: International Conference on Artificial Intelligence and Statistics, PMLR, [77] E. Rizk, S. Vlaski, A.H. Sayed, Federated learning under importance sampling,
2020, pp. 2021–2031. arXiv:2012.07383 (2020).
[42] J. Wu, W. Huang, J. Huang, T. Zhang, Error compensated quantized SGD and its [78] C. Thapa, M.A.P. Chamikara, S. Camtepe, Splitfed: when federated learning meets
applications to large-scale distributed optimization, arXiv:1806.08054 (2018). split learning, arXiv:2004.12088 (2020).
[43] R. Johnson, T. Zhang, Accelerating stochastic gradient descent using pre- [79] L. Lamport, R. Shostak, M. Pease, The byzantine generals problem, ACM Trans.
dictive variance reduction, Adv. Neural Inf. Process. Syst. 26 (2013) 315– Program. Lang. Syst. 4 (3) (1982) 382–401, doi:10.1145/357172.357176.
323. [80] P. Blanchard, E.M. El Mhamdi, R. Guerraoui, J. Stainer, Machine learning with ad-
[44] V. Gandikota, D. Kane, R.K. Maity, A. Mazumdar, vqsgd: vector quantized stochas- versaries: byzantine tolerant gradient descent, in: I. Guyon, U.V. Luxburg, S. Ben-
tic gradient descent, arXiv:1911.07971 (2019). gio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural
[45] A. Brutzkus, A. Globerson, E. Malach, S. Shalev-Shwartz, SGD learns over-param- Information Processing Systems 30, Curran Associates, Inc., 2017, pp. 119–129.
eterized networks that provably generalize on linearly separable data, in: Interna- [81] C. Xie, O. Koyejo, I. Gupta, Generalized byzantine-tolerant SGD, CoRR (2018).
tional Conference on Learning Representations, 2018, pp. 1–2. abs/1802.10116
[46] N. Strom, Scalable distributed DNN training using commodity GPUcloud comput- [82] Q. Xia, Z. Tao, Z. Hao, Q. Li, Faba: an algorithm for fast aggregation against byzan-
ing, in: INTERSPEECH, 2015, pp. 1–2. tine attacks in distributed neural networks, in: Proceedings of the Twenty-Eighth
[47] Y. Lin, S. Han, H. Mao, Y. Wang, W.J. Dally, 5: reducing the communication band- International Joint Conference on Artificial Intelligence, IJCAI-19, International
width for distributed training, CoRR (2017). abs/1712.01887 Joint Conferences on Artificial Intelligence Organization, 2019, pp. 4824–4830,
[48] R. Garg, R. Khandekar, Gradient descent with sparsification: an iterative algo- doi:10.24963/ijcai.2019/670.
rithm for sparse recovery with restricted isometry property, in: Proceedings of [83] V. Tolpegin, S. Truex, M.E. Gursoy, L. Liu, Data poisoning attacks against federated
the 26th International Conference On Machine Learning, ICML 2009, 2009, p. 43, learning systems, in: L. Chen, N. Li, K. Liang, S. Schneider (Eds.), Computer Security
doi:10.1145/1553374.1553417. – ESORICS 2020, Springer International Publishing, Cham, 2020, pp. 480–501.
[49] N. Dryden, T. Moon, S.A. Jacobs, B.V. Essen, Communication quantization for data– [84] B. Biggio, B. Nelson, P. Laskov, Poisoning attacks against support vector machines,
parallel training of deep neural networks, in: 2016 2nd Workshop on Machine in: Proceedings of the 29th International Coference on International Conference on
Learning in HPC Environments (MLHPC), 2016, pp. 1–8. Machine Learning, in: ICML’12, Omnipress, Madison, WI, USA, 2012, p. 14671474.
[50] A.F. Aji, K. Heafield, Sparse communication for distributed gradient descent, CoRR [85] A. Shafahi, W.R. Huang, M. Najibi, O. Suciu, C. Studer, T. Dumitras, T. Gold-
(2017). abs/1704.05021 stein, Poison frogs! targeted clean-label poisoning attacks on neural networks,
[51] Z. Tao, Q. Li, esgd: communication efficient distributed deep learning on the edge, in: Proceedings of the 32nd International Conference on Neural Information Pro-
in: USENIX Workshop on Hot Topics in Edge Computing (HotEdge 18), USENIX cessing Systems, in: NIPS’18, Curran Associates Inc., Red Hook, NY, USA, 2018,
Association, Boston, MA, 2018, pp. 1–2. p. 61066116.
[52] C. Chen, J. Choi, D. Brand, A. Agrawal, W. Zhang, K. Gopalakrishnan, Adacomp: [86] L. Huang, A.D. Joseph, B. Nelson, B.I. Rubinstein, J.D. Tygar, Adversarial machine
adaptive residual gradient compression for data-parallel distributed training, CoRR learning, in: Proceedings of the 4th ACM Workshop on Security and Artificial Intel-
(2017). abs/1712.02679 ligence, in: AISec ’11, Association for Computing Machinery, New York, NY, USA,
[53] J. Wangni, J. Wang, J. Liu, T. Zhang, Gradient sparsification for communication-ef- 2011, p. 4358, doi:10.1145/2046684.2046692.
ficient distributed optimization, in: S. Bengio, H. Wallach, H. Larochelle, K. Grau- [87] T. Gu, B. Dolan-Gavitt, S. Garg, Badnets: identifying vulnerabilities in the machine
man, N. Cesa-Bianchi, R. Garnett (Eds.), Advances in Neural Information Processing learning model supply chain, 2019.
Systems 31, Curran Associates, Inc., 2018, pp. 1299–1309. [88] E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, V. Shmatikov, How to backdoor fed-
[54] S. Caldas, J. Konečny, H.B. McMahan, A. Talwalkar, Expanding the reach of feder- erated learning, in: S. Chiappa, R. Calandra (Eds.), Proceedings of the Twenty
ated learning by reducing client resource requirements, arXiv:1812.07210 (2018). Third International Conference on Artificial Intelligence and Statistics, Pro-
[55] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Quantized neural ceedings of Machine Learning Research, 108, PMLR, Online, 2020, pp. 2938–
networks: training neural networks with low precision weights and activations, J. 2948.
Mach. Learn. Res. 18 (1) (2017) 6869–6898. [89] A.N. Bhagoji, S. Chakraborty, P. Mittal, S. Calo, Analyzing federated learning
[56] Z. Lin, M. Courbariaux, R. Memisevic, Y. Bengio, Neural networks with few multi- through an adversarial lens, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings
plications, arXiv:1510.03009 (2015). of the 36th International Conference on Machine Learning, Proceedings of Machine
[57] M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, Xnor-net: imagenet classification Learning Research, 97, PMLR, Long Beach, California, USA, 2019, pp. 634–643.
using binary convolutional neural networks, in: European conference on computer [90] D. Yin, Y. Chen, R. Kannan, P. Bartlett, Byzantine-robust distributed learning: to-
vision, Springer, 2016, pp. 525–542. wards optimal statistical rates, in: J. Dy, A. Krause (Eds.), Proceedings of the 35th
[58] X. Lin, C. Zhao, W. Pan, Towards accurate binary convolutional neural network, International Conference on Machine Learning, Proceedings of Machine Learn-
in: Advances in neural information processing systems, 2017, pp. 345–353. ing Research, 80, PMLR, Stockholmsmssan, Stockholm Sweden, 2018, pp. 5650–
[59] L. Hou, Q. Yao, J.T. Kwok, Loss-aware binarization of deep networks, 5659.
arXiv:1611.01600 (2016). [91] M. Fang, X. Cao, J. Jia, N.Z. Gong, Local model poisoning attacks to byzantine-ro-
[60] L. Hou, J.T. Kwok, Loss-aware weight quantization of deep networks, bust federated learning, in: S. Capkun, F. Roesner (Eds.), 29th USENIX Security
arXiv:1802.08635 (2018).

12
Q. Xia, W. Ye, Z. Tao et al. High-Confidence Computing 1 (2021) 100008

Symposium, USENIX Security 2020, August 12-14, 2020, USENIX Association, [120] S. Yi, Z. Qin, Q. Li, Security and privacy issues of fog computing: A survey, in: in In-
2020, pp. 1605–1622. ternational Conference on Wireless Algorithms, Systems and Applications (WASA,
[92] G. Damaskinos, E.M. El Mhamdi, R. Guerraoui, R. Patra, M. Taziki, Asynchronous 2015, pp. 685–695, doi:10.1007/978-3-319-21837-3_67.
Byzantine machine learning (the case of SGD), in: J. Dy, A. Krause (Eds.), Pro- [121] S. Wang, J. Xu, N. Zhang, Y. Liu, A survey on service migration in mobile edge com-
ceedings of the 35th International Conference on Machine Learning, Proceedings puting, IEEE Access 6 (2018) 23511–23528, doi:10.1109/ACCESS.2018.2828102.
of Machine Learning Research, 80, PMLR, Stockholmsmssan, Stockholm Sweden, [122] Z. Tao, Q. Xia, Z. Hao, C. Li, L. Ma, S. Yi, Q. Li, A survey of virtual ma-
2018, pp. 1145–1154. chine management in edge computing, Proc. IEEE 107 (8) (2019) 1482–1499,
[93] Y. Chen, L. Su, J. Xu, Distributed statistical machine learning in adversarial settings: doi:10.1109/JPROC.2019.2927919.
byzantine gradient descent, Proc. ACM Meas. Anal. Comput. Syst. 1 (2) (2017), [123] C. Thapa, M.A.P. Chamikara, S. Camtepe, Splitfed: when federated learning meets
doi:10.1145/3154503. split learning, 2020,
[94] D. Alistarh, Z. Allen-Zhu, J. Li, Byzantine stochastic gradient descent, CoRR (2018). [124] P. Vepakomma, O. Gupta, T. Swedish, R. Raskar, Split learning for health: dis-
abs/1803.08917 tributed deep learning without sharing raw patient data, 2018.
[95] Q. Xia, Z. Tao, Q. Li, Defenses against byzantine attacks in distributed deep neural [125] C. Dupont, R. Giaffreda, L. Capra, Edge computing in IoT context: horizontal and
networks, IEEE Trans. Netw. Sci.Eng. (2020), doi:10.1109/TNSE.2020.3035112. vertical linux container migration, in: 2017 Global Internet of Things Summit
1–1 (GIoTS), 2017, pp. 1–4, doi:10.1109/GIOTS.2017.8016218.
[96] A. Ghosh, J. Hong, D. Yin, K. Ramchandran, Robust federated learning in a hetero- [126] M. Chen, W. Li, G. Fortino, Y. Hao, L. Hu, I. Humar, A dynamic service migra-
geneous environment, CoRR (2019). abs/1906.06629 tion mechanism in edge cognitive computing, ACM Trans. Internet Technol. 19 (2)
[97] L. Muoz-Gonzlez, K.T. Co, E.C. Lupu, Byzantine-robust federated machine learning (2019), doi:10.1145/3239565.
through adaptive model averaging, 2019. [127] T.G. Rodrigues, K. Suto, H. Nishiyama, N. Kato, K. Temma, Cloudlets activa-
[98] S. Prakash, A.S. Avestimehr, Mitigating byzantine attacks in federated learning, tion scheme for scalable mobile edge computing with transmission power control
2020, and virtual machine migration, IEEE Trans. Comput. 67 (9) (2018) 1287–1300,
[99] J. Kang, Z. Xiong, D. Niyato, Y. Zou, Y. Zhang, M. Guizani, Reliable feder- doi:10.1109/TC.2018.2818144.
ated learning for mobile networks, IEEE Wirel. Commun. 27 (2) (2020) 72–80, [128] T. Nishio, R. Yonetani, Client selection for federated learning with heterogeneous
doi:10.1109/MWC.001.1900119. resources in mobile edge, in: ICC 2019 – 2019 IEEE International Conference on
[100] L. Melis, C. Song, E. De Cristofaro, V. Shmatikov, Exploiting unintended feature Communications (ICC), 2019, pp. 1–7, doi:10.1109/ICC.2019.8761315.
leakage in collaborative learning, in: 2019 IEEE Symposium on Security and Pri- [129] N. Yoshida, T. Nishio, M. Morikura, K. Yamamoto, R. Yonetani, Hybrid-fl for wire-
vacy (SP), 2019, pp. 691–706, doi:10.1109/SP.2019.00029. less networks: cooperative learning mechanism using non-IID data, in: ICC 2020
[101] R. Shokri, M. Stronati, C. Song, V. Shmatikov, Membership inference attacks against – 2020 IEEE International Conference on Communications (ICC), 2020, pp. 1–7,
machine learning models, in: 2017 IEEE Symposium on Security and Privacy (SP), doi:10.1109/ICC40277.2020.9149323.
2017, pp. 3–18, doi:10.1109/SP.2017.41. [130] H.H. Yang, Z. Liu, T.Q.S. Quek, H.V. Poor, Scheduling policies for federated
[102] M. Nasr, R. Shokri, A. Houmansadr, Comprehensive privacy analysis of deep learn- learning in wireless networks, IEEE Trans. Commun. 68 (1) (2020) 317–333,
ing: passive and active white-box inference attacks against centralized and fed- doi:10.1109/TCOMM.2019.2944169.
erated learning, in: 2019 IEEE Symposium on Security and Privacy (SP), 2019, [131] C.T. Dinh, N.H. Tran, M.N.H. Nguyen, C.S. Hong, W. Bao, A.Y. Zomaya, V. Gramoli,
pp. 739–753, doi:10.1109/SP.2019.00065. Federated learning over wireless networks: convergence analysis and resource al-
[103] S. Truex, L. Liu, M. Gursoy, L. Yu, W. Wei, Demystifying membership inference location, IEEE/ACM Trans. Netw. (2020) 1–12, doi:10.1109/TNET.2020.3035770.
attacks in machine learning as a service, IEEE Trans. Serv. Comput. PP (2019), [132] T. Li, M. Sanjabi, A. Beirami, V. Smith, Fair resource allocation in federated learn-
doi:10.1109/TSC.2019.2897554. 1–1 ing, in: International Conference on Learning Representations, 2020.
[104] L. Zhu, Z. Liu, S. Han, Deep leakage from gradients, in: H. Wallach, H. Larochelle, [133] Q. Zeng, Y. Du, K. Huang, K.K. Leung, Energy-efficient radio resource alloca-
A. Beygelzimer, F. d’ Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Infor- tion for federated edge learning, in: 2020 IEEE International Conference on Com-
mation Processing Systems, 32, Curran Associates, Inc., 2019, pp. 14774–14784. munications Workshops (ICC Workshops), 2020, pp. 1–6, doi:10.1109/ICCWork-
[105] B. Hitaj, G. Ateniese, F. Perez-Cruz, Deep models under the gan: information shops49005.2020.9145118.
leakage from collaborative deep learning, in: Proceedings of the 2017 ACM [134] M.J. Neely, E. Modiano, C. Li, Fairness and optimal stochastic control for
SIGSAC Conference on Computer and Communications Security, in: CCS ’17, heterogeneous networks, IEEE/ACM Trans. Netw. 16 (2) (2008) 396–409,
Association for Computing Machinery, New York, NY, USA, 2017, p. 603618, doi:10.1109/TNET.2007.900405.
doi:10.1145/3133956.3134012. [135] M.S.H. Abad, E. Ozfatura, D. GUndUz, O. Ercetin, Hierarchical federated learning
[106] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, across heterogeneous cellular networks, in: ICASSP 2020 - 2020 IEEE International
A. Courville, Y. Bengio, Generative adversarial nets, in: Z. Ghahramani, M. Welling, Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 8866–
C. Cortes, N. Lawrence, K.Q. Weinberger (Eds.), Advances in Neural Information 8870, doi:10.1109/ICASSP40776.2020.9054634.
Processing Systems, 27, Curran Associates, Inc., 2014, pp. 2672–2680. [136] Y. Sun, S. Zhou, D. Gndz, Energy-aware analog aggregation for federated learn-
[107] Z. Wang, M. Song, Z. Zhang, Y. Song, Q. Wang, H. Qi, Beyond inferring class rep- ing with redundant data, in: ICC 2020 – 2020 IEEE International Conference on
resentatives: user-level privacy leakage from federated learning, in: IEEE INFO- Communications (ICC), 2020, pp. 1–7, doi:10.1109/ICC40277.2020.9148853.
COM 2019 - IEEE Conference on Computer Communications, 2019, pp. 2512–2520, [137] G. Zhu, Y. Wang, K. Huang, Broadband analog aggregation for low-latency fed-
doi:10.1109/INFOCOM.2019.8737416. erated edge learning, IEEE Trans. Wirel. Commun. 19 (1) (2020) 491–506,
[108] C. Dwork, Differential privacy: a survey of results, in: M. Agrawal, D. Du, Z. Duan, doi:10.1109/TWC.2019.2946245.
A. Li (Eds.), Theory and Applications of Models of Computation, Springer, Berlin, [138] Y. Zou, S. Feng, D. Niyato, Y. Jiao, S. Gong, W. Cheng, Mobile device train-
Heidelberg, 2008, pp. 1–19. ing strategies in federated learning: an evolutionary game approach, in: 2019
[109] O. Goldreich, Secure multi-party computation, Manuscript. Preliminary Ver- International Conference on Internet of Things (iThings) and IEEE Green Com-
sion(1999). puting and Communications (GreenCom) and IEEE Cyber, Physical and Social
[110] K. Wei, J. Li, M. Ding, C. Ma, H.H. Yang, F. Farokhi, S. Jin, T.Q.S. Quek, Computing (CPSCom) and IEEE Smart Data (SmartData), 2019, pp. 874–879,
H. Vincent Poor, Federated learning with differential privacy: algorithms doi:10.1109/iThings/GreenCom/CPSCom/SmartData.2019.00157.
and performance analysis, Trans. Inf. Forensic Secur. 15 (2020) 34543469, [139] H.T. Nguyen, N.C. Luong, J. Zhao, C. Yuen, D. Niyato, Resource allocation in mo-
doi:10.1109/TIFS.2020.2988575. bility-aware federated learning networks: a deep reinforcement learning approach,
[111] R.C. Geyer, T. Klein, M. Nabi, Differentially private federated learning: a client CoRR (2019). abs/1910.09172
level perspective, 2018. [140] Y. Zhan, P. Li, S. Guo, Experience-driven computational resource allocation of
[112] A. Bhowmick, J. Duchi, J. Freudiger, G. Kapoor, R. Rogers, Protection against re- federated learning by deep reinforcement learning, in: 2020 IEEE International
construction and its applications in private federated learning, 2019. Parallel and Distributed Processing Symposium (IPDPS), 2020, pp. 234–243,
[113] B. Ghazi, R. Pagh, A. Velingker, Scalable and differentially private distributed ag- doi:10.1109/IPDPS47924.2020.00033.
gregation in the shuffled model, CoRR (2019). abs/1906.08320 [141] Y. Chen, Y. Ning, H. Rangwala, Asynchronous online federated learning for edge
[114] L.T. Phong, Y. Aono, T. Hayashi, L. Wang, S. Moriai, Privacy-preserving deep learn- devices, CoRR (2019). abs/1911.02134
ing via additively homomorphic encryption, IEEE Trans. Inf. Forensics Secur. 13 [142] Y. Lu, X. Huang, Y. Dai, S. Maharjan, Y. Zhang, Differentially private asynchronous
(5) (2018) 1333–1345, doi:10.1109/TIFS.2017.2787987. federated learning for mobile edge computing in urban informatics, IEEE Trans.
[115] A. Elgabli, J. Park, C.B. Issaid, M. Bennis, Harnessing wireless channels for scalable Ind. Inform. 16 (3) (2020) 2134–2143, doi:10.1109/TII.2019.2942179.
and privacy-preserving federated learning, 2020. [143] Y. Chen, X. Sun, Y. Jin, Communication-efficient federated deep learning
[116] S. Truex, N. Baracaldo, A. Anwar, T. Steinke, H. Ludwig, R. Zhang, Y. Zhou, with layerwise asynchronous model update and temporally weighted aggre-
A hybrid approach to privacy-preserving federated learning, in: Proceedings of gation, IEEE Trans. Neural Netw. Learn.Syst. 31 (10) (2020) 4229–4238,
the 12th ACM Workshop on Artificial Intelligence and Security, in: AISec’19, doi:10.1109/TNNLS.2019.2953131.
Association for Computing Machinery, New York, NY, USA, 2019, p. 111, [144] T. Chen, X. Jin, Y. Sun, W. Yin, Vafl: a method of vertical asynchronous federated
doi:10.1145/3338501.3357370. learning, 2020.
[117] M. Hao, H. Li, X. Luo, G. Xu, H. Yang, S. Liu, Efficient and privacy-enhanced feder- [145] S. Feng, D. Niyato, P. Wang, D.I. Kim, Y. Liang, Joint service pricing and
ated learning for industrial artificial intelligence, IEEE Trans. Ind. Inform. 16 (10) cooperative relay communication for federated learning, in: 2019 Interna-
(2019) 6532–6542. tional Conference on Internet of Things (iThings) and IEEE Green Comput-
[118] M. Hao, H. Li, G. Xu, S. Liu, H. Yang, Towards efficient and privacy-preserving ing and Communications (GreenCom) and IEEE Cyber, Physical and Social
federated deep learning, in: ICC 2019 - 2019 IEEE International Conference on Computing (CPSCom) and IEEE Smart Data (SmartData), 2019, pp. 815–820,
Communications (ICC), 2019, pp. 1–6, doi:10.1109/ICC.2019.8761267. doi:10.1109/iThings/GreenCom/CPSCom/SmartData.2019.00148.
[119] S. Yi, C. Li, Q. Li, A survey of fog computing: concepts, applications and is- [146] L.U. Khan, S.R. Pandey, N.H. Tran, W. Saad, Z. Han, M.N.H. Nguyen,
sues, in: Proceedings of the 2015 Workshop on Mobile Big Data, in: Mobidata C.S. Hong, Federated learning for edge networks: resource optimization
’15, Association for Computing Machinery, New York, NY, USA, 2015, p. 3742, and incentive mechanism, IEEE Commun. Mag. 58 (10) (2020) 88–93,
doi:10.1145/2757384.2757397. doi:10.1109/MCOM.001.1900649.
13

You might also like