0% found this document useful (0 votes)
86 views9 pages

Coedge: Exploiting The Edge-Cloud Collaboration For Faster Deep Learning

This document proposes CoEdge, a method for allocating deep neural network (DNN) layers across edge and cloud resources to minimize latency for deep learning tasks used in IoT applications. CoEdge formalizes the latency-minimum allocation problem, proves it is NP-hard, and presents an approximate algorithm that handles it in polynomial time by greedily selecting beneficial edge nodes and allocating DNN layers using a recursive policy. Simulation results show CoEdge reduces deep learning latency compared to other allocation schemes.

Uploaded by

hazwan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views9 pages

Coedge: Exploiting The Edge-Cloud Collaboration For Faster Deep Learning

This document proposes CoEdge, a method for allocating deep neural network (DNN) layers across edge and cloud resources to minimize latency for deep learning tasks used in IoT applications. CoEdge formalizes the latency-minimum allocation problem, proves it is NP-hard, and presents an approximate algorithm that handles it in polynomial time by greedily selecting beneficial edge nodes and allocating DNN layers using a recursive policy. Simulation results show CoEdge reduces deep learning latency compared to other allocation schemes.

Uploaded by

hazwan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Received May 2, 2020, accepted May 15, 2020, date of publication May 19, 2020, date of current version

June 9, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2995583

CoEdge: Exploiting the Edge-Cloud Collaboration


for Faster Deep Learning
LIANGYAN HU1 , GUODONG SUN 1,3 , (Member, IEEE), AND YANLONG REN 2
1 Schoolof Information Science and Technology, Beijing Forestry University, Beijing 100083, China
2 Network Information Management and Service Center, Beijing University of Civil Engineering and Architecture, Beijing 100044, China
3 Engineering Research Center for Forestry-oriented Intelligent Information Processing, National Forestry and Grassland Administration, Bejing 10083, China

Corresponding author: Guodong Sun ([email protected])


This work was supported in part by NSF of China under Grant 61300180, and in part by the Fundamental Research Funds for the Central
Universities of China under Grant 2017ZY20.

ABSTRACT Recently a great number of ubiquitous Internet-of-Things (IoT) devices have been connecting
to the Internet. With the massive amount of IoT data, the cloud-based intelligent applications have sprang up
to support accurate monitoring and decision-making. In practice, however, the intrinsic transport bottleneck
of the Internet severely handicaps the real-time performance of the cloud-based intelligence depending on
IoT data. In the past few years, researchers have paid attention to the computing paradigm of edge-cloud
collaboration; they offload the computing tasks from the cloud to the edge environment, in order to avoid
transmitting much data through the Internet to the cloud. To present, it is still an open issue to effectively
allocate the deep learning task (i.e., deep neural network computation) over the edge-cloud system to shorten
the response time of application. In this paper, we propose the latency-minimum allocation (LMA) problem,
aimed at allocating the deep neural network (DNN) layers over the edge-cloud environment while the total
latency of processing this DNN can be minimized. First, we formalize the LMA problem in general form,
prove its NP-hardness, and present an insightful characteristic of feasible DNN layer allocations. Second,
we design an approximate algorithm, called CoEdge, which can handle the LMA problem in polynomial
time. By exploiting the communication and computation resources of the edge, CoEdge greedily selects
the beneficial edge nodes and allocates the DNN layers to the selected nodes by a recursion-based policy.
Finally, we conduct extensive simulation experiments with realistic setups, and the experimental results show
the efficacy of CoEdge in reducing the deep learning latency compared to two state-of-the-art schemes.

INDEX TERMS Edge computing, deep learning, latency, allocation of DNN layers.

I. INTRODUCTION to the cloud will be inevitably impacted by the inherent


In the past few years, the popularity of emerging Internet-of- transmission bottleneck of the Internet [3]–[6]. Accordingly,
Things (IoT) applications has been generating a huge amount the response time or delay of application will be signifi-
of real-time data. Based on the IoT data, the IoT end-users cantly extended, especially in the intelligent IoT applications
can make accurate monitoring and effective decisions with where massive real-time IoT data struggles to traverse the
their intelligent utility deployed in cloud. For such cloud- Internet, swarming into the cloud. As a matter of fact, the non-
based intelligent IoT applications, however, the data gen- negligible transmission delay of the Internet has become the
erated by the ubiquitous IoT devices has to be delivered, essential obstacle to expediting the deep learning-based IoT
through the Internet, to the cloud for further processing. applications [7]–[9].
In industry, cloud computing has played an indispensable One promising way of addressing the above issue is edge
role in executing large-scale and intensive-computing tasks, computing. It is a new computing paradigm complementary
such as deep learning, and thus, the intelligence of the IoT to cloud computing, in which the IoT data processing or
applications usually resides in the cloud [1], [2]. Under such at least part of it is moved from the cloud to the edge of
a cloud-centric paradigm, the data delivered from IoT devices the Internet, so that the computing task can be executed
at the proximity of data sources, rather than being done
The associate editor coordinating the review of this manuscript and at the remote, hard-to-reach cloud [7], [10], [11]. Many
approving it for publication was Jjun Cheng . recent studies show the ability of edge computing in reduc-

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 100533
L. Hu et al.: CoEdge: Exploiting the Edge-Cloud Collaboration for Faster Deep Learning

ing the system delay. In [12], for instance, the researchers A. OFFLOADING COMPUTE TASK ONTO EDGE
offload the computation from the cloud to the wearable cog- In cloud computing, transferring data to the cloud server
nitive assistance. They reduce the response time of appli- often needs a considerable amount of time, which will surely
cation by 80 ms to 200 ms; moreover, their work squeezes weaken the quality-of-service for the time-sensitive appli-
30%-40% less energy consumption from the cloud-based cations. To address this issue, a promising approach is to
strategy. Edge computing has been envisioned to be a uni- offload a part of computing tasks onto the resource-rich edge,
fying platform that can engender a new breed of emerging alleviating the deficiencies of network congestion, latency
services and support a variety of new intensive-computing and energy consumption [7]. Exiting works in edge-cloud
real-time applications. offloading focus on how to determine the effective and
Recently researchers have attempted to process the deep efficient task offloading policy [14], [15], [16]–[18]. To
learning task in edge, mainly in order to reduce the system improve the resource efficiency, reference [19] designs a
response delay. In general, the execution of deep learning resource-efficient edge computing framework, which enables
task is a layer-by-layer process of the deep neural network intelligent IoT users to flexibly offload task across the edge
(DNN) model, which typically consists of a set of consecutive device, the nearby assistant devices, and the adjacent edge
perceptron layers [13]. The raw IoT data is fed into the first cloud. In [20], the authors exploit the possibility that mas-
DNN layer, and the inference or classification can be finally sive mobile devices collaboratively execute the on-edge task,
yielded out of the last DNN layer. In the ongoing process aimed to optimize the energy efficiency of those devices.
of DNN, the intermediate results are transfered through the In [21], the authors consider a scenario where flying
DNN layers and can be quickly scaled down in terms of unmanned aerial vehicles serve as edge nodes; they propose
size. For edge-based intelligent IoT systems, the whole DNN a resource scheduling approach that can offload tasks in
model or a part of its layers can be deployed on the edge node, dynamic environments by leveraging based a learning algo-
which is far closer to the IoT devices than the cloud server rithm. How to offload the computing task onto the edge
is. Sometimes, the IoT devices are resource-rich and then has recently attracted more and more attention in industry
can by themselves serve as a kind of edge devices. The edge and academia. We have observed that the sustainable edge
node only needs to load up to the cloud some size-reduced computing can bring latency reduction and there is a great
intermediate data or even the final result, instead of the raw chance to deploy deep learning in edge.
IoT data of large size. The edge participation can therefore
shorten the response time of DNN-based IoT system by
reducing the Internet traffic. In order to further reduce the B. ALLOCATING DNN TO EDGE-CLOUD
system response time, however, it is still an open issue to The high-accuracy deep learning usually depends on a lot of
design an effective paradigm that can take full advantage of training data, and as a result, the demand for bandwidth rises
the edge-cloud resources. up dramatically in the cloud-based intelligent applications. To
In this paper we design and implement a new allocation accelerate the model training or inference of DNN, a popular
scheme, called CoEdge, which attempts to allocate the DNN means nowadays is to transfer to the edge environment a part
layers over the edge and the cloud such that the deep learning of or even all the DNN layers, i.e., making them closer to the
delay can be further reduced in comparison with the schemes data sources.
that only involve a single edge node. With a greedy criterion, Many hardware platforms including GPU or customized
CoEdge iteratively finds out the best edge node and forms accelerators such as FPGA and ASIC have emerged. In [22],
a set of edge nodes, over which the DNN is then allocated DjiNN is designed, which is an open infrastructure for DNN
with a recursion-based policy. Essentially, CoEdge unlocks with large-scale GPU servers to achieve high throughput and
the potential of edge in deep learning, by exploiting the low network occupancy. There have been several approaches
high-speed connection and the computing capacity embraced to accelerating machine learning [23]–[25]. FPGA-based
by the edge environment. accelerators have more flexibility than ASICs in accelerating
The remainder of this paper is organized as follows. large-scale CNN models. A deeply pipelined multi-FPGA
Section II briefly introduces major works related to ours. architecture is designed in [26], which can achieve lower
Section III and Section IV give the models and the detailed latency by using a dynamic programming method to map a
design of CoEdge. Section V evaluates our design and com- DNN onto several pipelined FPGAs. For this approach based
pares it with two state-of-art schemes via extensive experi- on a fixed pipeline, however, the FPGA devices are assumed
ments. Finally, Section VI concludes this paper. to be homogeneous in computing capacity and each of them
is required to undertake at least one DNN layer.
The demand for memory, computing and energy capacity
has gradually grown up to be a critical bottleneck for the allo-
II. RELATED WORK cation of DNN to edge devices. How to deploy DNN into edge
In this section, we first introduce the task offloading in edge environment has been studied primarily over industry and
computing and then, the approaches to allocating DNN learn- academia. A software accelerator for DNN execution in edge
ing or inference tasks to edge-cloud system. network is presented in [27], in which the resource demands

100534 VOLUME 8, 2020


L. Hu et al.: CoEdge: Exploiting the Edge-Cloud Collaboration for Faster Deep Learning

are reduced by decomposing DNN layers into various unit- the intermediate data, which will immediately be passed on
blocks that can be effectively processed by heterogeneous to layer Li+1 for further processing. We denote by θiin the
processors. In [28], a method called AAIoT is proposed to size of the data input to Li , and by θiout the size of the data
allocate DNN to a set of devices that form a multi-level IoT output from Li . Often, layer L1 is called the input layer, which
system. AAIoT balances the computation and transmission receives the input data to be classified; Lm is called the output
time to minimize the overall response time. Reference [29] layer, which returns the final classification result; and other
schedules DNN layers in edge computing environment: it layers are called hidden layer because they are not connected
allocates as many deep learning tasks to the edge devices with the external world.
as possible, while satisfying a given constraint on response Fig. 1 shows an example where a pixelated image of cat
time. For a given deep learning task, however, the authors is fed to a 5-layer CNN and after two stages (i.e., the feature
only employ a single edge device to share some DNN layers. extraction and the feature classification), a cat is recognized.
Particularly, they do not consider the potential of collab- Despite the number of layers, the DNN also specifies a neuron
oration across the edge devices, and thus, their allocation set for each layer and the connection pattern between two
policy cannot well suit the deep learning task with a strict adjacent layers. Both factors collectively determine the com-
requirement for system latency. putational cost at each layer and the volume of intermediate
Different from the above works, the CoEdge proposed in results to be transferred between two consecutive layers.
this study is designed for a general edge-cloud computing Given the raw input data (often, an image), the latency of
where the edge devices could be heterogeneous in comput- DNN-based learning or inference has two parts: the total
ing and communication resources. Additionally, in order to computation time of all the layers, and the total time of
further shorten the deep learning latency, CoEdge can elab- transferring all the intermediate results between any two con-
orately exploit the high-speed links and strong computing secutive layers plus transferring the final result to the cloud
capacity contained in the edge network. server.

III. MODELS AND PROBLEM


In this section, we first introduce the system model in use
and the corresponding notations. Second, we describe the
proposed LMA problem, reformulate it as a constrained
delay-minimization problem, and analyze its intractability;
and particularly, we theoretically reveal an important fea-
ture of all optimal LMA solutions, which can exactly help
reduce the searching space of our algorithm (to be given in
Section IV) and thus improves the efficiency.

A. MODELS AND PRELIMINARIES


Typically, an edge-cloud consists of a cloud server and an
edge network. We use c to represent the cloud server, and
E = {e1 , e2 , . . . en }, to represent the set of edge nodes. The FIGURE 1. A typical example of layer-by-layer DNN structure.
edge nodes are interconnected via D2D links, 5G network,
or other kinds of high-speed network. We assume that the
In this paper, we call set Li a segment of DNN L, if Li is
transmission in edge is much faster than that between any
empty or a subset of L that involves a single layer or multiple
edge node e ∈ E and the cloud node c. If the deep learning
consecutive layers. A segment of L is an element-consecutive
task is completely processed in the edge, we still report the
subsequence derived from L. In Fig. 1, for instance, hL1 i
final result to the cloud node.
and hL2 , L3 i are two segments of that 5-layer CNN, but
A deep neural network (DNN) can be represented with
neither of hL1 , L3 i and hL2 , L1 i is a segment. We say that a
a hierarchical structure L = hLi |1 ≤ i ≤ mi, where
partially-ordered set P(L) is a partition of L, if P(L) is a set
each layer Li is a set of neurons. Because of the intrinsic
of the segments of L and the following three properties hold:
directionality of processing DNN, L is a partially ordered S|P(L)|
set (or simply a sequence) of perception layers. In this paper, 1) i=1 Li = L, i.e., a partition of L should exactly
we define a partial-order relation (denoted by ‘‘≺’’ and ‘‘’’) cover all the layers of L,
on a set U —if ui ≺ uj , ui proceeds uj in U , and if ui  uj , 2) Li ∩ Lj = ∅ for any two distinct segments Li and Lj of
we have ui ≺ uj or i = j. If there exists no uk ∈ U such that P(L), and
ui ≺ uk and uk ≺ uj , we say that ui and uj are adjacent or 3) Li ≺ Lj , if the last layer of Li and the first layer of Lj
consecutive in U . For two layers Li and Lj of L, ‘‘Li ≺ Lj ’’ satisfy the relation of ≺ in L.
means that Li should be executed earlier in the deep learning By the above definition, we can also simply consider a
task than Lj . Fig. 1 shows a typical DNN which involves partition as a sequence of segments. Especially, h∅, Li and
five layers. The calculation of layer Li (1 ≤ i < m) yields hL, ∅i are both partitions of L distinct from each other.

VOLUME 8, 2020 100535


L. Hu et al.: CoEdge: Exploiting the Edge-Cloud Collaboration for Faster Deep Learning

B. PROBLEM DESCRIPTION AND ANALYSIS


In the edge-cloud system we consider, for a given deep
learning task, some edge node of E (denoted by er ) receives
the raw input data, and the cloud node c receives the final
result of learning or inference. So, both er and c will always
participate in the deep learning task, although either of them
does not have to undertake any DNN layers. In this paper,
the LMA problem is described as: for a given deep learning FIGURE 2. Examples of three feasible allocation policies with a given pair
task defined with L, to determine E ⊆ E\er as well as a of P(L) and Src , where the blue arrows form the flow of intermediate
data.
partition P(L) of L and allocate each segment of P(L) to
exactly one node of E ∪ {er , c}, while minimizing the total
latency of processing L. For simplicity, we hereafter use Src in performing layer Lj of L. For this special case, we can
to represent E ∪ {er , c} for any E ⊆ E\er . Next, we formally treat Src to be a set of parallel machine and L, to be the set
present the LMA problem to be addressed and then prove its of jobs with precedence. This special problem instance is thus
NP-hardness. transformed into determining a partition of L and a policy that
The allocation of L over E ∪ {c} can be expressed with a can allocate each resulted segment to a machine of Src while
tuple A = (P(L), Src , ϕ), where function ϕ : P(L) → Src minimizing the total processing time. Clearly, this problem
is the allocation policy. If ϕ(Lj ) = ei , node ei will perform can be modeled as the Workload Partition Problem (WPP)
j
all the layers in segment Lj ; we denote by di the time cost with the precedence constraint of jobs. The WPP has proven
j
for ei to perform Lj . In practice, di can be figured out in NP-hard [30]–[32]. To solve the problem proposed in this
advance, according to the available computing resource of paper, we need to determine not only the partition of L and
ei and the total computing load incurred by Lj . Noticeably, the allocation policy ϕ but the subset Src of E ∪ {c}, and thus,
we can reasonably neglect the time consumed to transfer the LMA problem is generally at least NP-hard. 
the intermediate data between any two adjacent layers of Theorem 2: Suppose that there is an allocation of partition
segment Lj , because such on-node data transfers are carried P(L) over Src (|Src | ≥ 2) and the corresponding allocation
out by ei within its local space of memory access. Consider policy is ϕ. Consider three consecutive nonempty segments
two adjacent segments Lj and Lj+1 of partition P(L), and Li−1 , Li , and Li+1 of P(L) such that ϕ(Li−1 ) = ϕ(Li+1 ) =
assume that ϕ(Lj ) = ei and ϕ(Lj+1 ) = ek . If ei and ek ep and ϕ(Li ) = eq , where ep and eq are two distinct nodes
are two different nodes of Src , the cross-edge data transfer is of Src . We declare that the allocation policy ϕ cannot lead to
needed—node ei needs to transfer to node ek the intermediate an optimal allocation of P(L) over Src .
j
data output from the last layer of Lj . We use di→k to represent Proof: We prove this theorem by construction. The left
the delay of such a cross-edge data transfer from node ei to part of Fig. 3 shows the allocation policy ϕ on the given three
node ek . consecutive segments. We next reshape ϕ under two mutually
We denote by δ(A) the latency that the allocation A can exclusive cases:
achieve on processing a deep learning task. Thus, the LMA • Case I: ep is at least as powerful as eq in terms of
problem can be formally written as computing capacity, and
j j • Case II: eq is more powerful than ep in terms of comput-
X
min : δ(A) = (di + di→k ) , (1)
Lj ∈P(L)
ing capacity.
ei =ϕ(Lj )
ek =ϕ(Lj+1 )
j
where di→k equals zero if ei is the cloud node under the cur-
rent allocation policy ϕ. Given partition P(L) and Src , Fig. 2
shows three feasible application policies. In the solution for
LMA, some nodes of Src might not process any segments
but relay the intermediate data between nodes. In Fig. 2, for
example, e3 and er are just relaying nodes under allocation
policies ϕ2 and ϕ3 , respectively. FIGURE 3. Illustration of reshaping the allocation policy ϕ into ϕ 0 or ϕ 00
Theorem 1: Generally, the LMA problem is NP-hard. such that either ϕ 0 or ϕ 00 is better than ϕ. Here, the thin arrowed line
presents an allocation of DNN layer to edge node, and the blue solid
Proof: To show the NP-hardness of the LMA problem, arrowed lines profile the necessary intermediate data flow, while the
we here consider a special instance of LMA, in which (1) dashed arrowed lines, the possible data flow.
the nodes that can process DNN L are given in advance
j
as Src = {e1 , e2 . . . ek }(k ≥ 2), (2) the time cost dp→q is If Case I holds, we can derive a new allocation policy ϕ 0
extremely low and can be neglected for any pair of nodes from ϕ, by only re-allocating segment Li to ep , as shown in
j
ep , eq ∈ Src , and (3) each node ei will consume τi time the middle part of Fig. 3. We first evaluate by (2) the time cost

100536 VOLUME 8, 2020


L. Hu et al.: CoEdge: Exploiting the Edge-Cloud Collaboration for Faster Deep Learning

of ϕ on processing these three consecutive segments. LMA problem, we design an approximate algorithm, called
CoEdge. With a greedy policy, CoEdge attempts to iteratively
δ(ϕ) = dpi−1 + dp→q
i−1
+ dqi + dq→p
i
insert a new edge node into the Src which is initialized only
with {er , c}, until the iteratively updated Src cannot assure

i+1
dp→x or
i+1
+dp + (2) a shorter period of time to perform the deep learning task.
d i+1 + d i+1
p→q q→y Algorithm 1 shows how the proposed CoEdge works. Before
diving into the algorithm description, we first introduce the
Under allocation policy ϕ, nodes ep and eq process seg-
information to be input to CoEdge and how to initialize these
ments Li−1 and Li , respectively, and then, ep takes over
inputs.
segment Li+1 . ϕ results in two cross-edge intermediate data
transfers, both of which consume times of dp→q i−1 and d i
q→p ,
respectively. In (2), the last item of the right-hand indicates Algorithm 1 CoEdge
that there are two alternatives for ϕ to transfer the intermedi- Input: E ∪ {c}, L, W , and C
ate data output from the last layer of Li+1 : one is transfer the Result: the allocation of L over a subset Src of E ∪ {c}
data directly from ep to some node ex processing Li+2 , and the 1 Src ← her , ci
other, through eq to some node ey processing Li+2 . Similarly, 2 Determine an optimal layer allocation A over Src and
the time cost of ϕ 0 on processing the three segments can be then obtain the minimum total latency (i.e., δmin )
expressed with 3 while E ∪ {c} − Src 6 = ∅ do
4 foreach e ∈ E ∪ {c} − Src do
δ(ϕ 0 ) = dpi−1 + dpi + dqi+1 5 Determine a partition P(L) as well as an
allocation policy ϕ for Src ⊕ e to minimize the

i+1
dp→x or
+ (3) total latency
d i+1 + d i+1 . 6 end
p→q q→y
7 Select e∗ from the nodes examined in the above
In Case I, ep is identical to or faster than eq in terms of for-loop such that the corresponding partition P∗ (L)
processing speed, i.e., dpi ≤ dqi . Comparing (3) and (2), we and allocation policy ϕ ∗ on Src ⊕ e∗ can achieve the
always have δ(ϕ 0 ) < δ(ϕ), regardless of how the allocation minimum latency (denoted by δ ∗ )
policy ϕ relays the intermediate data output from the last layer 8 if δ ∗ < δmin then
of Li+1 . 9 δmin ← δ ∗
If Case II holds, i.e., eq can process segment Li faster 10 Src ← Src ⊕ e∗
than ep , we can create a new allocation ϕ 00 that re-allocates 11 Update A with Src , P∗ (L), and ϕ ∗
segment Li to eq , the sole difference from ϕ. Also, we can 12 else
easily prove δ(ϕ 00 ) < δ(ϕ). In conclusion, we can always 13 return A
reshape ϕ given in this theorem into a latency-shorter allo- 14 end
cation policy.  15 end
Comparing the three allocation policies shown in Fig. 3, 16 return A
we can see that both Li−1 and Li+1 are allocated by ϕ to node
ep but ‘‘cut in’’ by Li , which is allocated to another node eq .
Theorem 2 implies that if an allocation policy allocates non-
consecutive segments to some node, such a cut-in policy is A. INITIALIZATION OF THE CoEdge INPUTS
not optimal. Not limited to the case of three consecutive The input information needed by CoEdge consists of four
segments, Theorem 2 can be easily extended such that it can data sets: E ∪ {c}, L, W , and C. The latter two sets profile the
apply to the case of any three nonempty segments Li ≺ bandwidth resource and compute capacity of the edge-cloud
Lj ≺ Lk where Li and Lk are both allocated to one node environment. More specifically, W is a matrix of |E ∪ {c}|
but Lj is allocated to another node. The heuristics offered rows and |E ∪ {c}| columns; each element ωij measures the
by Theorem 2 make designers relievedly bypass those cut-in best available bandwidth between two distinct nodes ei and ej .
allocations, which will narrow down the solution space (or Since we neglect the time consumed in on-node data trans-
the feasible region) and then help speed up their algorithms. fers, we let ωii = ∞. Input C is also a matrix; its elements cij
stores the computational time that each node ei of E ∪ {c}
IV. DESIGNS needs to pay if it is assigned to perform any possible segment
Recall that for the deep learning considered in this paper, Lj of L. Next we introduce how to determinate matrices W
we let c and er represent the cloud node and the edge node that and C before algorithm CoEdge can go ahead.
receives the input data, respectively. If we only allocate the We assume that the edge network is connected and each
L to er and c, as [29] does, we can easily figure out an optimal edge node connects with the cloud node through the Internet.
solution for our problem with polynomial time. Besides these So, there exists at least one communication path between any
two nodes, actually, any other nodes of E could be included two edge nodes or between any edge node and the cloud node.
in the optimal solution for the general case. To address the We employ the Floyd algorithm to calculate the best available

VOLUME 8, 2020 100537


L. Hu et al.: CoEdge: Exploiting the Edge-Cloud Collaboration for Faster Deep Learning

from 4 to 7). If the addition of e∗ into Src and the correspond-


ing partition P∗ (L) can achieve a minimum latency so far,
we then update δmin and the allocation while doing Src ⊕ e∗ .
More specifically, CoEdge in line 5 determines the latency-
minimum partition over Src ⊕ e in a recursive way. We next
resort to Fig. 5 to explain how to recursively obtain the best
partition over the given Src .

FIGURE 4. Determination of W , where the weight of arc (i , j ) in the right


part represents the best available bandwidth between i and j .

bandwidth between any two distinct nodes of E ∪ {c}, and


then, obtain the input matrix W . Considering symmetric
communication bandwidth, the example given in Fig. 4 shows
the determination of matrix W .
For a DNN L with m perceptron layers, obviously, there FIGURE 5. Illustration of allocating segments to a given Src by recursively
are m(m+1)
2 different segments. We set up C as a matrix of partitioning L.
|E ∪ {c}| rows and m(m+1) 2 columns, and each element cij
stores the total computational time for node ei ∈ E ∪ {c} to As shown in Fig. 5, we suppose that there are k nodes
perform all the consecutive layers in the j-th segment of L. in sequence Src (including the cloud server c) and an
Asymptotically, we need O(nm2 ) time to determine matrix C. m-layer DNN L is to be allocated to Src while minimizing
Since each segment can be possibly included in an optimal the total latency. Recall that er receives the raw input data
allocation, it is necessary for CoEdge to know how fast each and then should be included in sequence Src . To obtain a
node processes any possible segments. latency-minimum partition over Src , we need only to consider
two cases: (1) allocating all the DNN layers L to er , and
B. DESCRIPTION OF CoEdge (2) allocating the layers from L1 to Lp to er while the remain-
After the above four input data sets come prepared, algorithm ing layers are allocated to the subsequence Src \er . After
CoEdge initializes Src with a partially-ordered set of her , ci, evaluating the latencies of these two cases, we can pick out
in which er is the edge node receiving the input data of the best allocation. When we are going to determine the
DNN L and c is the cloud server. At the very beginning, allocation of layers Lp+1 to Lm over the subsequence Src \er ,
therefore, only er and c are prepared to share all the DNN we can also calculate the corresponding shortest latency by
layers. By brute-force enumeration, CoEdge can achieve an recursion. We next denote by Li∼j the set of consecutive
optimal solution A in linear time for allocating L to her , ci, layers from Li to Lj , and by δ(Li∼j , S) the latency of some
which yields the shortest latency of processing L, denoted partition of Li∼j over S ⊆ Src , where 1 ≤ i, j ≤ m and
by δmin . CoEdge will next enter a greedy iterative procedure ‘‘i > j’’ makes Li∼j be an empty set. We use π (S) to represent
(line 3 of Algorithm 1), which tries to invite more edge nodes the first node of S (i.e., the node who proceeds all the other
to collaboratively process DNN L, aimed at updating δmin nodes of S) and then, π (Src ) = er . For the allocation of
with a shorter latency. In essential, the CoEdge algorithm Li∼j over the subsequence S, its shortest latency, denoted by
continually grows the Src set to pursue the acceleration of δ ∗ (Li∼j , S), can be recursively expressed as
deep learning. In CoEdge, we regulate Src such that is a
 δ(Li∼j , π(S)) ,
 
partially-ordered set: for two distinct nodes ei , ej ∈ Src that 
process segments Li and Lj , respectively, we have ei ≺ ej if δ ∗ (Li∼j , S) = min δ(Li∼p , π(S)) + d̃p , (4)
i−1≤p<j 
Li ≺ Lj holds. Such a partial-order regulation on Src makes + δ ∗ (Lp+1∼j , S\π (S))

the CoEdge algorithm bypass checking the cut-in alloca-
tion policies, which will not be optimal solutions, according where d̃p is time consumed in transferring the intermediate
to Theorem 2. Keeping Src partially-ordered can thus help data from layer Lp to layer Lp+1 . In (4), when p = i − 1,
CoEdge narrow down the search space in its iterations. When we have δ(Li∼p , π(S)) = 0 because none of layers is
a new node e is examined in each iteration in hope of reducing assigned to node π (S). In this case, although π (S) does not
the latency, CoEdge will always put it right ahead of the last process any layers, it still needs to pay d̃p time to relay
node (i.e., the cloud server c) of Src . The addition of e into Src the intermediate data from layer Li−1 to layer Li . Given a
is expressed with ‘‘Src ⊕ e’’ in this paper. DNN, the size of the intermediate data output from layer Lp
After entering the iteration in line 3 of Algorithm 1, (i.e., θpout ) is foreknown. If we allocate Lp and Lp+1 to nodes
CoEdge first checks all the nodes of E that have not yet ei and ej , respectively, we can then evaluate the time cost d̃p
been inserted into Src , in order to decide which of these by θpout /ωij , where ωij is the best available bandwidth from
nodes can lead to a latency-minimum allocation (see the lines ei to ej and stored in matrix W . For a given sequence Src

100538 VOLUME 8, 2020


L. Hu et al.: CoEdge: Exploiting the Edge-Cloud Collaboration for Faster Deep Learning

and an m-layer DNN, we can obtain a latency-minimum allo- cases. For different experimental cases, the computing capac-
cation by recursively solving δ ∗ (L1∼m , Src ) according to (4). ity and the bandwidth are randomly chosen from the cor-
At last, CoEdge returns a partition P(L), a nonempty Src , and responding ranges. In all the experiments, we make the
an allocation policy ϕ; and for any Li ≺ Lj of this partition, edge network connected, although not all the pairs of edge
we always have ϕ(Li ) ≺ ϕ(Lj ). nodes are directly connected. Each edge node can commu-
In each iteration, for a given e ∈ E ∪ {c} − Src , CoEdge nicate with the cloud through the Internet; for a given edge
employs recursion to complete allocating all the layers of node, its bandwidth to the cloud is set to a random value
L across Src ⊕ e. According to (4), CoEdge needs to evaluate between 1 Mbps and 10 Mbps. The input images for the
δ(Li∼j , π(Src )) for any 1 ≤ i < j ≤ m on the first node of AlexNet and the VGGNet-19 are 227 × 227 pixels (about
sequence Src . As analyzed above, there are m(m+1) 2 different 1.1794 Mb) and 224 × 224 pixels (about 1.1484 Mb) in size.
partitions for an m-layer DNN. In addition, we have created The computing load and the reduced ratio of intermediate
the set C before CoEdge enters greedy iterations, and cij results are set with the default values of these two DNN
stores the total computational time for node ei to perform models. Each experimental case is repeated 40 times, each
all the consecutive layers of partition Lj . We thus know with a randomly-chosen edge node as the data source (i.e.,
that in the iteration with Src and a given e, we can obtain the receiver of the input images); and the average for that case
δ(L1∼m , Src ⊕e) with the time complexity of O(m2 (|Src |+1)). is reported.
Furthermore, we know that in each greedy iteration, CoEdge
needs O(m2 (|Src | + 1) · |E − Src |) time to find out e∗ B. RESULTS AND ANALYSIS
and the corresponding δ ∗ . We then easily know that for an Figures 6-9 show how these three schemes perform in differ-
m-layer DNN and an edge-cloud of size (n + 1), the total time ent cases. In Fig. 6, we first examine the inference latency
complexity of CoEdge is upper bounded with O(m2 n3 ). of our CoEdge when the edge network is formed with edge
devices that are higher in processing capability than 80 Gflops
V. EXPERIMENTS and support high-speed communications of bandwidth rang-
In this section, we conduct simulation experiments with ing from 500 Mbps to 1000 Mbps. It can be seen that for
realistic setups to evaluate our designs and compare it with the AlexNet and the VGGNet-19, the CoEdge can always
two baseline algorithms [26], [29], which are here termed achieve the fastest inference, regardless of what the edge net-
by fixedEdge and singleEdge. We use the AlexNet [33] and work size is set to be. Additionally, compared with the other
VGGNet-19 [34] in our experiments to do image classifica- two baselines, especially with the fixedEdge, the CoEdge
tion. AlexNet is an eight-layer DNN, including five convo- remains much stable under each DNN model—its inference
lution layers and three fully-connected layers; and the first, time in need experiences a subtle fluctuation as the network
second, and fifth layers of AlexNet also involve the max size increases. It is worth noticing in Fig. 6 that although
pooling. VGGNet-19 is a 19-layer DNN, which is divided VGGNet-19 involves only more than twice layers of AlexNet,
into five convolutional segments. Each convolutional seg- the deep inference for VGGNet-19 takes far longer time
ment of VGGNet-19 is followed by a max pooling layer than that for AlexNet. For example, the inference latency of
that is used to reduce the size of the image data. Since our CoEdge in AlexNet is only 1.77 ms, on average, whereas the
objective is to reduce the deep learning latency by turning to latency of CoEdge in VGGNet-19 is higher than 90 ms. The
the collaborative edge, we evaluate our algorithm and the two inference time needed by the singleEdge sharply grows up to
baselines in terms of latency under a variety of experimental the order of hundreds of milliseconds. Such an observation
cases. reflects that for a complex DNN with a huge amount of
computation, it is necessary and feasible to ‘‘dissolve’’ this
A. EXPERIMENTAL SETUP DNN into the edge-cloud to further improve the inference
In simulation, we set the parameters with realistic setups. The performance.
computing capacity of the cloud node is set to 3200 Gflops.
We set the edge with four different setups in bandwidth
resource and computing capacity; they are given as follows.
1) high-speed edge: the bandwidth of in-edge link ranges
from 500 Mbps to 1000 Mbps,
2) low-speed edge: the bandwidth of in-edge link ranges
from 10 Mbps to 200 Mbps.
3) high-capacity edge: the compute capacity of edge
node ranges from 80 Gflops to 640 Gflops.
4) low-capacity edge: the compute capacity of edge node
ranges from 4 Gflops to 32 Gflops.
All the above setups are advised by [29], [35], [36]
on the basis of empirical measurements. We evaluate the FIGURE 6. Variation of inference latency against edge size under the
proposed CoEdge and the baselines under four different high-speed and high-capacity edge network.

VOLUME 8, 2020 100539


L. Hu et al.: CoEdge: Exploiting the Edge-Cloud Collaboration for Faster Deep Learning

impact of the reduced edge bandwidth on the deep learning


latency. Comparing the results in Fig. 8 and Fig. 9, we can
find that when the edge resource is very constrained in terms
of bandwidth and computing capacity, the proposed CoEdge
scheme is more capable of reducing the deep learning latency,
because it well leverages the edge-cloud collaboration.

FIGURE 7. Variation of inference latency against edge size under the


high-speed and low-capacity edge network.

Fig. 7 compares the three schemes in the cases where


each edge node works with a constrained computing capac-
ity that ranges from 4 Gflops to 32 Gflops. Comparing
Fig. 7 with Fig. 6, we can find that the computing capacity
impacts considerably the deep inference delay. For AlexNet FIGURE 9. Variation of inference latency against edge size under the
low-speed and low-capacity edge network.
and VGGNet-19, CoEdge outperforms the two baselines. The
latency CoEdge on VGGNet-19 increases up to eight times In summary, the above experimental results show that
the latency of CoEdge on AlexNet; a similar increase in CoEdge is more resource-aware than the two baselines. The
latency also happens to the singleEdge scheme. For the fixed- singleEdge baseline uses only a single edge device to share
Edge scheme, however, the constrained computing capac- the DNN layers with cloud, and the fixedEdge baseline
ity considerably increases its latency when it works with processes a DNN on an already-configured set of pipelined
VGGNet-19, deeper than AlexNet. The comparison shown edge devices. Both baselines allocate DNN layers without
in Fig. 7 leads to a two-fold indication. First, in the high-speed considering how to exploit as many available edge resources
and low-capacity edge, the singleEdge scheme recruits only as possible, which is the essential reason they perform worse
one edge node to share the DNN layer and then avoids than CoEdge does.
the possible latency bottleneck caused by these low-capacity
edge nodes. That is why the singleEdge performs better VI. CONCLUSION
than the fixedEdge on scheduling the VGGNet-19 model, In this paper, we have designed and implemented CoEdge,
where the fixedEdge always pre-select a set of edge nodes which exploits the communication and computing resources
without considering the actual edge network resources. Sec- to deploy the deep learning task on the edge-cloud computing
ond, in comparison with the two baselines, CoEdge always environment, in order to minimize the deep learning latency.
dynamically cherry-picks the ‘‘best-fitting’’ edge nodes and CoEdge involves effective approaches to determining bene-
lets them collaboratively perform the DNN task, thereby ficial nodes, and can in polynomial time allocate the deep
resulting in far lower latency. learning layers to these selected nodes, laying a foundation
Fig. 8 compares the three schemes in latency under low- of implementing collaborative processing of the deep learn-
speed and high-capacity edge. The overall performances of ing task. Our extensive simulation experiments with realistic
our CoEdge for both the AlexNet and the VGGNet-19 are setups also demonstrate that CoEdge outperforms two state-
much close to its counterparts shown in Fig. 6. On the of-the-art schemes in term of latency. In the future, we will
contrary, the fixedEdge performs a little worse in low-speed take a further step towards achieving latency-aware allocation
edge than it does in the high-speed edge, which means the of DNN layers in the edge-cloud system with dynamic edge
resources in terms of communication and computation.

REFERENCES
[1] H. El-Sayed, S. Sankar, M. Prasad, D. Puthal, A. Gupta, M. Mohanty, and
C.-T. Lin, ‘‘Edge of things: The big picture on the integration of edge,
IoT and the cloud in a distributed computing environment,’’ IEEE Access,
vol. 6, pp. 1706–1717, 2017.
[2] H. Song, J. Bai, Y. Yi, J. Wu, and L. Liu, ‘‘Artificial intelligence enabled
Internet of Things: Network architecture and spectrum access,’’ IEEE
Comput. Intell. Mag., vol. 15, no. 1, pp. 44–51, Feb. 2020.
[3] J. Ren, H. Guo, C. Xu, and Y. Zhang, ‘‘Serving at the edge: A scalable IoT
architecture based on transparent computing,’’ IEEE Netw., vol. 31, no. 5,
pp. 96–105, Aug. 2017.
[4] P. G. Lopez, J. Cao, Q. Zhang, Y. Li, and L. Xu, ‘‘Edge-centric computing:
FIGURE 8. Variation of inference latency against edge size under the Vision and challenges,’’ ACM SIGCOMM Comput. Commun. Rev., vol. 45,
low-speed and high-capacity edge network. no. 5, pp. 37–42, 2015.

100540 VOLUME 8, 2020


L. Hu et al.: CoEdge: Exploiting the Edge-Cloud Collaboration for Faster Deep Learning

[5] K. Wang, H. Yin, W. Quan, and G. Min, ‘‘Enabling collaborative edge [28] J. Zhou, Y. Wang, K. Ota, and M. Dong, ‘‘AAIoT: Accelerating artificial
computing for software defined vehicular networks,’’ IEEE Netw., vol. 32, intelligence in IoT systems,’’ IEEE Wireless Commun. Lett., vol. 8, no. 3,
no. 5, pp. 112–117, Sep. 2018. pp. 825–828, Jun. 2019.
[6] R. Liao, H. Wen, J. Wu, F. Pan, A. Xu, Y. Jiang, F. Xie, and M. Cao, ‘‘Deep- [29] H. Li, K. Ota, and M. Dong, ‘‘Learning IoT in edge: Deep learning for
learning-based physical layer authentication for industrial wireless sensor the Internet of Things with edge computing,’’ IEEE Netw., vol. 32, no. 1,
networks,’’ Sensors, vol. 19, no. 11, p. 2440, 2019. pp. 96–101, Jan. 2018.
[7] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, ‘‘Edge computing: Vision and [30] A. Janiak, W. Janiak, and M. Lichtenstein, ‘‘Resource management in
challenges,’’ IEEE Internet Things J., vol. 3, no. 5, pp. 637–646, Oct. 2016. machine scheduling problems: A survey,’’ Decis. Making Manuf. Services,
[8] T. X. Tran, A. Hajisami, P. Pandey, and D. Pompili, ‘‘Collaborative mobile vol. 1, no. 2, pp. 59–89, Oct. 2007.
edge computing in 5G networks: New paradigms, scenarios, and chal- [31] D. Oron, D. Shabtay, and G. Steiner, ‘‘Approximation algorithms for
lenges,’’ IEEE Commun. Mag., vol. 55, no. 4, pp. 54–61, Apr. 2017. the workload partition problem and applications to scheduling with vari-
able processing times,’’ Eur. J. Oper. Res., vol. 256, no. 2, pp. 384–391,
[9] F. Xie, H. Wen, J. Wu, S. Chen, W. Hou, and Y. Jiang, ‘‘Convo-
Jan. 2017.
lution based feature extraction for edge computing access authenti-
[32] H. Wang and B. Alidaee, ‘‘Unrelated parallel machine selection and job
cation,’’ IEEE Trans. Netw. Sci. Eng., early access, Dec. 3, 2019,
scheduling with the objective of minimizing total workload and machine
doi: 10.1109/TNSE.2019.2957323.
fixed costs,’’ IEEE Trans. Autom. Sci. Eng., vol. 15, no. 4, pp. 1955–1963,
[10] S. K. Sharma and X. Wang, ‘‘Live data analytics with collaborative edge
Oct. 2018.
and cloud processing in wireless IoT networks,’’ IEEE Access, vol. 5,
[33] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘Imagenet classification
pp. 4621–4635, 2017.
with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. Pro-
[11] X. Lyu, H. Tian, L. Jiang, A. Vinel, S. Maharjan, S. Gjessing, and Y. Zhang, cess. Syst., 2012, pp. 1097–1105.
‘‘Selective offloading in mobile edge computing for the green Internet of [34] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
Things,’’ IEEE Netw., vol. 32, no. 1, pp. 54–60, Jan. 2018. large-scale image recognition,’’ 2014, arXiv:1409.1556. [Online]. Avail-
[12] K. Ha, Z. Chen, W. Hu, W. Richter, P. Pillai, and M. Satyanarayanan, able: http://arxiv.org/abs/1409.1556
‘‘Towards wearable cognitive assistance,’’ in Proc. 12th Annu. Int. Conf. [35] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and A. Ng, ‘‘Deep
Mobile Syst., Appl., Services (MobiSys), 2014, pp. 68–81. learning with COTS HPC systems,’’ in Proc. Int. Conf. Mach. Learn., 2013,
[13] I. Goodfellow, Y. Bengio, A. Courville, and F. Bacn, Deep Learning. pp. 1337–1345.
Cambridge, MA, USA: MIT Press, 2017. [36] X. Zeng, K. Cao, and M. Zhang, ‘‘MobileDeepPill: A small-footprint
[14] E. Cuervo, A. Balasubramanian, D. Cho, A. Wolman, S. Saroiu, mobile deep learning system for recognizing unconstrained pill images,’’ in
R. Chandra, and P. and Bahl, ‘‘MAUI: Making smartphones last longer Proc. 15th Annu. Int. Conf. Mobile Syst., Appl., Services, 2017, pp. 56–67.
with code offload,’’ in Proc. 8th Int. Conf. Mobile Syst., Appl., Services,
2010, pp. 49–62.
[15] M. Satyanarayanan, P. Bahl, R. Caceres, and N. Davies, ‘‘The case for VM-
based cloudlets in mobile computing,’’ IEEE Pervas. Comput., vol. 8, no. 4, LIANGYAN HU received the B.E. degree in com-
pp. 14–23, Oct. 2009. puter science and technology from the School
[16] M. Gordon, D. Jamshidi, S. Mahlke, Z. Mao, and X. Chen, ‘‘COMET: of Information Science and Technology, Beijing
Code offload by migrating execution transparently,’’ in Proc. 10th USENIX Forestry University, Beijing, China, in 2018. She is
Symp. Operating Syst. Design Implement. (OSDI), 2012, pp. 93–106. currently pursuing the master’s degree in software
[17] B.-G. Chun, S. Ihm, P. Maniatis, M. Naik, and A. Patti, ‘‘CloneCloud: engineering with Beijing Forestry University. Her
Elastic execution between mobile device and cloud,’’ in Proc. 6th Conf. current research interests include edge computing,
Comput. Syst., 2011, pp. 301–314. deep learning, and mobile computing.
[18] S. Kosta, A. Aucinas, P. Hui, R. Mortier, and X. Zhang, ‘‘ThinkAir:
Dynamic resource allocation and parallel execution in the cloud for mobile
code offloading,’’ in Proc. IEEE INFOCOM, Mar. 2012, pp. 945–953.
[19] X. Chen, Q. Shi, L. Yang, and J. Xu, ‘‘ThriftyEdge: Resource-efficient edge
computing for intelligent IoT applications,’’ IEEE Netw., vol. 32, no. 1, GUODONG SUN (Member, IEEE) was a Post-
pp. 61–65, Jan. 2018.
doctoral Researcher with Tsinghua University,
[20] X. Chen, L. Pu, L. Gao, W. Wu, and D. Wu, ‘‘Exploiting massive D2D China, before joining the Faculty of Beijing
collaboration for energy-efficient mobile edge computing,’’ IEEE Wireless
Forestry University. He was a Visiting Professor
Commun., vol. 24, no. 4, pp. 64–71, Aug. 2017.
of computer science with North Carolina Uni-
[21] X. Cheng, F. Lyu, W. Quan, C. Zhou, H. He, W. Shi, and X. Shen,
versity at Charlotte, USA. He is currently an
‘‘Space/Aerial-assisted computing offloading for IoT applications: A
Associate Professor of computer science with the
learning-based approach,’’ IEEE J. Sel. Areas Commun., vol. 37, no. 5,
pp. 1117–1129, May 2019. School of Information Science and Technology,
[22] J. Hauswald, Y. Kang, M. A. Laurenzano, C. Quan, and L. Tang, ‘‘DjiNN Beijing Forestry University, Beijing, China. His
and Tonic: DNN as a service and its implications for future warehouse research interests include mobile computing, wire-
scale computers,’’ in Proc. ACM/IEEE 42nd Annu. Int. Symp. Comput. less ad-hoc and sensor networks, combinatorial optimization, the Internet
Architecture (ISCA), Jun. 2015, pp. 27–40. of Things, and machine learning. He is a member of the IEEE Computer
[23] T. Chen, Z. Du, N. Sun, W. Jia, C. Wu, Y. Chen, and O. Temam, ‘‘Diannao: Society.
A small-footprint high-throughput accelerator for ubiquitous machine-
learning,’’ ACM SIGPLAN Notices, vol. 49, no. 4, pp. 269–284, 2014.
[24] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, and T. Darrell, ‘‘Caffe:
Convolutional architecture for fast feature embedding,’’ in Proc. 22nd ACM YANLONG REN received the B.E. and M.E.
Int. Conf. Multimedia, Nov. 2014, pp. 675–678. degrees from the School of Information Sci-
[25] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, ence and Technology, Beijing Forestry University,
and Y. and Chen, ‘‘Pudiannao: A polyvalent machine learning accelerator,’’ Beijing, China, in 2013 and 2015, respectively.
ACM SIGARCH Comput. Archit. News, vol. 43, no. 1, pp. 369–381, 2015. He is currently a Senior Engineer with the
[26] C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong, ‘‘Energy-efficient Network Information Management and Service
CNN implementation on a deeply pipelined FPGA cluster,’’ in Proc. Int. Center, Beijing University of Civil Engineering
Symp. Low Power Electron. Design (ISLPED), 2016, pp. 326–331. and Architecture, Beijing. His major efforts are
[27] N. D. Lane, S. Bhattacharya, P. Georgiev, C. Forlivesi, L. Jiao, L. Qendro, put to maintaining the large-scale network infras-
and F. Kawsar, ‘‘DeepX: A software accelerator for low-power deep learn- tructure and the cloud-based data center of his
ing inference on mobile devices,’’ in Proc. 15th ACM/IEEE Int. Conf. Inf. university.
Process. Sensor Netw. (IPSN), Apr. 2016.

VOLUME 8, 2020 100541

You might also like