Coedge: Exploiting The Edge-Cloud Collaboration For Faster Deep Learning
Coedge: Exploiting The Edge-Cloud Collaboration For Faster Deep Learning
June 9, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2995583
ABSTRACT Recently a great number of ubiquitous Internet-of-Things (IoT) devices have been connecting
to the Internet. With the massive amount of IoT data, the cloud-based intelligent applications have sprang up
to support accurate monitoring and decision-making. In practice, however, the intrinsic transport bottleneck
of the Internet severely handicaps the real-time performance of the cloud-based intelligence depending on
IoT data. In the past few years, researchers have paid attention to the computing paradigm of edge-cloud
collaboration; they offload the computing tasks from the cloud to the edge environment, in order to avoid
transmitting much data through the Internet to the cloud. To present, it is still an open issue to effectively
allocate the deep learning task (i.e., deep neural network computation) over the edge-cloud system to shorten
the response time of application. In this paper, we propose the latency-minimum allocation (LMA) problem,
aimed at allocating the deep neural network (DNN) layers over the edge-cloud environment while the total
latency of processing this DNN can be minimized. First, we formalize the LMA problem in general form,
prove its NP-hardness, and present an insightful characteristic of feasible DNN layer allocations. Second,
we design an approximate algorithm, called CoEdge, which can handle the LMA problem in polynomial
time. By exploiting the communication and computation resources of the edge, CoEdge greedily selects
the beneficial edge nodes and allocates the DNN layers to the selected nodes by a recursion-based policy.
Finally, we conduct extensive simulation experiments with realistic setups, and the experimental results show
the efficacy of CoEdge in reducing the deep learning latency compared to two state-of-the-art schemes.
INDEX TERMS Edge computing, deep learning, latency, allocation of DNN layers.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 100533
L. Hu et al.: CoEdge: Exploiting the Edge-Cloud Collaboration for Faster Deep Learning
ing the system delay. In [12], for instance, the researchers A. OFFLOADING COMPUTE TASK ONTO EDGE
offload the computation from the cloud to the wearable cog- In cloud computing, transferring data to the cloud server
nitive assistance. They reduce the response time of appli- often needs a considerable amount of time, which will surely
cation by 80 ms to 200 ms; moreover, their work squeezes weaken the quality-of-service for the time-sensitive appli-
30%-40% less energy consumption from the cloud-based cations. To address this issue, a promising approach is to
strategy. Edge computing has been envisioned to be a uni- offload a part of computing tasks onto the resource-rich edge,
fying platform that can engender a new breed of emerging alleviating the deficiencies of network congestion, latency
services and support a variety of new intensive-computing and energy consumption [7]. Exiting works in edge-cloud
real-time applications. offloading focus on how to determine the effective and
Recently researchers have attempted to process the deep efficient task offloading policy [14], [15], [16]–[18]. To
learning task in edge, mainly in order to reduce the system improve the resource efficiency, reference [19] designs a
response delay. In general, the execution of deep learning resource-efficient edge computing framework, which enables
task is a layer-by-layer process of the deep neural network intelligent IoT users to flexibly offload task across the edge
(DNN) model, which typically consists of a set of consecutive device, the nearby assistant devices, and the adjacent edge
perceptron layers [13]. The raw IoT data is fed into the first cloud. In [20], the authors exploit the possibility that mas-
DNN layer, and the inference or classification can be finally sive mobile devices collaboratively execute the on-edge task,
yielded out of the last DNN layer. In the ongoing process aimed to optimize the energy efficiency of those devices.
of DNN, the intermediate results are transfered through the In [21], the authors consider a scenario where flying
DNN layers and can be quickly scaled down in terms of unmanned aerial vehicles serve as edge nodes; they propose
size. For edge-based intelligent IoT systems, the whole DNN a resource scheduling approach that can offload tasks in
model or a part of its layers can be deployed on the edge node, dynamic environments by leveraging based a learning algo-
which is far closer to the IoT devices than the cloud server rithm. How to offload the computing task onto the edge
is. Sometimes, the IoT devices are resource-rich and then has recently attracted more and more attention in industry
can by themselves serve as a kind of edge devices. The edge and academia. We have observed that the sustainable edge
node only needs to load up to the cloud some size-reduced computing can bring latency reduction and there is a great
intermediate data or even the final result, instead of the raw chance to deploy deep learning in edge.
IoT data of large size. The edge participation can therefore
shorten the response time of DNN-based IoT system by
reducing the Internet traffic. In order to further reduce the B. ALLOCATING DNN TO EDGE-CLOUD
system response time, however, it is still an open issue to The high-accuracy deep learning usually depends on a lot of
design an effective paradigm that can take full advantage of training data, and as a result, the demand for bandwidth rises
the edge-cloud resources. up dramatically in the cloud-based intelligent applications. To
In this paper we design and implement a new allocation accelerate the model training or inference of DNN, a popular
scheme, called CoEdge, which attempts to allocate the DNN means nowadays is to transfer to the edge environment a part
layers over the edge and the cloud such that the deep learning of or even all the DNN layers, i.e., making them closer to the
delay can be further reduced in comparison with the schemes data sources.
that only involve a single edge node. With a greedy criterion, Many hardware platforms including GPU or customized
CoEdge iteratively finds out the best edge node and forms accelerators such as FPGA and ASIC have emerged. In [22],
a set of edge nodes, over which the DNN is then allocated DjiNN is designed, which is an open infrastructure for DNN
with a recursion-based policy. Essentially, CoEdge unlocks with large-scale GPU servers to achieve high throughput and
the potential of edge in deep learning, by exploiting the low network occupancy. There have been several approaches
high-speed connection and the computing capacity embraced to accelerating machine learning [23]–[25]. FPGA-based
by the edge environment. accelerators have more flexibility than ASICs in accelerating
The remainder of this paper is organized as follows. large-scale CNN models. A deeply pipelined multi-FPGA
Section II briefly introduces major works related to ours. architecture is designed in [26], which can achieve lower
Section III and Section IV give the models and the detailed latency by using a dynamic programming method to map a
design of CoEdge. Section V evaluates our design and com- DNN onto several pipelined FPGAs. For this approach based
pares it with two state-of-art schemes via extensive experi- on a fixed pipeline, however, the FPGA devices are assumed
ments. Finally, Section VI concludes this paper. to be homogeneous in computing capacity and each of them
is required to undertake at least one DNN layer.
The demand for memory, computing and energy capacity
has gradually grown up to be a critical bottleneck for the allo-
II. RELATED WORK cation of DNN to edge devices. How to deploy DNN into edge
In this section, we first introduce the task offloading in edge environment has been studied primarily over industry and
computing and then, the approaches to allocating DNN learn- academia. A software accelerator for DNN execution in edge
ing or inference tasks to edge-cloud system. network is presented in [27], in which the resource demands
are reduced by decomposing DNN layers into various unit- the intermediate data, which will immediately be passed on
blocks that can be effectively processed by heterogeneous to layer Li+1 for further processing. We denote by θiin the
processors. In [28], a method called AAIoT is proposed to size of the data input to Li , and by θiout the size of the data
allocate DNN to a set of devices that form a multi-level IoT output from Li . Often, layer L1 is called the input layer, which
system. AAIoT balances the computation and transmission receives the input data to be classified; Lm is called the output
time to minimize the overall response time. Reference [29] layer, which returns the final classification result; and other
schedules DNN layers in edge computing environment: it layers are called hidden layer because they are not connected
allocates as many deep learning tasks to the edge devices with the external world.
as possible, while satisfying a given constraint on response Fig. 1 shows an example where a pixelated image of cat
time. For a given deep learning task, however, the authors is fed to a 5-layer CNN and after two stages (i.e., the feature
only employ a single edge device to share some DNN layers. extraction and the feature classification), a cat is recognized.
Particularly, they do not consider the potential of collab- Despite the number of layers, the DNN also specifies a neuron
oration across the edge devices, and thus, their allocation set for each layer and the connection pattern between two
policy cannot well suit the deep learning task with a strict adjacent layers. Both factors collectively determine the com-
requirement for system latency. putational cost at each layer and the volume of intermediate
Different from the above works, the CoEdge proposed in results to be transferred between two consecutive layers.
this study is designed for a general edge-cloud computing Given the raw input data (often, an image), the latency of
where the edge devices could be heterogeneous in comput- DNN-based learning or inference has two parts: the total
ing and communication resources. Additionally, in order to computation time of all the layers, and the total time of
further shorten the deep learning latency, CoEdge can elab- transferring all the intermediate results between any two con-
orately exploit the high-speed links and strong computing secutive layers plus transferring the final result to the cloud
capacity contained in the edge network. server.
of ϕ on processing these three consecutive segments. LMA problem, we design an approximate algorithm, called
CoEdge. With a greedy policy, CoEdge attempts to iteratively
δ(ϕ) = dpi−1 + dp→q
i−1
+ dqi + dq→p
i
insert a new edge node into the Src which is initialized only
with {er , c}, until the iteratively updated Src cannot assure
i+1
dp→x or
i+1
+dp + (2) a shorter period of time to perform the deep learning task.
d i+1 + d i+1
p→q q→y Algorithm 1 shows how the proposed CoEdge works. Before
diving into the algorithm description, we first introduce the
Under allocation policy ϕ, nodes ep and eq process seg-
information to be input to CoEdge and how to initialize these
ments Li−1 and Li , respectively, and then, ep takes over
inputs.
segment Li+1 . ϕ results in two cross-edge intermediate data
transfers, both of which consume times of dp→q i−1 and d i
q→p ,
respectively. In (2), the last item of the right-hand indicates Algorithm 1 CoEdge
that there are two alternatives for ϕ to transfer the intermedi- Input: E ∪ {c}, L, W , and C
ate data output from the last layer of Li+1 : one is transfer the Result: the allocation of L over a subset Src of E ∪ {c}
data directly from ep to some node ex processing Li+2 , and the 1 Src ← her , ci
other, through eq to some node ey processing Li+2 . Similarly, 2 Determine an optimal layer allocation A over Src and
the time cost of ϕ 0 on processing the three segments can be then obtain the minimum total latency (i.e., δmin )
expressed with 3 while E ∪ {c} − Src 6 = ∅ do
4 foreach e ∈ E ∪ {c} − Src do
δ(ϕ 0 ) = dpi−1 + dpi + dqi+1 5 Determine a partition P(L) as well as an
allocation policy ϕ for Src ⊕ e to minimize the
i+1
dp→x or
+ (3) total latency
d i+1 + d i+1 . 6 end
p→q q→y
7 Select e∗ from the nodes examined in the above
In Case I, ep is identical to or faster than eq in terms of for-loop such that the corresponding partition P∗ (L)
processing speed, i.e., dpi ≤ dqi . Comparing (3) and (2), we and allocation policy ϕ ∗ on Src ⊕ e∗ can achieve the
always have δ(ϕ 0 ) < δ(ϕ), regardless of how the allocation minimum latency (denoted by δ ∗ )
policy ϕ relays the intermediate data output from the last layer 8 if δ ∗ < δmin then
of Li+1 . 9 δmin ← δ ∗
If Case II holds, i.e., eq can process segment Li faster 10 Src ← Src ⊕ e∗
than ep , we can create a new allocation ϕ 00 that re-allocates 11 Update A with Src , P∗ (L), and ϕ ∗
segment Li to eq , the sole difference from ϕ. Also, we can 12 else
easily prove δ(ϕ 00 ) < δ(ϕ). In conclusion, we can always 13 return A
reshape ϕ given in this theorem into a latency-shorter allo- 14 end
cation policy. 15 end
Comparing the three allocation policies shown in Fig. 3, 16 return A
we can see that both Li−1 and Li+1 are allocated by ϕ to node
ep but ‘‘cut in’’ by Li , which is allocated to another node eq .
Theorem 2 implies that if an allocation policy allocates non-
consecutive segments to some node, such a cut-in policy is A. INITIALIZATION OF THE CoEdge INPUTS
not optimal. Not limited to the case of three consecutive The input information needed by CoEdge consists of four
segments, Theorem 2 can be easily extended such that it can data sets: E ∪ {c}, L, W , and C. The latter two sets profile the
apply to the case of any three nonempty segments Li ≺ bandwidth resource and compute capacity of the edge-cloud
Lj ≺ Lk where Li and Lk are both allocated to one node environment. More specifically, W is a matrix of |E ∪ {c}|
but Lj is allocated to another node. The heuristics offered rows and |E ∪ {c}| columns; each element ωij measures the
by Theorem 2 make designers relievedly bypass those cut-in best available bandwidth between two distinct nodes ei and ej .
allocations, which will narrow down the solution space (or Since we neglect the time consumed in on-node data trans-
the feasible region) and then help speed up their algorithms. fers, we let ωii = ∞. Input C is also a matrix; its elements cij
stores the computational time that each node ei of E ∪ {c}
IV. DESIGNS needs to pay if it is assigned to perform any possible segment
Recall that for the deep learning considered in this paper, Lj of L. Next we introduce how to determinate matrices W
we let c and er represent the cloud node and the edge node that and C before algorithm CoEdge can go ahead.
receives the input data, respectively. If we only allocate the We assume that the edge network is connected and each
L to er and c, as [29] does, we can easily figure out an optimal edge node connects with the cloud node through the Internet.
solution for our problem with polynomial time. Besides these So, there exists at least one communication path between any
two nodes, actually, any other nodes of E could be included two edge nodes or between any edge node and the cloud node.
in the optimal solution for the general case. To address the We employ the Floyd algorithm to calculate the best available
and an m-layer DNN, we can obtain a latency-minimum allo- cases. For different experimental cases, the computing capac-
cation by recursively solving δ ∗ (L1∼m , Src ) according to (4). ity and the bandwidth are randomly chosen from the cor-
At last, CoEdge returns a partition P(L), a nonempty Src , and responding ranges. In all the experiments, we make the
an allocation policy ϕ; and for any Li ≺ Lj of this partition, edge network connected, although not all the pairs of edge
we always have ϕ(Li ) ≺ ϕ(Lj ). nodes are directly connected. Each edge node can commu-
In each iteration, for a given e ∈ E ∪ {c} − Src , CoEdge nicate with the cloud through the Internet; for a given edge
employs recursion to complete allocating all the layers of node, its bandwidth to the cloud is set to a random value
L across Src ⊕ e. According to (4), CoEdge needs to evaluate between 1 Mbps and 10 Mbps. The input images for the
δ(Li∼j , π(Src )) for any 1 ≤ i < j ≤ m on the first node of AlexNet and the VGGNet-19 are 227 × 227 pixels (about
sequence Src . As analyzed above, there are m(m+1) 2 different 1.1794 Mb) and 224 × 224 pixels (about 1.1484 Mb) in size.
partitions for an m-layer DNN. In addition, we have created The computing load and the reduced ratio of intermediate
the set C before CoEdge enters greedy iterations, and cij results are set with the default values of these two DNN
stores the total computational time for node ei to perform models. Each experimental case is repeated 40 times, each
all the consecutive layers of partition Lj . We thus know with a randomly-chosen edge node as the data source (i.e.,
that in the iteration with Src and a given e, we can obtain the receiver of the input images); and the average for that case
δ(L1∼m , Src ⊕e) with the time complexity of O(m2 (|Src |+1)). is reported.
Furthermore, we know that in each greedy iteration, CoEdge
needs O(m2 (|Src | + 1) · |E − Src |) time to find out e∗ B. RESULTS AND ANALYSIS
and the corresponding δ ∗ . We then easily know that for an Figures 6-9 show how these three schemes perform in differ-
m-layer DNN and an edge-cloud of size (n + 1), the total time ent cases. In Fig. 6, we first examine the inference latency
complexity of CoEdge is upper bounded with O(m2 n3 ). of our CoEdge when the edge network is formed with edge
devices that are higher in processing capability than 80 Gflops
V. EXPERIMENTS and support high-speed communications of bandwidth rang-
In this section, we conduct simulation experiments with ing from 500 Mbps to 1000 Mbps. It can be seen that for
realistic setups to evaluate our designs and compare it with the AlexNet and the VGGNet-19, the CoEdge can always
two baseline algorithms [26], [29], which are here termed achieve the fastest inference, regardless of what the edge net-
by fixedEdge and singleEdge. We use the AlexNet [33] and work size is set to be. Additionally, compared with the other
VGGNet-19 [34] in our experiments to do image classifica- two baselines, especially with the fixedEdge, the CoEdge
tion. AlexNet is an eight-layer DNN, including five convo- remains much stable under each DNN model—its inference
lution layers and three fully-connected layers; and the first, time in need experiences a subtle fluctuation as the network
second, and fifth layers of AlexNet also involve the max size increases. It is worth noticing in Fig. 6 that although
pooling. VGGNet-19 is a 19-layer DNN, which is divided VGGNet-19 involves only more than twice layers of AlexNet,
into five convolutional segments. Each convolutional seg- the deep inference for VGGNet-19 takes far longer time
ment of VGGNet-19 is followed by a max pooling layer than that for AlexNet. For example, the inference latency of
that is used to reduce the size of the image data. Since our CoEdge in AlexNet is only 1.77 ms, on average, whereas the
objective is to reduce the deep learning latency by turning to latency of CoEdge in VGGNet-19 is higher than 90 ms. The
the collaborative edge, we evaluate our algorithm and the two inference time needed by the singleEdge sharply grows up to
baselines in terms of latency under a variety of experimental the order of hundreds of milliseconds. Such an observation
cases. reflects that for a complex DNN with a huge amount of
computation, it is necessary and feasible to ‘‘dissolve’’ this
A. EXPERIMENTAL SETUP DNN into the edge-cloud to further improve the inference
In simulation, we set the parameters with realistic setups. The performance.
computing capacity of the cloud node is set to 3200 Gflops.
We set the edge with four different setups in bandwidth
resource and computing capacity; they are given as follows.
1) high-speed edge: the bandwidth of in-edge link ranges
from 500 Mbps to 1000 Mbps,
2) low-speed edge: the bandwidth of in-edge link ranges
from 10 Mbps to 200 Mbps.
3) high-capacity edge: the compute capacity of edge
node ranges from 80 Gflops to 640 Gflops.
4) low-capacity edge: the compute capacity of edge node
ranges from 4 Gflops to 32 Gflops.
All the above setups are advised by [29], [35], [36]
on the basis of empirical measurements. We evaluate the FIGURE 6. Variation of inference latency against edge size under the
proposed CoEdge and the baselines under four different high-speed and high-capacity edge network.
REFERENCES
[1] H. El-Sayed, S. Sankar, M. Prasad, D. Puthal, A. Gupta, M. Mohanty, and
C.-T. Lin, ‘‘Edge of things: The big picture on the integration of edge,
IoT and the cloud in a distributed computing environment,’’ IEEE Access,
vol. 6, pp. 1706–1717, 2017.
[2] H. Song, J. Bai, Y. Yi, J. Wu, and L. Liu, ‘‘Artificial intelligence enabled
Internet of Things: Network architecture and spectrum access,’’ IEEE
Comput. Intell. Mag., vol. 15, no. 1, pp. 44–51, Feb. 2020.
[3] J. Ren, H. Guo, C. Xu, and Y. Zhang, ‘‘Serving at the edge: A scalable IoT
architecture based on transparent computing,’’ IEEE Netw., vol. 31, no. 5,
pp. 96–105, Aug. 2017.
[4] P. G. Lopez, J. Cao, Q. Zhang, Y. Li, and L. Xu, ‘‘Edge-centric computing:
FIGURE 8. Variation of inference latency against edge size under the Vision and challenges,’’ ACM SIGCOMM Comput. Commun. Rev., vol. 45,
low-speed and high-capacity edge network. no. 5, pp. 37–42, 2015.
[5] K. Wang, H. Yin, W. Quan, and G. Min, ‘‘Enabling collaborative edge [28] J. Zhou, Y. Wang, K. Ota, and M. Dong, ‘‘AAIoT: Accelerating artificial
computing for software defined vehicular networks,’’ IEEE Netw., vol. 32, intelligence in IoT systems,’’ IEEE Wireless Commun. Lett., vol. 8, no. 3,
no. 5, pp. 112–117, Sep. 2018. pp. 825–828, Jun. 2019.
[6] R. Liao, H. Wen, J. Wu, F. Pan, A. Xu, Y. Jiang, F. Xie, and M. Cao, ‘‘Deep- [29] H. Li, K. Ota, and M. Dong, ‘‘Learning IoT in edge: Deep learning for
learning-based physical layer authentication for industrial wireless sensor the Internet of Things with edge computing,’’ IEEE Netw., vol. 32, no. 1,
networks,’’ Sensors, vol. 19, no. 11, p. 2440, 2019. pp. 96–101, Jan. 2018.
[7] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, ‘‘Edge computing: Vision and [30] A. Janiak, W. Janiak, and M. Lichtenstein, ‘‘Resource management in
challenges,’’ IEEE Internet Things J., vol. 3, no. 5, pp. 637–646, Oct. 2016. machine scheduling problems: A survey,’’ Decis. Making Manuf. Services,
[8] T. X. Tran, A. Hajisami, P. Pandey, and D. Pompili, ‘‘Collaborative mobile vol. 1, no. 2, pp. 59–89, Oct. 2007.
edge computing in 5G networks: New paradigms, scenarios, and chal- [31] D. Oron, D. Shabtay, and G. Steiner, ‘‘Approximation algorithms for
lenges,’’ IEEE Commun. Mag., vol. 55, no. 4, pp. 54–61, Apr. 2017. the workload partition problem and applications to scheduling with vari-
able processing times,’’ Eur. J. Oper. Res., vol. 256, no. 2, pp. 384–391,
[9] F. Xie, H. Wen, J. Wu, S. Chen, W. Hou, and Y. Jiang, ‘‘Convo-
Jan. 2017.
lution based feature extraction for edge computing access authenti-
[32] H. Wang and B. Alidaee, ‘‘Unrelated parallel machine selection and job
cation,’’ IEEE Trans. Netw. Sci. Eng., early access, Dec. 3, 2019,
scheduling with the objective of minimizing total workload and machine
doi: 10.1109/TNSE.2019.2957323.
fixed costs,’’ IEEE Trans. Autom. Sci. Eng., vol. 15, no. 4, pp. 1955–1963,
[10] S. K. Sharma and X. Wang, ‘‘Live data analytics with collaborative edge
Oct. 2018.
and cloud processing in wireless IoT networks,’’ IEEE Access, vol. 5,
[33] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘Imagenet classification
pp. 4621–4635, 2017.
with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. Pro-
[11] X. Lyu, H. Tian, L. Jiang, A. Vinel, S. Maharjan, S. Gjessing, and Y. Zhang, cess. Syst., 2012, pp. 1097–1105.
‘‘Selective offloading in mobile edge computing for the green Internet of [34] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
Things,’’ IEEE Netw., vol. 32, no. 1, pp. 54–60, Jan. 2018. large-scale image recognition,’’ 2014, arXiv:1409.1556. [Online]. Avail-
[12] K. Ha, Z. Chen, W. Hu, W. Richter, P. Pillai, and M. Satyanarayanan, able: http://arxiv.org/abs/1409.1556
‘‘Towards wearable cognitive assistance,’’ in Proc. 12th Annu. Int. Conf. [35] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and A. Ng, ‘‘Deep
Mobile Syst., Appl., Services (MobiSys), 2014, pp. 68–81. learning with COTS HPC systems,’’ in Proc. Int. Conf. Mach. Learn., 2013,
[13] I. Goodfellow, Y. Bengio, A. Courville, and F. Bacn, Deep Learning. pp. 1337–1345.
Cambridge, MA, USA: MIT Press, 2017. [36] X. Zeng, K. Cao, and M. Zhang, ‘‘MobileDeepPill: A small-footprint
[14] E. Cuervo, A. Balasubramanian, D. Cho, A. Wolman, S. Saroiu, mobile deep learning system for recognizing unconstrained pill images,’’ in
R. Chandra, and P. and Bahl, ‘‘MAUI: Making smartphones last longer Proc. 15th Annu. Int. Conf. Mobile Syst., Appl., Services, 2017, pp. 56–67.
with code offload,’’ in Proc. 8th Int. Conf. Mobile Syst., Appl., Services,
2010, pp. 49–62.
[15] M. Satyanarayanan, P. Bahl, R. Caceres, and N. Davies, ‘‘The case for VM-
based cloudlets in mobile computing,’’ IEEE Pervas. Comput., vol. 8, no. 4, LIANGYAN HU received the B.E. degree in com-
pp. 14–23, Oct. 2009. puter science and technology from the School
[16] M. Gordon, D. Jamshidi, S. Mahlke, Z. Mao, and X. Chen, ‘‘COMET: of Information Science and Technology, Beijing
Code offload by migrating execution transparently,’’ in Proc. 10th USENIX Forestry University, Beijing, China, in 2018. She is
Symp. Operating Syst. Design Implement. (OSDI), 2012, pp. 93–106. currently pursuing the master’s degree in software
[17] B.-G. Chun, S. Ihm, P. Maniatis, M. Naik, and A. Patti, ‘‘CloneCloud: engineering with Beijing Forestry University. Her
Elastic execution between mobile device and cloud,’’ in Proc. 6th Conf. current research interests include edge computing,
Comput. Syst., 2011, pp. 301–314. deep learning, and mobile computing.
[18] S. Kosta, A. Aucinas, P. Hui, R. Mortier, and X. Zhang, ‘‘ThinkAir:
Dynamic resource allocation and parallel execution in the cloud for mobile
code offloading,’’ in Proc. IEEE INFOCOM, Mar. 2012, pp. 945–953.
[19] X. Chen, Q. Shi, L. Yang, and J. Xu, ‘‘ThriftyEdge: Resource-efficient edge
computing for intelligent IoT applications,’’ IEEE Netw., vol. 32, no. 1, GUODONG SUN (Member, IEEE) was a Post-
pp. 61–65, Jan. 2018.
doctoral Researcher with Tsinghua University,
[20] X. Chen, L. Pu, L. Gao, W. Wu, and D. Wu, ‘‘Exploiting massive D2D China, before joining the Faculty of Beijing
collaboration for energy-efficient mobile edge computing,’’ IEEE Wireless
Forestry University. He was a Visiting Professor
Commun., vol. 24, no. 4, pp. 64–71, Aug. 2017.
of computer science with North Carolina Uni-
[21] X. Cheng, F. Lyu, W. Quan, C. Zhou, H. He, W. Shi, and X. Shen,
versity at Charlotte, USA. He is currently an
‘‘Space/Aerial-assisted computing offloading for IoT applications: A
Associate Professor of computer science with the
learning-based approach,’’ IEEE J. Sel. Areas Commun., vol. 37, no. 5,
pp. 1117–1129, May 2019. School of Information Science and Technology,
[22] J. Hauswald, Y. Kang, M. A. Laurenzano, C. Quan, and L. Tang, ‘‘DjiNN Beijing Forestry University, Beijing, China. His
and Tonic: DNN as a service and its implications for future warehouse research interests include mobile computing, wire-
scale computers,’’ in Proc. ACM/IEEE 42nd Annu. Int. Symp. Comput. less ad-hoc and sensor networks, combinatorial optimization, the Internet
Architecture (ISCA), Jun. 2015, pp. 27–40. of Things, and machine learning. He is a member of the IEEE Computer
[23] T. Chen, Z. Du, N. Sun, W. Jia, C. Wu, Y. Chen, and O. Temam, ‘‘Diannao: Society.
A small-footprint high-throughput accelerator for ubiquitous machine-
learning,’’ ACM SIGPLAN Notices, vol. 49, no. 4, pp. 269–284, 2014.
[24] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, and T. Darrell, ‘‘Caffe:
Convolutional architecture for fast feature embedding,’’ in Proc. 22nd ACM YANLONG REN received the B.E. and M.E.
Int. Conf. Multimedia, Nov. 2014, pp. 675–678. degrees from the School of Information Sci-
[25] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, ence and Technology, Beijing Forestry University,
and Y. and Chen, ‘‘Pudiannao: A polyvalent machine learning accelerator,’’ Beijing, China, in 2013 and 2015, respectively.
ACM SIGARCH Comput. Archit. News, vol. 43, no. 1, pp. 369–381, 2015. He is currently a Senior Engineer with the
[26] C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong, ‘‘Energy-efficient Network Information Management and Service
CNN implementation on a deeply pipelined FPGA cluster,’’ in Proc. Int. Center, Beijing University of Civil Engineering
Symp. Low Power Electron. Design (ISLPED), 2016, pp. 326–331. and Architecture, Beijing. His major efforts are
[27] N. D. Lane, S. Bhattacharya, P. Georgiev, C. Forlivesi, L. Jiao, L. Qendro, put to maintaining the large-scale network infras-
and F. Kawsar, ‘‘DeepX: A software accelerator for low-power deep learn- tructure and the cloud-based data center of his
ing inference on mobile devices,’’ in Proc. 15th ACM/IEEE Int. Conf. Inf. university.
Process. Sensor Netw. (IPSN), Apr. 2016.