0% found this document useful (0 votes)
10 views8 pages

Aca Finalproj

The paper proposes an autonomous and incremental deep learning framework called In-situ AI for IoT systems. It uses a mobile GPU and FPGA to perform diagnosis and inference tasks in single or co-running modes. Analytical models are developed to evaluate different architectures and show In-situ AI achieves better latency and throughput than other approaches due to weight sharing.

Uploaded by

krishna prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views8 pages

Aca Finalproj

The paper proposes an autonomous and incremental deep learning framework called In-situ AI for IoT systems. It uses a mobile GPU and FPGA to perform diagnosis and inference tasks in single or co-running modes. Analytical models are developed to evaluate different architectures and show In-situ AI achieves better latency and throughput than other approaches due to weight sharing.

Uploaded by

krishna prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

ADVANCED COMPUTER ARCHITECTURE

ECE7373
FINAL PROJECT

LITERATURE REVIEW ON
MOBILE PLATFORMS

BY

KRISHNA SRIVATHSAV PRAKASH


2157874
INTRODUCTION
With increase in the number of mobile devices due to the ease and
accessibility it is important that we consider the computation power
of the processors used in these devices. With many features from
desktop being available in mobile phones it is highly important to
build efficient processors and better architectures. A lot of research
has been done in this field with a huge demand for improving the
processing power to allow mobile phones to run applications, allow
for the use of voice assistants and improving the smart features or
adding more for that matter. In this project several papers related to
this field will be reviewed and summarized.

LIST OF PAPERS INCLUDED


1. Y. Zhu and V. J. Reddi, "High-performance and energy-efficient mobile web
browsing on big/little systems," 2013 IEEE 19th International Symposium on
High Performance Computer Architecture (HPCA), Shenzhen, China, 2013, pp.
13-24, doi: 10.1109/HPCA.2013.6522303.
2. M. Song et al., "In-Situ AI: Towards Autonomous and Incremental Deep
Learning for IoT Systems," 2018 IEEE International Symposium on High
Performance Computer Architecture (HPCA), Vienna, Austria, 2018, pp. 92-
103, doi: 10.1109/HPCA.2018.00018.
3. Kang Y, Hauswald J, Gao C, Rovinski A, Mudge T, Mars J, Tang L.
Neurosurgeon: Collaborative intelligence between the cloud and mobile
edge. ACM SIGARCH Computer Architecture News. 2017 Apr 4;45(1):615-29.
4. Guo P, Hu W. Potluck: Cross-application approximate deduplication for
computation-intensive mobile applications. InProceedings of the Twenty-
Third International Conference on Architectural Support for Programming
Languages and Operating Systems 2018 Mar 19 (pp. 271-284).
5. Y. Huang, Z. Zha, M. Chen and L. Zhang, "Moby: A mobile benchmark suite
for architectural simulators," 2014 IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS), Monterey, CA, USA,
2014, pp. 45-54, doi: 10.1109/ISPASS.2014.6844460.
High-Performance and Energy-Efficient
MobileWeb Browsing on Big/Little Systems

difference in energy requirements. The


authors have inferred that time and energy
1. Abstract:
consumption increases with increase in the
In this paper heterogenous systems which can number of nodes in the DOM tree of a
accommodate the energy requirements of webpage. Coming to the CSS analysis the
applications and webpages to run on both authors look at selectors associated with each
desktops and mobiles are simulated and HTML tag and property usage in CSS. After
proposed. Statistical inference models were performing microbenchmarking on selectors
used based on the data of 5000 webpages to the authors find out that as the number of
understand the energy consumption. Using selectors increase the time and energy
this data, the authors try to build a consumption increases drastically. After
heterogenous system which achieves higher considering the 7 most used properties in CSS
energy savings and lower latency cut off. webpages the authors find that the execution
time and energy increases when the
2. Main content: properties affect individual HTML tags and
The authors in this paper being by explaining properties affecting other nodes in a DOM
the experimental setup they have for tree.
performing their tests. For a small core they 2.2 Energy modelling analysis:
use a cortex A8 which is of low power and for
a big core they use a cortex A9 which has After analysing all the webpage loading times
more power consumption capabilities. The and understanding the variables required to
authors consider parameters like the most measure them the authors go ahead to build
used webpages, load time measurements and their prediction models using linear
energy measurement. After getting the results regression. After taking the webpage loading
the authors do a representative as well as a and energy consumption observations the
comprehensive analysis to understand the authors use it to construct the regression
energy-latency cut-off. The authors finally end model. After overcoming the overfitting
up concluding that different webpages problems and obtaining more observations
required different ideal core and frequency the authors build 3 more regression models
settings to achieve the ideal balance between which are basic linear regression, regularized
performance and energy-efficiency. Due to linear regression and restricted cubic spline
varying cut-offs a versatile heterogeneous based (RCS) model. After evaluation it is found
system consisting of both big and little cores out that the RCS based model was the best
would be ideal. The authors then proceed to performer within 10% error for 70% of the
determine the reason for the varying load webpages and withing 20% for 91.8% of the
timing and energy consumption variance webpages. The authors then build a dynamic
across different webpages. To achieve this, webpage scheduler which is put to test
they look at the language upon which the against an OS based scheduler and integrated
webpage is designed on. They begin by webpage aware scheduler.
analysing html pages followed by CSS pages.
3. Conclusion:
Using microbenchmarking they compute the
execution time and energy consumption The authors conclude by measuring the
associated with each Tags which are used in energy savings and proving how much more
HTML pages. The authors also look at efficient the webpage aware scheduler is
attributes and DOM trees to understand the against a production OS DVFS scheduler.
showing that the mobile GPU is used for single
running mode and the FPGA design is used for
In-situ AI: Towards Autonomous and
co-running mode. The authors evaluate the In-
Incremental Deep Learning for IoT Systems
situ AI node from a microarchitecture
1. Abstract: perspective and compare their shared weight
architecture with other existing architectures.
In this paper the authors propose In-situ AI an The authors have used NVIDIA TX1 as mobile
autonomous and incremental computing GPU and Xilinx Vertex-7 VX690T as FPGA
framework for deep learning based IOT design. For comparing the In-situ AI node both
applications. The main aim of this system is to the cases of single running and co-running
help IOT systems handle big data movement have been considered. Time models between
at a stable level at the same time help different architectures and this
compute large amounts of raw unsupervised implementation are measured using non
data. The authors consider two main types of batching methods and show that the next best
tasks namely diagnosis and inference which architecture after this implementation was
the In-situ AI will perform on a mobile GPU VGGNet architecture which only achieves a
and a FPGA. The tasks will be run on different 1.1 times speedup as compared to 3.3 times
modes which are single running and co speedup achieved by In-situ AI. To evaluate
running modes. The authors then develop their two shared architecture (in co-running
analytical model to find out the ideal mode) conv is performed on all layers of NWS,
configuration to help deploy the In-situ AI WS and WSS (In-situ AI co-running
architecture in IOT systems. architecture) and runtimes are compared. The
2. Main content: WS had the worst compute time while the
WSS had the best processing time due to the
To tackle the problem of improving a system’s weight sharing in the architecture. The
accuracy with dynamic data the authors authors then evaluate the overall architecture
propose by using unsupervised learning to of WSS-NWS (In-situ AI architecture) by
deal with big raw IOT data. The authors use comparing throughput with WSS, WS, NWS
two level weight shared FPGA design to avoid and NWS batch as well. Among all WSS-NWS
interference between inference and diagnosis has the best latency due to its high
in the Co-running mode. The authors then throughput and high performance which is
explain the challenges faced in a cloud centric missing in the other two. The authors then go
system and Fog computing system and how on to evaluate the In-situ cloud module with
the IOT systems based on In-situ AI three existing IOT architectures. Comparing
architecture overcomes them. The In-situ data movement in the three systems has been
consists of a node and a cloud module. The used to evaluate since it would provide the
unsupervised data is first transferred to the energy consumption among the architectures.
cloud module where the data features are In-situ AI consumed the least energy due to
extracted and then transferred to a target retraining of data and last two layers of data
inference network following which the only undergoing transfer learning. The In-situ
remaining layers of the data are trained by the AI also has a 1.5 times speedup compared to
inference network. The unsupervised network the existing IOT systems.
is deployed in the node module to perform
the diagnosis which then sends the data to 3. Conclusion:
the cloud again allowing for the In-situ AI to In-situ AI has shown good results which have
learn dynamic data more accurately with proven to be better than the other existing
lesser movement of data. The authors then systems, hopefully the authors develop it
explain the architecture of the In-situ AI by more, so it gets integrated into existing IOT
products as it has shown results of lesser showing the importance of communication
latency and more efficient consumption. network. For measuring computation latency,
they find that between mobile CPU, GPU, and
cloud GPU they find that mobile CPU is the
Neurosurgeon: Collaborative Intelligence
slowest and cloud GPU to be the fastest. End
Between the Cloud and Mobile Edge
to end latency is measured and is found out
1. Abstract: that mobile GPU has the least latency
compared to the others. Energy consumption
Investigating status quo of cloud approach
of a mobile GPU is lesser compared to
processing and exploring computation
transferring data to cloud. Using these
partitioning techniques are the ideas upon
findings, the authors begin to work on finding
which this article is based on. The authors
the partition points in layers in Alexnet
look to find partitioning techniques which
between mobile and cloud and keeping Wi-Fi
allow for low latency, low energy consumption
configuration. The authors discuss two ways
and high datacentre throughput on both cloud
to partition which are based on latency and
and mobile devices for smart intelligent
energy. In both cases partitioning at the
applications. The study used computer vision,
centre was the best-case scenario to achieve
speech and natural language domains based
maximum latency and energy. After finding
out of Deep Neural Networks. To study and
out the right partitioning points for several
understand the authors build neurosurgeon a
other DNNs the authors deploy neurosurgeon
light-weight scheduler which can help
to help find the partition points in highly
partition DNN computation between mobile
computing oriented environments. The
devices and datacentres.
authors evaluate neurosurgeon by testing it
2. Main Content: across 8 DNNs and find out an improvement
in the latency and a massive optimization of
According to the literature survey done by the
59% reduction in mobile energy. The authors
authors they infer that the most efficient way
then evaluate neurosurgeon for network
for a DNN based application to be executed is
variations using T-Mobile LTE network over a
partly through mobile and cloud instead of
period which it successfully overcomes by
executing completely on cloud or completely
shifting the partition. Neurosurgeon is then
on mobile although executing completely on
put to test over server load variation
mobile has proven to give better datacentre
problems. The authors observe that this
throughput, by using the neurosurgeon they
problem is overcome by neurosurgeon
find the partition points in DNNs which
changing its partition selection from complete
provide a balance of computation between
execution on cloud at low load to partitioning
the cloud and mobile device. The
the DNN between mobile and cloud at
experimental setup for the mobile platform
medium load and finally moving he execution
was a Jetson TK1 mobile platform developed
to mobile completely at higher load thus
by Nvidia, and the server was equipped with a
successfully managing the variation in server
NVIDIA Tesla K40 GPU. The authors use caffe
loads. Neurosurgeon also manages to increase
an open-source deep learning library for
the datacentre throughput by adapting the
mobile and server platforms. Using Alexnet
partitions according to the connection quality
they test the communication latency,
between Wi-Fi, LTE and 3G. As the connection
computation latency and end to end latency
quality reduces more DNN is pushed into the
as it is somewhat like most of the existing
mobile platform allowing for better datacentre
DNNs. While measuring communication
throughput.
latency they find that between 3G, LTE and
Wi-Fi the slowest is 3G and the fastest is Wi-fi 3. Conclusion:
The idea behind neurosurgeon is an excellent look up similar features to the input data from
one. The authors have understood the the cache stored, if there is a match between
requirements of building an efficient the feature vector and function key a cache is
framework to help improve collaboration returned if there’s no match the input gets
between mobile edge computing and cloud added as another cache leading to input
computing. similarity threshold adjustment. The authors
then explain the cache service system in
Potluck: Cross-Application Approximate potluck which is quite different from a usual
Deduplication for Computation-Intensive cache system where they are considered as
Mobile Applications entries, they also give a detailed explanation
about each step in the process. In this system
1. Abstract: the entry of a cache depends upon access
frequency, computation overhead and the
While most papers try to deal with latency size of the entry. The three factors lead to
and energy constraints by either moving a another factor called importance based upon
processing system completely to cloud or a which the entry of a cache is considered, and
mobile device to improve the computation the quantifiable formula is importance=
speed this paper deals with achieving access frequency*computation overhead/size
maximum efficiency within the constraints. of entry. The querying of cache is done using a
Potluck is a cross application approximate nearest neighbour search which parses
deduplication service for deduplication of through all the indexes of the entries and
computation across applications. Potluck is using a threshold K they eliminate
made in such a way that it stores results unnecessary entries from the search and
between applications and an input matching improve the output lookout and accuracy and
algorithm is entered to find out similarities with K=1 proving to be the best threshold
between the results. The authors mainly deal among many other values. To find the similar
with deduplication opportunities across AR features between the caches the authors
technologies on mobile applications to implement KNN algorithm to help deal with
implement potluck. unknown features as well. To help preventing
the decrease in QoS the combination of KNN
2. Main content: and random dropout algorithms have been
implemented. The authors have also included
One of the main reasons for implementing an eviction policy and expiry of cache systems
potluck is to deal with computation in AR to help prevent the storage unnecessary
applications hence features in the output cache which can damage the system.
from smart computer vision based cognitive Different kind of input keys such as multi-
assistance applications are considered such as index structure, cache lookup, cache insertion
common theme, temporal and spatial as well as cache eviction are also supported.
correlation, semantic correlation, high The authors implement potluck as a
similarities in output and correlation in the background application-level service on
results. The authors also consider features Android OS. The AppListener handles all the
found in 3D navigation based on object deduplication and handles all the requests as
detection feature and finding out similarities well. The CacheManager maintains all the
with image processing applications as entries and the expiry and eviction in the
mentioned before. The authors overcome the background. The DataStorage keeps all the
deduplication of raw input images by storing previously entered cache, computation results
them as cache and designing a cache service and indexes.
method for storing them. The potluck is
designed in such a way that when a raw input 3. Conclusion:
image is given it converts the input data into The authors evaluate potluck based on
feature vectors and a function key is used to accuracy which includes measuring the
performance, processing time based on the which are crucial during the evaluation of the
reduced computational time and finally benchmark suite. Since most of the apps
missed opportunity by measuring quantifying chosen are mostly famous commercial apps
and measuring the difference between an with no available source code the
optimal case and a present case. All in all instrumentation becomes complex and at the
potluck has proven to be a very good same time, they have a lot of user interaction
deduplication service to improve hence they depend on network connectivity
computational time for image processing and the dependence is removed by buffering
based applications in mobile devices. any required data in remote local storage.
Another major difficulty to analyse user
Moby: A mobile benchmark suite for interactive applications was to reproduce
architectural simulators results without manual user inputs, this was
however overcome by using automations
1. Abstract: tools.
With increase in the number of smartphones
going up exponentially everyday making lives
of humans much easier and more efficient the
need for building applications which can run
efficiently on mobile platforms is a must. Since
the case of a mobile workload is completely
different from that of a CPU, the authors build
a full-time system simulator which acts as a
benchmark. The authors build moby which is a
benchmark suite which can help evaluate
microarchitecture for mobile platforms. The
authors find results which show that mobile
applications exhibit more complex instruction
sets and that mobile platforms do not meet
them. Fig 1: List of apps selected and their functions.

2. Main content: The authors have used two kinds of


characterization to analyse the results namely
A benchmarking suite built to test mobile
microarchitecture independent
platforms must have two necessary
characterization and microarchitecture
properties. The benchmark suite should be
dependent characterization. For
diverse enough to exhibit the range of
microarchitecture independent
behaviours of target applications, applications
characterization they explain it in terms of
should also be transferrable to mobile
instruction mix, working set size, spawned
platforms. The authors also face system design
processes, and code locality. They first set up
issues while building moby as mobile
a mix of instructions and try to understand
applications on different operating systems
how these instructions lead to branch
are not compatible hence moby considers
penalties since almost 70% of them are
applications which are only available on
conditional branches which affect overall
Android OS. 10 different applications which
performance. For working sets the authors
are free of cost are downloaded in moby each
consider the number of pages used to
which perform different tasks like a web
understand the cache memory and main
browser, a social networking application,
access memory behaviours. The authors find
email, audio, video, document, map, and
out that only small portion of each touched
game. The authors use a gem5 simulator for
page is used. To gain a better understanding
testing and evaluating moby. The authors
of the code all requests are captured when
then explain benchmark selection methods
they access the memory and reuse distance is
calculated, they find that over 80% of the
instructions contains highly associative
instructions. To understand the execution
flow of instructions in a more detailed manner
the authors collect instruction traces and map
dynamic instructions back till the static
binaries. For microarchitecture dependent
characterization the authors explain it using
CPI, Stalled cycle per component, Branch
misprediction rate, Cache and TLB and
memory followed by core utilization.

3. CONCLUSION:

The authors build a new benchmark suite for


mobile applications and test it using a gem5
simulator. The authors measure and evaluate
the microarchitecture dependent and
independent features and try to achieve what
other papers haven’t achieved before. Though
this paper may be old it is still one of a kind as
it has a lot features and a high range of
features to test architecture design for mobile
applications.

You might also like