The paper proposes an autonomous and incremental deep learning framework called In-situ AI for IoT systems. It uses a mobile GPU and FPGA to perform diagnosis and inference tasks in single or co-running modes. Analytical models are developed to evaluate different architectures and show In-situ AI achieves better latency and throughput than other approaches due to weight sharing.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
10 views8 pages
Aca Finalproj
The paper proposes an autonomous and incremental deep learning framework called In-situ AI for IoT systems. It uses a mobile GPU and FPGA to perform diagnosis and inference tasks in single or co-running modes. Analytical models are developed to evaluate different architectures and show In-situ AI achieves better latency and throughput than other approaches due to weight sharing.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8
ADVANCED COMPUTER ARCHITECTURE
ECE7373 FINAL PROJECT
LITERATURE REVIEW ON MOBILE PLATFORMS
BY
KRISHNA SRIVATHSAV PRAKASH
2157874 INTRODUCTION With increase in the number of mobile devices due to the ease and accessibility it is important that we consider the computation power of the processors used in these devices. With many features from desktop being available in mobile phones it is highly important to build efficient processors and better architectures. A lot of research has been done in this field with a huge demand for improving the processing power to allow mobile phones to run applications, allow for the use of voice assistants and improving the smart features or adding more for that matter. In this project several papers related to this field will be reviewed and summarized.
LIST OF PAPERS INCLUDED
1. Y. Zhu and V. J. Reddi, "High-performance and energy-efficient mobile web browsing on big/little systems," 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), Shenzhen, China, 2013, pp. 13-24, doi: 10.1109/HPCA.2013.6522303. 2. M. Song et al., "In-Situ AI: Towards Autonomous and Incremental Deep Learning for IoT Systems," 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Vienna, Austria, 2018, pp. 92- 103, doi: 10.1109/HPCA.2018.00018. 3. Kang Y, Hauswald J, Gao C, Rovinski A, Mudge T, Mars J, Tang L. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. ACM SIGARCH Computer Architecture News. 2017 Apr 4;45(1):615-29. 4. Guo P, Hu W. Potluck: Cross-application approximate deduplication for computation-intensive mobile applications. InProceedings of the Twenty- Third International Conference on Architectural Support for Programming Languages and Operating Systems 2018 Mar 19 (pp. 271-284). 5. Y. Huang, Z. Zha, M. Chen and L. Zhang, "Moby: A mobile benchmark suite for architectural simulators," 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Monterey, CA, USA, 2014, pp. 45-54, doi: 10.1109/ISPASS.2014.6844460. High-Performance and Energy-Efficient MobileWeb Browsing on Big/Little Systems
difference in energy requirements. The
authors have inferred that time and energy 1. Abstract: consumption increases with increase in the In this paper heterogenous systems which can number of nodes in the DOM tree of a accommodate the energy requirements of webpage. Coming to the CSS analysis the applications and webpages to run on both authors look at selectors associated with each desktops and mobiles are simulated and HTML tag and property usage in CSS. After proposed. Statistical inference models were performing microbenchmarking on selectors used based on the data of 5000 webpages to the authors find out that as the number of understand the energy consumption. Using selectors increase the time and energy this data, the authors try to build a consumption increases drastically. After heterogenous system which achieves higher considering the 7 most used properties in CSS energy savings and lower latency cut off. webpages the authors find that the execution time and energy increases when the 2. Main content: properties affect individual HTML tags and The authors in this paper being by explaining properties affecting other nodes in a DOM the experimental setup they have for tree. performing their tests. For a small core they 2.2 Energy modelling analysis: use a cortex A8 which is of low power and for a big core they use a cortex A9 which has After analysing all the webpage loading times more power consumption capabilities. The and understanding the variables required to authors consider parameters like the most measure them the authors go ahead to build used webpages, load time measurements and their prediction models using linear energy measurement. After getting the results regression. After taking the webpage loading the authors do a representative as well as a and energy consumption observations the comprehensive analysis to understand the authors use it to construct the regression energy-latency cut-off. The authors finally end model. After overcoming the overfitting up concluding that different webpages problems and obtaining more observations required different ideal core and frequency the authors build 3 more regression models settings to achieve the ideal balance between which are basic linear regression, regularized performance and energy-efficiency. Due to linear regression and restricted cubic spline varying cut-offs a versatile heterogeneous based (RCS) model. After evaluation it is found system consisting of both big and little cores out that the RCS based model was the best would be ideal. The authors then proceed to performer within 10% error for 70% of the determine the reason for the varying load webpages and withing 20% for 91.8% of the timing and energy consumption variance webpages. The authors then build a dynamic across different webpages. To achieve this, webpage scheduler which is put to test they look at the language upon which the against an OS based scheduler and integrated webpage is designed on. They begin by webpage aware scheduler. analysing html pages followed by CSS pages. 3. Conclusion: Using microbenchmarking they compute the execution time and energy consumption The authors conclude by measuring the associated with each Tags which are used in energy savings and proving how much more HTML pages. The authors also look at efficient the webpage aware scheduler is attributes and DOM trees to understand the against a production OS DVFS scheduler. showing that the mobile GPU is used for single running mode and the FPGA design is used for In-situ AI: Towards Autonomous and co-running mode. The authors evaluate the In- Incremental Deep Learning for IoT Systems situ AI node from a microarchitecture 1. Abstract: perspective and compare their shared weight architecture with other existing architectures. In this paper the authors propose In-situ AI an The authors have used NVIDIA TX1 as mobile autonomous and incremental computing GPU and Xilinx Vertex-7 VX690T as FPGA framework for deep learning based IOT design. For comparing the In-situ AI node both applications. The main aim of this system is to the cases of single running and co-running help IOT systems handle big data movement have been considered. Time models between at a stable level at the same time help different architectures and this compute large amounts of raw unsupervised implementation are measured using non data. The authors consider two main types of batching methods and show that the next best tasks namely diagnosis and inference which architecture after this implementation was the In-situ AI will perform on a mobile GPU VGGNet architecture which only achieves a and a FPGA. The tasks will be run on different 1.1 times speedup as compared to 3.3 times modes which are single running and co speedup achieved by In-situ AI. To evaluate running modes. The authors then develop their two shared architecture (in co-running analytical model to find out the ideal mode) conv is performed on all layers of NWS, configuration to help deploy the In-situ AI WS and WSS (In-situ AI co-running architecture in IOT systems. architecture) and runtimes are compared. The 2. Main content: WS had the worst compute time while the WSS had the best processing time due to the To tackle the problem of improving a system’s weight sharing in the architecture. The accuracy with dynamic data the authors authors then evaluate the overall architecture propose by using unsupervised learning to of WSS-NWS (In-situ AI architecture) by deal with big raw IOT data. The authors use comparing throughput with WSS, WS, NWS two level weight shared FPGA design to avoid and NWS batch as well. Among all WSS-NWS interference between inference and diagnosis has the best latency due to its high in the Co-running mode. The authors then throughput and high performance which is explain the challenges faced in a cloud centric missing in the other two. The authors then go system and Fog computing system and how on to evaluate the In-situ cloud module with the IOT systems based on In-situ AI three existing IOT architectures. Comparing architecture overcomes them. The In-situ data movement in the three systems has been consists of a node and a cloud module. The used to evaluate since it would provide the unsupervised data is first transferred to the energy consumption among the architectures. cloud module where the data features are In-situ AI consumed the least energy due to extracted and then transferred to a target retraining of data and last two layers of data inference network following which the only undergoing transfer learning. The In-situ remaining layers of the data are trained by the AI also has a 1.5 times speedup compared to inference network. The unsupervised network the existing IOT systems. is deployed in the node module to perform the diagnosis which then sends the data to 3. Conclusion: the cloud again allowing for the In-situ AI to In-situ AI has shown good results which have learn dynamic data more accurately with proven to be better than the other existing lesser movement of data. The authors then systems, hopefully the authors develop it explain the architecture of the In-situ AI by more, so it gets integrated into existing IOT products as it has shown results of lesser showing the importance of communication latency and more efficient consumption. network. For measuring computation latency, they find that between mobile CPU, GPU, and cloud GPU they find that mobile CPU is the Neurosurgeon: Collaborative Intelligence slowest and cloud GPU to be the fastest. End Between the Cloud and Mobile Edge to end latency is measured and is found out 1. Abstract: that mobile GPU has the least latency compared to the others. Energy consumption Investigating status quo of cloud approach of a mobile GPU is lesser compared to processing and exploring computation transferring data to cloud. Using these partitioning techniques are the ideas upon findings, the authors begin to work on finding which this article is based on. The authors the partition points in layers in Alexnet look to find partitioning techniques which between mobile and cloud and keeping Wi-Fi allow for low latency, low energy consumption configuration. The authors discuss two ways and high datacentre throughput on both cloud to partition which are based on latency and and mobile devices for smart intelligent energy. In both cases partitioning at the applications. The study used computer vision, centre was the best-case scenario to achieve speech and natural language domains based maximum latency and energy. After finding out of Deep Neural Networks. To study and out the right partitioning points for several understand the authors build neurosurgeon a other DNNs the authors deploy neurosurgeon light-weight scheduler which can help to help find the partition points in highly partition DNN computation between mobile computing oriented environments. The devices and datacentres. authors evaluate neurosurgeon by testing it 2. Main Content: across 8 DNNs and find out an improvement in the latency and a massive optimization of According to the literature survey done by the 59% reduction in mobile energy. The authors authors they infer that the most efficient way then evaluate neurosurgeon for network for a DNN based application to be executed is variations using T-Mobile LTE network over a partly through mobile and cloud instead of period which it successfully overcomes by executing completely on cloud or completely shifting the partition. Neurosurgeon is then on mobile although executing completely on put to test over server load variation mobile has proven to give better datacentre problems. The authors observe that this throughput, by using the neurosurgeon they problem is overcome by neurosurgeon find the partition points in DNNs which changing its partition selection from complete provide a balance of computation between execution on cloud at low load to partitioning the cloud and mobile device. The the DNN between mobile and cloud at experimental setup for the mobile platform medium load and finally moving he execution was a Jetson TK1 mobile platform developed to mobile completely at higher load thus by Nvidia, and the server was equipped with a successfully managing the variation in server NVIDIA Tesla K40 GPU. The authors use caffe loads. Neurosurgeon also manages to increase an open-source deep learning library for the datacentre throughput by adapting the mobile and server platforms. Using Alexnet partitions according to the connection quality they test the communication latency, between Wi-Fi, LTE and 3G. As the connection computation latency and end to end latency quality reduces more DNN is pushed into the as it is somewhat like most of the existing mobile platform allowing for better datacentre DNNs. While measuring communication throughput. latency they find that between 3G, LTE and Wi-Fi the slowest is 3G and the fastest is Wi-fi 3. Conclusion: The idea behind neurosurgeon is an excellent look up similar features to the input data from one. The authors have understood the the cache stored, if there is a match between requirements of building an efficient the feature vector and function key a cache is framework to help improve collaboration returned if there’s no match the input gets between mobile edge computing and cloud added as another cache leading to input computing. similarity threshold adjustment. The authors then explain the cache service system in Potluck: Cross-Application Approximate potluck which is quite different from a usual Deduplication for Computation-Intensive cache system where they are considered as Mobile Applications entries, they also give a detailed explanation about each step in the process. In this system 1. Abstract: the entry of a cache depends upon access frequency, computation overhead and the While most papers try to deal with latency size of the entry. The three factors lead to and energy constraints by either moving a another factor called importance based upon processing system completely to cloud or a which the entry of a cache is considered, and mobile device to improve the computation the quantifiable formula is importance= speed this paper deals with achieving access frequency*computation overhead/size maximum efficiency within the constraints. of entry. The querying of cache is done using a Potluck is a cross application approximate nearest neighbour search which parses deduplication service for deduplication of through all the indexes of the entries and computation across applications. Potluck is using a threshold K they eliminate made in such a way that it stores results unnecessary entries from the search and between applications and an input matching improve the output lookout and accuracy and algorithm is entered to find out similarities with K=1 proving to be the best threshold between the results. The authors mainly deal among many other values. To find the similar with deduplication opportunities across AR features between the caches the authors technologies on mobile applications to implement KNN algorithm to help deal with implement potluck. unknown features as well. To help preventing the decrease in QoS the combination of KNN 2. Main content: and random dropout algorithms have been implemented. The authors have also included One of the main reasons for implementing an eviction policy and expiry of cache systems potluck is to deal with computation in AR to help prevent the storage unnecessary applications hence features in the output cache which can damage the system. from smart computer vision based cognitive Different kind of input keys such as multi- assistance applications are considered such as index structure, cache lookup, cache insertion common theme, temporal and spatial as well as cache eviction are also supported. correlation, semantic correlation, high The authors implement potluck as a similarities in output and correlation in the background application-level service on results. The authors also consider features Android OS. The AppListener handles all the found in 3D navigation based on object deduplication and handles all the requests as detection feature and finding out similarities well. The CacheManager maintains all the with image processing applications as entries and the expiry and eviction in the mentioned before. The authors overcome the background. The DataStorage keeps all the deduplication of raw input images by storing previously entered cache, computation results them as cache and designing a cache service and indexes. method for storing them. The potluck is designed in such a way that when a raw input 3. Conclusion: image is given it converts the input data into The authors evaluate potluck based on feature vectors and a function key is used to accuracy which includes measuring the performance, processing time based on the which are crucial during the evaluation of the reduced computational time and finally benchmark suite. Since most of the apps missed opportunity by measuring quantifying chosen are mostly famous commercial apps and measuring the difference between an with no available source code the optimal case and a present case. All in all instrumentation becomes complex and at the potluck has proven to be a very good same time, they have a lot of user interaction deduplication service to improve hence they depend on network connectivity computational time for image processing and the dependence is removed by buffering based applications in mobile devices. any required data in remote local storage. Another major difficulty to analyse user Moby: A mobile benchmark suite for interactive applications was to reproduce architectural simulators results without manual user inputs, this was however overcome by using automations 1. Abstract: tools. With increase in the number of smartphones going up exponentially everyday making lives of humans much easier and more efficient the need for building applications which can run efficiently on mobile platforms is a must. Since the case of a mobile workload is completely different from that of a CPU, the authors build a full-time system simulator which acts as a benchmark. The authors build moby which is a benchmark suite which can help evaluate microarchitecture for mobile platforms. The authors find results which show that mobile applications exhibit more complex instruction sets and that mobile platforms do not meet them. Fig 1: List of apps selected and their functions.
2. Main content: The authors have used two kinds of
characterization to analyse the results namely A benchmarking suite built to test mobile microarchitecture independent platforms must have two necessary characterization and microarchitecture properties. The benchmark suite should be dependent characterization. For diverse enough to exhibit the range of microarchitecture independent behaviours of target applications, applications characterization they explain it in terms of should also be transferrable to mobile instruction mix, working set size, spawned platforms. The authors also face system design processes, and code locality. They first set up issues while building moby as mobile a mix of instructions and try to understand applications on different operating systems how these instructions lead to branch are not compatible hence moby considers penalties since almost 70% of them are applications which are only available on conditional branches which affect overall Android OS. 10 different applications which performance. For working sets the authors are free of cost are downloaded in moby each consider the number of pages used to which perform different tasks like a web understand the cache memory and main browser, a social networking application, access memory behaviours. The authors find email, audio, video, document, map, and out that only small portion of each touched game. The authors use a gem5 simulator for page is used. To gain a better understanding testing and evaluating moby. The authors of the code all requests are captured when then explain benchmark selection methods they access the memory and reuse distance is calculated, they find that over 80% of the instructions contains highly associative instructions. To understand the execution flow of instructions in a more detailed manner the authors collect instruction traces and map dynamic instructions back till the static binaries. For microarchitecture dependent characterization the authors explain it using CPI, Stalled cycle per component, Branch misprediction rate, Cache and TLB and memory followed by core utilization.
3. CONCLUSION:
The authors build a new benchmark suite for
mobile applications and test it using a gem5 simulator. The authors measure and evaluate the microarchitecture dependent and independent features and try to achieve what other papers haven’t achieved before. Though this paper may be old it is still one of a kind as it has a lot features and a high range of features to test architecture design for mobile applications.
Dynamic Reconfigurable Architectures and Transparent Optimization Techniques: Automatic Acceleration of Software Execution 2010th Edition by Antonio Carlos Schneider Beck Fl Luigi Carro ISBN 9048139120 978-9048139125 - Discover the ebook with all chapters in just a few seconds
2023 ICCAD XJiang Invited Paper Accelerating Routability and Timing Optimization With Open-Source AI4EDA Dataset CircuitNet and Heterogeneous Platforms.pdf
Dynamic Reconfigurable Architectures and Transparent Optimization Techniques: Automatic Acceleration of Software Execution 2010th Edition by Antonio Carlos Schneider Beck Fl Luigi Carro ISBN 9048139120 978-9048139125 instant download