Complete IOT Notes
Complete IOT Notes
EL
PT Dr. Rajiv Misra
Professor, Dept. of Computer
N
Science & Engg. Indian Institute of
Technology Patna rajivm@[Link]
After this lecture you will be knowing following things:
EL
● Edge Computing Architecture & building blocks
PT
● Edge Computing for IOT
● Advantages of Edge Computing
N
Recapitulate:Evolution of Cloud
Virtual machines running in Data processed locally and
a remote data center or compute comes much
storage that was offered in a closer to the devices or the
EL
remote data center sources of data
EL
Don't need to rely on the cloud for all the processing and data
aggregation collection processing and querying.
PT
Mimics the public cloud platform capabilities.
Reduces the latency by avoiding the round-trip and brings in the
data sovereignty by keeping data where it actually belongs.
N
Delivers local storage, compute, and network services.
Edge Computing: makes distributed cloud
●Edge computing makes the cloud truly distributed. The
current cloud or rather the previous generation of cloud
EL
was almost like a mainframe or like a client-server
architecture where very little processing was done on
the client side but all the heavy lifting was done by the
cloud.
PT
●With all the innovations in the hardware chips and with
the affordable electronics and silicon it makes more
sense to bring compute down to the last mile and actually
keep the compute closer to the devices.
●So that's when edge computing becomes more and
N
more viable where you don't need to rely on the cloud for
all the processing and data aggregation, collection,
processing and querying instead you could actually run a
computing layer that is very close to the devices.
Edge Computing: Mimics the public cloud platform capabilities
and Move cloud service closer to data-source
●The edge computing mimics the public cloud platform
EL
capabilities
●For example when you dissect an edge computing
platform you would notice that it almost has all the
capabilities of a typical public cloud
PT
●IOT pass: it has device management, it has data
ingestion, it has stream analytics and it can run
machine learning models and it can run server less
functions so all of those are capabilities that are
predominantly available on the public cloud
N
●But with edge computing they all come to the last mile
delivery point and run very close to the source of the
data which is sensors actuators and devices
Edge Computing: Reduces the latency by avoiding the round-trip
and brings in the data sovereignty
●The biggest advantage of deploying an edge computing
EL
layer is that it reduces the latency by avoiding the round-
trip.
●It also brings in the data sovereignty by keeping data
where it actually belongs to.
PT
●For example in a healthcare scenario it may not be viable
or it may not be compliant to actually stream sensitive
patient data to the cloud where it is getting stored and
processed instead the patient data should remain on-Prem
N
within the hospital premises but it still needs to go through
lot of processing and find out very useful insights so in that
case the edge computing layer is going to stay close to the
healthcare equipment with connectivity back to the cloud
and the architects and the customer engineers will decide
what data will stay within the edge boundary and what will
actually cross that and move to the cloud may be
anonymized data.
Edge Computing Building Blocks
Data Ingestion
EL
M2M Brokers
Object Storage
Function as a Service
PT
NoSQL/Time-Series Database
Streem Processing
N
ML Models
Edge Computing Building Blocks: Data
Ingestion
EL
Data Ingestion:
This is the high velocity, high throughput
data endpoint like the Kafka endpoint that
PT
is going to ingest the data.
It is the process of obtaining and
importing data for immediate use or
N
storage in a database. To ingest
something is to take something in or
absorb something. Data can be streamed
in real time or ingested in batches. In
real-time data ingestion, each data item
is imported as the source emits it.
Edge Computing Building Blocks: M2M
Brokers
M2M Brokers:
EL
Edge will also run message
brokers that will orchestrate
machine to machine
communication.
PT
N
For example device one talks
to device two via the M2M
broker.
Edge Computing Building Blocks: Storage
Object Storage:
EL
there may be unstructured storage
particularly to store the feed from
video cameras and mics and
PT
anything that is unstructured will go
into object storage.
N
NoSQL/Time-Series Database:
More structured data goes into time
series data base and no sequel
database
Edge Computing Building Blocks: Stream
Processing
Stream Processing:
EL
It is a complex event processing
engine that is enabling you to
perform real-time queries and
PT
process the data as it comes.
For example for every data point
you want to convert Fahrenheit to
N
Celsius or you want to convert the
timestamp from one format to
another, you could do it either in
stream processing.
Edge Computing Building Blocks: Function
as a Sevice
Function as a service:
EL
To add additional business
logic there is a functions as a
service which is actually
lightweight compute. PT
responsible for running
EL
Lastely, there is an ML runtime for
example most of the computing
platforms are capable of running
PT
tensorflow light, cafe models and
pitorch models, so you can actually
process the data that comes in
N
more intelligently and take
preventive measures and perform
predictive analytics.
Edge Computing Architecture
EL
PT
N
Edge Computing: Three-tier Architecture
Now let's look at this from a different dimension.
There are data sources and by the way edge computing is
EL
not confined just to IOT, it could be even for non IOT use
cases. Anything that generates data can be fed into an IOT
like cameras, clickstream analysis, gaming, etc.
PT
it's basically like a three-tier architecture.
EL
In industrial IOT environment, this
could be a set of devices that are
generating the data.
PT
These are nothing but original
endpoint, from where the data is
N
acquired or the origin of the data.
Edge Computing Architecture: Intelligence Tier
EL
learning models.
This intelligent tier cuts across the
PT
cloud and the edge so there is a very
well-defined boundary between edge
and cloud where the training takes
place on the cloud and the inferencing
N
is run on the edge. But collectively,
this overlap between the cloud and
the edge is this intelligence layer.
Edge Computing Architecture: Actionable Insight Tier
EL
layer:
Responsible for sending an alert to the
relevant stakeholders or populating the
PT
dashboards and showing some
visualizations or even the edge taking
an action to immediately shut down a
N
faulty machine or controlling an
actuator and again the actionable
insight takes place on the edge so this
is not the physical boundary.
Edge Computing Architecture: Summary
In Summary, you logically look at the whole
EL
architecture so there is a data source which is
the original endpoint from where the data is
acquired or the origin of the data.
PT
Then there is an intelligence layer where the
constant training and inferencing takes place.
Then there is an insight layer where you actually
N
visualize the outcome from the intelligence and
also perform actions based on those insights so
that is one way of visualizing edge computing.
Lecture Summary
EL
○ Edge makes distributed cloud
○ Edge mimics the public cloud platform capabilities and Move cloud service closer to data-source
○ Edge reduces the latency by avoiding the round-trip and brings in the data sovereignty
PT
● Building Blocks of Edge Computing
● Three tier architecture of Edge Computing
○ Data Source
N
○ Intelligence
○ Actionalble Insight
N
PT
EL
Introduction to Cloud
EL
PT Dr. Rajiv Misra
Professor, Dept. of Computer
N
Science & Engg. Indian Institute of
Technology Patna rajivm@[Link]
On completion of this Lecture you will get to know about the following:
EL
● Different objectives of cloud
●
● PT
Current limitations of traditional cloud
EL
● Compute is going beyond VMs
● Storage is complemented by CDN
●
●
●
PT
Network stack is programmable
The Web and Software-as-a-Service
Infrastructure-as-a-Service
N
● High-Availability cloud
Current State of Today’s Cloud: Highly
Centralized in Client-Server Architecture
EL
virtual machines that were running in a
remote data center (or storage).
● Highly centralized architecture closely
● PT
resembles 90s client-server computing.
For example cloud (the remote data
center or the remote infrastructure)
exposed by Amazon, Microsoft, Google,
N
IBM and others is the server and the
machine from which you are connecting
to it and consuming the cloud resources
is the client.
Current State of Today’s Cloud: Compute is
going beyond VMs
EL
● Although cloud resembles the 90s client-server
computing but at the same time compute has
gone beyond VMs the first generation of cloud
was all about VM virtual machines.
●
●
PT
Where you could programmatically launch a VM
and you could SSH into it and take control of the
Virtual Machine and install the software.
But there is a dramatic shift in the compute
N
where VMs are slowly getting replaced by
containers.
● More and more workloads are moving towards
containers.
Current State of Today’s Cloud: Storage is
complemented by CDN
EL
● Another important trend almost all the public
cloud are in storage offerings.
● Object storage is complemented by a content
●
PT
delivery network today.
Whenever you put an object in a bucket or a
container of the public cloud storage you can
click a check box to basically replicate and
N
cache the data across multiple edge locations
but this edge is not the edge that we are talking
about this is the content delivery network where
it caches the frequently accessed content in a
set of pop or edge locations.
Current State of Today’s Cloud: Network
stack is programmable
EL
● Finally network has become extremely programmable today.
● If you look at the hybrid cloud, multi-cloud scenarios and how
network traffic is getting routed and how load balancers firewalls
and a variety of network components are configured it is through
●
api's and programmability.
PT
The same capability of SDN is enabling hybrid scenarios
particularly when we look at the combination of software-defined
network with some of the emerging networking technologies.
These mesh they are opening up additional avenues some of the
N
very recent trends like Google’s Anthos, IBM cloud private and
some of the other container based hybrid cloud platforms are
heavily relying on the programmable Network stack and also a
combination of SDN with service mesh.
● This is the current state of the cloud and these trends represent
how the cloud is currently being consumed or how it is delivered to
customers but cloud is going through a huge transformation.
Multiple waves of innovation in Cloud: Pass
to IOT
● Initially cloud was all about compute storage and network resources globally
available highly centralized set of resources because cloud made compute and
EL
storage extremely cheap and affordable lot of industrial customers and
enterprises started connecting devices to the cloud.
● The data that was not persisted or aggregated or acquired is now streamed to
●
PT
the cloud because it is extremely cheap to store data in the cloud.
So a lot of companies and lot of industrial environments started to take
advantage of the cloud by streaming the data coming from a variety of sensors
and devices.
N
Also use the cheaper compute power to process those data streams and make
sense out of the raw data generated these sensors and devices and that was
the next big shift in the cloud this was IOT pass.
Challenges for IOT-Pass
● If you look at azure IOT, Google Cloud IOT, AWS IOT core all of them
essentially give you a mechanism a platform to connect devices and store data
and process it in the cloud but it was not sufficient or it was not enough to
address a lot of scenarios while cloud enabled capabilities like Big Data and
EL
IOT.
● Lot of customers were not ready to move the data to the cloud that is one
challenge.
●
●
PT
The second one is the round trip from the devices to the cloud and back to the
devices was too long and it was increasing the latency in a lot of mission-critical
industrial IOT scenarios.
Sending the data to the cloud and waiting for the cloud to process it and send
N
the results back was just not feasible so there had to be a mechanism where
data could be processed locally and compute comes much closer to the devices
or the sources of data so that's how IOT led to edge computing and today
almost every mainstream enterprise IOT platform has a complimentary edge
offering and associated edge offering and more recently there has been a lot of
focus on artificial intelligence.
Cloud for AI-ML
● Today’s cloud has become the logical destination for training and running artificial intelligence
and machine learning models.
● Due to accelerators like GPUs GPUs FPGAs it has become extremely cheap and also
powerful to train very complex very sophisticated ML models and AI models
●
EL
But in most of the scenarios a model that is restrained in the cloud is going to be run in an
offline environment.
● For example, you might have trained an artificial intelligence model that can identify the make
and model of a car and automatically charge the tall fee for that vehicle when it passes through
the toll gate now since the toll gates are on highways and freeways with very little connectivity
●
PT
and almost with no network access you need to run this model in offline scenario.
So edge computing became the boundary for running these cloud trained AI models but
running in an offline mode within the edge so that is how we are basically looking at the
evolution of cloud and on the waves of innovation.
So cloud are distributed or rather decentralized platform for aggregating storing and processing
N
data with high performance computing IOT brought in all the devices to the cloud with IOT data
at edge made cloud decentralized by bringing compute closer to the data source and now it is
AI that is actually driving the next wave where cloud is becoming the de facto platform for
training the models and edge is becoming the de facto platform for running the models so one
is called the training the other one is called inferencing.
Limitations of current cloud system
● AI use cases need real-time responses from the devices they are
EL
monitoring.
● Cloud-based inference cannot provide this real-time response due to
inherent issues with latency.
PT
● If edge devices have connectivity issues or no internet connection it
can not perform well.
N
● Sufficient bandwidth required to transfer the relevant amount of data in
a proper time frame can also be an issue.
Evolution of Cloud
Virtual machines running in Data processed locally and
a remote data center or compute comes much
storage that was offered in a closer to the devices or the
EL
remote data center sources of data
EL
Architecture
● Compute is going beyond VMs
● Storage is complemented by CDN
PT
● Network stack is programmable
● Multiple waves of innovation in Cloud
N
● Challenges for IOT-Pass
● Evolution of Cloud towards Edge Computing
N
PT
EL
Introduction to IoT Platform
EL
PT Dr. Rajiv Misra
Professor, Dept. of Computer
N
Science & Engg. Indian Institute of
Technology Patna rajivm@[Link]
After completion of this lecture you will knowing the following:
EL
● IoT platforms building blocks which are provided by different
cloud providers such as microsoft, amazon, google, etc
PT
N
Architectural approach for IoT platform
IoT applications have three components. Things or devices send data or events
that are used to generate insights. Insights are used to generate actions to help
improve a business or process.
EL
The equipment or things in a manufacturing plant send various types of data as
they operate. An example is a milling machine sending feed rate and temperature
data. This data is used to evaluate whether the machine is running or not, an
PT
insight. The insight is used to optimize the plant, an action.
N
Introduction to IoT platform
Hot Path
Things Cloud IoT Presentation Consumers
EL
Stream Analytics,
Real-time Event Hub,
PT
Consumers,
Data Lake, Data
Small Batch Functions, Integration
Data Factory,
Azure Processing Synapse,
Azure DBaas,
Sphere IoT Central Databricks,
Power BI,
Edge
Azure DBaas
Azure Iot IoT Hub, Synapse,
Device
N
IoT Hub DPS, Azure App
SDK Digital Twins services
Cold Path
Data Lake,
Batch Data Factory,
Processing Synapse,
Databricks,
Azure DBaas
Introduction to IoT platform: Things
Hot Path
Things Cloud IoT Presentation Consumers
EL
Stream Analytics,
Real-time Event Hub,
PT
Consumers,
Data Lake, Data
Small Batch Functions, Integration
Data Factory,
Azure Processing Synapse,
Azure DBaas,
Sphere IoT Central Databricks,
Power BI,
Edge
Azure DBaas
Azure Iot IoT Hub, Synapse,
Device
N
IoT Hub DPS, Azure App
SDK Digital Twins services
Cold Path
Data Lake,
Batch Data Factory,
Processing Synapse,
Databricks,
Azure DBaas
Introduction to IoT platform: Things
Everything in the iot space starts with the things side of the internet of things.
When you talk to people about iot, people probably think about nest doorbells,
EL
simplisafe appliances, different kinds of things that you can use in your house that
make your house smart. Things
All of these things are part of an IoT network so that's very familiar to most people
and that is true, it is the sensors that goes into making a device work. Sensors
Automation
PT
On azure there's a couple of things that you can use to create these things.
One is azure sphere which is like a lightweight operating system that you can put on
a device and you can use this as an embedded system that will allow you to create
devices and also have the connected back up to azure and also secure the device
etc
Azure
Sphere
N
using that particular specialized operating system. Azure Iot
Device
There's also the azure IoT SDK which is a specialized sdk for interacting with some SDK
of these other services. But it can be embedded on many different systems and
supports a lot of different kinds of languages as well.
Introduction to IoT platform: Cloud IoT
Hot Path
Things Cloud IoT Presentation Consumers
EL
Stream Analytics,
Real-time Event Hub,
PT
Consumers,
Data Lake, Data
Small Batch Functions, Integration
Data Factory,
Azure Processing Synapse,
Azure DBaas,
Sphere IoT Central Databricks,
Power BI,
Edge
Azure DBaas
Azure Iot IoT Hub, Synapse,
Device
N
IoT Hub DPS, Azure App
SDK Digital Twins services
Cold Path
Data Lake,
Batch Data Factory,
Processing Synapse,
Databricks,
Azure DBaas
Introduction to IoT platform: Cloud IoT
Internet of things you also have the iot stack that typically exists on many
different iot deployments. Cloud IoT
EL
Basically with the iot stack you're going to be managing devices and also Device Provision,
brokering messages between devices and the cloud. Device security,
Device messaging
This is a management suite that allows you to scale devices and it provides a
PT
lot of services for that so you can provision devices you can take devices
offline to provide security for devices it also provides a messaging
infrastructure so that you can send commands to devices and also receive
telemetry back from devices.
IoT Central
N
All of these endpoints and all of this management infrastructure is
IoT Hub,
encapsulated in a couple of different services.
IoT Hub DPS,
Digital Twins
Introduction to IoT platform: Cloud IoT
On azure iot central which is more a software–as-a-service offering that
encapsulates a lot of the functionality for ability to create applications in the Cloud IoT
EL
context of an iot central and that allows to have multi-tenancy with different
devices to scale not only the devices but also the downstream components of Device Provision,
Device security,
things integrating with those devices are serving up. Device messaging
IOT hub is ageneral purpose tool on azure for managing devices so it has
PT
device provisioning services that need for scaling up devices, for putting
certificates on devices, generating those certificates for messaging from device
for messaging to a device ie low level and more functionally oriented.
Azure digital twins, digital twinning is the ability to manage device configuration IoT Central
N
in a suite of software to integrate with azure IOT hub maintains some kind of IoT Hub,
state information about devices in the cloud. IoT Hub DPS,
Digital Twins
Introduction to IoT platform: Hot Path
Hot Path
Things Cloud IoT Presentation Consumers
EL
Stream Analytics,
Real-time Event Hub,
PT
Consumers,
Data Lake, Data
Small Batch Functions, Integration
Data Factory,
Azure Processing Synapse,
Azure DBaas,
Sphere IoT Central Databricks,
Power BI,
Edge
Azure DBaas
Azure Iot IoT Hub, Synapse,
Device
N
IoT Hub DPS, Azure App
SDK Digital Twins services
Cold Path
Data Lake,
Batch Data Factory,
Processing Synapse,
Databricks,
Azure DBaas
Introduction to IoT platform: Hot Path
The data is routed to one of the three different paths.
ie the hot path or the cold path or the warm path Hot Path
EL
Hot path data is data that is processed in real time so as it comes off of the iot Real-time Stream
hub It gets processed within seconds of that happening so the message hits data Analytics,
Processing Event Hub,
the hot path it's processed and then it's presented to something in the Functions,
consumption layer. Synapse,
PT
The consumption layer is able to consume tt data immediately once it's been
processed in the hot path.
You could write the output from a hotpath to a cold storage system that is
Kafka,
Databricks
N
consumed by something like an api. The data is written in real time but the api
might be querying that data that was written an hour ago.
EL
Event hubs can also write messages to a cold storage that can be consumed by cold pass or warm
path but whatever you get out of event hubs can be wired up to all these other other kinds of
Real-time Stream
processors such as stream analytics which is a platform as a service offering that uses sql to
data Analytics,
transform data aggregate data enrich it Processing Event Hub,
Functions,
Then you have functions which can be triggered by event hubs . Then there's azure synapse which Synapse,
PT
as synapse allows you to have a full suite of tools at your disposal that do all kinds of things related
You can also use kafka which is out of the apache space which is similar to stream analytics in that
you do real-time data processing but it's more specific in its implementation but it wires up directly
to event hubs.
Kafka,
Databricks
N
Databricks is typically used for more batch style oriented workloads but you can use it for
streaming
Combining any number of these can do a lot of different kinds of hotpath aggregations
transformations queries filters whatever it might be they're all different tools that all do it very
similar functionality within the azure context.
Introduction to IoT platform: Cold Path
Hot Path
Things Cloud IoT Presentation Consumers
EL
Stream Analytics,
Real-time Event Hub,
PT
Consumers,
Data Lake, Data
Small Batch Functions, Integration
Data Factory,
Azure Processing Synapse,
Azure DBaas,
Sphere IoT Central Databricks,
Power BI,
Edge
Azure DBaas
Azure Iot IoT Hub, Synapse,
Device
N
IoT Hub DPS, Azure App
SDK Digital Twins services
Cold Path
Data Lake,
Batch Data Factory,
Processing Synapse,
Databricks,
Azure DBaas
Introduction to IoT platform: Cold Path
Coldpath is more batch-oriented, hotpath will process the message as it hits the system
while coldpath really processes the messages as they accumulate on the system and
EL
rather being triggered by the message itself what it allows for is data to be accumulated
over a period of time and then typically on a trigger that is timer based it will then take
whatever data has been accumulated and process that data in batch.
Then it will write the data back to some kind of cold storage whatever the processing on
that data might look like
PT
This typically works as opposed to hot path where you have something like event hubs
that deliver a message to a processor what you typically do in this case is you land the
message that as it comes off of iot hub into some kind of what we call cold storage so
that's typically some kind of database or some kind of data retention system.
N
Cold Path
Now that could be something like a data lake which would be basically built on top of
blob storage, you can also do it with blob storage as well but fundamentally data lake is Batch Data Lake,
built on top of blob storage in any case. Processing Data Factory,
Synapse,
Databricks,
Azure DBaas
Introduction to IoT platform: Cold Path
Azure database as a service offering, use sql databases, use cosmos databases, use
postgres or mysql putting into some kind of data storage platform.
EL
Then from there once it's accumulated in that cold storage then the trigger fires and it's
going to launch whatever processing capability is going to be a part of that and that's
where something like data data factory or azure synapse or or databricks
Data factory is software as a service or platform as a service gives the ability to visually
PT
build workflows inside of data factory that can then take data out of a data lake or
database and process it in batches and then write the results back to some kind of
output.
Now synapse has similar functionality but it is integrated with the synapse suite on
N
azure Cold Path
so databricks has the ability to scale and it also integrates with a lot of other different Batch Data Lake,
offerings on azure including the databases data lakes and many of these other similar Processing Data Factory,
things is more of a visual designer for building those kind of workflows. Synapse,
Databricks,
Azure DBaas
Introduction to IoT platform: Warm Path
Hot Path
Things Cloud IoT Presentation Consumers
EL
Stream Analytics,
Real-time Event Hub,
PT
Consumers,
Data Lake, Data
Small Batch Functions, Integration
Data Factory,
Azure Processing Synapse,
Azure DBaas,
Sphere IoT Central Databricks,
Power BI,
Edge
Azure DBaas
Azure Iot IoT Hub, Synapse,
Device
N
IoT Hub DPS, Azure App
SDK Digital Twins services
Cold Path
Data Lake,
Batch Data Factory,
Processing Synapse,
Databricks,
Azure DBaas
Introduction to IoT platform: Warm Path
Between the hot path and the cold path is warm path. It has some kind of
functionality that might seem similar to hot path and something that might
EL
seem to similar to cold path.
Tools that are more in line with warmpath such as data lake, factory data,
factory synapse, databricks use azure functions. Warm Path
PT
Even use something like stream analytics or kafka for some smaller workloads
The distinction between hot path, warm path and cold path really isn't clear.
The takeaway from this is that hot path is real-time warm path is going to be
more often smaller workloads that are going to be rating on smaller time scales
Small Batch
Processing
Data Lake,
Functions,
Data Factory,
Synapse,
Databricks,
Azure DBaas
N
like 5 minutes, 10 minutes, 15 minutes or an hour and cold path is going to be
larger workloads that are going to be operating over long periods of time. It
might be five minutes if there's a lot of data it could be an hour it could be a day
could be a week.
Introduction to IoT platform: Presentation
Hot Path
Things Cloud IoT Presentation Consumers
EL
Stream Analytics,
Real-time Event Hub,
PT
Consumers,
Data Lake, Data
Small Batch Functions, Integration
Data Factory,
Azure Processing Synapse,
Azure DBaas,
Sphere IoT Central Databricks,
Power BI,
Edge
Azure DBaas
Azure Iot IoT Hub, Synapse,
Device
N
IoT Hub DPS, Azure App
SDK Digital Twins services
Cold Path
Data Lake,
Batch Data Factory,
Processing Synapse,
Databricks,
Azure DBaas
Introduction to IoT platform: Presentation
The data that was collected by way of things that originated in the iot layer i's
going to be an aggregated plus some enhancement of that data and some Presentation
EL
filtering of that data. This is going to be things like api’s that are going to be
consumed by applications, it's going to be reports that people are going to be
Reporting,
looking at. Dataset, APIs
Applications
That could be some kind of dashboard like report where you're looking at
PT
telemetry in real time or a query telemetry out of a data set or it could just be
the raw data itself that you're going to be providing by way of some kind of data
integration where you're taking some kind of export of the data and then taking
that into another data system for consumption in that system.
Azure DBaas,
Power BI,
Synapse,
Azure App
N
Regardless of whatever is the presentation of that data it's basically the output services
of the data pipelines that you're employing either as hotpath warm path or cold
path and the presentation then can take that data and then just make it
available.
Introduction to IoT platform: Presentation
So this is going to imply things like security, access controls and those kinds of things,
also a database as a service offering, so anything that would store the data, that would Presentation
EL
be sql server, cosmos db, maria, azure data explorer there's a lot of different ways to
present data.
Reporting,
Then you have the reporting services such as power guide, which is kind of the one Dataset, APIs
tool that a lot of folks love to use for building dashboards in the microsoft context and it Applications
PT
can hook up to all kinds of data sources and then it can import those and then use data
sets that are manipulated inside of the rbi context itself.
You can use azure functions and azure app services for serving up api’s, so azure
functions gives you the ability to create http endpoints that can then query back into
whatever database a source that you want to use or other data sources.
Azure DBaas,
Power BI,
Synapse,
Azure App
N
services
Then azure app services if you want to just write something like an mpd application
that's going to be exposing some kind of data api that external applications then can
consume from that data source.
Introduction to IoT platform: Consumers
Hot Path
Things Cloud IoT Presentation Consumers
EL
Stream Analytics,
Real-time Event Hub,
PT
Consumers,
Data Lake, Data
Small Batch Functions, Integration
Data Factory,
Azure Processing Synapse,
Azure DBaas,
Sphere IoT Central Databricks,
Power BI,
Edge
Azure DBaas
Azure Iot IoT Hub, Synapse,
Device
N
IoT Hub DPS, Azure App
SDK Digital Twins services
Cold Path
Data Lake,
Batch Data Factory,
Processing Synapse,
Databricks,
Azure DBaas
Introduction to IoT platform: Consumers
Now we have consumers, this is not so much an explicit part of the system as
it is a more implicit part of the system. Consumers
EL
Ultimately what ends up in the presentation layer is going to be determined by
what the external consumers of this data are going to want to be in that
presentation layer.
PT
So whenever you're designing a system that is going to be presenting data,
you start with the api in mind and you kind of work back from that to the source
data and that's really why we have set it up this way.
The reason we included it is because you need to be cognizantly aware of how
External systems,
Report Consumers,
Data Integration
N
you want this data to show up in whatever is going to be integrated with it
whether it be a report, whether it be an api or some kind of external data
integration.
Introduction to IoT platform: Edge
Hot Path
Things Cloud IoT Presentation Consumers
EL
Stream Analytics,
Real-time Event Hub,
PT
Consumers,
Data Lake, Data
Small Batch Functions, Integration
Data Factory,
Azure Processing Synapse,
Azure DBaas,
Sphere IoT Central Databricks,
Power BI,
Edge
Azure DBaas
Azure Iot IoT Hub, Synapse,
Device
N
IoT Hub DPS, Azure App
SDK Digital Twins services
Cold Path
Data Lake,
Batch Data Factory,
Processing Synapse,
Databricks,
Azure DBaas
Introduction to IoT platform: Edge
On the edge of a network,a local area network a bunch of devices that are emitting elemetry and events and
doing all those kinds of things that they do and those are ultimately sent back to the cloud.
EL
However, in some cases you might want to put some kind of preprocessor in place that will do some filtering
and aggregation and some other enhancements on the data closer to where the devices are.
Databox Edge
IoT Edge,
So in a sense the edge is almost a microcosm of everything that happens in the cloud.
PT
You will have things like message buses, data pipelines and other kinds of data enhancement tools that exist
in that context for the purpose of pre-processing that data before it goes over to the cloud side.
There are two services that are in this space on the edge, the first one is the iot edge.
Edge
IoT edge is a platform that is more of an operating system that you can install on an appliance and it's based
N
around docker containers. You can do things like stream analytics in that context, it also gives you the ability to
do message filtering and a number of other things that are a part of that ecosystem. Also the code that you
want to make and install it by way of a docker container on the iot edge.
Introduction to IoT platform: Edge
It also offers a message proxy for sending messages from devices to the cloud so that you can basically
queue those messages up on the IoT edge.
EL
In the event of an internet outage, you can then queue those messages up there and then when the internet is
restored, it will then forward those onto the cloud so it mitigates against things like losses of message.
Databox Edge
IoT Edge,
There is local response to events in that particular context as well, so you can build an ML and other kind of
event management into the iot edge. It can quickly respond to something like a fire, for instance if a device
PT
reports that there's a fire, you can have a command issued by the iot edge to put that fire out for instance.
Databox is a similar service but it's not as purpose-built as iot edge and it's basically bringing a lot more the
ML type workloads that you get in something like ML workspaces.
Edge
These kind of things are bringing to the edge as well, so it can do data ingestion and apply ML models.
N
In the context of an edge installation rather than having to ship all that data back up to the cloud you can do it
more intelligently on the edge and do it more quickly, so that you don't have to rely on an internet connection
and the latency that the cloud introduces.
Summary: IoT platform
EL
Device
Registry Batch
Processing
Devices Edge
PT Message
Routing
Policies
Storage
&
Database
Machine
Learning
Business
Intellige-
nce
N
On-premises Data
Ingestion Stream
Analytics
Public cloud
IoT platform: Things
On one end of the platform we have devices and these
devices are either sensors or actuators.
EL
Sensors generate data, for example a temperature
sensor, a humidity sensor, pressure sensor and so on.
The generated data is going to be acquired and ingested Devices Edge
into the cloud.
PT
Then there are actuators like switches and bulbs. These
are the things that you could switch on and switch off that
have electromechanical interface.
On-premises
N
The devices are further connected to the edge and and
the edge acts as a gateway abstracting the devices that
are at the lowest level of the spectrum and that actually
connects to the public cloud.
IoT platform: Insight
Now on the cloud side we have two touch points for the edge or the devices.
One is the device registry that is primarily used for onboarding the devices and it is the
repository of devices. Device
EL
Registry
Every device that is connected to the IOT platform has an identity within the device registry.
The authentication authorization and the metadata of the devices is stored in the device
registry.
The public cloud pass also called IOT pass exposes a data ingestion endpoint. This is the
Data
Ingestion
N
high velocity, high throughput endpoint where the sensor data gets streamed.
Typically could be Kafka if you are doing it yourself or it could be as your event hubs or
Amazon kinases or Google cloud pub/sub so it is the pipe that basically acquires the data
and passes on to the data processing pipeline
IoT platform: Insight
Now both the device registry and data ingestion
endpoints are connected to a message routing policy.
EL
A message routing policy which will define how this
data is going to be split between real-time processing
and batch processing and how the raw data is stored Message
and how the processed data is going to be stored. Routing
PT
This is the place where you actually create a rules
engine or you basically create some kind of policy that
is going to define how the data flows.
Policies
N
For example, some data needs to be batch process,
where you first collect and then process, in some
cases you need to perform real-time stream analytics.
IoT platform: Insight
The batch processing layer which is also called as cold path analytics and
stream analytics layer which is also called as hot path analytics.
Batch
In other words, when you are performing queries on data as it comes that is
EL
Processing
called the hot path analytics and if you are storing and processing the data over
a period of time it is called the cold path analytics.
Now both the raw data which is going or about to go to either batch processing Storage
or stream analytics is first persisted in a time series database or an &
PT
unstructured database and even the output from the batch processing and
stream analytics gets persisted in the the same database.
Then we have storage and databases for persisting the raw sensor data and
also the process data
Database
Stream
Analytics
N
IoT platform: actions
Now from the same data store we apply machine learning
algorithms to basically find out anomaly detection and
predictive analytics from the data that is coming in.
EL
Finally, all of that is fed into an enterprise Business
Intelligence Platform, where you can actually run dashboards Business
and alerts and the entire visualization happens on the data Machine Intellige-
PT
warehousing or the business intelligence layer
These key building blocks of IOT platform you could actually
map this to Azure or AWS or Google or G predicts, Bosch
Learning nce
N
IOT, etc.
Every platform has a very similar architecture it's almost like
a blueprint for any public cloud-based IOT platform.
Lecture Summary
● Detail of components of IoT architecture.
EL
● Concepts of IoT platform building blocks.
PT
N
N
PT
EL
Time and Clock Synchronization in IoT
You want to catch a bus at 9.05 am, but your watch is off by 15
minutes
What if your watch is Late by 15 minutes?
• You’ll miss the bus!
What if your watch is Fast by 15 minutes?
• You’ll end up unfairly waiting for a longer time than you
intended
Distributed Time
The notion of time is well defined (and measurable) at
each single location
But the relationship between time at different locations
is unclear
S
Check local clock to find time t
What’s Wrong:
By the time the message has received at P, time has moved on.
P’s time set to t is in accurate.
Inaccuracy a function of message latencies.
Since latencies unbounded in an asynchronous system, the inaccuracy
cannot be bounded.
Primary servers
Secondary servers
Tertiary servers
Client
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
NTP Protocol
(one-way delay)
A B C D E
P1
Time
P2 E F G
P3 H I J
Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Example 1: Happens-Before
A B C D E
P1
Time
P2 E F G
P3 H I J
A B C D E
P1
Time
P2 E F G
P3 H I J
• HàG
• FàJ Instruction or step
• HàJ
Message
• CàJ
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport timestamps
Goal: Assign logical (Lamport) timestamp to each event
Timestamps obey causality
Rules
Each process uses a local counter (clock) which is an integer
• initial value of counter is zero
A process increments its counter when a send or an
instruction happens at it. The counter is assigned to the
event as its timestamp.
A send (message) event carries its timestamp
For a receive (message) event the counter is updated by
P1
Time
P2
P3
Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport Timestamps
P1 0
Time
P2 0
P3 0
Instruction or step
Initial counters (clocks)
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport Timestamps
P1 0
ts = 1
Time
P2 0
Message carries
ts = 1
P3 0
ts = 1
Message send Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport Timestamps
P1 0
1 ts = max(local, msg) + 1
Time
= max(0, 1)+1
=2
P2 0
Message carries
ts = 1
P3 0
1
Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport Timestamps
P1 0 2
1
Message carries Time
ts = 2
P2
0 2
max(2, 2)+1
=3
P3
0 1
Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport Timestamps
max(3, 4)+1
=5
P1 0 2
1 3
Time
P2 0
2 3 4
P3 0
1
Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport Timestamps
P1 0 2 5 6
1 3
Time
P2 0
2 3 4
P3 0
1 2 7
Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Obeying Causality
A B C D E
P1 0 2 5 6
1 3
Time
E F G
P2 0
2 3 4
P3 0 H I J
1 2 7
Instruction or step
• A à B :: 1 < 2
• B à F :: 2 < 3 Message
• A à F :: 1 < 3
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Obeying Causality (2)
A B C D E
P1 0 2 5 6
1 3
Time
E F G
P2 0
2 3 4
H I J
P3 0
1 2 7
H à G :: 1 < 4 Instruction or step
F à J :: 3 < 7
H à J :: 1 < 7 Message
C à J :: 3 < 7
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Not always implying Causality
A B C D E
P1 0 2 5 6
1 3
Time
E F G
P2 0
2 3 4
H I J
P3 0
1 2 7
• ? C à F ? :: 3 = 3 Instruction or step
• ? H à C ? :: 1 < 3 Message
• (C, F) and (H, C) are pairs of
concurrent events
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Concurrent Events
A pair of concurrent events doesn’t have a causal path
from one event to another (either way, in the pair)
Lamport timestamps not guaranteed to be ordered or
unequal for concurrent events
Ok, since concurrent events are not causality related!
Remember:
A B C D E
P1
Time
E F G
P2
H I J
P3
Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Vector Timestamps
P1 (0,0,0)
Time
P2
(0,0,0)
P3
(0,0,0)
P1 (0,0,0) (1,0,0)
Time
P2
(0,0,0)
P3 Message(0,0,1)
(0,0,0) (0,0,1)
P1(0,0,0) (1,0,0)
Time
P2
(0,0,0) (0,1,1)
P3 Message(0,0,1)
(0,0,0) (0,0,1)
P1
(0,0,0) (1,0,0) (2,0,0)
Message(2,0,0) Time
P2
(0,0,0) (0,1,1) (2,2,1)
P3
(0,0,0) (0,0,1)
P1
(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,3,1) (5,3,1)
Time
P2
(0,0,0) (0,1,1) (2,2,1) (2,3,1)
P3
(0,0,0) (0,0,1) (0,0,2) (5,3,3)
P2 E F G
(0,0,0) (0,1,1) (2,2,1) (2,3,1)
H I J
P3
(0,0,0) (0,0,1) (0,0,2) (5,3,3)
P2 E F G
(0,0,0) (0,1,1) (2,2,1) (2,3,1)
H I J
P3
(0,0,0) (0,0,1) (0,0,2) (5,3,3)
• H à G :: (0,0,1) < (2,3,1)
• F à J :: (2,2,1) < (5,3,3)
• H à J :: (0,0,1) < (5,3,3)
• C à J :: (3,0,0) < (5,3,3)
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Identifying Concurrent Events
A B C D E
P1(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,3,1) (5,3,1)
Time
P2 E F G
(0,0,0) (0,1,1) (2,2,1) (2,3,1)
P3 H I J
(0,0,0) (0,0,1) (0,0,2) (5,3,3)
EL
PT Dr. Rajiv Misra
Professor, Dept. of Computer
N
Science & Engg. Indian Institute of
Technology Patna rajivm@[Link]
After Completion of this lecture you will knowing the following:
EL
● How edge ML addresses the issues of IoT platform?
● Work flow of edge ML
●
PT
Advantages and applications of edge ML
N
Recapitulate: Internet of Things
EL
PT
N
Recapitulate: Traditional IoT platform
EL
PT
N
Once the edge device is connected to azure IoT hub service, different custom code modules are developed.
One to capture incoming data and send that to the custom vision module and another one to manage, control
and get the score out to display output of the model and the last one is a custom vision model which is used
to provide the insight.
Recapitulate: Limitation of traditional IoT platform
Poor internet connection, when the internet is down the system fails. For example, if a
smart fire alarm system just detect fires when internet connection is up, then it fails in
EL
performing its task.
Data gravity, IoT devices create lots of data that demand more way to find the insights
locally on a device than shipping all of the data to the cloud. For example, a smart
PT
doorbell, you don't want to stream video to the cloud 24-7 just to identify faces for the
two minutes that someone is in front of your door, you would rather do that, locally on
the smart doorbells.
N
Real time responses, as opposed to near real time responses that you cannot get by
sending data to the cloud finding insights and then sending the actions back down.
ML on cloud Vs ML on Edge
EL
PT
Remote monitoring and control
EL
PT
N
Once the IoT device fetches the workload description from cloud, then whenever the device receives his
deployment manifest from the IoT hub service, it understands that it should go fetch those two containers
i.e. action and things.
Enabling Intelligence at Edge layer for IOT
To manage the increasing amount of data that is generated by the devices, sensors, most of
the business logic is now applied at the edge instead of the cloud to achieve low-latency and
EL
have faster response time for IOT devices using Machine learning at edge.
EL
business logic is now deployed at the edge layer instead of cloud to ensure low-latency
and faster response time.
● Only a subset of the data generated by sensors is sent to the cloud after aggregating and
PT
filtering the data at the edge.
EL
To bridge the gap between the cloud and edge, innovations in chip designs offers purpose-built
accelerator that speed up model inferencing significantly. Chip manufacturers such as Qualcomm,
NVIDIA and ARM have launched specialized chips that speed up the execution of ML-enabled
applications.
PT
These modern processors GPUs assist the CPU of the edge devices by taking over the complex
mathematical calculations needed for running deep learning models, accelerate the inference
process.
N
This result in faster prediction, detection and classification of data ingested to the edge layer.
The solutions like Microsoft Azure IoT Edge runtime and the Qualcomm Neural Processing SDK for
ML makes it possible to take models trained in the cloud and run hardware-accelerated inference at
the intelligent edge.
Training: Involves the use of a
deep-learning framework (e.g.,
TensorFlow) and training
dataset. IoT data provides a
source of training data that data
scientists and engineers can
use to train machine learning
EL
models for a variety of use
cases, from failure detection to
consumer intelligence.
EL
a general machine learning workflow.
This lecture session is not intended
as an in-depth introduction to
machine learning.
PT
We intend to illustrate the process of
creating and using a viable model for
IoT data processing.
N
ML on edge IoT: Collect training data
EL
The process begins by collecting training data.
In some cases, data has already been collected
and is available in a database, or in form of data
files.
PT
In other cases, especially for IoT scenarios, the
data needs to be collected from IoT devices and
N
sensors and stored in the cloud.
ML on edge IoT: Prepare data & Experiment
In most cases, the raw data as collected from devices
and sensors will require preparation for machine
EL
learning.
This step may involve data clean up, data reformatting,
or preprocessing to inject additional information machine
learning can key off.
PT
Data preparation involves calculating explicit label for
every data point in the sample based on the actual
observations on the data.
N
This information allows the machine learning algorithm
to find correlations between actual sensor data patterns
and the expected results. This step is highly domain-
specific.
ML on edge IoT: Build a machine learning model
EL
Build a machine learning model
experiment with different machine learning
algorithms and parameterizations to train models
and compare the results to one another.
In this case, for testing we compare the predicted
PT
outcome computed by the model with the real
outcome observed for a IOT Application.
In Azure Machine Learning, this be can done in the
N
different iterations of models that is created in a
model registry.
ML on edge IoT: Deploy the machine learning model
EL
Based on the prepared data, we can now
experiment with different machine learning
algorithms and parameterizations to train models
and compare the results to one another.
PT
In this case, for testing we compare the predicted
outcome computed by the model with the real
outcome observed for a IOT Application.
In Azure Machine Learning, this be can done in the
N
different iterations of models that is created in a
model registry.
ML on edge IoT: Maintain and refine the model
EL
Our work is not done once the model is deployed. In many cases, we want to continue
collecting data and periodically upload that data to the cloud.
We can then use this data to retrain and refine our model, which we then can redeploy to IoT
Edge.
PT
N
Edge ML Platform (SaaS)
EL
PT
N
Once the edge device is connected to azure IoT hub service, two custom code modules are developed.
One to capture incoming data and send that to the custom vision module and another one to manage,
control and get the score out to display output of the model and the last one is a custom vision model which
is used to provide the insight.
Edge ML Platform: Insight-Container
Step 1: Package the data transform, insight and
action into containers.
EL
Now, write those three modules and package them
as docker containers
PT
N
Edge ML Platform: Docker
Step 2: Put the containers to container registry.
EL
Push all those docker containers into container
registry
PT
N
Edge ML Platform: Cloud IOT-HUB
Step 3: Define a workload description in the cloud.
EL
Then, write a deployment manifest which is also
called as the workload description that deploy
those three modules.
PT
N
Edge ML Platform: Edge Run-time Manifestation
Step 4: Target a IoT edge runtime on edge device.
EL
The edge device is running its runtime that's
appear because it's right hooked up to specific
instance of it.
PT
N
Edge ML Platform: Migrating Workload
Step 5: Shift the workload description to the target
IoT edge runtime on edge device.
EL
Whenever the device receives his deployment
manifest from the IoT hub service, it understands
that it should go fetch those three containers.
PT
N
Edge ML Platform: Enabling Edge
Step 6: The target IoT edge runtime download the
correct work load from the cloud and start them up
EL
using container registry and runs on the edge
device.
PT
N
Azure IoT Hub
Azure iot hub allows for bi-directional communication between the cloud and iot devices
Also, allows developers to take advantage of this information to provide insights monitoring
EL
and develop custom solutions for their iot platform.
PT
N
Azure IoT Hub: key characteristics
Manages service for bi-directional communication: it is a managed service for bi-directional
communication between the cloud and iot devices.
EL
Platform as a service (Paas): it's a platform as a service offering in azure for iot development.
Highly secure, scalable and reliable: it's a highly secure, scalable and reliable service for iot
devices.
PT
Integrates with lots of azure services: perfectly integrates with a lot of azure services.
Programmable SDK for popular languages: you do not need to learn any new language to take
advantage of iot hub for their development purposes.
N
Multiple protocols: it support for multiple common standards on the market when it comes to
communication protocols
IoT-Edge: key characteristics
The Camera Capture Module handles scanning items using a camera. It then calls the Image Classification module to
identify the item, a call is then made to the “Text to Speech” module to convert item label to speech, and the name of
the item scanned is played on the attached speaker.
EL
The Image Classification Module runs a Tensorflow machine learning model that has been trained with images of
fruit. It handles classifying the scanned items.
The Text to Speech Module converts the name of the item scanned from text to speech using Azure Speech Services.
Azure IoT Hub (Free tier) is used for managing, deploying, and reporting Azure IoT Edge devices running the solution.
N
Azure Speech Services (free tier) is used to generate very natural speech telling the shopper what they have just
scanned.
Azure Custom Vision service was used to build the fruit model used for image classification.
Services offered by IoT Edge:
EL
PT
N
Azure IoT Edge in Action
EL
PT
N
Azure IoT edge: Functionalities
● Target workload at the correct type of device
○ Once the workload description sent down to the edge, the run time will download the correct work
EL
load from the cloud and start them up and running.
● Create workload which can include high value ML
○ This results in the custom code, machine learning model, and business logic all running locally
○ PT
independent of cloud connection and also all of those values of edge analytics.
● Run those workload locally, in disconnected manner
The runtime is smart enough to detect if the workload is trying to send messages to the cloud while it
doesn't have internet connection, the runtime will catch those messages and sync them with the
N
cloud once the internet is up.
● Monitor the health of the workloads
○ Azure IoT edge ensures that the work loads continue to run and report status sent back to the cloud.
Reporting the status back to the cloud allows to understand if there is any issues issues in the
deployment and take preventive actions.
Advantages of Edge ML
Reduced latency: Transfer of data back & forth from the cloud takes time. Edge ML reduces
latency by processing data locally (at the device level).
EL
Real-time analytics: Real-time analytics is a major advantage of Edge Computing. Edge ML
brings high-performance computing capabilities to the edge, where sensors and IoT devices are
located.
Reduced bandwidth requirement: Edge ML processes the data locally on the device itself,
reducing the cost of internet bandwidth and cloud storage.
N
Improved data security: Edge ML systems perform the majority of data processing locally i.e. on
the edge device itself. This greatly reduces the amount of data that is sent to the cloud and other
external locations.
Advantages of Edge ML
Scalability: Edge ML typically processes large amounts of data. If you have to
process video image data from many different sources simultaneously,
EL
transferring the data to a cloud service is not required.
Improved reliability: Higher levels of security combined with greater speed
produce greater the reliability of Edge ML System.
PT
Reduced cost: ML processing is working on the edge of the device so it is
highly cost-cost efficient because only processed, data required or valuable
data is sent to the cloud.
N
Reduced power: Edge ML processes data at the device level so it saves
energy costs
Applications of Edge ML
Manufacturing: rapid collection and analysis of data produced by edge-based devices and sensors.
Energy (Oil and Gas): real-time analytics with information processing in remote locations.
EL
Industrial IoT: Inspection of devices/machines is done via ML algorithms instead of human beings
performing manual inspections can save time & money.
Autonomous Vehicles: fast data processing that could take milliseconds to perform which could
reduce collision.
PT
Healthcare: to process all patient monitoring device data locally like glucose monitors, cardiac
trackers, blood pressure sensors, etc.
Smart Homes: data movement time can be reduced and also the sensitive information can be
N
processed only on edge.
Lecture Summary
● Limitations of IoT platform
EL
● How edge ML addresses the issues of IoT platform?
● Work flow of edge ML
●
PT
Advantages and applications of edge ML
N
N
PT
EL
ML-based Image Classifier at IoT-Edge
EL
PT Dr. Rajiv Misra
Professor, Dept. of Computer
N
Science & Engg. Indian Institute of
Technology Patna rajivm@[Link]
After Completion of this lecture you will knowing the following:
EL
● Different techniques of computer vision like image classification,
detection, segmentation, etc
● Object detection models like RCNN, Fast RCNN, Faster RCNN,
SDD, YOLO
PT
● Azure compute vision as SaaS
N
Computer Vision: Introduction
EL
Computer vision is a sub branch
of machine learning which deals
with giving computers the ability
PT
to see and interpret and extract
information from images and
videos, videos can be seen as
collection of images.
N
How Computer Vision Works?
To train a computer vision model, you essentially feed some thousands of images of cats and it's going to do
some complex mathematics and feature extraction etc in the background.
EL
Based on that it learns some key understanding or properties that define cats.
PT
N
Input
Feature extraction
Output
Thousands of cats images
Computer Vision: Data Analytics View
With every machine learning model, the model is not
EL
only the important part.
The fundamental fact that's going to determine how
good your model is the data you feed it.
PT
Today, this is another point that we want to focus on is
the fact that your model is only as good as your data.
So one of the key things that we're going to focus on
today’s lecture is how to make sure that our data is
N
good when we're building computer vision models.
Computer Vision: Techniques
Computer vision deals with all the problems related to images and videos. There's a lot of techniques,
fundamentals or problems that can be tackled with computer vision. A few of them includes image
EL
classification, image detection, image segmentation, pattern detection and object localisation. So object
detection and image classification are the two things that we're going to talk about today and we're going to
go into some of the predictive model build for an object detection model.
PT
N
Computer Vision: Architecture
A typical end-to-end pipeline of computer vision architecture is shown below.
EL
PT
N
Input data, the images that you want to train your model on, but those images might be coming from a lot of different sources.
Computer Vision Architecture: Pre-processing
The second and important step is pre-processing of input data.
EL
Machine learning depends on standardization, means you need to pre-process
input images to make sure that they're all of the same size.
There might be some noise in the images, all of that needs to be dealt with
before the image fed into the model.
PT
If it's not done correctly, the model might learn noise or other features that are
not good features or it might learn from those that actually mean that your
model is going to be fundamentally flawed.
N
Therefore, pre-processing is essentially it's very important that need to be.
Computer Vision Architecture: Data labeling
The third step is labeling your data.
EL
For an object detection problem, you have images with different
objects, let's say if you have an image with a cat and a dog, you would
need to label that specific part of the image where there's a cat and a
dog.
PT
This label or tags to that specific area where there's a dog or cat, so
this is essentially labeling and this needs to be done as well.
N
Computer Vision Architecture: Feature extraction and prediction
EL
Then feature extraction and prediction part performed by a machine
learning model.
These are part of the model training, the model learns about what
features are present, it extracts the features.
PT
And if features are relevant from the images, then those features are
learned along with the patterns and later the model uses it to build a
sort of rules for itself and these rules are used to predict the output.
N
Computer Vision: Object Detection Models
The field of object detection is
EL
not as new as it may seem. In
fact, object detection has
evolved over the past 20 years.
Popular deep learning algorithm
that achieved remarkable
results in this domain are:
PT
N
● RCNN
● Fast RCNN
● Faster RCNN
● YOLO
● SSD (Single Shot Detector)
Object Detection Model: RCNN
R-CNN, or Region-based Convolutional Neural
Network, consisted of 3 simple steps:
EL
1. Scan the input image for possible objects using
an algorithm called Selective Search,
generating ~2000 region proposals
2. Run a convolutional neural net (CNN) on top of
PT
each of these region proposals
3. Take the output of each CNN and feed it into
a) an SVM to classify the region and b) a linear
regressor to tighten the bounding box of the
object, if such an object exists.
N
In other words, we first propose regions, then extract
features, and then classify those regions based on
their features. In essence, we have turned object
detection into an image classification problem. R-
CNN was very intuitive, but very slow.
Object Detection Model: Fast RCNN
As we can see from the image, we are now
generating region proposals based on the last
EL
feature map of the network, not from the original
image itself. As a result, we can train just one CNN
for the entire image.
PT
to classify each object class, there is a single
softmax layer that outputs the class probabilities
directly. Now we only have one neural net to train,
as opposed to one neural net and many SVM’s.
N
Fast R-CNN performed much better in terms of
speed. There was just one big bottleneck remaining:
the selective search algorithm for generating region
proposals.
Object Detection Model: Faster RCNN
The main insight of Faster R-CNN was to replace the slow
selective search algorithm with a fast neural net. Specifically, it
EL
introduced the region proposal network (RPN).
PT
dimension (e.g. 256-d)
● For each sliding-window location, it generates multiple
possible regions based on k fixed-ratio anchor boxes
(default bounding boxes)
● Each region proposal consists of a) an “objectness” score
for that region and b) 4 coordinates representing the
N
bounding box of the region
EL
PT
N
Object Detection Model: SSD
SSD stands for Single-Shot Detector. Like R-FCN, it provides enormous speed gains over Faster R-CNN,
but does so in a markedly different manner.
EL
Our first two models performed region proposals and region classifications in two separate steps. First,
they used a region proposal network to generate regions of interest; next, they used either fully-connected
layers or position-sensitive convolutional layers to classify those regions. SSD does the two in a “single
shot,” simultaneously predicting the bounding box and the class as it processes the image.
●
PT
Given an input image and a set of ground truth labels, SSD does the following:
● Pass the image through a series of convolutional layers, yielding several sets of feature maps at
different scales (e.g. 10x10, then 6x6, then 3x3, etc.)
For each location in each of these feature maps, use a 3x3 convolutional filter to evaluate a small set
N
of default bounding boxes. These default bounding boxes are essentially equivalent to Faster R-
CNN’s anchor boxes.
● For each box, simultaneously predict a) the bounding box offset and b) the class probabilities
● During training, match the ground truth box with these predicted boxes based on IoU. The best
predicted box will be labeled a “positive,” along with all other boxes that have an IoU with the truth
>0.5.
Object Detection Model: YOLOv3
You Only Look Once or more popularly known as YOLO is one of the fastest real-time object
detection algorithm (45 frames per second) as compared to the R-CNN family (R-CNN, Fast R-CNN,
EL
Faster R-CNN, etc.)
The R-CNN family of algorithms uses regions to localise the objects in images which means the
model is applied to multiple regions and high scoring regions of the image are considered as object
detected.
different way.
PT
Instead of selecting some regions, YOLO approaches the object detection problem in a completely
It forwards the entire image to predict bounding boxes and their probabilities only once through the
N
neural network.
The authors have also improved the network by making it bigger and taking it towards residual
networks by adding shortcut connections.
Object Detection Model: YOLOv3
First, it divides the image into a 13×13 grid of cells. The size of these 169 cells varies depending on the input size. For a
416×416 input size, the cell size was 32×32. Each cell is responsible for predicting the number of boxes in image.
EL
For each bounding box, the network also predicts the confidence that the bounding box actually encloses an object, and the
probability of the enclosed object being a particular class.
Most of these bounding boxes are eliminated because their confidence is low or because they are enclosing the same object
as another bounding box with a very high confidence score. This technique is called non-maximum suppression.
PT
N
Object Detection Models: Performance Metric
An overview of the most popular metrics used to compare performance of different deep learning models:
EL
Intersection Over Union (IOU)
Intersection Over Union (IOU) is a measure based on Jaccard Index that evaluates the overlap between two
bounding boxes. IOU is given by the overlapping area between the predicted bounding box and the ground
truth bounding box divided by the area of union between them:
PT
N
Object Detection Models: Performance Metric
Precision:
EL
Precision is the ability of a model to identify only the relevant objects. It is the percentage of correct positive
predictions and is given by:
Recall:
PT
Recall is the ability of a model to find all the relevant cases (all ground truth bounding boxes). It is the
percentage of true positive detected among all relevant ground truths and is given by:
N
True Positive (TP): A correct detection. Detection with IOU ≥ threshold
False Positive (FP): A wrong detection. Detection with IOU < threshold
False Negative (FN): A ground truth not detected
Computer Vision: SaaS Architecture
A computer vision architecture can easily be taken up
by a cloud service that is running a computer vision
EL
model in the cloud.
A SaaS is a software as a service that is offered by all
providers like azure, amazon aws, google cloud. All of
them offers some variation of these for computer vision
PT
service.
In that architecture, all you need to do is you need to
have your images and upload them and tag them.
Tagging is vital because you as the domain expert
know what information is present in the images.
N
Once you've uploaded them in the cloud, the model
training and everything that is completely dependent on
the cloud provider and fully managed by the cloud
service provider.
Computer Vision: SaaS Architecture
It's extremely easy to scale up your dataset
and allow you to download the models that
EL
you've built that can later be used offline.
PT
The SaaS architecture provided by different
cloud provider offers similar services.
N
There might be fundamental differences in the
ui or how you are uploading images or the api
or how you're calling the services but under the
hood they're doing the same thing.
SaaS: Azure Custom Vision
Azure Custom Vision is a cloud service used to build and deploy
computer vision models.
EL
Custom Vision uses a pretty interesting neural network technique
called transfer learning, which applies knowledge gained from
solving one problem to a different, but related situation. This can
PT
substantially decrease the time needed for creating the models.
EL
Custom Vision.
● Develop an IoT Edge module
●
that queries the Custom
PT
Vision web server on device.
Send the results of the image
N
classifier to IoT Hub.
Use Case: Creating an image recognition solution with
Azure IoT Edge and Azure Cognitive Services
Although there are lots of applications for image recognition but we had chosen
this application which is a solution for vision impaired people scanning fruit and
vegetables at a self-service checkout.
EL
Required Components
Raspberry Pi 3B or better, USB Camera, and a Speaker.
PT
Note, the solution will run on a Raspberry Pi 3A+, it has enough processing
power, but the device is limited to 512MB RAM. A Raspberry Pi 3B+ has 1GB of
RAM and is faster than the older 3B model. Azure IoT Edge requires an
ARM32v7 or better processor. It will not run on the ARM32v6 processor found in
the Raspberry Pi Zero.
N
Desktop Linux - such as Ubuntu 18.0
This solution requires USB camera pass through into a Docker container
as well as Azure IoT Edge support. So for now, that is Linux.
Guide for installing Raspberry Pi
Set up Raspbian Stretch Lite on Raspberry Pi: Be sure to configure the correct Country
Code in your wpa_supplicant.conf file.
EL
Azure subscription: If you don’t already have an Azure account then sign up for a free
Azure account. If you are a student then sign up for an Azure for Students account, no
credit card required.
Create an Azure IoT Hub, and an Azure IoT Edge device: Install Azure IoT Edge runtime
PT
on Raspberry Pi and download the deployment configuration file that describes the Azure
IoT Edge Modules and Routes for this solution. Open the [Link] link
and save the [Link] in a known location on your computer.
Install Azure CLI and Azure CLI command line tools: With CLI open a command line
console/terminal and change directory to the location where you saved the
N
[Link] file.
Deploy edge Iot to device: The modules will now start to deploy to
your Raspberry Pi, the Raspberry Pi green activity LED will flicker
until the deployment completes. Approximately 1.5 GB of Dockers
modules will be downloaded and decompressed on the Raspberry
Pi. This is a one off operation.
Considerations and constraints for the solution
The solution should scale from a Raspberry Pi (running
Raspbian Linux) on ARM32v7, to my desktop development
EL
environment, to an industrial capable IoT Edge device such
as those found in the Certified IoT Edge Catalog.
PT
The camera capture module needed Docker USB device
pass-through (not supported by Docker on Windows) so that
plus targeting Raspberry Pi meant that need to target Azure
IoT Edge on Linux.
N
To mirror the devices plus targeting, ir requires Docker
support for the USB webcam, so develop the solution on
Ubuntu 18.04 developer desktop.
Create Classification model using Azure Custom Vision
The Azure Custom Vision service is a simple way to create an image classification machine learning
model without having to be a data science or machine learning expert.
EL
You simply upload multiple collections of labelled images. For example, you could upload a collection of
banana images and label them as ‘banana’.
It is important to have a good variety of labelled images so be sure to improve your classifier.
PT
N
Create Custom Vision Classification model
1. Create a project in custom vision service mentioning the project type, classification type and domains.
2. Gather initial data (images) and separate them in different folders.
3.
EL
Once data is uploaded, train your model by clicking “Train” button on the navigation bar.
1.
PT
When the training is ended, the performance metrics will be shown. Click on the “i” bubble to
see the meaning of each performance metric.
N
Create Custom Vision Classification model
5. Custom Vision offers fluent prediction thresholds adjustment to improve model performance. In our case
we prefer higher Recall over high Precision. It is important not to lower those thresholds too much as the
EL
model performance will suffer significantly. E.g., having low probability threshold will lead to increased
number of false positives. If the model is supposed to be deployed in a production setting, we can’t be
stopping the production line for every false positive detection produced by the model.
For the problem that we are working with right now we decided to set our KPIs as follows:
●
● PT
The main metric to optimize for is mAP – it cannot be any lower than 85%
The Recall and Precision are equally important, and both should stay above 80%
N
Export Custom Vision Classification model
Step 1: From the Performance tab of your Custom Vision project click Export.
EL
PT
N
Export Custom Vision Classification model
Step 2: Select Dockerfile from the list of available options
EL
PT
N
Export Custom Vision Classification model
Step 3: Then select the Linux version of the Dockerfile.
EL
PT
N
Step 4: Download the docker file and unzip and you have a ready-made Docker solution with a Python Flask REST API. This
was how the Azure IoT Edge Image Classification module is created in this solution.
Installing the solution
Step 1: Clone the repository for creating an image recognition solution with Azure IoT Edge and Azure
EL
Cognitive Services.
Step 2: Install the Azure IoT Edge runtime on your Linux desktop or device (eg Raspberry Pi).
Step 4: With Visual Studio Code, open the IoT Edge solution you cloned to your developer desktop.
Building the Solution
Step 1: Pushing the image to a local Docker repository with specifying the localhost.
EL
Step 2: Confirm processor architecture using the Visual Studio Code
PT
N
Building the Solution
Step 3: Build and Push the solution to
EL
Docker by right mouse clicking the
[Link] file and select
“Build and Push IoT Edge Solution”.
PT
N
Deploying the Solution
When the Docker Build and Push
EL
process has completed select the Azure
IoT Hub device you want to deploy the
solution to. Right mouse click the
[Link] file found in the config
PT
folder and select the target device from
the drop-down list.
N
Monitoring the Solution on the IoT Edge Device
Once the solution has been deployed you can monitor it on the IoT Edge device itself using the
EL
iotedge list command.
PT
N
Monitoring the Solution on the IoT Edge Device
You can also monitor the state of the Azure IoT Edge module from the Azure IoT Hub blade on the
Azure Portal.
EL
PT
N
Monitoring the Solution on the IoT Edge Device
Click on the device from the Azure IoT Edge blade to view more details about the modules running
on the device.
EL
PT
N
Lecture Summary
● Computer Vision
○ Introduction
EL
○ How it works?
○ Techniques
○ Architecture
● Objection detection models
○
○
○
○
RCNN
Fast-RCNN
Faster-RCNN
SSD
PT
N
○ YOLO
● Azure compute vision as SaaS
● Usecase
N
PT
EL
Introduction to Docker Containers and Kubernetes
EL
PT Dr. Rajiv Misra
Professor, Dept. of Computer
N
Science & Engg. Indian Institute of
Technology Patna rajivm@[Link]
After completion of this lecture you will be knowing the following:
● Introduction to Kubernetes
EL
○ Containers
○ Orchestration
● Concepts of Dockers
PT
● Power of kubernetes to deploy software on edge devices
N
Introduction to Kubernetes
Kubernetes is the greek word for helmsman or captain of a ship.
EL
It is now an open source project and is one of the best and most
popular container orchestration technologies out there.
As applications grow to span multiple containers deployed across
PT
multiple servers, operating them becomes more complex.
To manage this complexity, Kubernetes provides an open source API
that controls how and where those containers will run.
EL
networking interfaces, mounts similar to
virtual machines except the fact that they
all share the same operating system
kernel.
PT
Orchestration consists of a set of tools and
scripts that can help host containers in a
N
production environment. An orchestration
consists of multiple container hosts that
can host containers, if one fails the
application is still accessible through the
others.
Introduction to Kubernetes
Kubernetes consists of one computer that gets designated as the
control plane, and lots of other computers that get designated as
worker nodes. Each of these has a complex but robust stack making
EL
orchestration possible,
PT
Kubernetes also automatically manages service discovery,
incorporates load balancing, tracks resource allocation and scales
based on compute utilisation. And, it checks the health of individual
resources and enables apps to self-heal by automatically restarting or
replicating containers.
N
Now get familiar with each of the kubernetes components:
● Control plane component
● Worker node component
Kubernetes: Control plane Components
Etcd:
EL
Etcd is a fast, distributed, and consistent key-value store used
as a backing store for persistently storing Kubernetes object
data such as pods, replication controllers, secrets, and services.
Etcd is the only place where Kubernetes stores cluster state
PT
and metadata. The only component that talks to etcd directly is
the Kubernetes API server. All other components read and write
data to etcd indirectly through the API server.
Etcd also implements a watch feature, which provides an event-
N
based interface for asynchronously monitoring changes to keys.
Once you change a key, its watchers get notified. The API
server component heavily relies on this to get notified and move
the current state of etcd towards the desired state.
Kubernetes: Control plane Components
API Server:
EL
The API server is the only component in Kubernetes that directly interacts with
etcd. All other components in Kubernetes must go through the API server to
work with the cluster state, including the clients (kubectl). The API server has
the following functions:
configured objects.
PT
Performs validation of those objects so clients can't store improperly
EL
In Kubernetes, controllers are control loops that watch the state
of your cluster, then make or request changes where needed.
Controller examples: PT
resource type, and these objects have a spec field that
N
● Node controller
● Service controller
● Endpoints controller
● Namespace controller
● Deployment controller
● StatefulSet controller
Kubernetes: Control plane Components
Scheduler:
EL
The Scheduler is a control plane process that assigns pods to
nodes. It watches for newly created pods that have no nodes
assigned.
For every pod that the Scheduler discovers, the Scheduler
run on.
PT
becomes responsible for finding the best node for that pod to
EL
Kubelet is an agent that runs on each node in the cluster and is
responsible for everything running on a worker node.
It ensures that the containers run in the pod.
●
the API server.
PT
Register the node it's running on by creating a node resource in
Continuously monitor the API server for pods that got scheduled
to the node.
N
● Start the pod's containers by using the configured container
runtime.
● Continuously monitor running containers and report their status,
events, and resource consumption to the API server.
● Run the container liveness probes, restart containers when the
probes fail and terminate containers when their pod gets deleted
from the API server (notifying the server about the pod
termination).
Kubernetes: Worker node components
Service proxy (kube-proxy) :
EL
The service proxy (kube-proxy) runs on each node
and ensures that one pod can talk to another pod,
one node can talk to another node, and one
PT
container can talk to another container.
It is responsible for watching the API server for
changes on services and pod definitions to
maintain that the entire network configuration is up
N
to date.
When a service gets backed by more than one
pod, the proxy performs load balancing across
those pods.
Kubernetes: Worker node components
Container runtime:
EL
There are two categories of container runtimes:
Lower-level container runtimes: These focus on running containers
and setting up the namespace and cgroups for containers.
PT
Higher-level container runtimes (container engine): These focus on
formats, unpacking, management, sharing of images, and providing
APIs for developers.
EL
Docker is an open platform for developing, shipping, and running
applications.
Docker enables you to separate your applications from your
PT
infrastructure so you can deliver software quickly.
With Docker, you can manage your infrastructure in the same ways you
manage your applications.
N
By taking advantage of Docker’s methodologies for shipping, testing,
and deploying code quickly, you can significantly reduce the delay
between writing code and running it in production.
Docker provides the ability to package and run an application in a
loosely isolated environment called a container.
Docker Architecture
Docker uses a client-server architecture.
EL
The Docker client talks to the Docker
daemon, which does the heavy lifting of
building, running, and distributing your
Docker containers.
PT
The Docker client and daemon can run on
the same system, or you can connect a
Docker client to a remote Docker daemon.
The Docker client and daemon communicate
N
using a REST API, over UNIX sockets or a
network interface.
EL
The Docker daemon listens for Docker API requests and
manages Docker objects such as images, containers,
networks, and volumes. A daemon can also communicate
with other daemons to manage Docker services.
PT
The Docker client is the primary way that many Docker
users interact with Docker. When you use commands
such as docker run, the client sends these commands to
N
dockerd, which carries them out. The docker command
uses the Docker API. The Docker client can communicate
with more than one daemon.
Docker Architecture: Components
Docker registries:
EL
A Docker registry stores Docker images. Docker Hub is a public
registry that anyone can use, and Docker is configured to look
for images on Docker Hub by default. You can even run your
own private registry.
Docker objects:
PT
When you use Docker, you are creating and using images,
containers, networks, volumes, plugins, and other objects.
Docker Desktop:
N
Docker Desktop includes the Docker daemon, the Docker client,
Docker Compose, Docker Content Trust, Kubernetes, and
Credential Helper. For more information, see Docker Desktop.
Power of Kubernetes to deploy software on edge devices
Architecture diagram shows works flow from the
cloud through the virtual cubelet through the edge
EL
provider down to all of your edge devices
First, the virtual cubelet project lets you create a
virtual node in your kubernetes cluster, a virtual
PT
node is not a VM like most other nodes in the
kubernetes cluster instead it is an abstraction of a
kubernetes node that is provided by the virtual
cubelet.
N
Backing it, is an IOT hub, it can schedule
workloads to it and treat it like any other
kubernetes node.
Power of Kubernetes to deploy software on edge devices
When workloads are scheduled to this virtual node,
edge provider comes in and that's depicted.
EL
The edge connector or the edge provider which
are working in tandem with the virtual cubelet it
takes the workload specification that comes in from
deployment.
PT
kubernetes and converts it into an IOT edge
EL
Containers
○ Orchestration
● Concepts of Dockers
● Power of kubernetes to deploy software on edge devices
PT
N
N
PT
EL
ML based Predictive Maintenance at IoT Edge
EL
PT Dr. Rajiv Misra
Professor, Dept. of Computer
N
Science & Engg. Indian Institute of
Technology Patna rajivm@[Link]
Lecture Overview
● In this lecture, we combine the Machine Learning (ML) and IoT together.
● The primary objective of this lecture is to introduce the processing of IoT data with machine
EL
learning, specifically on the edge.
● While we touch many aspects of a general machine learning workflow, this lecture is not intended
as an in-depth introduction to machine learning
● We do not attempt to create a highly optimized model for the use case, it just illustrates the
PT
process of creating and using a viable model for IoT data processing.
N
ML Development at IoT Edge
EL
PT
N
Machine Learning: Background
Artificial intelligence (A.I.) is defined as the property of machines that mimic
human intelligence as characterized by behaviours such as cognitive ability,
memory, learning, and decision making.
EL
Machine learning is a branch of artificial intelligence (AI) and computer
science which focuses on the use of data and algorithms to imitate the way
that humans learn, gradually improving its accuracy.
PT
Deep learning can ingest unstructured data in its raw form (e.g., text or
images), and it can automatically determine the set of features which
distinguish different categories of data from one another.
EL
Example: Train a model for the motor vibration with two sensors namely
A and B, in normal operating conditions. That means, using normal data
points, model has good understanding of what the motor vibration value
could be approximately when the motor is operating in normal mode and
without any problems.
PT
Now, let’s say, one day at a random point in time, model observes that
the value of sensor A is 8, and at the same time, the value for sensor B is
2. This is clearly an unusual value. The trained model can easily say that
N
this new value is not normal and can indicate that there might be
something wrong with the motor. This is how machine learning works to
detect the unusual behavior of a machine.
Predictive Maintenance: Introduction
In the past, companies have used reactive maintenance,
which focused on preparing an asset once failures had
EL
occurred.
PT
reducing the failures by replacing parts based on worst
case lifetimes for critical pieces of manufacturing tooling.
Next came condition-based maintenance methods, which repairs or replaces equipment when they begin to show
signs of failure. However, this condition-based method requires an experienced maintenance team to inspect the
N
equipment at regular intervals.
With the explosion of computers and sensors, companies are now engaging in machine-led condition-based
maintenance to reduce costs while improving the uptime of factories. Predictive maintenance takes condition-based
maintenance a step father. In this methodology, machine learning analytics are used to predict a machine's failure
early by examining the real-time sensor data and detecting changes in machine health status.
Predictive Maintenance: Introduction
Predictive maintenance employs advanced analytics, on the machine data collected from end sensor nodes to draw meaningful
insights that more accurately predict machine failures. It is comprised of three steps; sense, compute, and act.
EL
PT
N
Predictive Maintenance: Introduction
Data is collected from sensors that are already available
in machines or by adding new sensors, or by using
EL
control inputs.
Depending upon the machine types and the required
failure analysis, different sensor signals, such as
temperature, sound, magnetic field, current, voltage,
PT
ultrasonic, vibration are analyzed to predict the failure.
The predicted information from sensor data analysis is
used to generate an event, work order, and notification.
EL
What is the remaining useful life or the time to failure?
PT
For detecting anomalies in equipment behavior.
● With further analysis, it can provide failure classification.
To optimize equipment settings.
N
Machine Learning Workflow: Predictive Maintenance
A six-step process:
EL
Define the problem and
the outcome
EL
the problem.
This includes the motivation behind creating a predictive and
PT
the intended goals and outcomes.
After this, you can decide how to tackle the task at hand,
including which software to use at each stage.
N
For example, we might use Excel for data preparation, R for
the analysis and modelling and Power BI for deployment.
Machine Learning Workflow: Prepare
It is essential to properly prepare and clean the data that will be
EL
used to create the prediction.
Data cleansing might involve removing duplicated or inaccurate
records, or dealing with missing data points or outliers.
PT
In the case of a predictive maintenance project, the data will take
the form of a time series.
N
Depending on what is being predicted, the observations might be
daily, weekly, monthly, quarterly or yearly.
Machine Learning Workflow: Analyse
Once the data has been prepared, the next step is to analyse it.
EL
For a time series, this involves decomposing the series into its
constituent parts. These include trend and seasonal effects.
PT
The trend is the long-term overall pattern of the data and is not
necessarily linear.
N
Seasonality is a recurring pattern of a fixed length which is
caused by seasonal factors.
Machine Learning Workflow: Modelelling
Predictions are created by combining the trend and seasonality components.
EL
There are functions that can do this for you in Excel, or it can be done by hand in a
statistical package like R.
If modelling manually, refine the weightings of each component to produce a more
accurate model.
PT
The model can be edited to account for any special factors that need to be included.
However, be careful to avoid introducing bias into the prediction and making it less
N
accurate.
Whether using Excel or R, your model will include prediction intervals (or confidence
intervals). These show the level of uncertainty in the prediction at each future point.
Machine Learning Workflow: Deploy
Once you are happy with your model, it's time to deploy it and make the
predictions live.
EL
This means that decision makers within the business or organisation can utilise
and benefit from your predictions.
PT
Deployment may take the form of a visualisation, a performance dashboard, a
graphic or table in a report, or a web application.
You may wish to include with the prediction intervals calculated in the previous
N
step.
These show the user the limits within which each future value can be expected
to fall between if your model is correct.
Machine Learning Workflow: Monitor
After the prediction goes live, it is important to monitor its performance.
EL
A common way of doing this is to calculate the accuracy using an error
measurement statistic.
Popular measures include the mean absolute percentage error (MAPE)
PT
and the mean absolute deviation (MAD).
Depending on what is being predicted, it may be possible to update your
N
model as new data becomes available.
This should also lead to a more accurate prediction of future values.
Machine Learning Methods: Predictive Maintenance
Problem definition: Classification and Regression approach
EL
– Regression: After how long will it fail?
• Methods:
– Traditional machine learning:
● Decision trees: Random forests, gradient boosting trees, isolation forest
● SVM (Support Vector Machines)
EL
It has also gained popularity in domains such as finance where time-series data plays
an important role.
Predictive Maintenance is also a domain where data is collected over time to monitor
PT
the state of an asset with the goal of finding patterns to predict failures which can also
benefit from certain deep learning algorithms.
Among the deep learning methods, Long Short Term Memory (LSTM) networks are
N
especially appealing to the predictive maintenance domain due to the fact that they are
very good at learning from sequences.
This fact lends itself to their applications using time series data by making it possible to
look back for longer periods of time to detect failure patterns.
Deep Learning Methods: Multilayer Perceptrons (MLPs)
Generally, neural networks like Multilayer Perceptrons or MLPs provide capabilities that are offered
by few algorithms, such as:
EL
● Robust to Noise. Neural networks are robust to noise in input data and in the mapping function
and can even support learning and prediction in the presence of missing values.
● Nonlinear. Neural networks do not make strong assumptions about the mapping function and
readily learn linear and nonlinear relationships.
●
● PT
Multivariate Inputs. An arbitrary number of input features can be specified, providing direct
support for multivariate forecasting.
Multi-step Forecasts. An arbitrary number of output values can be specified, providing direct
support for multi-step and even multivariate forecasting.
N
For these capabilities alone, feedforward neural networks may be useful for time series forecasting.
Deep Learning Methods: Convolutional Neural Networks (CNNs)
Convolutional Neural Networks or CNNs are a type of neural network that was designed to efficiently
handle image data.
EL
The ability of CNNs to learn and automatically extract features from raw input data can be applied to
time series forecasting problems. A sequence of observations can be treated like a one-dimensional
image that a CNN model can read and distill into the most salient elements.
●
PT
Feature Learning. Automatic identification, extraction and distillation of salient features from
raw input data that pertain directly to the prediction problem that is being modeled.
CNNs get the benefits of Multilayer Perceptrons for time series forecasting, namely support for
multivariate input, multivariate output and learning arbitrary but complex functional relationships, but
N
do not require that the model learn directly from lag observations. Instead, the model can learn a
representation from a large input sequence that is most relevant for the prediction problem.
Deep Learning Methods: Long Short-Term Memory Networks (LSTMs)
Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN,
capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber
EL
(1997), and were refined and popularized by different researchers.
LSTM add the explicit handling of order between observations when learning a mapping function
from inputs to outputs, not offered by MLPs or CNNs. They are a type of neural network that adds
native support for input data comprised of sequences of observations.
●
PT
Native Support for Sequences. Recurrent neural networks directly add support for input
sequence data.
This capability of LSTMs has been used to great effect in complex natural language processing
N
problems such as neural machine translation where the model must learn the complex
interrelationships between words both within a given language and across languages in translating
from one language to another.
● Learned Temporal Dependence. The most relevant context of input observations to the
expected output is learned and can change dynamically.
Deep Learning Methods: Long Short-Term Memory Networks (LSTMs)
The model both learns a mapping from inputs to outputs and learns what context from the input
sequence is useful for the mapping, and can dynamically change this context as needed.
EL
LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information
for long periods of time is practically their default behavior.
All recurrent neural networks have the form of a chain of repeating modules of neural network. In
PT
standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.
N
Deep Learning Methods: Long Short-Term Memory Networks (LSTMs)
LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of
having a single neural network layer, there are four, interacting in a very special way.
EL
In the below diagram, each line carries an entire vector, from the output of one node to the inputs of
others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are
learned neural network layers. Lines merging denote concatenation, while a line forking denote its
content being copied and the copies going to different locations.
PT
N
Deep Learning Methods: Long Short-Term Memory Networks (LSTMs)
An LSTM has three of gates, to protect and control the cell state. The first part is called Forget gate, the
second part is known as the Input gate and the last one is the Output gate.
EL
Forget Gate: The first step in our LSTM is to decide what information we’re going to throw away from
the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at ht−1 and
xt, and outputs a number between 0 and 1 for each number in the cell state Ct−1.
PT
A 1 represents “completely keep this” while a 0 represents “completely get rid of this.”
N
Deep Learning Methods: Long Short-Term Memory Networks (LSTMs)
Input Gate: The next step is to decide what new information we’re going to store
EL
in the cell state. This has two parts. First, a sigmoid layer called the “input gate
layer” decides which values we’ll update. Next, a tanh layer creates a vector of
new candidate values, C̃ t, that could be added to the state. In the next step, we’ll
PT
combine these two to create an update to the state.
N
Deep Learning Methods: Long Short-Term Memory Networks (LSTMs)
It’s now time to update the old cell state, Ct−1, into the new cell state Ct. The
EL
previous steps already decided what to do, we just need to actually do it. We
multiply the old state by ft, forgetting the things we decided to forget earlier. Then
we add it∗C̃ t. This is the new candidate values, scaled by how much we decided
to update each state value.
PT
N
Deep Learning Methods: Long Short-Term Memory Networks (LSTMs)
Output gate: Finally, we need to decide what we’re going to output. This output will be based
on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what
EL
parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push
the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we
only output the parts we decided to.
PT
N
Performance Metric: R-squared
The stationary R-squared is used in time series forecasting as a measure that
compares the stationary part of the model to a simple mean model. It is defined
EL
as,
PT
Where SSres denotes the sum of squared residuals from expected values and
SStot denotes the sum of squared deviations from the dependent variable’s
sample mean. It denotes the proportion of the dependent variable’s variance
that may be explained by the independent variable’s variance. A high R2 value
N
shows that the model’s variance is similar to that of the true values, whereas a
low R2 value suggests that the two values are not strongly related.
Performance Metric: Mean Absolute Error (MAE)
EL
The MAE is defined as the average of the absolute difference between forecasted and true values.
Where yi is the expected value and xi is the actual value (shown below formula). The letter n
represents the total number of values in the test set.
PT
The MAE shows us how much inaccuracy we should expect from the forecast on average. MAE = 0
means that the anticipated values are correct, and the error statistics are in the original units of the
N
forecasted values.
The lower the MAE value, the better the model; a value of zero indicates that the forecast is error-
free. In other words, the model with the lowest MAE is deemed superior when comparing many
models.
Performance Metric: Mean Absolute Percentage Error (MAPE)
EL
MAPE is the proportion of the average absolute difference between projected
and true values divided by the true value. The anticipated value is Ft, and the
true value is At. The number n refers to the total number of values in the test
set.
PT
N
It works better with data that is free of zeros and extreme values because of the
in-denominator. The MAPE value also takes an extreme value if this value is
exceedingly tiny or huge.
Performance Metric: Mean Squared Error (MSE)
EL
MSE is defined as the average of the error squares. It is also known as the metric
that evaluates the quality of a forecasting model or predictor. MSE also takes into
account variance (the difference between anticipated values) and bias (the distance
of predicted value from its true value).
PT
N
Where y’i denotes the predicted value and yi denotes the actual value. The number n
refers to the total number of values in the test set. MSE is almost always positive,
and lower values are preferable. This measure penalizes large errors or outliers
more than minor errors due to the square term (as seen in the formula above).
Performance Metric: Root Mean Squared Error(RMSE)
EL
This measure is defined as the square root of mean square error and is an extension
of MSE. Where y’i denotes the predicted value and yi denotes the actual value. The
number n refers to the total number of values in the test set. This statistic, like MSE,
penalizes greater errors more.
PT
N
This statistic is likewise always positive, with lower values indicating higher
performance. The RMSE number is in the same unit as the projected value, which is
an advantage of this technique. In comparison to MSE, this makes it easier to
comprehend.
Use Case: Prognostics and Health Management
The objective of this use case is to build an LSTM model that can predict the
number of remaining operational cycles before failure in the test set, i.e., the
number of operational cycles after the last cycle that the engine will continue to
EL
operate. Also provided a vector of true Remaining Useful Life (RUL) values for the
test data.
The data was generated using C-MAPSS, the commercial version of MAPSS
(Modular Aero-Propulsion System Simulation) software. This software provides a
PT
flexible turbofan engine simulation environment to conveniently simulate the
health, control, and engine parameters.
The simulated aircraft sensor values is used to predict two scenarios, so that
maintenance can be planned in advance:
Simplified diagram of engine simulation
N
* Regression models: The question to ask is "Given these aircraft engine
operation and failure events history, can we predict when an in-service engine will
fail?"
* Binary classification: We re-formulate this question “Is this engine going to fail A layout showing various modules and their
within w1 cycles?” connections as modeled in the simulation
LSTM model: Dataset
Dataset consists of multiple multivariate time series, such data set is divided
EL
into training and test subsets. Each time series is from a different engine.
The engine is operating normally at the start of each time series and
develops a fault at some point during the series.
In the training set, the fault grows in magnitude until system failure. In the
PT
test set, the time series ends some time prior to system failure.
Public dataset (Nasa Turbo fan)
● Damage propagation for aircraft engine
● Run to failure simulation
Simplified diagram of engine simulation
N
Aircraft gas turbine. Dataset contains time series (cycles) for all
measurements of 100 different engines.
The training data consists of multiple multivariate time series with "cycle" as the time unit, together with 21 sensor readings for
EL
each cycle.
Each time series can be assumed as being generated from a different engine of the same type.
The testing data has the same data schema as the training data. The only difference is that the data does not indicate when the
failure occurs.
PT
Finally, the ground truth data provides the number of remaining working cycles for the engines in the testing data.
N
LSTM model: Data Preparation and Feature Engineering
First step is to generate labels for the training data which are Remaining Useful Life (RUL), label1 and label2.
EL
Each row can be used as a model training sample where the s_k columns are the features and the RUL is the model
target. The rows are treated as independent observations and the measurement trends from the previous cycles are
ignored. The features are normalized to μ = 0, σ = 1 and PCA is applied.
For the LSTM model, opt for more advanced feature engineering and chose to incorporate the trends from the previous
PT
cycles. In this case, each training sample consists of measurements at cycle i as well as i-5, i-10, i-20, i-30, i-40. The
model input is a 3D tensor with shape (n, 6, 24) where n is the number of training samples, 6 is the number of cycles
(timesteps), and 24 is the number of principal components (features).
N
LSTM model: Modelling
When using LSTMs in the time-series domain, one important parameter to pick is the sequence length which is the
window for LSTMs to look back.
EL
This may be viewed as similar to picking window_size = 5 cycles for calculating the rolling features which are rolling
mean and rolling standard deviation for 21 sensor values.
The idea of using LSTMs is to let the model extract abstract features out of the sequence of sensor values in the
window rather than engineering those manually. The expectation is that if there is a pattern in these sensor values
PT
within the window prior to failure, the pattern should be encoded by the LSTM.
One critical advantage of LSTMs is their ability to remember from long-term sequences (window sizes) which is hard to
achieve by traditional feature engineering. For example, computing rolling averages over a window size of 50 cycles
may lead to loss of information due to smoothing and abstracting of values over such a long period, instead, using all
N
50 values as input may provide better results. While feature engineering over large window sizes may not make sense,
LSTMs are able to use larger window sizes and use all the information in the window as input.
LSTM model: Modelling
Let's first look at an example of the
sensor values 50 cycles prior to the
EL
failure for engine id 3.
EL
Dropout is also applied after each LSTM layer to control overfitting.
Final layer is a Dense output layer with single unit with sigmoid activation for the binary classification problem and linear
activation for the regression problem.
PT
Network for binary classification problem
N Network for regression problem
LSTM model: Model Evaluation
Results of Regression problem:
EL
PT
N
LSTM model: Model Evaluation
Results of Binary Classification problem:
EL
PT
N
Azure Time Series Insights (PaaS): Predictive Maintenance
Azure Time Series Insights (TSI) is a cloud-based service offered by Azure that can be
used to ingest, model, query and visualize fast-moving time-series data generated by IoT
EL
devices.
PT
N
Azure Time Series Insights (PaaS): Predictive Maintenance
Real-time data in the form of a time-series can be generated by various devices like mobile devices,
sensors, satellites, medical devices etc.
EL
Data from these devices can be fetched to the Azure environment using Azure IoT Hub. Azure IoT hub acts
as a data integration pipeline to connect to the source devices and then fetch data and deliver it to the TSI
platform.
Once the data is in the TSI, it can then be used for visualization purposes, and can be queried and
aggregated accordingly. Additionally, customers can also leverage existing analytics and machine learning
PT
capabilities on top of the data available in TSI.
Data from TSI can be further processed using Azure Databricks and machine learning models can be
applied based on pre-trained models that will offer predictions in real-time. This is how an overall
architecture of Azure Time Series Insights can be enabled.
N
Azure Time Series Insights (PaaS): Components
Azure TSI provides the following four components using which users can consume data from varied data
sources as follows.
● Integration – TSI provides an easy integration from data generated using IoT devices by allowing
EL
connection between the cloud data gateways such as Azure IoT Hub and Azure Event Hubs. Data from
these sources can be easily consumed in JSON structures, cleaned and then stored in a columnar
store
● Storage – Azure TSI also takes care of the data that is to be retained in the system for querying and
PT
visualizing the data. By default, data is stored on SSDs for fast retrieval and has a data retention policy
of 400 days. This supports querying historic data for up to a period of 400 days
● Data Visualization – Once the data is fetched from the data sources and stored in the columnar stores,
it can be visualized in the form of line charts or heat maps. The visuals are provided out of the box by
Azure TSI and can be leveraged for easy visual analysis
N
● Query Service – Although, visualizing the data will answer many questions, however, TSI also provides
a query service using which you can integrate TSI into your custom applications.
Usually, a time series data is indexed by timestamps. Therefore, you can build your applications by using TSI
as a backend service for integrating and storing the data and using the client SDK for Azure TSI for building
the frontend and display visuals like line charts and heat maps.
Predictive Maintenance: Steps
Step 1: Sensor data are collected from edge devices and are forwarded to Azure IoT Hub.
Step 2: Azure IoT hub then drives these gathered data to the TSI platform and Stream Analytics.
EL
Step 3: At TSI, data can be visualised, queried and aggregated with other services.
Step 4: Azure machine learning service provides the training of ML model or using a pretrained model on top of
the data available in TSI.
Step 5: Once the training is completed, inference is provided using Azure IoT Hub and Iot Edge service.
PT
N
Lecture Summary
EL
● Understanding of predictive maintenance
● Machine learning models for predictive maintenance
●
● PT
Use case of predictive maintenance using LSTM model
Azure Time Series Insights
N
N
PT
EL
Deep Reinforcement Learning for
EL
Cloud-Edge
PT
N
Dr. Rajiv Misra
Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
rajivm@[Link]
Deep Reinforcement Learning for Cloud-Edge
Preface
Content of this Lecture:
• In this lecture, we will discuss how Collaborative cloud-edge
approaches can provide better performance and efficiency
EL
than traditional cloud or edge approaches.
• To understand how resource allocation strategies can be
tailored to specific use cases and can evolve over time
PT
based on user demand and network conditions.
N
EL
• Resource allocation is important for optimizing
system performance while ensuring efficient use of
resources.
•
PT
Collaborative cloud-edge approaches can be more
effective than traditional approaches that focus
solely on cloud or edge resources.
N
Cloud Services:
• Cloud services can be divided into private and public cloud.
• Private cloud is dedicated to a single organization and provides greater control and
security.
• Public cloud is shared by multiple organizations and provides more flexibility and
scalability.
Deep Reinforcement Learning for Cloud-Edge
The Collaborative Cloud-Edge Environment
Edge Nodes:
• Edge nodes are local computing resources that are
closer to the user than the cloud node.
• Edge nodes can provide low-latency, high-bandwidth
services to users and can offload some processing from
the cloud.
EL
Resource Allocation Strategies:
• Cloud Resource allocation strategies can be based on various factors, such as user
demand, network conditions, and available resources.
•
•
PT
Collaborative cloud-edge approaches can use machine learning algorithms to optimize
resource allocation over time.
Load balancing, task offloading, and caching are some common resource allocation
N
techniques that can be applied to both cloud and edge resources.
Multi-Edge-Node Scenario:
• Cloud In a multi-edge-node scenario, resource allocation becomes more complex as the
cloud and edge nodes must coordinate with each other to allocate resources effectively.
• Collaborative cloud-edge approaches can use communication protocols and data sharing
to enable effective coordination.
Deep Reinforcement Learning for Cloud-Edge
Public vs Private Cloud
Public Cloud Environment:
• In a public cloud environment, the cloud provider offers different pricing modes for
cloud services based on demand characteristics.
• Pricing modes have different cost structures that affect resource allocation strategies.
• Cloud service providers like Amazon, Microsoft, and Alicloud provide three different
pricing modes, each with different cost structures.
EL
• The edge node must select the appropriate pricing mode and allocate user demands
to rented VMs or its own VMs.
Private Cloud Environment:
•
•
process user demands. PT
In a private cloud environment, the edge node has its own virtual machines (VMs) to
If the number of VMs requested exceeds the edge node's capacity, the edge node can
N
rent VMs from the cloud node to scale up.
• The cost of private cloud changes dynamically according to its physical computing cost,
so the edge node needs to allocate resources dynamically at each time slot according to
its policy.
• After allocating resources, the computing cost of the edge node and private cloud in this
time slot can be calculated and used to receive new computing tasks in the next time
slot.
Deep Reinforcement Learning for Cloud-Edge
User Settings
The time is discretized into T time slots.
We assume that in each time slot t, the demand submitted by the user can be
defined as the following:
D t =( d t , l t )
D t is a pair of d t and l t , where d t is the number of VMs requested of D t , and l t is the
computing time duration of D t .
EL
Computing Resources and Cost of Edge Nodes:
• The total computing resources owned by the edge node are represented by E.
•
•
•
PT
As the resource is allocated to users, we use 𝒆𝒕 to represent the number of
remaining VMs of edge node in time slot t.
The number of VMs provided by the edge node is expressed as 𝒅𝒆𝒕 .
The number of VMs provided by the cloud node is expressed as 𝒅𝑪𝒕 .
N
• It should be noted that if the edge node exhibits no available resources, it will
hand over all the arriving computing tasks to the cloud service for processing.
So, no. of VM provided by edge node in time t is given as:
𝒆 𝒅𝒕 − 𝒅𝒄𝒕 , 𝒆𝒕 ≥ 𝟎
𝒅𝒕 = $
𝟎, 𝒆𝒕 = 𝟎
Deep Reinforcement Learning for Cloud-Edge
Computing Resources and Cost of Edge Nodes:
EL
𝐻 =< ℎ% , ℎ& ,…. ℎ' >
At the end of each time slot, the following actions are taken:
PT
• The edge node traverses the allocation record list and subtracts one from the
remaining computing time of each record.
• If a record's remaining computing time reaches 0, it means that the demand
N
has been completed. The edge node releases the corresponding VMs and
deletes the allocation record from the list.
• The number of VMs waiting to be released at the end of time slot t is denoted
as η( .
η( = ∑𝒎 𝒅
𝒊*𝟏 𝒊
𝒄
𝑠. 𝑡. 𝑙- = 0, ℎ- ∈ 𝐻
• The number of remaining VMs at the next time slot t+1 is calculated based on
the number of remaining VMs at the beginning of time slot t, the quantity
allocated in the end of time slot t, and the quantity released due to completion
of the computing task in time slot t. Then, the number of remaining VMs of the
edge node at the time slot t + 1 is
EL
𝑒(.% = 𝑒( − 𝑑(/ + η(
• PT
The cost of the edge node in time slot t is calculated as the sum of standby
cost (𝑒( 𝑝/ ) and computing cost ((𝐸 − 𝑒( )𝑝0 ).
N
𝐶(/ = 𝑒( 𝑝/ + 𝐸 − 𝑒( )𝑝0
EL
𝑝% : unit cost of VMs in private cloud
𝐶!& : cost of the edge node
Cost in Public Cloud:
•
𝐶!
"'(
𝑋$ = '
PT
In time slot t, the cost of collaborative cloud-edge in public cloud environment includes the
computing cost of cloud nodes and the cost of edge node, which is the following:
= 𝑋)𝑝*+ 𝑑!% + 𝑋,𝑝'"-#*.! + 𝑋/𝑝#& 𝑑!% + 𝑋0𝑝! 𝑑!% + 𝐶!&
𝟏, The service is used
N
𝟎, The service is not used
Where,
𝑋)𝑝*+ 𝑑!% : cost of on-demand instance
𝑋,𝑝'"-#*.! + 𝑋/𝑝#& 𝑑!% : cost of reserved instance
𝑋0𝑝! 𝑑!% : cost of spot instance
EL
• In a public cloud environment, the edge node determines the type of cloud
service to be used based on the allocation and the price of the corresponding
cloud service set by the cloud service provider.
• PT
The cost of the current time slot t, denoted as 𝐶( , is calculated based on the
allocation and the price of the corresponding cloud service set by the cloud
service provider.
N
• The long-term cost of the system is minimized over the T time slots by
minimizing the sum of the costs over all time slots i.e.
1
> 𝐶(
(*%
Deep Reinforcement Learning for Cloud-Edge
Resource Allocation Algorithms: [Link] Decision Process
• The resource allocation problem is a sequential decision-making problem
• It can be modeled as a Markov decision process.
• Markov decision process is a tuple (S, A, P, r, γ), where S is the finite set of
states, A the finite set of actions, P is the probability of state transition, r and γ
are the immediate reward and discount factor, respectively.
EL
• 𝒔𝒕 = (𝒆𝒕, 𝜼𝒕 − 𝟏, 𝑫𝒕, 𝑝! ) ∈ 𝑆 ,is used to describe the state of the edge node at the
beginning of each time slot, where
et :number of remaining VMs of the edge node in t,
ηt−1 :number of VMs returned in the previous time slot
PT
Dt :user’s demand information in t
pt :unit cost of VMs in private cloud in t.
• In the public cloud environment, first, the edge node needs to select the pricing mode of
cloud service to be used and then determine the resource segmentation between the
edge node and the cloud node in each time slot t.
• The resource allocation action can be described by parametric action.
EL
• In order to describe this parameterized action sequential decision, parameterized action
Markov decision process (PAMDP) is used.
• Similar to Markov decision process, PAMDP is a tuple (S, A, P, r, γ).
parameterized actions.
PT
• The difference with the Markov decision process is that A is the finite set of
EL
DDPG introduces the idea of DQN and contains four networks, where the main Actor
network selects the appropriate action a, according to the current state, s and interacts
with the environment:
EL
The parameters ω of the Actor target network and the parameters θ of the Critic
target network are updated using a soft update:
𝜔' ⃪𝜏𝜔 + 1 − 𝜏 𝜔.
𝜃 . ⃪𝜏 + 1 − 𝜏 𝜃 .
PT (4)
N
EL
the collaborative cloud-edge environment
• It then pass the state as the input of the neural network into the main
Actor network to obtain the action at.
•
•
PT
After the edge node gets the action, the number of demands to be
processed by the edge node and the number of demands to be
processed by the private cloud will be calculated by the action value,
i.e., 𝑑!" and 𝑑!# , respectively.
Then, interaction with the environment based on 𝑑!" and 𝑑!# , to get the
N
next state, reward, and termination flag.
• Storing this round of experience to the experience replay pool
• CERAI will sample from the experience replay pool and calculate the
loss functions of Actor and Critic to update the parameters of the
master and target networks.
• After one round of iterative, the training will be continued to the
maximum number of training rounds set to ensure the convergence of
the resource allocation policy.
Deep Reinforcement Learning for Cloud-Edge
CERAI(Cost efficient resource allocation with private cloud ) Algorithm
1. Initialize Actor main network and target network parameters 𝜃, 𝜃 ; Critic main network and target
network parameters 𝜔, 𝜔; , . soft update coefficient 𝜏. number of samples for batch gradient
descent m, maximum number of iterations M, random noise 𝓝 and experience replay pool K
2. For i = 1 to M do
3. Receive user task information and obtain the status s of collaborative cloud-edge
computing environment;
4. Actor main network selects actions according to s: 𝑎 = 𝜋< 𝑆 + 𝓝;
EL
5. The edge node performs action a and obtains the next satus s', reward r and termination flag
𝑖𝑠𝑒𝑛𝑑
6. The edge node generates an allocation record ℎ$ according to the allocation operation. Add it
to the allocation record H;
7.
8.
9.
Update status: s = s’;
PT
Add the state transition tuple (𝑠, 𝑎, 𝑟, 𝑠 ; , 𝑖𝑠𝑒𝑛𝑑) in the experience replay pool K;
Sample m samples from experience replay pool P calculate the target Q value y according to the
eq 2;
N
10. Calculate the loss function according to (1) and update the parameters of the Critic main
network;
11. Calculate the loss function according to (3) and update the parameters of the Actor main network;
12. update the parameters of the Critic and Actor target network according to (4)
13. Update allocation record H and release computing resources for completed tasks;
14. If s’ is terminated, complete the current round of iteration, otherwise goto step 3;
15. end.
EL
parameter value corresponding to k.
• Similar to DQN, deep neural network Q (s, k, xk; ω) is used in
P-DQN to estimate Q (s, k, xk), where ω is the neural network
parameter.
• For Q (s, k, xk; ω), P-DQN uses the determined policy network
PT
xk(·; θ): S → X k to estimate the parameter value 𝒙𝑸 𝒌 (s), where θ
is used to represent the policy network. That means the goal of
P-DQN is to find the corresponding parameters θ, when ω is
fixed. It can be written as the following
𝑸 𝒔 𝒌 , 𝒙𝒌 𝒔; 𝜽 ; 𝝎 ≈ 𝑸 𝒔, 𝒌, 𝒙𝒌 ; 𝝎
N
(5)
• Similar to DQN, the value of ω can be obtained by minimizing
the mean square error by gradient descent.
• In particular, step t, ωt and θt are the parameters of value
network and deterministic policy network, respectively.
• yt can be written as :
𝒚 = 𝒓 + 𝒎𝒂𝒙𝑸 𝒔; 𝒌, 𝒙𝒌 𝒔' , 𝜽𝒕 ; 𝝎𝒕 (6)
𝒌∈ 𝒌
where s′ is the next state after taking the mixed action a = (k, xk).
Deep Reinforcement Learning for Cloud-Edge
4. Resource Allocation Based on P-DQN
The loss function of value network can be written as the following:
𝟏
𝒍𝑸 𝝎 = 𝑸 𝒔, 𝒌, 𝒙𝒌 ; 𝒘 − 𝒚 𝟐 (7)
𝟐
loss function of a policy network can be written as
𝒍𝜽 𝜽 = − ∑𝒌𝒌, 𝝁 𝒔, 𝒌, 𝒙𝒌 𝒔; 𝜽 ; 𝝎 (8)
EL
• P-DQN structure is shown in Figure .
• Cost Efficient Resource Allocation with public cloud (CERAU) is a resource allocation algorithm based on P-
DQN,. The input of the algorithm contains information about the user requests demands Dt and the unit cost of
spot instance in public cloud in time slot t pt.
•
•
PT
At the beginning of each iteration of the algorithm, the edge node first needs to obtain the state st of the
collaborative cloud-edge environment
Then pass the state as the input of the neural network into the strategy network to obtain the parameter values
of each discrete action.
After the edge node gets the action, it will select the appropriate public cloud instance type based on the
N
discrete values in the action and determine the number of public cloud instances to be used based on the
parameter values.
• Then, interaction with the environment occurs, to get the next state, reward, and termination flag.
• Storing this round of experience to the experience replay pool, CERAU will sample from the experience replay
pool and calculate the gradient of the value network and the policy network.
• Then, it will update the parameters of the corresponding networks.
• After one round of iterative, to ensure the convergence of the resource allocation policy, the training will be
continued to the maximum number of training rounds set.
EL
environment;
4. Calculate the parameter value of each instance type in the cloud service; 𝑥= ⃪𝑥= (𝑠!, 𝜃! ) + 𝓝 ;
5. Selects discrete actions according to 𝜖 −greedy strategy:
𝑟𝑎𝑛𝑑𝑜𝑚 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒 𝑎𝑐𝑡𝑖𝑜𝑛, 𝑟𝑛𝑑 > 𝜖
a= ' 𝑘, 𝑥 , 𝑘 = 𝑎𝑟𝑔𝑚𝑎𝑥
=
PT =∈ @ 𝑄 𝑠, 𝑘, 𝑥= ; 𝜔 , 𝑟𝑛𝑑 ≥ 𝜖
6. The edge node performs action and obtains the next status s’, reward r and termination flag isend;
7. The edge node generates an allocation record ℎ$ according to the allocation operation. Add it to
the allocation record list H;
N
8. Add the state transition tuple (𝑠, 𝑎, 𝑟, 𝑠′ 𝑖𝑠𝑒𝑛𝑑) in the experience replay pool D;
9. Sample m samples from experience replay pool P, calculate the target Q value y according to (6);
10. Update satus: s = s’;
11. Calculate gradient 𝛻A 𝑙B 𝜔 and 𝛻< 𝑙< 𝜃 according to (7) and (8);
12. Update network parameters: 𝜔′ ← 𝜔 − 𝜏) 𝛻A 𝑙B 𝜔 ,𝜃′ ← 𝜃 − 𝜏, 𝛻< 𝑙< 𝜃
13. Update allocation record H and release computing resources for completed tasks:
14. If s’ is terminated, complete the current round of iteration. otherwise go to step 3:
15. end
Deep Reinforcement Learning for Cloud-Edge
EL
Thank You
PT
N
Deep Reinforcement Learning for
EL
Cloud-Edge: Example
PT
N
Dr. Rajiv Misra
Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
rajivm@[Link]
Deep Reinforcement Learning for Cloud-Edge
Preface
Content of this Lecture:
• In this lecture, we will discuss how Collaborative cloud-edge
approaches can provide better performance and efficiency
EL
than traditional cloud or edge approaches with the help of
some examples
PT
N
EL
1 (30, 2)
2 (10, 1)
PT
3 (20, 2)
where (𝑑! ) represents the number of VMs requested and (𝑙! ) represents the duration of
N
service request. Assume that time slot (1) is the starting slot such that no VMs have been
allocated a priori. There are 80 VMs available at the edge node.
1 0.4
EL
2 0.7
3 0.8
The action (𝑥)* ∈ [0, 1]) represents the ratio of VMs allocated from the private cloud to the total VMs
PT
requested by client at time slot t. The remaining VMs (1 − 𝑥)* ) are allocated from the edge node.
+,-
Calculate the cost of collaborative cloud side computing (𝐶) ) in the given private cloud setting at each of
the three time slots. Also, find out the number of VMs that will be available at the edge node at the
beginning of fourth time slot
N
Given Constants:
Constant Value
1 (1,0.4)
EL
2 (0,0.7)
3 (2,0.8)
where (𝑘) ∈ {0=on_demand,1=reserved,2=spot}) represents the type of public cloud instance that was
PT +01
allocated. Calculate the cost of collaborative cloud side computing (𝐶) ) in the given public cloud setting at
each of the three time slot. Assume that the same demands were made by client as in part (a) and that no
customization is performed on reserved instances.
N
Additional Constants:
Constant Value
EL
No of VMs allocated from cloud: 𝑑"$ = 𝑥"# ∗ 𝑑" = 0.4 ∗ 30 = 12
No of VMs allocated from edge node: 𝑑"% = 𝑑" − 𝑑"$ = 30 − 12 = 18
No of VMs remaining at the edge node: 𝑒" = 𝑒" − 𝑑"% = 80 − 18 = 62
PT
Resources can be successfully allocated from edge node; hence, allocation record will be generated:
Allocation record: ℎ" = 𝑑"% , 𝑙" = (18, 2)
Allocation Record List 𝐻: < ℎ" > ∶ < 18, 2 >
N
Updated Allocation Record List 𝐻: < ℎ" > ∶ < 18, 1 >
Number of VMs waiting to be released: 𝑛" = 0
Number of VMs available at next time slot: 𝑒& = 𝑒" + 𝑛" = 62 + 0 = 62
Cost at the edge node: 𝐶"% = 𝑒" 𝑝% + 𝐸 − 𝑒" 𝑝' = 62 ∗ 0.03 + (80 − 62) ∗ 0.2 = 1.86 + 3.6 = 5.46
()*
Cost at the private cloud: 𝐶" = 𝑑"$ 𝑝$ + 𝐶"% = 12 ∗ 3.0 + 5.46 = 41.46
(+,
Cost at the public cloud: 𝐶" = 𝑑"$ 𝑝)% + 𝐶"% = 12 ∗ 1.5 + 5.46 = 23.46
Deep Reinforcement Learning for Cloud-Edge
Example : Solution
At time slot (t = 2):
Demand: 𝐷& = (𝑑& , 𝑙& ) = (10, 1)
Action: 𝑥&# = 0.7
No of VMs allocated from cloud: 𝑑&$ = 𝑥&# ∗ 𝑑& = 0.7 ∗ 10 = 7
No of VMs allocated from edge node: 𝑑&% = 𝑑& − 𝑑&$ = 10 − 7 = 3
EL
No of VMs remaining at the edge node: 𝑒& = 𝑒& − 𝑑&% = 62 − 3 = 59
Resources can be successfully allocated from edge node; hence, allocation record will be generated:
Allocation record: ℎ& = 𝑑&% , 𝑙& = (3, 1)
PT
Allocation Record List 𝐻: < ℎ" , ℎ& > ∶ < 18, 1 , (3, 1) >
Updated Allocation Record List 𝐻: < ℎ" , ℎ& > ∶ < 18, 0 , (3, 0) >
Number of VMs waiting to be released: 𝑛& = 18 + 3 = 21
N
Number of VMs available at next time slot: 𝑒- = 𝑒& + 𝑛& = 59 + 21 = 80
Cost at the edge node: 𝐶&% = 𝑒& 𝑝% + 𝐸 − 𝑒& 𝑝' = 59 ∗ 0.03 + (80 − 59) ∗ 0.2 = 1.77 + 4.2 = 5.97
Cost at the private cloud: 𝐶&()* = 𝑑&$ 𝑝$ + 𝐶&% = 7 ∗ 3.0 + 5.97 = 26.97
(+,
Cost at the public cloud: 𝐶& = 𝑑&$ 𝑝./ + 𝐶&% = 7 ∗ 3.0 + 5.97 = 26.97
EL
No of VMs remaining at the edge node: 𝑒- = 𝑒- − 𝑑-% = 80 − 4 = 76
Resources can be successfully allocated from edge node; hence, allocation record will be generated:
Allocation record: ℎ& = 𝑑-% , 𝑙- = (4, 2)
PT
Allocation Record List 𝐻: < ℎ- > ∶ < (4, 2) >
Updated Allocation Record List 𝐻: < ℎ- > ∶ < (4, 1) >
Number of VMs waiting to be released: 𝑛- = 0
N
Number of VMs available at next time slot: 𝑒0 = 𝑒- + 𝑛& = 76 + 0 = 76
Cost at the edge node: 𝐶-% = 𝑒- 𝑝% + 𝐸 − 𝑒- 𝑝' = 76 ∗ 0.03 + (80 − 76) ∗ 0.2 = 2.28 + 0.8 = 3.08
Cost at the private cloud: 𝐶-()* = 𝑑-$ 𝑝$ + 𝐶-% = 16 ∗ 3.0 + 3.08 = 51.08
Cost at the public cloud: 𝐶-()* = 𝑑-$ 𝑝! + 𝐶-% = 16 ∗ 1.0 + 3.08 = 19.08
EL
Services
PT
N
Dr. Rajiv Misra
Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
rajivm@[Link]
Public Cloud Services: AWS Services
Contents of lecture
In this lecture, we will cover a Public Cloud Services,
a case study of AWS services
EL
PT
N
Reference Model
We will use a reference model to explain AWS services systematically as 5-layered model Model
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
Dr. Rajiv Misra
Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
rajivm@[Link]
Mathematical formulations for task-offloading in Edge-Cloud Environment
Preface
Reference
Content Model
of this Lecture:
EL
• An Edge-Cloud latency models that show the impact of different tasks’
offloading scenarios/schemes for time-sensitive applications in terms of
end-to-end service times.
•
PT
Evaluation of the offloading latency models that consider computation and
communication as key parameters with respect to offloading to the local
edge node, other edge nodes or the cloud..
N
EL
• More than 50 billion devices will be connected to the internet , which will
produce a new set of applications such as Autonomous Vehicles, Augmented
Reality (AR), online video games and Smart CCTV.
PT
• Thus, Edge Computing has been proposed to deal with the huge change in the
area of the distributed system.
N
EL
because it deals with several complex factors (e.g., different characteristics of IoT
applications and heterogeneity of resources).
PT
• Different service architecture and offloading strategies quantitatively impact the end-
to-end service time performance of IoT applications .
EL
nodes such as Edge Computing and Cloud
Computing.
Different service architecture and offloading
strategies have a different impact on the service time
PT
performance of IoT applications.
An Edge-Cloud system architecture that supports
scheduling offloading tasks of IoT applications in
N
order to minimize the enormous amount of
transmitting data in the network.
Also, it introduces the offloading latency models to
investigate the delay of different offloading
scenarios/schemes and explores the effect of
computational and communication demand on each
one.
Mathematical formulations for task-offloading in Edge-Cloud Environment
System Architecture
•Reference
The Edge-Cloud system Model
from bottom to
the top consists of three layers/tiers: IoT
devices (end-user devices), multiple
Edge Computing nodes and the Cloud
(service provider).
EL
• The IoT level is composed of a group of
connected devices (e.g., smartphones,
self driving cars, smart CCTV);
• PT
These devices have different
applications where each application has
several tasks
N
• Difference in the given architecture is the
introduced layer between the edge
nodes and the cloud. This layer
responsible for managing and assign
offloading tasks to the edge nodes.
EL
system (e.g., available and used), the number of IoT
devices, their applications’ tasks and where IoT tasks
have been allocated (e.g., Edge or Cloud).
•
PT
Manager, Infrastructure Manager, Monitoring and
Planner.
EL
data to be transferred, the amount of
computational requirement (e.g., required
CPU) and the latency constraints.
PT
Besides, the number of application users
for each edge node
N
EL
resources in the entire Edge-Cloud
system. For instance, processors,
networking and the connected IoT
devices for all edge nodes.
PT
Edge-Cloud is a virtualized
environment; thus, this component
N
responsible for the VMs as well. In this
context, this component provides the EC
with the utilization level of the VMs.
EL
Edge-Cloud system. Furthermore, detecting the
tasks’ failures due to network issues or the
shortage of computational resources.
Planner:
PT
The main role of this component is to
propose the scheduling policy of the offloading
N
tasks in the Edge-Cloud system and the location
where they will be placed (e.g., local edge, other
edges or the cloud). This offloading tasks works on
this component and passes its results to EC for
execution.
EL
PT
N
EL
response to receive the results.
• For example, self-driving cars consist of several services, classified these
services in categories based on their latency-sensitivity, quality
• PT
constraints and workload profile (required communication and
computation).
First, critical applications, which must be processed in the car’s
computational resources, for instance, autonomous driving and road
N
safety applications.
• Second, high-priority applications, which can be offloaded but with
minimum latency, such as image aided navigation, parking navigation
system and traffic control.
• Third, low-priority applications, which can be offloaded and not vital as
high-priority applications (e.g., infotainment, multimedia, and speech
processing).
Mathematical formulations for task-offloading in Edge-Cloud Environment
Latency Sensitive Applications
Reference Model
EL
PT
N
EL
time comparing to the traditional Cloud system.
• However, different offloading decisions within the Edge-Cloud system can lead to various service time
due to the computational resources and communications types. The current real-world applications
measure the latency between the telecommunication service provider and the cloud services.
• Compare the latency between offloading to the edge or the cloud, latency between multiple edge
• PT
nodes that work collectively to process the offloading tasks. investigating the latency of the Edge-Cloud
system is an essential step towards developing an effective scheduling policy.
Firstly, task allocation in the Edge-Cloud system is not only two choices (e.g., either at IoT device or in
the cloud), but could be on any edge nodes. Moreover, edge nodes connected in a loosely coupled
N
way on heterogeneous wireless networks (i.e., WLAN, MAN and WAN), making the process of
resource management and the offloading decision more sophisticated.
• Secondly, given that task processing is allocated among multiple edge nodes working collectively and
the cloud, it is challenging to make an optimal offloading decision. The latency models to investigate
the delay of different offloading scenarios/schemes.
EL
to be processed on the edge or cloud, as
an example.
This offloading scenario/scheme provides
ultra-low latency due to the avoidance of
PT
network backhaul delays. The end-to-end
service time composed of two delays,
network delay and computational delay.
The network delay consists of the time of
sending the data to the edge and the time
N
to receive the output from the edge to the
IoT device.
The computation time is the time from
To clarify, IoT devices send their offloading
arriving the task to the edge node until the tasks through the wireless network, and
processing has completed. Therefore, the then the tasks will be processed by the
end-to-end service time latency is the sum edge node and finally send the results to
of communication delay and computational
delay, which can be calculated as follows: IoT devices,
• 𝑳𝑳𝒐𝒄𝒂𝒍_𝒆𝒈𝒅𝒆 = 𝒕𝒕𝒆_𝒖𝒑 + 𝒕𝒄𝒆 + 𝒕𝒕𝒆_𝒅𝒐𝒘𝒏
Mathematical formulations for task-offloading in Edge-Cloud Environment
Latency Models: Latency to Local Edge with Cloud
Reference Model
In this offloading scenario/scheme, rather than
relying on only one Edge node, the IoT tasks can be
processed collaboratively between the connected
Edge node and the cloud servers.
This will combine the benefits of both Cloud and
Edge Computing, where the cloud has a massive
amount of computation resources, and the edge has
EL
lower communication time.
In this scenario/scheme, the edge can do part of the
processing such as pre-processing, and the rest of
the tasks will be processed in the cloud.
PT
IoT sends the computation tasks to the connected
edge and then part of these tasks forwarded to the
cloud.
Once the cloud finishes the computation, it will send
the result to the edge, and the edge will send it to the
N
IoT devices.
This scenario/scheme consists of communication
time (e.g., the time between the IoT device to the
edge node and the time between edge nodes to the
cloud) and computation time (e.g., processing time in
the edge and processing time in the cloud). Thus, the
end-to-end service time can be calculated as follows:
𝑳𝑳_𝑪 = 𝒕𝒕𝒆_𝒖𝒑 + 𝒕𝒄𝒆 + 𝒕𝒕𝒄_𝒖𝒑 + 𝒕𝒄𝒄 + 𝒕𝒕𝒄_𝒅𝒐𝒘𝒏 + 𝒕𝒕𝒆_𝒅𝒐𝒘𝒏
EL
PT
N
IoT sends the computation tasks to the connected edge
and then part of these tasks transferred to other
available resources in the edge level through the edge
controller and the rest to the cloud.
Mathematical formulations for task-offloading in Edge-Cloud Environment
Latency Models: Latency to Multiple Edge Nodes with Cloud
• This is known as a three-level offloading scenario/scheme that aims to utilize more resources at
the edge layer and support the IoT devices in order to reduce the overall service time.
• It adds another level by considering other available computation resources in the edge layer.
• Basically, it distributes IoT tasks over three levels: connected edge, other available edge nodes
EL
and the cloud.
• The edge controller (edge orchestrator) controllers all edge servers by Wireless Local Area
Network (WLAN) or Metropolitan Area Network (MAN), which have low latency compared to Wild
Area Network (WAN).
•
•
PT
This will help to decrease the dependency of cloud processing as well as increase the utilization of
computing resources at the edge.
This scenario/scheme consists of communication time (e.g., the time between the IoT device to the
edge node, the time between edge node to other collaborative edge node and the time between
edge nodes to the cloud) and computation time (e.g., processing time in the edge, processing time
N
in other collaborative edge node and processing time in the cloud). Thus, the end-to-end service
time can be calculated as follows:
• 𝐿01233_455 = 𝑡03_67 + 𝑡83 + 𝑡834 + 𝑡08_67 + 𝑡88 + 𝑡08_94:; + 𝑡034_94:; + 𝑡03_94:;
Assumptions:
• We have three edge nodes connected to the cloud.
• Each edge node has two servers, and each of them has four VMs with a similar configuration.
• The cloud contains an unlimited number of computational resources
EL
Key parameters: Values
Simulation Time :30 min
Warm-up Period :3 min
Number of Iterations: 5
Number of IoT Devices: 100–1000
Number of Edge Nodes :3
Number of VM per Edge Server: 8 PT
Number of VM in the Cloud :not limited
N
Average Data Size for Upload/Download (KB) :500/500
EL
PT
N
• The architecture includes several components that interact with each other to
EL
support task offloading, such as IoT devices, edge nodes, and cloud servers.
PT
communication as key parameters for offloading tasks to different destinations,
including local edge nodes, other edge nodes, and the cloud.
• The algorithm predicts the edge server's load in real-time and allocates
resources in advance, improving the convergence accuracy and speed of the
offloading process.
• By predicting the characteristics of tasks and edge server loads, tasks are
dynamically offloaded to the optimal edge server
•In the training phase, the OPO algorithm predicts the load of the edge server in real-time
with the LSTM algorithm, improving the convergence accuracy and speed of the DRL
algorithm in the offloading process.
•In the testing phase, the LSTM network predicts the characteristics of the next task, and
the DRL decision model allocates computational resources for the task in advance,
reducing the response delay and improving offloading performance.
• Each arriving task is first stored in corresponding MD task cache queue, and then
the decision model gives where the task will be offloaded to be executed.
• For t ∈ 𝓣, new task generated by the terminal device m ∈𝓜is denoted as
Where ,
"
𝐷! : size of the task data
"
𝜌! : computational resources required per bit
"
𝜏!,!$% : maximum tolerated delay of the task
• "
𝑦!,& ∈ {0, 1} represents the edge server to which the task is offloaded for
execution
"
• 𝑦!,& = 1, the task is offloaded to the edge server n ∈ 𝓝 for execution
• The tasks in this model are atomic level tasks
• Each offloaded task can be executed in only one edge server, and the tasks
offloaded to the edge server for execution are constrained by
" "
∑𝑛 ∈ 𝓝𝑦!,& = 1, 𝑚 ∈ 𝓜, 𝑡 ∈ 𝓣, 𝑥! =1
Reference Model
• The task generated by the t time slot must wait until the computation queue is free
to execute the [Link] waiting delay is:
4
" 01!2 3
• 𝜏!,'$(" = max 𝑙! 𝑡 −𝑡+1 (1)
" ! ){+,,,…,".,}
Where,
01!2
𝑙! 𝑡 : completion time slot of the task
processing delay in the computational queue is
# 7#
6"
• "
𝜏!,5%5 = "
(2)
$%&'(%
8"
Where,
𝑓!95:(05 : processing capacity (bits/s) of the MD
By 1 and 2,
01!2 " " "
𝑙! 𝑡 = min{t + 𝜏!,'$(" , 𝜏!,,5%5 , 𝜏!,!$% }
95:(05
energy consumption 𝐸! required for the task to be executed locally
95:(05 "
𝐸! = 𝑃!5%5 𝜏!,5%5
"
+ 𝑃!'$(" 𝜏!,'$("
Where ,
%
𝜏!,,-.% :waiting delay in the local model
%/-)
𝜏!,,) :transmission delay
"
𝜏!,'$(" ";$& +𝜏 "
+𝜏!,,& !,&,5%5 ! : time slot required for a task to be offloaded from the endpoint to the edge
server and executed to completion
%
𝜏!,!-$ :maximum tolerated delay
energy consumption incurred when tasks are offloaded to the edge server
59<5 "
𝐸!,& = 𝑃!'$(" 𝜏!,'$(" ";$& 𝜏 ";$& + 𝑃 5%5 𝜏 "
+ 𝑃!,& !,& !,& !,&,5%5
"
Where, 𝑃!'$(" 𝜏!,'$(" ";$& ";$&
,𝑃!,& 5%5 "
𝜏!,& , 𝑃!,& 𝜏!,&,5%5 denote waiting energy consumption,
transmission consumption, and edge node computation consumption of the task,
respectively.
Task Offloading Based on LSTM Prediction and Deep Reinforcement Learning
Goal
Reference Model
• The overall model of the system is a trade-off between the time delay and energy
consumption of the task computation to create a minimization cost problem
• The solution goal is to minimize the total cost of the tasks generated in the system
over time.
• Reference Model
A decision process is required after task generation,
and there will be a certain time delay from task
generation to give a decision.
• Although task generation is a dynamic and random
process, considering the long-term nature of the
task, it will have a strong correlation with time.
• Therefore, based on the history of user devices, we
can predict the tasks that will be generated in the
next network time slot
• As shown in Figure, we can predict the information
of the future task by the prediction model, and
determine the decision and allocate computing
resources for the task.
• If the error between the real task and the predicted
task is within the allowed threshold, the task is
directly offloaded and computed according to the
assigned decision information.
• Otherwise, the offloading decision is given using
the decision model and the information of the new
task is added to the historical data as training
samples.
• By training the LSTM network, the weights and
biases of each gate in the network are updated to
improve the accuracy of the prediction model.
Reference Model
•Historical load sequence data is logged and
used to train an LSTM load prediction model.
•The predicted idle server (𝐻" ) is obtained
from the trained model using historical load
sequence data as input.
•The predicted idle server is used as the
offload computing node when training the
DRL.
•The DRL training process involves selecting
actions with a certain probability (ε).
•When a random action is selected, the size
comparison between a random value σ and
the probability ε is used to determine
whether it is a Random Action or a Prediction
Action.
•Using Prediction Action with the pre-
selected idle server can reduce the number
of explorations by the agent and improve
convergence speed of the algorithm.
Reference Model
•The goal of DRL is to maximize the total reward by making optimal actions.
•DRL typically uses ε-greedy strategies for exploration and exploitation.
•Exploration involves random selection of any action with probability in expectation of a higher reward,
while exploitation selects the action with the largest action estimate.
•The stochastic strategy fully explores the environment state, but requires extensive exploration and low
data utilization.
•In the model, action selection is the offloading decision of the task, with the action space known whether
to execute locally or offload to an edge server.
•During stochastic exploration, LSTM is used to predict the load of the edge server and give an optimal
action.
•The optimal server at the next time slot is predicted based on historical load situation to obtain a higher
reward and avoid edge server load imbalance.
Reference Model
•Each MD generates different types of tasks at different time slots.
•There is a system response delay to the task's decision request and a waiting delay in the queue between
the generation of a task and giving a decision.
•The edge system processes data from MD and stores processed records.
•Based on historical records, feature information of the next arriving task can be predicted by LSTM.
•The predicted information is given to the reinforcement learning decision model to make an offloading
scheme for the predicted task.
•When the real task arrives, the offloading decision is given directly if the error between the real task and
predicted task is within the allowed range.
•If the error is not within the allowed range, the decision is made according to the real task using the
decision model.
•Predicting the task's information can reduce the task's response and waiting delay in the system.
Reference Model
• A typical DQN model is composed of agent, state, action, and reward
• the policy is generated as a mapping π : S → A of states to actions to obtain a
reward 𝑅, 𝑟" 𝑠" , 𝑎" , denotes the reward that can be obtained by choosing
action 𝑎" in state 𝑠"
=
• 𝑅+ = ∑A("?+) 𝛾 " 𝑟" 𝑠" , 𝑎" , is the long-term reward
• when the state space and action space dimensions are large, it is difficult to put all
state-action pairs into Q-table.
• To solve this problem, the DQN model in DRL combines deep neural networks and
Q-learning algorithms, and it transforms the Q-table tables into the Q-networks and
uses neural networks to fit the optimal Q-functions.
• There are two neural networks with the same structure but different parameters in
DQN, i.e., the target network and the main network.
• When iteratively updating the network, the algorithm first uses the target network to
generate the target Q-value as the label f(t), and uses the loss function Loss(θ) to
update the parameters of the main network.
• After the introduction of the target network, the target Q value generated by the
target network remains constant in time j, which can reduce the correlation between
the current Q value and the target Q value and improve the stability of the
algorithm.
Task Offloading Based on LSTM Prediction and Deep Reinforcement Learning
Algorithm Design: Replay Memory
• In order to break the correlation within the data, DQN uses the experience replay
method to solve this problem.
• After interacting with the environment, the agent is stored in the replay buffer in the
form of (𝑠" , 𝑎" , 𝑟" , 𝑠"4, ).
• When executing valuation updates, the agent randomly selects a small set of
experience tuples (𝑠" , 𝑎" , 𝑟" , 𝑠"4, ) from the replay buffer at each time step
• Then the algorithm updates the network parameters by optimizing the loss function
• Using experience replay can not only make training more efficient, but also reduce
the problem overfitting that generated by the training process
• Compared with DQN, Dueling DQN considers the Q network into two parts
the first part is only related to the state S, and the specific action A to be
adopted has nothing to do with this part is called the value function part, noted as
𝑉 B (s),
second part is related to both the state S and action A, this part is called the
action advantage function, noted as 𝐴B (s, a), the final value function can be
expressed as
𝑄B s, a = 𝐴B 𝑠, 𝑎 + 𝑉 B (s)
• State:
[Link] the beginning of each time slot, each agent observes the state of the environment
[Link] includes the properties of the MD task, the waiting queue state, the transmission
queue state, bandwidth information, and the real-time load of the edge nodes, all the
states are closely related to the action to be selected by the agent.
• Action:
1. Based on the current state, the agent first decides whether the newly generated
task needs to be offloaded for computation.
2. if it needs to be offloaded, it chooses which server to offload
3. It also chooses the appropriate transmission power when offloading the
transmission
• Reward:
1. After observing the state at time slot t, the agent takes an action according to
the policy and then receives a reward at time slot t + 1 while updating the
scheduling policy network to make an optimal decision in the next time slot.
2. The goal of each agent is to maximize its long-term discounted reward by
optimizing the mapping from states to actions so that the agent tends to make
optimal decisions in its continuous interaction with the environment.
3. The reward function is shown below,
A
Assume,
• We use a dataset from Google Cluster, which includes information about the arrival
time, data size, processing time, and deadline of the tasks.
• Each type of task processing density, task processing time and the size of data
volume are related
• preprocess the raw data according to the characteristics of the data and make the
data size compatible with the established model by normalization and
denormalization
• When performing training on the DRL offload decision model, it takes a longer time
to explore and select the better result due to the initial random selection action of
the agent.
• We predict the server load based on the edge server record history data
• Based on the prediction results, the server predicted to be non-idle is selected with
a certain probability as the offload choice for the next moment
• This solution allows the agent to effectively avoid selecting servers with high loads,
thus reducing task processing latency and task dropping rates.
• We use LSTM for load prediction and compare the impact of decisions with load
prediction (LSTM & DRL) and without load prediction (DRL) on offloading
performance.
• As result, DRL is significantly slower than the LSTM & DRL for load prediction in the
early stages of training decision making
• after certain training, the average delay, energy consumption, and the number of
task throw volumes is reduced rapidly by using LSTM for load prediction
Suppose you are managing a data center that provides cloud computing services to
customers. You want to use an LSTM model to forecast the hourly CPU utilization of the
data center for the next 24 hours in order to optimize resource allocation and minimize
energy consumption.
You have a dataset with hourly CPU utilization data for the past year, which contains 8,760
data points. You decide to use the first 7,000 data points for training and the remaining
1,760 data points for validation. You set the batch size to 64 and the number of epochs to
50.
Assuming the model takes 5 seconds to process one batch of data on a GPU, how long will it
take to train the model?
Note: This question assumes that the data has already been preprocessed and formatted for
input into the LSTM model.
The time it will take to train the model can be calculated as follows:
• Batch size = 64
• Number of training data points = 7,000
• Number of epochs = 50
• Number of iterations per epoch = Number of training data points / Batch size = 7,000 / 64
= 109.375 = ~109 (rounded down to nearest integer)
Total number of iterations = Number of epochs x Number of iterations per epoch = 50 x 109
= 5,450
Time taken to process one batch of data on a GPU = 5 seconds
Total time taken to train the model = Time taken per iteration x Total number of iterations =
(5 seconds x Batch size) x Total number of iterations
= (5 seconds x 64) x 5,450 = 1,760,000 seconds = ~20.4 days (rounded to 1 decimal
place)
Therefore, it will take approximately 20.4 days to train the LSTM model using the given
dataset, batch size, and number of epochs.
EL
Cloud-Edge
PT
N
Dr. Rajiv Misra
Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
rajivm@[Link]
Workload Optimization for Cloud-Edge
Preface
Content of this Lecture:
EL
cloud-edge computing with the aim of providing both
vertical and horizontal offloading between service nodes.
• PT
An approximation algorithm which applies a branch-and-
bound method to obtain optimal solutions iteratively.
N
• This allows for a reduction in end-to-end delay for accessing these resources, and
EL
makes it more suitable for real-time or delay-sensitive services.
PT
distributed which can address the requirements of mobility and geo-distribution of
mobile and IoT services.
Example:
• A smart home system that utilizes edge computing could provide a more secure,
efficient, and cost-effective solution for controlling and monitoring devices such as
lights, thermostats, cameras, and door locks.
EL
• The system would have a gateway device, such as a router, that would provide a
local connection for each device.
•
PT
The gateway would run a virtualized instance of a cloud application, allowing for
local processing of data and commands.
This would reduce the latency for any commands sent to the devices, providing a
N
more responsive system.
• Additionally, all data would be stored on the local gateway, providing a more secure
solution than if the data were stored in a cloud.
EL
between service nodes.
Vertical Offloading :
•
PT
Vertical offloading refers to the process of transferring tasks or services from cloud or
datacenters to edge nodes in order to reduce latency or increase efficiency. It is also
known as cloud-edge computing and is used to reduce the burden on the cloud.
N
Horizontal Offloading :
• Horizontal offloading, on the other hand, is the process of transferring tasks or services
between edge nodes in order to reduce latency or increase efficiency. It is used to
improve the capacity of edge nodes and can also be used to reduce the load on the
cloud.
EL
1. First Tier:
• The first tier of the hierarchy is
composed of end devices, such as
smartphones, IP cameras and IoT
PT
sensors, which directly receive
service workloads from their sources.
• A device can by itself locally
N
process a fraction of the input
workloads or horizontally offload
some of the other workloads to
neighboring devices, using various
short-range wireless transmission
techniques such as LTE D2D, Wi-Fi
Direct, ZigBee, and Bluetooth.
Workload Optimization for Cloud-Edge
Architecture of Collaborative Cloud-Edge Computing
2. Second Tier:
The second tier comprises access network
technologies such as Ethernet, Wi-Fi, and
4G/5G. The edge nodes are capable of
processing part of the workloads.
EL
3. Third Tier:
The third tier consists of horizontal and
vertical offloading from the edge nodes to
the central offices.
4. Fourth Tier: PT
The fourth tier consists of horizontal
N
offloading from the central offices to
neighboring central offices and vertical
offloading to a remote federated data
center. The data center is the top-most tier
of the cloud-edge computing hierarchy and
is responsible for processing the remaining
workloads.
Workload Optimization for Cloud-Edge
Architecture of Collaborative Cloud-Edge Computing
For example:
In the case of a vehicle congestion avoidance service in a smart city:
EL
• IP cameras are used to monitor traffic and detect abnormal behavior that might
indicate an emergency event.
• The data captured by the cameras is then sent to an edge server for further
analysis and processing.
the city. PT
• The server can then send the refined data to drivers or news outlets throughout
• If there is a lack of computational power, the data can be redirected to other edge
N
servers or even to a remote data center.
1) Workload Model:
Let f ∈ F denote an offered service of a cloud-edge computing system. Each
service f has a computation size ZSf which is the number of mega CPU cycles
required to process a request for service f. Also, communication size ZNf indicates
the data size of the request in megabytes.
EL
Let Iα, Iβ, Iγ, and Iδ be the sets of devices, network edges, central offices and data
centers of the system, respectively. A service node i ∈ I could process a set of
services Fi ⊆ F, where I is the set of all service nodes of the system,
i.e., I = Iα ∪ Iβ ∪ Iγ ∪ Iδ .
a) Local processing: PT
Let pfi denote the workload (in requests per second) of a service f which is locally
N
processed by a node i. We have
EL
Similarly, let ufj ,i be the workload of a service f which is horizontally offloaded from j ∈ Hi
to i. Here, we assume that a service node i can offload the workload of a service f to a
sibling node j on condition that j is able to process
f, i.e., f ∈ Fj . In addition, to prevent loop situations, a node cannot receive the workloads
PT
of a service f from its siblings if it already horizontally offloads this type of workload.
Thus, we have
ufj ,I = { ≥ 0, if f ∈ Fj , ∀ j ∈ Hi , ∀i ∈ I ,
= 0, if f ∉ Fj , ∀j ∈ Hi , ∀i ∈ I ,
EL
which is vertically offloaded from j ∈ Ki to i. Since a device i∈Iα directly receives service
workloads from external sources, it has no child nodes, i.e., Ki = ∅, ∀i ∈ Iα.
Similarly, a data center i ∈ Iδ is in the most-top tier of the system, and hence has no parent
nodes, i.e., Vi = ∅, ∀i ∈ Iδ.
PT
Opposed to horizontal offloading, a service node can carry out vertical offloading for all
services f ∈ F. In other words, it can dispatch all types of workloads to its parents. Thus, we
have
yfi ,j ≥ 0, ∀ f ∈ F, ∀ j ∈ Vi , ∀ i ∈ Iα ∪ Iβ ∪ Iγ
N
vfj ,i ≥ 0, ∀ f ∈ F, ∀ j ∈ Ki , ∀ i ∈ Iβ ∪ I γ ∪ Iδ.
Let λfi denote the submitted workload of a service f from external sources to a device i ∈
Iα. We have
λfi ≥ 0, ∀ f ∈ F, ∀ i ∈ Iα.
EL
d) Computation and communication delay of the cloud-edge computing system
PT
The total system cost C of a cloud-edge computing is defined as
C = C S + CN
Where CS is Computation cost of service nodes and CN is Communication cost of
N
network connections.
Since we aim to minimize the total cost of the cloud-edge computing system while
guaranteeing its delay constraints, we hence have an optimization problem.
• We try to solve a problem (P) which has variables that are integers and nonlinear
delay constraints.
• This type of problem is usually very hard to solve, so we are using the Branch-and-
EL
bound algorithm.
•We search the tree looking for solutions with integers and when we find one, we use it
PT
as an upper bound for the original problem.
•We keep searching until all the nodes of the tree have been solved or the search
conditions have been met.
N
2. If a feasible solution C*(N*,O*) is reached, set it to the current optimal solution C(N,O)
3. Add an NLP sub-problem SP, generated by removing the integrality conditions of variables
ni of the problem P, to the tree data structure T
EL
4. Start the branch-and-bound procedure iteratively solve the sub-problem SP using
Interior/Direct algorithm with parallel multiple initial searching points
PT
5. If a feasible solution C*(N*,O*) is smaller than the current optimal solution C(N,O) and N*
are integers, set C*(N*,O*) to the current optimal solution and prune the node SP, removing
it and its sub-nodes from T
N
6. If N* is not an integer, perform a branching operation on a variable ni ∈ N* creating two
new sub-problems SSP1 and SPP2 of SP, added to T using the Pseudo-cost branching method
7. If C*(N*,O*)i >= C(N,O), or there is not a feasible solution, prune the node SP
8. Repeat the branch-and-bound procedure until all nodes of T have been resolved
Workload Optimization for Cloud-Edge
ALGORITHM DESIGN - BRANCH-AND-BOUND WITH PARALLEL MULTI-
START SEARCH POINTS
EL
PT
N
2. Compare the cloud-edge computing system with a traditional design (NH) which does not
support horizontal offloading.
EL
3. Adjust the arrival rate to generate workloads whose total demanded computation
capacity is 10%, 50%, and 100% of the maximum capacity of all service nodes.
PT
4. Optimize the system to minimize the total system cost C which consists of the
computation cost of service nodes and the communication cost of network connections.
N
5. Present results of other metrics such as computation capacity allocation, workload
allocation, and horizontal offloading workloads.
EL
to scenarios where incoming workloads are
not evenly distributed across cloud
computing and edge computing resources.
•
PT
This could occur due to a sudden spike in
requests from one geographical location or
due to a particular type of workload that is
N
more suited to being processed locally at
the edge.
Balanced Workload:
EL
across cloud computing and edge computing
resources.
PT
planning, careful monitoring of incoming
workloads and the use of intelligent
algorithms to route the workloads to the
N
most appropriate resources.
EL
• In a homogeneous service allocation scenario, services are allocated to the same type of
cloud-edge computing environment and resources.
•
sites.
PT
This means that the same type of hardware and software is used across all the cloud-edge
This type of scenario is useful when the same types of applications are running across
N
multiple sites or when the same types of services need to be provided.
• For example, if the same type of virtual machine is allocated to different tasks on the
cloud and edge, then it would be a homogeneous service allocation scenario.
EL
• This means that different types of hardware and software are used across different cloud-
edge sites.
•
PT
This type of scenario is useful when different types of applications are running across
multiple sites or when different types of services need to be provided.
N
• This type of scenario also allows for more flexibility in the types of resources that can be
used, allowing for a more customized experience for each site.
• For example, if different types of virtual machines are allocated to different tasks on the
cloud and edge, then it would be a heterogeneous service allocation scenario.
EL
• Cloud-edge computing architectures typically provide more cost-efficient solutions than
traditional designs, as they leverage the cost-effectiveness of the cloud while providing
more localized processing power.
•
PT
For example, if computation capacity costs are high, cloud-edge computing architectures
can be more cost-effective by utilizing the cloud for its cost-effectiveness and leveraging
localized processing power for more efficiency.
N
• This allows for cost savings in both cloud and edge compute costs, as cloud capacity is
leveraged for less expensive compute and edge compute resources can be used as
needed to meet performance and latency requirements.
This lecture first defines consistent global states and discusses issues
to be addressed to compute consistent distributed snapshots.
e1 1 e1 2 e1 3 e1 4
P1
m12 m21
e2 1 e2 2 e2 3 e2 4
P2
e3 1 e3 2 e3 3 e3 4 e3 5
P3
I2: How to determine the instant when a process takes its snapshot.
-A process pj must record its snapshot before processing a message
mij that was sent by process pi after recording its snapshot.
S1: A
$50
$80
S2: B
$200 $200 $120 $120 $170
t0 t1 t2 t3 t4
C12 $0 $50 $50 $50 $0
C21 $0 $0 $80 $0 $0
T4: Site S2 receives the message for a $50
credit to Account B and updates Account B
In non-FIFO model, a channel acts like a set in which the sender process
adds messages and the receiver process removes messages from it in a
random order.
S1: A $50
$80
S2: B
$200 $200 $120 $120 $170
t0 t1 t2 t3 t4
C12 $0 $50 $50 $50 $0
C21 $0 $0 $80 $0 $0
Let site S1 initiate the algorithm just after t1. Site S1 records
its local state (account A=$550) and sends a marker to site
S2. The marker is received by site S2 after t4. When site S2
receives the marker, it records its local state (account
B=$170), the state of channel C12 as $0, and sends a marker
along channel C21. When site S1 receives this marker, it
records the state of channel C21 as $80. The $800 amount in
the system is conserved in the recorded global state,
A = $550, B = $170, C12 = $0, C21 = $80
S1: A
$50
$80
S2: B
$200 $200 $120 $120 $170
t0 t1 t2 t3 t4
The $800 amount in the system is conserved in the recorded global state
Figure 6.4: Timing diagram of two possible executions of the banking example
Global State and Snapshot
Properties of the recorded global state
2. (Markers shown using green dotted arrows.)
Let site S1 initiate the algorithm just after t0 and before sending the
$50 for S2. Site S1 records its local state (account A = $600) and
sends a marker to site S2. The marker is received by site S2 between
t2 and t3. When site S2 receives the marker, it records its local state
(account B = $120), the state of channel C12 as $0, and sends a
marker along channel C21. When site S1 receives this marker, it
records the state of channel C21 as $80. The $800 amount in the
system is conserved in the recorded global state,
A = $600, B = $120, C12 = $0, C21 = $80
S1: A
$50
$80
S2: B
$200 $200 $120 $120 $170
t0 t1 t2 t3 t4
The $800 amount in the system is conserved in the recorded global state
Figure 6.4: Timing diagram of two possible executions of the banking example
Global State and Snapshot
Properties of the recorded global state
In both these possible runs of the algorithm, the recorded global
states never occurred in the execution.
This happens because a process can change its state asynchronously
before the markers it sent are received by other sites and the other sites
record their states.
But the system could have passed through the recorded global states in
some equivalent executions.
The recorded global state is a valid state in an equivalent execution and
if a stable property (i.e., a property that persists) holds in the system
before the snapshot algorithm begins, it holds in the recorded global
snapshot.
Therefore, a recorded global state is useful in detecting stable
properties.
Device
IoT Hub DPS, Azure App
SDK services
Digital Twins
Cold Path
Data Lake,
Data Factory,
Batch Synapse,
Processing Databricks,
Azure DBaas
The main thing about hotpath is that you're processing data in real
time as it's happening however what's consuming that might be Cold Path
querying old data that was processed an hour ago. It could be Data Lake,
something that's processing it and then presenting it in real time such Batch
Data Factory,
Synapse,
Processing Databricks,
as a dashboard that is constantly monitoring things in their present Azure DBaas
state as comes off of the hot path and into the consumption layer.
Website monitoring
Fraud detection
Ad monetization
- Website statistics
- etc.
Website monitoring
Scales Fraud
to hundreds
detectionof nodes
Ad monetization
Achieves second-scale latencies
Stream Processing
• Ability to ingest, process and analyze data in-motion in real- or near-
real-time
• Event or micro-batch driven, continuous evaluation and long-lived
processing
• Enabler for real-time Prospective, Proactive and Predictive
Analytics for Next Best Action
Stream Processing + Batch Processing = All Data Analytics
real-time (now) historical (past)
• Storm
• Replays record if not processed by a node
tweets DStream
new DStream transformation: modify data in one DStream to create another DStream
tweets DStream
every batch
saved to HDFS
Scala
val tweets = [Link]()
val hashTags = [Link](status => getTags(status))
[Link]("hdfs://...")
Java
JavaDStream<Status> tweets = [Link]()
JavaDstream<String> hashTags = [Link](new Function<...> { })
[Link]("hdfs://...")
Function object
sockets
join, …
• Stateful operations – window, countByValueAndWindow, …
…
reduceByKey reduceByKey reduceByKey
tagCounts
[(#cat, 10), (#dog, 25), ... ]
sliding window
window length sliding interval
operation
window length
DStream of data
sliding interval
countByValu
e count over all
tagCounts
the data in the
window
counting)
moods = [Link](updateMood _)
[Link](tweetsRDD => {
[Link](spamHDFSFile).filter(...)
})
• CEP-style processing
• window-based operations (reduceByWindow, etc.)
flatMap
7 3.5
Grep WordCount
Cluster Thhroughput (GB/s)
Grep WordCount
60 30
Spark Spark
(MB/s)
(MB/s)
40 20
20 10 Storm
Storm
0 0
100 1000 100 1000
Record Size (bytes) Record Size (bytes)
2000
▪ Markov-chain Monte Carlo
observations 1200
800
▪ Very CPU intensive, requires
400
dozens of machines for useful
computation 0
0 20 40 60 80
# Nodes in Cluster
▪ Scales linearly with cluster size
} object ProcessLiveStream {
def main(args: Array[String]) {
val sc = new StreamingContext(...)
realtime processing
}
}
• Performance optimizations
performed
• Both the parameters must be a multiple of the batch interval
• For (2) specifying a window spec, there are three components: partition by, order by, and
frame.
1. “Partition by” defines how the data is grouped; in the above example, it was by
customer. You have to specify a reasonable grouping because all data within a group will
be collected to the same machine. Ideally, the DataFrame has already been partitioned by
the desired grouping.
2. “Order by” defines how rows are ordered within a group; in the above example, it
was by date.
3. “Frame” defines the boundaries of the window with respect to the current row; in the
above example, the window ranged between the previous row and the next row.
• Targeted advertising
• [Link]
• [Link]
EL
PT
N
Dr. Rajiv Misra, Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
rajivm@[Link]
EL
Integration of MQTT and Kafka
EL
The protocol is lightweight and implements a publish/subscribe
communication pattern.
PT
MQTT is stable in unreliable environments of high latency and
low network bandwidth which makes it a perfect match for
N
Internet of Things scenarios like connected cars or smart homes.
EL
PT
N
EL
managed service providing several additional features around like
Schema Registry, REST & MQTT Proxies, and specific connectors.
PT
• Kafka implements an own protocol under the hood, following a
publish/subscribe pattern which structures communication into
N
topics — similar to MQTT. However, that's the only thing both have
in common.
EL
applications.
PT
• Its main use cases are distributed event streaming and
storage/consumption of massive amounts of data as
messages.
N
• It makes Kafka a perfect match for scenarios that require
high-performance, scalable data pipelines, or data
integration across multiple systems.
Vu Pham Introduction to MQTT and Kafka
Introduction: Internet of Things Streaming using Kafka
EL
PT
N
EL
The characteristics of Kafka are:
PT
1. It is a distributed and partitioned messaging system.
2. It is highly fault-tolerant
N
3. It is highly scalable.
4. It can process and send millions of messages per second
to several receivers.
EL
PT
N
EL
• Web site: page views, clicks, searches, …
• IoT: sensor readings, …
and so on.
PT
N
EL
If you have lots of small applications or devices, running in unwired or
unstable environments, exchanging messages in real-time on
numerous different channels/topics — use MQTT.
PT
There are two things that make it quite obvious to combine the two
technologies:
N
• the communication structure in topics and
• the publish/subscribe message exchange pattern.
But in which scenarios would you use both Kafka and MQTT together?
Lets see in further slides.
Vu Pham Introduction to MQTT and Kafka
Use Case: Why using both MQTT and Kafka?
The most popular use case is probably the integration of MQTT devices with backend
applications for monitoring, control, or analytics running in the companies' data centers
or the cloud.
Imagine you want to send data from different IoT devices to a backend application for
EL
machine learning based pattern recognition or analytics. At the same time, the backend
application should send back messages to control the IoT device based on the central
insights (e.g. send control messages to avoid a device from overheating, …).
PT
Consequently, MQTT and Kafka are a perfect combination for end-to-end IoT
integration from the edge to the business applications and data centers.
N
The IoT/edge devices can connect to the MQTT broker via MQTT protocol (with all the
advantages it has in these environments).
The messages are then forwarded to Kafka to distribute them into the subscribing
business applications and the other way around.
EL
4. Connect MQTT Broker to Kafka via Kafka Connect
5. Connect MQTT Broker to Kafka via MQTT Broker extension
PT
N
EL
PT
N
The IoT device can publish two messages — one to the topic of the MQTT
broker and a second one to the topic of the Kafka broker. This has several
EL
drawbacks:
• The IoT device needs to check the delivery guarantees of both protocols and
it must be ensured that the message is received by both or not at all. A lot
PT
of investment in error handling must be done.
• Additionally, most IoT devices are lightweight. Sending two messages with
two different protocols is a huge overhead. Most IoT devices might have not
even the possibility to connect to Kafka natively.
N
• Kafka is not designed to handle a massive amount of different topics with
millions of different devices. A full-blown IoT scenario with this integration
option could lead to issues on the Kafka broker side.
EL
direction.
In this context resilience and fault tolerance are very important, but hard to reach, especially
if an end-to-end guaranty of at least once or exactly once message delivery is required. The
PT
custom bridge application can only acknowledge the MQTT receipt if it successfully forwarded
the message to the Kafka broker or need to buffer the messages in case something goes
wrong. A significant development effort in error handling and functionality similar to
technology already found in Kafka an/or MQTT broker is necessary.
N
If the only requirement is to persist MQTT messages or integrate them with legacy
systems, this option could be a good fit. In this case, the Confluent Kafka MQTT
proxy can be used by the IoT devices to directly publish the messages to Kafka. An
EL
MQTT broker would be additional overhead and would be simply removed from the
picture.
PT
N
EL
PT
N
EL
MQTT broker and Kafka by extending their brokers with a native Kafka protocol.
PT
N
EL
PT
N
EL
The processes that publish messages into a topic in Kafka are known as
producers.
The processes that receive the messages from a topic in Kafka are known as
consumers.
PT
The processes or servers within Kafka that process the messages are known as
brokers.
N
A Kafka cluster consists of a set of brokers that process the messages.
EL
A partition is also known as a commit log.
Each partition contains an ordered set of messages.
PT
Each message is identified by its offset in the partition.
Messages are added at one end of the partition and consumed
at the other.
N
EL
multiple servers.
A topic can have any number of partitions.
Each partition should fit in a single Kafka server.
PT
The number of partitions decide the parallelism of the topic.
N
EL
The leader controls the read and write for the partition, whereas, the
followers replicate the data.
PT
If a leader fails, one of the followers automatically become the leader.
Zookeeper is used for the leader selection.
N
EL
Topics should already exist before a message is placed by the producer.
Messages are added at one end of the partition.
PT
N
EL
The consumers specify what topics they want to listen to.
A message is sent to all the consumers in a consumer group.
The consumer groups are used to control the messaging system.
PT
N
EL
the same order.
• Each partition acts as a message queue.
PT
• Consumers are divided into consumer groups.
• Each message is delivered to one consumer in each consumer group.
• Zookeeper is used for coordination.
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
• They coordinate among each other using Zookeeper.
PT
• One broker acts as a leader for a partition and handles the
delivery and persistence, where as, the others act as followers.
N
EL
are appended in the same order
EL
The leader propagates the writes to the followers.
The leader waits until the writes are completed on all the
replicas. PT
If a replica is down, it is skipped for the write until it
N
comes back.
If the leader fails, one of the followers will be chosen as
the new leader; this mechanism can tolerate n-1 failures if
the replication factor is ‘n’
Vu Pham Introduction to MQTT and Kafka
Persistence in Kafka
Kafka uses the Linux file system for persistence of messages
Persistence ensures no messages are lost.
Kafka relies on the file system page cache for fast reads
EL
and writes.
All the data is immediately written to a file in file system.
writes. PT
Messages are grouped as message sets for more efficient
EL
2. Kafka Connect: A framework to import event streams from
other source data systems into Kafka and export event
streams from Kafka to destination data systems.
3.
they occur.
PT
Kafka Streams: A Java library to process event streams live as
N
EL
o Source Code [Link]
Kafka Streams Java docs
PT
o
[Link]
EL
Kafka data model consists of messages and topics.
PT
Kafka architecture consists of brokers that take messages from the
producers and add to a partition of a topics.
N
Kafka architecture supports two types of messaging system called
publish-subscribe and queue system.
Brokers are the Kafka processes that process the messages in Kafka.
Vu Pham Introduction to MQTT and Kafka
Introduction to Edge Data Center for IoT
platform
EL
PT
N
Dr. Rajiv Misra, Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
rajivm@[Link]
EL
Computing and also focus on the aspects i.e. Why Clouds,
What is a Cloud, Whats new in todays Clouds and also
PT
distinguish Cloud Computing from the previous generation
of distributed systems
N
IoT
(IoT-as-SaaS platform is key drivers of public cloud)
EL
Harnessing signals from sensors and devices, managed
centrally by the cloud
Edge
PT
(IoT realize not everything needs to be in the cloud)
Intelligence offloaded from the cloud to IoT devices
ML
CurrentStateofCloud
• Highly centralized set of resources,
N
• Resembles Client/Server computing
(rise of AI, ML models are trained in cloud are deployed
at the edge to make inferencing for predictive analytics • Compute is going beyond VMs as
) Containers becoming mainstream
• Storage is complemented by CDN is
Breakthrough intelligence capabilities, in the cloud and replicated and cached at edge locations
on the edge
• Network stack is programmable SDN
enabling hybrid scenarios
Cloud IoT
Introduction EdgeData
to Edge ML Center
Edge Computing
EL
● Edge Mimics public cloud platform capabilities
● Delivers storage, compute, and network services locally.
●
● PT
Reduces the latency by avoiding the roundtrip to the cloud
Brings in data sovereignty by keeping data where it actually
N
belongs, savings on cloud and bandwidth usages
Cloud IoT
Introduction EdgeData
to Edge ML Center
Functionality of Edge Computing for IOT
EL
•
• Containers
Distributed Computing
•
• PT
NoSQL/Time-Series Database
Stream Processing
N
•
• ML Models
EL
service and application delivery from online
providers. Also, some M2M devices, such as
autonomous vehicles, will require real-time
PT
communications with local processing resources
to guarantee safety.
Connected
Devices
The Cloud
N
Today’s IP networks cannot handle the high-speed data transmissions that
tomorrow’s connected devices will require. In a traditional IP architecture, data
must often travel hundreds of miles over a network between end users or devices
and cloud resources. This results in latency, or slow delivery of time-sensitive
data.
EL
PT
N
EL
center to the users and devices. Edge computing relies on a distributed data center
architecture, in which IT cloud servers housed in edge data centers are deployed on the outer
edges of a network. By bringing IT resources closer to the end users and/or devices they serve,
PT
we can achieve high-speed, low-latency processing of applications and data.
N
EL
PT
N
EL
where data flow is impaired
2. Bandwidth: edge data centers process data locally, reducing the volume of traffic
flowing to and from central servers. In turn, greater bandwidth across the user’s
broader network becomes available, which improves overall network performance
3.
PT
Operating Cost: because edge data centers reduce the volume of traffic flowing to
and from central servers, they inherently reduce the cost of data transmission and
routing, which is important for high-bandwidth applications. More specifically, edge
data centers lessen the number of necessary high-cost circuits and interconnection
N
hubs leading back to regional or cloud data centers, by moving compute and storage
closer to end users
4. Security: edge data centers enhance security by: i) reducing the amount of sensitive
data transmitted, ii) limiting the amount of data stored in any individual location,
given their decentralized architecture, and iii) decreasing broader network
vulnerabilities, because breaches can be ring-fenced to the portion of the network that
they compromise
EL
client data is processed as close to the originating source as
possible. Because the smaller data centers are positioned close to
the end users, they are used to deliver fast services with minimal
latency.
PT
In an edge computing architecture, time-sensitive data may be
processed at the point of origin by an intermediary server that is
located in close geographical proximity to the client. The point is
to provide the quickest content delivery to an end device that may
N
need it, with as little latency as possible. Data that is less time-
sensitive can be sent to a larger data center for historical analysis,
big data analytics and long-term storage. Edge data centers work
off of the same concept, except instead of just having one
intermediary server in close geographical proximity to the client,
it's a small data center -- that can be as small as a box. Even
though it is not a new concept, edge data center is still a relatively
new term.
EL
to a larger, central data center or multiple other edge data centers.
Data is processed as close to the end user as possible, while less integral or time-centric
data can be sent to a central data center for processing. This allows an organization to
reduce latency.
PT
N
EL
would be used if data generated by devices needs more processing but is also too time-
sensitive to be sent to a centralized server.
3. Healthcare: Some medical equipment, such as those used for robotic surgeries, would
require extremely low latency and network consistency, of which, edge data centers can
provide.
PT
4. Autonomous vehicles: Edge data centers can be used to help collect, process and share
data between vehicles and other networks, which also relies on low latency. A network of
N
edge data centers can be used to collect data for auto manufacturers and emergency response
services.
5. Smart factories: Edge data centers can be used for machine Predictive maintenance, as well
as predictive quality management. It can also be used for efficiency regarding robotics used
within inventory management.
EL
platform, network connectivity, and application workload.
Edge computing uses multiple computers at network edge to solve
large-scale problems locally and over the Internet. Thus, distributed
PT
edge computing becomes data-intensive and network-centric.
The emergence of distributed edge computing clouds instead
demands high-throughput computing (HTC) systems built with
N
distributed computing technologies.
High-throughput computing (HTC) appearing as computer clusters,
service-oriented, computational grids, peer-to-peer networks,
Internet clouds and edge, and the future Internet of Things.
EL
IDC in 2009: “Spending on IT cloud services will triple
in the next 5 years, reaching $42 billion.”
PT
Forrester in 2010 – Cloud computing will go from
$40.7 billion in 2010 to $241 billion in 2020.
N
Companies and even federal/state governments using
cloud computing now: [Link]
EL
– EBS: Elastic Block Storage
• Microsoft Azure
PT
• Google Compute Engine/AppEngine
• Rightscale, Salesforce, EMC,
Gigaspaces, 10gen, Datastax, Oracle,
N
VMWare, Yahoo, Cloudera
• And 100s more…
EL
datasets, pay per GB-month stored
EL
“With Online Services, reduce the IT operational costs by roughly 30% of
spending”
PT
“A private cloud of virtual servers inside its datacenter has saved nearly
crores of rupees annually, because the company can share computing
power and storage resources across servers.”
N
100s of startups can harness large computing resources without buying
their own machines.
EL
History:
In 1984, John Gage Sun Microsystems gave the slogan,
“The network is the computer.”
PT
In 2008, David Patterson UC Berkeley said,
“The data center is the computer.”
N
Recently, Rajkumar Buyya of Melbourne University simply said:
“The cloud is the computer.”
Some people view clouds as grids or clusters with changes through
virtualization, since clouds are anticipated to process huge data sets generated
by the traditional Internet, social networks, and the future IoT.
EL
Storage (backend) nodes connected to the
network
Front-end for submitting jobs and receiving
client requests
PT
(Often called “three-tier architecture”)
Software Services
N
A geographically distributed cloud consists of
Multiple such sites
Each site perhaps with a different structure and
services
EL
PT
Cloud computing: Clouds can be built with physical or
N
virtualized resources over large data centers that are distributed
systems. Cloud computing is also considered to be a form of
utility computing or service computing.
EL
1940
& Data Processing Industry
1950 Clusters
1960
Grids
PT 1970
PCs
(not distributed!)
1980
1990
2000
N
Peer to peer systems 2012
EL
Supercomputers
1940
Server Farms (e.g., Oceano)
1950
1960 P2P Systems (90s-00s)
EL
months.
Gilder’s law indicates that network bandwidth has doubled each
year in the past.
EL
In 1965, MIT's Fernando Corbató of the Multics operating
system envisioned a computer facility operating “like a
power company or water company”.
PT
Plug your thin client into the computing Utility and Play
Intensive Compute & Communicate Application
N
Utility computing focuses on a business model in which
customers receive computing resources from a paid
service provider.
All grid/cloud platforms are regarded as utility service providers.
Vu Pham Introduction to Edge Data Center
Features of Today’s Clouds
I. Massive scale: Very large data centers, contain tens of thousands
sometimes hundreds of thousands of servers and you can run your
computation across as many servers as you want and as many servers
as your application will scale.
EL
II. On-demand access: Pay-as-you-go, no upfront commitment.
– And anyone can access it
III. Data-intensive Nature: What was MBs has now become TBs, PBs and
IV.
XBs.
PT
– Daily logs, forensics, Web data, etc.
New Cloud Programming Paradigms: MapReduce/Hadoop,
N
NoSQL/Cassandra/MongoDB and many others.
EL
–
•
–
–
100K
PT
Split into clusters of 4000
AWS EC2 [Randy Bias, 2009]
N
– 40K machines
– 8 cores/machine
• eBay [2012]: 50K machines
• HP [2012]: 380K in 180 DCs
• Google: A lot
Vu Pham Introduction to Edge Data Center
What does a datacenter look like from inside?
EL
Lots of Servers
PT
N
EL
Off-site
PT
N
On-site
•Air sucked in
EL
•Moves cool air through system
PT
N
EL
HaaS: Hardware as a Service
Get access to barebones hardware machines, do whatever
you want with them, Ex: Your own cluster
PT
Not always a good idea because of security risks
IaaS: Infrastructure as a Service
Get access to flexible computing and storage infrastructure.
N
Virtualization is one way of achieving this. subsume HaaS.
Ex: Amazon Web Services (AWS: EC2 and S3), OpenStack,
Eucalyptus, Rightscale, Microsoft Azure, Google Cloud.
EL
Ex: Google’s AppEngine (Python, Java, Go)
PT
SaaS: Software as a Service
Get access to software services, when you need
them. subsume SOA (Service Oriented
N
Architectures).
Ex: Google docs, MS Office on demand
EL
Data-Intensive
Typically store data at datacenters
PT
Use compute nodes nearby
Compute nodes run computation services
N
In data-intensive computing, the focus shifts
from computation to the data:
CPU utilization no longer the most important
resource metric, instead I/O is (disk and/or network)
EL
• Indexing: a chain of 24 MapReduce jobs
• ~200K jobs processing 50PB/month (in 2006)
PT
Yahoo! (Hadoop + Pig)
• WebMap: a chain of several MapReduce jobs
• 300 TB of data, 10K cores, many tens of hours (~2008)
N
Facebook (Hadoop + Hive)
• ~300TB total, adding 2TB/day (in 2008)
• 3K jobs processing 55TB/day
NoSQL: MySQL is an industry standard, but Cassandra is 2400 times faster
EL
Example of popular vendors for creating private
clouds are VMware, Microsoft Azure, Eucalyptus etc.
customer PT
Public clouds provide service to any paying
EL
– Total = Storage + CPUs = $62 K + $0.10 X 1024 X 24 X 30 ~ $136 K
• Own: monthly cost
–
EL
systems
• Characteristics of cloud computing problem
PT
- Scale, On-demand access, data-intensive,
new programming
N
But distributed.
Joins infrequent
Tables
“Column families” in Cassandra, “Table” in HBase,
“Collection” in MongoDB
Like RDBMS tables, but …
May be unstructured: May not have schemas
• Some columns may be missing from some rows
Don’t always support joins or have foreign keys
Can have index tables, just like RDBMSs
N80 N45
Client Coordinator
Backup replicas for
key K13
Cassandra uses a Ring-based DHT but without
finger tables or routing
Keyàserver mapping is the “Partitioner”
Vu Pham Design of Apache Cassandra
Data Placement Strategies
Replication Strategy:
1. SimpleStrategy
2. NetworkTopologyStrategy
1. SimpleStrategy: uses the Partitioner, of which there are two kinds
1. RandomPartitioner: Chord-like hash partitioning
2. ByteOrderedPartitioner: Assigns ranges of keys to servers.
• Easier for range queries (e.g., Get me all twitter users starting
with [a-b])
2. NetworkTopologyStrategy: for multi-DC deployments
Two replicas per DC
Three replicas per DC
Per DC
• First replica placed according to Partitioner
• Then go clockwise around ring until you hit a different rack
Vu Pham Design of Apache Cassandra
Snitches
Maps: IPs to racks and DCs. Configured in [Link]
config file
Some options:
SimpleSnitch: Unaware of Topology (Rack-unaware)
RackInferring: Assumes topology of network by octet of
server’s IP address
• [Link] = x.<DC octet>.<rack octet>.<node octet>
PropertyFileSnitch: uses a config file
EC2Snitch: uses EC2.
• EC2 Region = DC
• Availability zone = rack
Other snitch options available
4 4 10111 65
Protocol:
3
•Nodes periodically gossip their Current time : 70 at node 2
membership list
•On receipt, the local membership (asynchronous clocks)
list is updated, as shown
•If any heartbeat older than Tfail, (Remember this?)
node is marked as failed
Vu Pham Design of Apache Cassandra
Suspicion Mechanisms in Cassandra
Suspicion mechanisms to adaptively set the timeout based on
underlying network and failure behavior
Accrual detector: Failure Detector outputs a value (PHI)
representing suspicion
Applications set an appropriate threshold
PHI calculation for a member
Inter-arrival times for gossip messages
PHI(t) =
– log(CDF or Probability(t_now – t_last))/log 10
PHI basically determines the detection timeout, but takes
into account historical inter-arrival time variations for
gossiped heartbeats
In practice, PHI = 5 => 10-15 sec detection time
Vu Pham Design of Apache Cassandra
Cassandra Vs. RDBMS
MySQL is one of the most popular (and has been for a
while)
On > 50 GB data
MySQL
Writes 300 ms avg
Reads 350 ms avg
Cassandra
Writes 0.12 ms avg
Reads 15 ms avg
Orders of magnitude faster
What’s the catch? What did we lose?
Vu Pham Design of Apache Cassandra
CAP Theorem
Cassandra
Eventual (weak) consistency, Availability, Partition-
tolerance
Traditional RDBMSs
Strong consistency over availability under a partition
Red-Blue
Causal Probabilistic
Red-Blue
Causal Probabilistic
Red-Blue
Causal Probabilistic
EL
PT
Dr. Rajiv Misra
N Professor, Dept. of Computer
Science & Engg. Indian Institute of
Technology Patna rajivm@[Link]
After Completion of this lecture you will knowing the following:
EL
● Concepts of AWS IoT Core
● Understanding of AWS greengrass
PT
● Event-Driven architecture with sensor data in AWS IoT
N
Recapitulate: Traditional IoT platform
Cloud
Globally available, unlimited compute resources
ML
EL
IoT
Harnessing signals from sensors and devices,
managed centrally by the cloud
PT
Edge
N
Intelligence offloaded from the cloud to IOT
devices
ML
Breakthrough intelligence capabilities, in the cloud
and on the edge
AWS IoT: Introduction
AWS IoT started in 2015 with Amazon acquiring a
company called telemetry.
EL
It started with several cloud services with a very
simple IoT device management and M2M.
Now it has been expanded significantly.
PT
AWS IoT architecture consists of three different
layers:
●
●
●
Things
Cloud
Intelligence
N
AWS IoT Architecture: Services Suite
EL
PT
N
AWS IoT Architecture: Things
EL
which actually sense data and act.
● Amazon offers a couple of products for this layer.
● First one is Amazon FreeRTOS which is a real-
PT
time operating system that can run on top of a
microcontroller with 64 KB of memory or more
N
● Then AWS greengrass which is the edge
computing software act as a interfacing with the
local devices running either Amazon FreeRTOS
or the AWS IoT devices SDK
AWS IoT Architecture: Cloud
When it comes to cloud there are two important aspects:
The first one is AWS IoT core and as the name suggests it is the core
building block of the AWS IoT platform and is responsible for registering
EL
the devices, so it acts as the device registry.
It also exposes endpoints for MQTT WebSockets and HTTP for the
devices to talk to each other and to talk to the cloud and it is also the
PT
touch point for applications that want to control the devices running in
the field.
AWS IoT core acts as an interface between the applications for example
EL
It also maintains a highly secure footprint of all the devices
and if there is any anomaly it raise an alert so that is the fleet
audit or protection service.
PT
Finally, AWS IoT analytics which is an analytic solution and
this service is responsible for analyzing the trends, visualizing
and from there feeding to more powerful systems like quick
site or redshift and so on.
N
AWS IoT Core: Building Blocks
AWS IOT core is all about connecting devices
to the cloud, the moment you bring in your
first device that is going to become available
EL
you need to talk to AWS IoT core.
The workflow is very straightforward you need
to register your device with AWS IoT core and
PT
that is going to act as the digital identity of
your device.
The moment you register a device you receive a set of credentials for the device and you're going to
N
embed those credentials in the device and once the device has those credentials and it connects to
the cloud it gets authenticated, authorized and it shows up in the device registry.
The device could be running a microcontroller, a single board computer, a slightly more powerful
machine that can talk to an Modbus or canvas internally or, even an automobile device like a car.
After that it can send messages to the cloud and it can receive commands from the cloud.
AWS IoT Core: Building Blocks
When you zoom into AWS IoT core, the first one is all about
authentication and authorization and the second one is device
gateway which is the cloud endpoint for talking to the IoT core.
EL
Message broker which is based on MQTT WebSockets and
HTTP for publishing and subscribing messages or feeding data
from the device to the cloud but it is predominantly uses it for a
PT
communication between devices at the cloud to send some
metadata or telemetry and to receive some settings or
commands.
There is a rules engine which decides how the messages will flow into rest of the system and the rules engine is ANSI SQL compliant that
The device shadow is the digital twin or digital identity of the physical device and all the changes that are made to the device will first get
applied to the device shadow and then it gets propagated all the way to the device. When the device state changes it automatically gets
synchronized with the device shadow. It acts as the buffer between the desired state and the current state.
The job of the AWS IoT core is to make sure that the desired configuration is matching with the current configuration or not.
Device registry is a huge database repository but meant for the devices or things that you connect to AWS IoT.
AWS IoT Core: Summary
To put things in perspective, for using multiple
building blocks of AWS IoT, the device SDK which
is supported in variety of languages like [Link],
EL
Python, C, Java where SDK is used to connect
your device to the cloud.
The first touch point is authentication and
PT
authorization and then the device gateway for
communication and further it goes to a rules
engine and device shadow which maintains a
replica of this state
N
The rules engine is responsible for extending the
IoT platform to rest of AWS services like dynamo
DB, Neptune, redshift, AWS sage maker and to
third-party services.
AWS Greengrass: Building Blocks
AWS Greengrass extends AWS IOT to your devices
so they can act locally and the data that they generate
or filter is filtered before it is sent to the cloud.
EL
Like AWS IoT core there is a message broker built into
green grass so devices can continue to talk to each
other, there is a compute layer which is based on
lambda to write functions that are running locally and
PT
triggered when a specific condition is met and these
triggers will actually fire lambda functions that perform
an action.
N
Greengrass also have the data and state synchronized with the cloud with the help of local device
shadows and the cloud device shadow. If something updated locally, it first gets written to the device
shadow running on the edge and then it eventually gets synchronized with the cloud.
Greengrass provides local resource access. For example, you want to talk to a local database which
already has some metadata or material asset tracking information you can you can query that directly, talk
to the file system, databases or anything that is accessible within the network.
AWS Greengrass: Building Blocks
The most recent feature of greengrass is the ability to run
machine learning inferencing on the edge and this is one
of the key drivers because there are three aspects when it
comes to IOT.
EL
First one is the learning part which is happening in the
cloud, where you train machine learning models then you
have decision-making that takes place at the edge and
PT
where fully trained machine learning models are used and
they make decisions on behalf of the cloud and the action
phase that is directly done by the devices.
N
For example, a machine learning model trained in the cloud to find an anomaly is deployed on the edge and because
an anomaly is found with a very critical device, the machine learning model decides that one of the other equipments
need to be shut down and that decision will result in an action where an actuator or a relay or another interface
physically shuts down a malicious or a vulnerable machine to avoid any eventuality or any fatalities.
Thus the learn, decide and act cycle that happens with the cloud, edge and devices and performing the decision part
run locally by ML inferencing.
AWS Greengrass Group: Cloud Capabilities to the Edge
AWS IoT Greengrass Group: An AWS IoT Greengrass
group is a collection of settings and components, such as
an AWS IoT Greengrass core, devices, and subscriptions.
Groups are used to define a scope of interaction. For
example, a group might represent one floor of a building,
EL
one truck, or an entire mining site. Since the group acts as
the logical boundary for all the devices, it enforces
consistent configuration and policies to all the entities.
PT
AWS IoT Greengrass Core: This is just a device in AWS
IoT Core registry that doubles up as an edge device. It is
an x86 and ARM computing device running the
Greengrass runtime. Local devices talk to the Core similar
N
to the way they interact with AWS IoT Core.
AWS IoT Devices: These are the devices that are a part
of the Greengrass group. Once devices become a part of
the group, they automatically discover the Core to continue
the communication. Each device has a unique identity and
runs AWS IoT Device SDK. Existing devices can be added
to a Greengrass Group.
AWS Greengrass Group: Cloud Capabilities to the Edge
Lambda Functions: As discussed earlier, Lambda provides
the local compute capabilities for AWS IoT Greengrass. Each
function running within the Core uses Greengrass SDK to
interact with the resources and devices. Lambda functions
can be customized to run within the Greengrass sandbox
EL
container or directly as a process within the device OS.
PT
and subscribers that exchange messages. For another
scenario, a Lambda function may publish messages to a topic
to which the device is subscribed. Subscriptions eliminate the
strong dependency between publishers and consumers by
effectively decoupling them.
N
Connectors: AWS IoT Greengrass Connectors allows developers to easily build complex workflows on AWS IoT Greengrass
without having to worry about understanding device protocols, managing credentials, or interacting with external APIs. Based
on a declarative mechanism, Connects extend the edge computing scenarios to 3rd party environments and services.
Connectors rely on Secrets for maintaining the API keys, passwords, and credentials needed by external services.
ML Inferencing: This is one of the recent additions to AWS IoT Greengrass. The trained model is first uploaded to an Amazon
S3 bucket that gets downloaded locally. A Lambda function responsible for inferencing inbound data stream publishes the
predictions to a MQTT topic after loading the local model. Since Python is a first-class citizen in Lambda, many existing
modules and libraries can be used to perform ML inferencing at the edge.
AWS IoT: Event-driven architecture with sensor data
EL
PT
N
AWS IoT: Event-driven architecture with sensor data
Phase 1:
● Data originates in IoT devices such as medical devices, car sensors, industrial IoT sensors.
● This telemetry data is collected using AWS IoT Greengrass, an open-source IoT edge runtime and
EL
cloud service that helps your devices collect and analyze data closer to where the data is generated.
● When an event arrives, AWS IoT Greengrass reacts autonomously to local events, filters and
aggregates device data, then communicates securely with the cloud and other local devices in your
PT
network to send the data.
Phase 2:
●
N
Event data is ingested into the cloud using edge-to-cloud interface services such as AWS IoT Core,
a managed cloud platform that connects, manages, and scales devices easily and securely.
AWS IoT Core interacts with cloud applications and other devices.
● You can also use AWS IoT SiteWise, a managed service that helps you collect, model, analyze, and
visualize data from industrial equipment at scale.
AWS IoT: Event-driven architecture with sensor data
Phase 3:
● AWS IoT Core can directly stream ingested data into Amazon Kinesis Data Streams.
● The ingested data gets transformed and analyzed in near real time using Amazon
EL
Kinesis Data Analytics with Apache Flink and Apache Beam frameworks.
● Stream data can further be enriched using lookup data hosted in a data warehouse such
as Amazon Redshift.
PT
Phase 4:
● Amazon Kinesis Data Analytics can persist SQL results to Amazon Redshift after the
minutes). N
customer’s integration and stream aggregation (for example, one minute or five
● The results in Amazon Redshift can be used for further downstream business
intelligence (BI) reporting services, such as Amazon QuickSight.
● Amazon Kinesis Data Analytics can also write to an AWS Lambda function, which can
invoke Amazon SageMaker models.
● Amazon SageMaker is a the most complete, end-to-end service for machine learning.
AWS IoT: Event-driven architecture with sensor data
Phase 5:
● Once the ML model is trained and deployed in SageMaker, inferences are invoked in a
micro batch using AWS Lambda.
EL
● Inferenced data is sent to Amazon OpenSearch Service to create personalized
monitoring dashboards using Amazon OpenSearch Service dashboards.
● The transformed IoT sensor data can be stored in Amazon DynamoDB.
PT
● Customers can use AWS AppSync to provide near real-time data queries to API
services for downstream applications.
● These enterprise applications can be mobile apps or business applications to track and
N
monitor the IoT sensor data in near real-time.
● Amazon Kinesis Data Analytics can write to an Amazon Kinesis Data Firehose stream,
which is a fully managed service for delivering near real-time streaming data to
destinations like Amazon Simple Storage Service (Amazon S3), Amazon Redshift,
Amazon OpenSearch Service, Splunk, and any custom HTTP endpoints or endpoints
owned by supported third-party service providers, including Datadog, Dynatrace,
LogicMonitor, MongoDB, New Relic, and Sumo Logic.
Use Case: Greengrass Machine Learning Inference
This use case describe the steps in setting up Greengrass Machine Learning Inference, using Greengrass Image
Classification ML Connector with model trained with Amazon SageMaker, and Greengrass ML Feedback
connector to send data back to AWS for model retraining or prediction performance analysis.
EL
PT
N
Use Case: Greengrass Machine Learning Inference
The common design patterns of using Greengrass Connectors:
1. Creates a Amazon SageMaker training job to create the model. When the Greengrass configuration is being
deployed, the Greengrass Core will download the model from the Amazon SageMaker training job as a local
machine learning resource.
EL
2. Data acquisition - This function periodically acquire the raw data inputs from a image source. In this example, we
are using static images to simulate image sources.
3. Data preprocessor - This function pre-process the image by resize to the images used to train the model.
PT
4. Estimator - This function predict the data input with the connector via IPC
5. Greengrass ML Image Classification Connector - The Connector loads the model from local Greengrass
resource and invoke the model.
N
6. The process will handle the prediction result, with object detected and confidence level.
7. The result can be used to trigger an action, or send it back to the cloud for further processing.
8. Greengrass ML Feedback Connector - Greengrass ML Feedback Connector sends field data back to AWS
according to the sampling strategy configured
9. Greengrass ML Feedback Connector sends unlabeled data to AWS
10. Unlabled data can be labeled using Amazon Ground Truth, and the labeled data can be used to retrain the model
11. Greengrass ML Feedback Connector sends prediction performance which can be used for realtime performance
analysis.
Use Case Greengrass ML Inference: Deployment
The main steps for deployment are:
1. Prerequisites. Ensure there is an AWS IoT certificate and private key created and accessible
locally for use.
EL
2. Train the ML model. We will use an example notebook from Amazon SageMaker to train the
model with the Image Classification Algorithm provided by Amazon SageMaker.
PT
3. Generate and launch the CloudFormation stack. This will create the Lambda functions, the
Greengrass resources, and an AWS IoT thing to be used as the Greengrass Core. The
certificate will be associated with the newly created Thing. At the end, a Greengrass
N
deployment will be created and ready to be pushed to the Greengrass core hardware.
4. Create the [Link] file, using the outputs from the CloudFormation. Then place all files into
the /greengrass/certs and /greengrass/config directories.
5. Deploy to Greengrass. From the AWS Console, perform a Greengrass deployment that will
push all resources to the Greengrass Core and start the MLI operations.
Use Case Greengrass ML Inference: Deployment
Prerequisites:
● AWS Cloud.
EL
Ensure you have an AWS user account with permissions to manage iot, greengrass, lambda,
cloudwatch, and other services during the deployment of the CloudFormation stack.
● Local Environment
PT
Ensure a recent version of the AWS CLI is installed and a user profile with permissions mentioned above is
available for use.
EL
Caltech-256 dataset.
PT
● Select Create notebook instance
● Enter a name in Notebook instance name, such as greengrass-connector-training
● Use the default [Link] instance type
●
●
● N
Leave all default options and select Create notebook instance
Wait for the instance status to be InService, and select Open Jupyter
Select SageMaker Example tab, expand Sagemaker Neo Compilation Jobs, Image-classification-fulltraining-highlevel-
[Link], select Use
● Keep default option for the file name and select Create copy
Use Case Greengrass ML Inference: Deployment
EL
configure the hyper-parameters and add the additional use_pretrained_model=1. Details of the
hyperparameters can be found in Amazon SageMaker Developer Guide - Image Classification
Hyperparameters
PT
● We will also be setting the prefix for our training job so that the Cloudformation Custom Resources is
able to get the latest training job. Configure a base_job_name in the [Link]. Locate
the cell that initialize the [Link] and add the base_job_name, for example, using
●
N
greengrass-connector as the prefix. You will need this name prefix when creating the stack.
Add a cell below the cell that do the training [Link]() and the command
ic.latest_training_job.name in the empty cell. This will give you the name of the training job that
you can verify to make sure the Cloudformation stack picks up the correct job.
● Select the Cell from thet notebook menu and Run All
Use Case Greengrass ML Inference: Deployment
EL
and then the CloudFormation stack launched from the Template. Follow the steps below to
create the package via the command line, and then launch the stack via the CLI or AWS
PT
Console.
The CloudFormation template does most of the heavy lifting. Prior to running, each input
N
template needs to be processed to an output template that is actually used. The package
process uploads the Lambda functions to the S3 bucket and creates the output template
with unique references to the uploaded assets.
Use Case Greengrass ML Inference: Deployment
With the stack deployed, we use one output from the CloudFormation stack, the
EL
GreengrassConfig value, along with the certificate and private key to complete the [Link]
so that Greengrass Core can connect and authenticate.
PT
Starts the Greengrass Core:
N
With the Greengrass configuration [Link] in place, start the Greengrass Core.
Use Case Greengrass ML Inference: Deployment
From the AWS Console of AWS IoT Greengrass, navigate to the Greengrass Group you
EL
created with the Cloudformation, and perform Actions->Deploy to deploy to the Greengrass
Core machine.
PT
N
Use Case Greengrass ML Inference: Testing
To test out this accelerator without any hardware, you can install the Greengrass on
an EC2 to simulate as a Greengrass Core
EL
1.
cfn/greengrass_core_on_ec2-s3_models.[Link]
2. Once the instance is created, copy the [Link] to the EC2
PT
3. In the EC2, extract [Link] into /greengrass folder using
command sudo unzip -o [Link] -d /greengrass
4.
greengrass N
Restart the Greengrass daemon using the command sudo systemctl restart
Lecture Summary
EL
● Layered architecture of AWS IoT
● Concepts of AWS IoT Core
PT
● Understanding of AWS greengrass
● Event-Driven architecture with sensor data in AWS IoT
N
N
PT
EL
Introduction to Federated Learning at
IoT Edge
EL
PT
N
Dr. Rajiv Misra, Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
rajivm@[Link]
Federated Learning
Preface
EL
● Current IoT scenarios
● Why there is a need to shift from centralized ML training to decentralized ML
training of data?
PT
● Concepts of Federated Learning ( ie Distributed ML)
● Several challenges of federated learning
N
Federated Learning
Current IoT Scenario
Explosion of IoT Market
● McKinsey reported $11.1 Trillion market value by 2025
● 14 billion connected devices - Bosch
● 5 billion connected devices - Cisco
● 309 billion IoT supplier revenue - Gartner
● 7.1 trillion IoT solutions revenue - IDC
EL
A “deluge of data” is observed in 2020
1.5 GB of traffic per day from average internet user
3000 GB per day - Smart Hospitals
4000 GB data per day - self driving cars EACH
Radars ~ 10-100 kb per sec
PT
40,000 GB per day - connected aircrafts
1,000,000 GB per day - connected factories
N
Federated Learning
Shift from Centralized to Decentralized data
● The standard setting in Machine Learning (ML) considers a centralized dataset
processed in a tightly integrated system
● But in the real world data is often decentralized across many IOT devices
● Sending the data tpo Cloud for centralized ML may be too costly
EL
○ Self-driving cars are expected to generate several TBs of data a day
○ Some wireless devices have limited bandwidth/power
● Data may be considered too sensitive sometimes such as medical reports
○ We see a growing public awareness and regulations on data privacy
PT
○ Keeping control of data can give a competitive advantage in business and
research
N
Federated Learning
Federated Learning: Distributed ML
● 2016: the term FL is first coined by Google researchers; 2020: more
than 1,000 papers on FL in the first half of the year (compared to just
180 in 2018)1
● We have already seen some real-world deployments by companies and
EL
researchers for large scale IOT devices
● Several open-source libraries are under development: PySyft,
TensorFlow Federated, FATE, Flower, Substra...
PT
● FL is highly multidisciplinary: it involves machine learning, numerical
optimization, privacy & security, networks, systems, hardware...
N
Federated Learning
Federated Learning: Decentralised data
● Federated Learning (FL) aims to collaboratively train a ML model while
keeping the data decentralized
● Enabling devices to learn from each other (ML training is brought close
● A network of nodes and all nodes with their own central server but
EL
instead of sharing data with the central server, we share model we
don't send data from node to server instead send our model to server
PT
N
Federated Learning
Gradient Descent Procedure
The procedure starts off with initial values for the coefficient or coefficients for the
function. These could be 0.0 or a small random value.
coefficient = 0.0
The cost of the coefficients is evaluated by plugging them into the function and
EL
calculating the cost.
cost = f(coefficient) or cost = evaluate(f(coefficient))
We need to know the slope so that we know the direction (sign) to move the
PT
coefficient values in order to get a lower cost on the next iteration.
delta = derivative(cost)
we can now update the coefficient values.
N
A learning rate parameter (alpha) must be specified that controls how much the
coefficients can change on each update.
coefficient = coefficient – (alpha * delta)
This process is repeated until the cost of the coefficients (cost) is 0.0 or close to 0
It does require you to know the gradient of your cost function or the function you
are optimizing
Federated Learning
Gradient Descent Algorithm
Gradient Descent
• Gradient Descent is the most basic but most used optimization algorithm. It’s
used heavily in linear regression and classification algorithms. Backpropagation
in neural networks, Federated Learning also uses a gradient descent algorithm.
• Gradient descent is a first-order optimization algorithm which is dependent on
EL
the first order derivative of a loss function. It calculates that which way the
weights should be altered so that the function can reach a minima. Through
backpropagation, the loss is transferred from one layer to another and the
model’s parameters also known as weights are modified depending on the
Advantages:
PT
losses so that the loss can be minimized.
algorithm: θ=θ−α⋅∇J(θ)
N
• Easy computation
• Easy to implement
• Easy to understand
The devices train the generic neural network model using the gradient descent
algorithm, and the trained weights are sent back to the server. The server then
takes the average of all such updates to return the final weights.
Federated Learning
Edge Computing ML: FL
• FL is category of machine learning (ML) , which
moves the processing over the edge nodes so
that the clients’ data can be maintained. This
approach is not only a precise algorithm but
also a design framework for edge computing.
EL
trains an ML algorithm with the local data
samples distributed over multiple edge devices
or servers without any exchange of data. This
term was first introduced in 2016 by McMahan.
•
PT
Federated learning distributes deep learning by
eliminating the necessity of pooling the data
into a single place.
N
• In FL, the model is trained at different sites in
numerous iterations. This method stands in
contrary to other conventional techniques of
ML, where the datasets are transferred to a
single server and to more traditional
decentralized techniques that undertake that
local datasets
Federated Learning
Edge Computing ML: FL
Finding the function: model training
EL
PT
Deep Learning model training
N
Federated Learning
Edge Computing ML: FL
Finding the function: model training
EL
PT
N
Federated Learning
How is this aggregation applied? FedAvg Algo
EL
PT
N
Federated Learning
Example: FL with i.i.d.
In FL, each client trains its model decentral. In other
words, the model training process is carried out
separately for each client.
EL
center to combine and feed the aggregated main model.
Then the trusted center sent back the aggregated main
model back to these clients, and this process is
circulated.
PT
A simple implementation with IID (independent and
identically distributed) data to show how the parameters
of hundreds of different models that are running on
N
different nodes can be combined with the FedAvg
method and whether this model will give a reasonable
result.
Federated Learning
Image Classifier using FedAvg
The MNIST data set does not contain each label equally. Therefore, to fulfill the IID
requirement, the dataset was grouped, shuffled, and then distributed so that each
node contains an equal number of each label.
A simple 2-layer model can be used for the classification process used FedAvg.
Since the parameters of the main model and parameters of all local models in the
EL
nodes are randomly initialized, all these parameters will be different from each other,
so the main model sends its parameters to the nodes before the training of local
models in the nodes begins.
PT
Nodes start to train their local models over their own data by using these parameters.
Each node updates its parameters while training its own model. After the training
process is completed, each node sends its parameters to the main model.
N
The main model takes the average of these parameters and sets them as its new
weight parameters and passes them back to the nodes for the next iteration.
The above flow is for one iteration. This iteration can be repeated over and over to
improve the performance of the main model.
The accuracy of the centralized model was calculated as approximately 98%. The
accuracy of the main model obtained by FedAvg method started from 85% and
improved to 94%.
Federated Learning
Apple personalizes Siri without hoovering up data
The tech giant is using privacy-preserving machine learning to
improve its voice assistant while keeping your data on your
phone.
EL
It allows Apple to train different copies of a speaker
recognition model across all its users’ devices, using only the
audio data available locally.
PT
It then sends just the updated models back to a central server
to be combined into a master model.
Federated Learning
Federated Learning: Training
● There are connected devices let's say we have cluster of four IOT
Devices from four of the IOT devices and there is one central server
that has an untrained model.
● We will send a copy of the model to each of the node.
EL
● Each node would receive a copy of that model.
PT
N
Federated Learning
Federated Learning: Training
● Now all the nodes in the network has that untrained model that is
received from the server.
EL
PT
N
Federated Learning
Federated Learning: Training
● In the next step, we are taking data from each node by taking data it
doesn't mean that we are sharing data.
● Every node has its own data based on which it is going to train a
model.
EL
PT
N
Federated Learning
Federated Learning: Training
● Each node is training the model to fit the data that they have and it will
train the model accordingly to its data.
EL
PT
N
Federated Learning
Federated Learning: Training
● Now the server would combine all these model received from each node
by taking an average or it will aggregate all the models received from the
nodes.
● Then the server will train that a central model, this model which is now
trained by aggregating the models from each node. It captures the pattern
EL
in the training data on all the nodes it is an aggregated one
PT
N
Federated Learning
Federated Learning: Training
● Once the model is aggregated, the server will send the copy of the
updated model back to the nodes.
● Everything is being achieved at the edge so no data sharing is done
which means there is privacy preservation and also very less
EL
communication overhead.
PT
N
Federated Learning
Federated Learning: Challenges
Systems heterogeneity
● Size of data
● Computational power
● Network stability
EL
● Local solver
● Learning rate
PT
Expensive Communication
N
● Communication in the
network can be slower
than local computation by
many order of magnitude.
Federated Learning
Federated Learning: Challenges
Dealing with Non-I.I.D. data i.i.d (independent and identical distributed)
● Learning from non-i.i.d. data is difficult/slow because each IOT device
needs the model to go in a particular direction
● If data distributions are very different, learning a single model which
performs well for all IOT devices may require a very large number of
EL
parameters
● Another direction to deal with non-i.i.d. data is thus to lift the
requirement that the learned model should be the same for all IOT
PT
devices (“one size fits all”)
● Instead, we can allow each IOT k to learn a (potentially simpler)
personalized model θk but design the objective so as to enforce some
N
kind of collaboration
● When local datasets are non-i.i.d., FedAvg suffers from client drift
● To avoid this drift, one must use fewer local updates and/or smaller
learning rates, which hurts convergence
Federated Learning
Federated Learning: Challenges
Preserving Privacy
● ML models are susceptible to various attacks on data privacy
● Membership inference attacks try to infer the presence of a
known individual in the training set, e.g., by exploiting the
EL
confidence in model predictions
● Reconstruction attacks try to infer some of the points used to
PT
train the model, e.g., by differencing attacks
● Federated Learning offers an additional attack surface because
the server and/or other clients observe model updates (not only
N
the final model)
Federated Learning
Key differences with Distributed Learning
Data distribution
● In distributed learning, data is centrally stored (e.g., in a data center)
○ The main goal is just to train faster
○ We control how data is distributed across workers: usually, it is
EL
distributed uniformly at random across workers
● In FL, data is naturally distributed and generated locally
○ Data is not independent and identically distributed (non-i.i.d.), and it
is imbalanced
PT
Additional challenges that arise in FL
N
● Enforcing privacy constraints
● Dealing with the possibly limited reliability/availability of participants
● Achieving robustness against malicious parties
EL
● High cost of data transfer
Federated Learning
Federated Learning: Applications
EL
● Enterprise/corporate IT (chat, issue trackers, emails, etc.)
PT
N
Federated Learning
Lecture Summary
EL
● Market trend of IoT platform
● Why decentralized training is important?
● Understanding of Federated Learning
●
PT
Different issues with federated learning
N
Federated Learning
EL
Thank You!
PT
N
Federated Learning
ML for Autonomous Driving Car
EL
PT
N
Dr. Rajiv Misra, Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
rajivm@[Link]
EL
● Understanding of Autonomous Vehicles
● Role of Edge computing in Automotive Industry
● How ML is trained in Self-driving cars?
●
PT
Use Case of LSTM model for self-driving cars
N
EL
A big chunk of major Automobile companies is
trying to develop Self-Driving-Cars. Some big
players are Tesla, Waymo, even Google is
developing Self Driving Cars which has no
PT
presence in the automobile sector, have invested a
huge amount of money, manpower and
engineering capabilities in developing such
systems.
N
Designing policies for an autonomous driving
system is particularly challenging due to
demanding performance requirements in terms of
both making safe operational decisions and fast
processing in real-time.
EL
Connected vehicles will continue to evolve at an exponential rate with V2V and V2X
communication. This generates a large volume of data (every connected vehicle will
generate data up to 4TB/day). How to handle, process, analyse the large amounts of
PT
data and make critical decisions quickly and efficiently?
Automobile makers are focused on leveraging edge computing to address these ever-
evolving challenges. A group of cross-industry global players has formed the
N
Automotive Edge Computing Consortium (AECC) to drive best practices for the
convergence between the vehicle and computing ecosystem.
When driving a vehicle, milliseconds matter. Autonomous vehicles are no different,
even though it may be your AI that drives them. AI = data + compute, and you want
your compute to be as close to your data as possible. Enter edge computing.
EL
Sensor topology can vary widely amongst autonomous vehicles, even within the
same sector.
Most self-driving sensors are fundamentally similar - they collect data about the
PT
world around them to help pilot the vehicle. For example, the Nuro vehicle
contains cameras, radar, Lidar, and thermal cameras to provide a complete, multi-
layered view of the vehicle's surroundings.
N
Currently, a Tesla utilize eight cameras,12 , and a forward radar system, but rely
much more heavily on camera visuals than Nuro vehicles. Google's Waymo Driver
primarily relies on Lidar and uses cameras and radar sensors to help map the
world around it.
● Pre-processing collected data. Autonomous vehicles have video cameras and a variety of sensors
like ultrasonic, LiDAR, and radar to become aware of their surroundings and the internals of the
vehicle. This data coming from different vehicle sources must be quickly processed through data
EL
aggregation and compression processes. An in-vehicle computer needs to have multiple I/O ports for
receiving and sending data.
● Secure network connectivity. The in-vehicle computing solution must remain securely connected to
the Internet to upload the pre-processed data to the cloud. In this case, having multiple wireless
PT
connections for redundancy and speed is crucial. High-speed connectivity is also vital for continuous
deployments of vehicle updates or "push" updates like location, on-road conditions, and vehicle
telematics.
N
● High-performance computing. Autonomous vehicles may generate approximately 1 GB of data
every second. Gathering and sending a fraction of that data (for instance, 5 minutes of data) to a
cloud-based server for analysis is impractical and quite challenging due to limited bandwidth and
latency. Autonomous driving systems shouldn’t always rely on network connectivity and cloud services
for their data processing. Self-driving vehicles need real-time data processing to make crucial quick
decisions according to their surroundings. In-vehicle edge computing is essential for reducing the
need for network connectivity (offline decision-making) and for increasing decision-making accuracy.a
EL
Today, all autonomous vehicles on the road utilize edge computing AI programs, which
are often trained using data center machine learning models. Autonomous car machine
learning models are only made possible by the incredible computing power of modern
data centers capable of hundreds of petaflops.
PT
The computing requirements of these vast machine learning models well exceed the
computing power of edge computers. Given this information, data centers are often
used to form algorithms deployed for edge.
N
The problem of self-driving-car can be seen as a Regression Problem.
Training an AI algorithm is similar; it takes hundreds of compute hours on a high-power
data center. Yet once that algorithm is learned, it can quickly and accurately utilize that
algorithm using much less computing power.
EL
PT
N
EL
pedestrians and pictures of human. Additionally, camera sets cannot precisely
measure distance or work at night. Lidar sensors usually emit high-frequency
signals, and those high-frequency signals could be used for positioning and 3D
PT
modelling, being able to tell the difference between actual human and pictures
of human. Radar is a low energy cost solution for positioning because the radio
wave it emits is usually with low frequency. Low-frequency wave cannot depict
N
the detailed 3D shape, but it is enough for positioning. However, cameras are
still needed because neither lidar nor radar can identify colors.
Vehicle-to-Vehicle Communication. Communication (V2V) technology can
increase the accuracy of autonomous driving prominently. When multiple cars
are sharing their information, they can calibrate according to their relative
positions.
ML for Autonomous Vehicles
Key component of ML for self driving cars
Perception: a core element of what the self-driving car needs to build an understanding of the
world around around it using two major inputs:
Scene Prior, is prior on the scene. For example it would be a little silly to recompute the actual
EL
location of the road, interconnectivity of the intersections of every intersection. Things you can pre-
compute in advance and save your onboard computing for all the tasks that are more critical which
is often referred to as the mapping exercise.
PT
Sensor, the signal that's going to tell you what is not like, what you mapped and the things like
traffic light right or green, where are the pedestrians and the cars what are you doing.
N
Components
EL
PT
N
Scene Representation
EL
PT
N
EL
PT
N
EL
PT
N
EL
recurrent neural networks that essentially are
networks that will build a state that gets better
and better as it gets more observation
sequential observations of for the pattern.
PT
Once semantic representation and coding in an
embedding for the pedestrian, the car under it
and the model will track that over time and build
N
a state of a good understanding of what's going
on in the scene.
EL
preprocessed before being fed into several machine learning models.
PT
N
EL
subsequent feature extraction step.
Coordinate Systems three coordinate systems are provided in this dataset: global
PT
frame, vehicle frame, and sensor frame. Some raw features are represented in
unintended coordinate systems. In order to maintain consistency, it is crucial to
transform data into the correct coordinate system. The dataset also provides
N
vehicle pose VP, a 4 × 4 row matrix, to transform variables from one coordinate
system to another.
EL
front-left, frontright, side-left, and side-left respectively. These images reflect the
time-series information of the moving vehicle with relatively smoother variation than
numerical data, which helps to prevent spiky prediction between consecutive
frames.
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
run out faster. So, the energy crisis is existing all the time.
First of all, the rise of autonomous driving cars can improve the energy
efficiency of private-owned cars. Usually, an average family car can reach its
PT
maximum speed at about 200 to 250 km/h, but the city’s usual speed limit is
usually about 60km/h. That means the engine displacement of nowadays cars
are mostly excessive. However, high engine displacement is necessary
N
because faster cars are always safer because driver can overtake or change
lane faster. If autonomous vehicles took the places of private-owned vehicles.
In that case, it is pointless to use bigger and faster cars because autonomous
driving cars are much more reliable than human drivers.
EL
energy over traditional energy sources, which will do good to the global climate
as well.
Last but not least, when autonomous driving vehicles replaced private cars,
PT
parking issues will be solved, people will have bigger house and living areas
because no garage is needed. There will be no traffic congestion as routes will
be pre-scheduled to ensure efficiency. Long-distance deliverance will be more
N
reliable because the auto-driving vehicle will never be tried.
EL
● Different concepts of Autonomous Vehicles
● How Edge computing is important in Automotive Industry?
● How ML is trained in Self-driving cars?
●
PT
Use Case of LSTM model for self-driving cars
N