JNTUA Cloud Computing Notes - R19
JNTUA Cloud Computing Notes - R19
me/jntua
Course Objectives:
Learning Outcomes
Unit-II: Cloud Services and Platforms: Compute Services, Storage Services, Database Services,
Application Services, Content Delivery Services, Analytics Services, Deployment and
Management Services, Identity and Access Management Services, Open Source Private Cloud
Software, Apache Hadoop, Hadoop MapReduce Job Execution, Hadoop Schedulers, Hadoop
Cluster Setup.
Learning Outcomes:
Multimedia Cloud: Introduction, Case Study: Live Video Streaming App, Streaming Protocols,
Case Study: Video Transcoding APP.
173 Page
Learning Outcomes:
Learning Outcomes:
Unit-V: Cloud Application Development in Python, Design Approaches, Image Processing APP,
Document Storage App, MapReduce App, Social Media Analytics App, Cloud Application
Benchmarking and Tuning, Cloud Security, Cloud Computing for Education.
Learning Outcomes:
Course Outcomes:
174 Page
Textbooks:
175 Page
Oracle and HP have all joined the game. This proves that today, cloud computing has
become mainstream.
Back End: The backend part helps you manage all the resources needed to provide Cloud
computing services. This Cloud architecture part includes a security mechanism, a large amount
of data storage, servers, virtual machines, traffic control mechanisms, etc.
3. What are the Important Components of Cloud Computing Architecture? What are its
benefits?
Here are some important components of Cloud computing architecture:
1. Client Infrastructure: Client Infrastructure is a front-end component that provides a GUI. It
helps users to interact with the Cloud.
2. Application: The application can be any software or platform which a client wants to access.
3. Service: The service component manages which type of service you can access according to
the client’s requirements.
Three Cloud computing services are:
Software as a Service (SaaS)
Platform as a Service (PaaS)
Infrastructure as a Service (IaaS)
4. Runtime Cloud: Runtime cloud offers the execution and runtime environment to the virtual
machines.
5. Storage: Storage is another important Cloud computing architecture component. It provides
a large amount of storage capacity in the Cloud to store and manage data.
6. Infrastructure: It offers services on the host level, network level, and application level.
Cloud infrastructure includes hardware and software components like servers, storage, network
devices, virtualization software, and various other storage resources that are needed to support
the cloud computing model.
7. Management: This component manages components like application, service, runtime
cloud, storage, infrastructure, and other security matters in the backend. It also establishes
coordination between them.
8. Security: Security in the backend refers to implementing different security mechanisms for
secure Cloud systems, resources, files, and infrastructure to the end-user.
9. Internet: Internet connection acts as the bridge or medium between frontend and backend. It
allows you to establish the interaction and communication between the frontend and backend.
Benefits of Cloud Computing Architecture: Following are the cloud computing architecture
benefits:
Makes the overall Cloud computing system simpler.
Helps to enhance your data processing.
Provides high security.
It has better disaster recovery.
Offers good user accessibility.
Significantly reduces IT operating costs.
4. What are the Characteristics of Cloud Computing?
The following are the essential characteristics of Cloud Computing:
Flexibilit: Cloud Computing lets users access data or services using internet-enabled devices
(such as smartphones and laptops). Whatever you want is instantly available on the cloud, just a
click away. Sharing and working on data thus becomes easy and comfortable. Many
organizations these days prefer to store their work on cloud systems, as it makes collaboration
easy and saves them a lot of cost and resources. Its ever-increasing set of features and services
is also accelerating its growth.
Scalability: Scalability is the ability of the system to handle the growing amount of work by
adding resources to the system. Continuous business expansion demands a rapid expansion of
cloud services. One of the most versatile features of Cloud Computing is that it is scalable. Not
only does it have the ability to expand the number of servers, or infrastructure, according to the
demand.
Resource pooling: Computing resources (like networks, servers, storage) that serve individual
users can be securely pooled to make it look like a large infrastructure. This can be done by
implementing a multiple-tenant model, just like a huge apartment where each individual has his
own flat but at the same time every individual shares the apartment. A cloud service provider
can share resources among clients, providing each client with services as per their requirements.
Broad network access: One of the most interesting features of cloud computing is that it
knows no geographical boundaries. Cloud computing has a vast access area and is accessible
via the internet. You can access your files and documents or upload your files from anywhere in
the world, all you need is a good internet connection and a device, and you are set to go.
On-demand self-service: It is based on a self-service model where users can manage their
services like- allotted storage, functionalities, server uptime, etc., making users their own boss.
The users can monitor their consumption and can select and use the tools and resources they
require right away from the cloud portal itself. This helps users make better decisions and
makes them responsible for their consumption.
Cost-effective: Since users can monitor and control their usage, they can also control the cost
factor. Cloud service providers do not charge any upfront cost and most of the time they
provide some space for free. The billing is transparent and entirely based upon their usage of
resources. Cloud computing reduces the expenditure of an organization considerably.
Security: Data security in cloud computing is a major concern among users. Cloud service
providers store encrypted data of users and provide additional security features such as user
authentication and security against breaches and other threats. Authentication refers to
identifying and confirming the user as an authorized user. If the user is not authorized, the
access is denied. Cloud vendors provide several layers of abstraction to improve the security
and speed of accessing data.
Automation: Automation enables IT teams, and developers, to create modify and maintain
cloud resources. Cloud infrastructure requires minimum human interaction. Everything, from
configuration to maintenance and monitoring, is most of the time automated. Automation is a
great characteristic of cloud computing and is very much responsible for the increase in demand
and rapid expansion of cloud services.
Maintenance: Maintenance of the cloud is an easy and automated process with minimum or no
extra cost requirements. With each upgrade in cloud infrastructure and software, maintenance is
becoming more easy and economical.
Measured services: Cloud resources and services such as storage, bandwidth, processing
power, networking capabilities, intelligence, software and services, development tools,
analytics, etc. used by the consumer are monitored and analyzed by the service providers. In
other words, the services you use are measured.
Resilience: Resilience in cloud computing means its ability to recover from any interruption. A
Cloud service provider has to be prepared against any disasters or unexpected circumstances
since a lot is at stake. Disaster management earlier used to pose problems for service providers
but now due to a lot of investments and advancements in this field, clouds have become a lot
more resilient. Like, for example, cloud service providers arrange many backup nodes (server).
Organizational demand for cloud computing has increased exponentially. Cloud service
providers, grasping the business opportunity, have continuously provided quality services to
their clients. Cloud technology has performed up to its potential and still, it has a huge prospect
for growth.
5. What is On-Demand Self-Service
On-demand self-service: means that a consumer can request and receive access to a service
offering, without an administrator or some sort of support staff having to fulfill the request
manually. The request processes and fulfillment processes are all automated. This offers
advantages for both the provider and the consumer of the service.
Implementing user self-service allows customers to quickly procure and access the services they
want. This is a very attractive feature of the cloud. It makes getting the resources you need very
quick and easy. With traditional environments, requests often took days or weeks to be fulfilled,
causing delays in projects and initiatives. You don’t have to worry about that in cloud
environments.
User self-service also reduces the administrative burden on the provider. Administrators are
freed from the day-to-day activities around creating users and managing user requests. This
allows an organization’s IT staff to focus on other, hopefully more strategic, activities.
Self-service implementations can be difficult to build, but for cloud providers they are definitely
worth the time and money. User self-service is generally implemented via a user portal. There
are several out-of-the-box user portals that can be used to provide the required functionality, but
in some instances a custom portal will be needed. On the front end, users will be presented with
a template interface that allows them to enter the appropriate information. On the back end, the
portal will interface with management application programming interfaces (APIs) published by
the applications and services. It can present quite a challenge if the backend systems do not
have APIs or other methods that allow for easy automation.
When implementing user self-service, you need to be aware of potential compliance and
regulatory issues. Often, compliance programs like Sarbanes-Oxley (SOX) require controls be
in place to prevent a single user from being able to use certain services or perform certain
actions without approval. As a result, some processes cannot be completely automated. It’s
important that you understand which process can or cannot be automated in implementing self-
service in your environment.
While public clouds bring broad network access capabilities by default, private servers don’t do
the same. For instance, you could set up an on-premise server to connect with only local devices
close to the enterprise core. HR systems are a good example. In a traditional office, employees
would be able to log into their attendance portal and clock in only from their designated
workstations inside the office campus. The server where the attendance app is posted does not
have broad network access, and therefore cannot be accessed from a remote location or a
mobile device.
However, broad network access is increasingly becoming a key demand for private cloud
solutions. This is due to three reasons:
Remote work and the occasional WFH were common even before 2020. Now, in the
wake of the pandemic, it is vital to support cloud access and app-based workflows from
any device.
Bring your own device (BYOD) allows employees to use a device of their choice, which
may be a personal device as well. Without broad network access, BYOD isn’t possible.
Companies may opt for a private instead of a public cloud landscape for security and
compliance reasons. However, they wouldn’t want to sacrifice the flexibility and
convenience of device-agnostic network access.
Therefore, it is vital to weave broad network access into your cloud SLAs so that employees
and business processes can easily access the resources they need to perform at their optimum.
so that the process of delivery is opaque and the services seem to be automatically and infinitely
available.
months, in advance. It is mostly done using physical servers, which are installed and configured
manually.
Semi-automated Scaling: Semi-automated scalability takes advantage of virtual servers, which
are provisioned (installed) using predefined images. A manual forecast or automated warning of
system monitoring tooling will trigger operations to expand or reduce the cluster or farm of
resources.
Using predefined, tested, and approved images, every new virtual server will be the same as
others (except for some minor configuration), which gives you repetitive results. It also reduced
the manual labor on the systems significantly, and it is a well-known fact that manual actions on
systems cause around 70 to 80 percent of all errors. There are also huge benefits to using a
virtual server; this saves costs after the virtual server is de-provisioned. The freed resources can
be directly used for other purposes.
Elastic Scaling (fully automatic Scaling): Elasticity, or fully automatic scalability, takes
advantage of the same concepts that semi-automatic scalability does but removes any manual
labor required to increase or decrease capacity. Everything is controlled by a trigger from the
System Monitoring tooling, which gives you this "rubber band" effect. If more capacity is
needed now, it is added now and there in minutes. Depending on the system monitoring tooling,
the capacity is immediately reduced.
It is important to remember that the items presented under each topic within this
article are not an exhaustive list and are aimed only at presenting a starting point
for a series of long and detailed conversations with the stakeholders of your
project, always the most important part of the design of any application. The aim
of these conversations should be to produce an initial high-level design and
architecture. This is achieved by considering these four key elements holistically
within the domain of the customers project requirements, always remembering to
consider the side-effects and trade-offs of any design decision (i.e. what we gain
vs. what we lose, or what we make more difficult).
Scalability
Capacity
Will we need to scale individual application layers and, if so, how can we achieve
this without affecting availability?
How quickly will we need to scale individual services?
How do we add additional capacity to the application or any part of it?
Will the application need to run at scale 24x7, or can we scale-down outside
business hours or at weekends for example?
Platform / Data
Can we work within the constraints of our chosen persistence services while
working at scale (database size, transaction throughput, etc.)?
How can we partition our data to aid scalability within persistence platform
constraints (e.g. maximum database sizes, concurrent request limits, etc.)?
How can we ensure we are making efficient and effective use of platform
resources? As a rule of thumb, I generally tend towards a design based on many
small instances, rather than fewer large ones.
Can we collapse tiers to minimise internal network traffic and use of resources,
whilst maintaining efficient scalability and future code maintainability?
Load
How can we improve the design to avoid contention issues and bottlenecks? For
example, can we use queues or a service bus between services in a co-operating
producer, competing consumer pattern?
Which operations could be handled asynchronously to help balance load at peak
times?
How could we use the platform features for rate-leveling (e.g. Azure Queues,
Service Bus, etc.)?
How could we use the platform features for load-balancing (e.g. Azure Traffic
Manager, Load Balancer, etc.)?
Availability
Uptime Guarantees
What Service Level Agreements (SLA’s) are the products required to meet?
Can these SLA’s be met? Do the different cloud services we are planning to use all
conform to the levels required? Remember that SLA’s are composite.
Are we restricted to specific geopolitical areas? If so, are all the services we are
planning to use available in those areas?
How do we prevent corrupt data from being replicated?
Will recovery from a failure put excess pressure on the system? Do we need to
implement retry policies and/or a circuit-breaker?
Disaster recovery
Performance
What are the acceptable levels of performance? How can we measure that? What
happens if we drop below this level?
Can we make any parts of the system asynchronous as an aid to performance?
Which parts of the system are the mostly highly contended, and therefore more
likely to cause performance issues?
Are we likely to hit traffic spikes which may cause performance issues? Can we
auto-scale or use queue-centric design to cover for this?
Security
This is clearly a huge topic in itself, but a few interesting items to explore which
relate directly to cloud-computing include:
What is the local law and jurisdiction where data is held? Remember to include the
countries where failover and metrics data are held too.
Is there a requirement for federated security (e.g. ADFS with Azure Active
Directory)?
Is this to be a hybrid-cloud application? How are we securing the link between our
corporate and cloud networks?
Manageability
This topic of conversation covers our ability to understand the health and
performance of the live system and manage site operations. Some useful cloud
specific considerations include:
Monitoring
Deployment
How many environments will we need (e.g. development, test, staging, production)
and how will deploy to each of them?
Will each environment need separate data storage?
Will each environment need to be available 24x7?
Feasibility
When discussing feasibility we consider the ability to deliver and maintain the
system, within budgetary and time constraints. Items worth investigating include:
Can the SLA’s ever be met (i.e. is there a cloud service provider that can give the
uptime guarantees that we need to provide to our customer)?
Do we have the necessary skills and experience in-house to design and build cloud
applications?
Can we build the application to the design we have within budgetary constraints
and a timeframe that makes sense to the business?
How much will we need to spend on operational costs (cloud providers often have
very complex pricing structures)?
What can we sensibly reduce (scope, SLAs, resilience)?
What trade-offs are we willing to accept?
Conclusion
that will not only save you time and money but will help you meet your security
and governance needs. So let’s get started.
When organizations start planning their cloud migration, and like anything else
new, they start by trying and testing some capabilities. Perhaps they start hosting
their development environment in the cloud while keeping their production one on-
premises.
It is also common to see small and isolated applications being migrated first,
perhaps because of their size, low criticality and to give the cloud a chance to
prove it is trust worthy. After all, migration to the cloud is a journey and doesn’t
happen overnight.
Then the benefits of cloud solutions became apparent and companies started to
migrate multiple large-scale workloads. As more and more workloads move to the
cloud, many organizations find themselves dealing with workload islands that are
managed separately with different security models and independent data flows.
Even worse, with the pressure to quickly get new applications deployed in the
cloud with strict deadlines, developers find themselves rushing to consume new
cloud services without reasonable consideration to organization’s security and
governance needs.
The unfortunate result in most cases is to end up with a cloud infrastructure that is
hard to manage and maintain. Each application could end up deployed in a separate
island with its own connectivity infrastructure and with poor access management.
Managing cost of running workloads in the cloud becomes also challenge. There is
no clear governance and accountability model which leads to a lot of management
overhead and security concerns.
The lack of governance, automation, naming convention and security models are
also even hard to achieve afterwards. In fact, it is nightmare to look at a poorly
managed cloud infrastructure and then trying to apply security and governance
afterword because these need to be planned a head before even deploying any
cloud resources.
Even worse, data can be hosted in geographies that violates corporate’s compliance
requirements, which is a big concern for most organizations. I remember once
asking one of my customers if they knew where their cloud data is hosted, and
most of them just don’t know.
Functional View
Implementation View
Deployment View.
You can think of the Cloud Reference Architecture (CRA) Deployment View as the
blueprint for all cloud projects. What you get from this blueprint, the end goal if
you are wondering, is to help you quickly develop and implement cloud-based
solutions, while reducing complexity and risk.
Therefore, having a foundation architecture not only helps you ensure security,
manageability and compliance but also consistency for deploying resources. It
includes network, security, management infrastructure, naming convention, hybrid
connectivity and more.
I know what you might be thinking right now, how does one blueprint fit the need
for organizations with different sizes? Since not all organizations are the same, the
Cloud Reference Architecture (CRA) Deployment View does not outline a single
design that fits all sizes. Rather, it provides a framework for decisions based on
core cloud services, features and capabilities.
The Need for Enterprise Scaffold
One of the main concepts of the Cloud Reference Architecture (CRA) that I would
like to share with you today is the concept of an enterprise scaffold.
Let’s start from the beginning. When you decide to migrate to the cloud and take
advantage of all what the cloud has to offer, there are couple of concerns that you
should address first. Things like:
A way to manage and track cost effectively (how can you know what resources are
deployed so you can account for it and bill it back accurately).
Establishing governance framework to address key issues like data sovereignty.
Deploy with mindset of security first (defining clear management roles, access
management, and security controls across all deployments).
Building trust in the cloud (have peace of mind that cloud resources are managed
and protected from day one).
These concerns are top priority for every organization when migrating to the cloud
and should be addressed early in the cloud migration planning phase.
To address all these key concerns, you need to think of adopting a framework or
an enterprise scaffold that can help you move to the cloud with confidence. Think
about how engineers build a building. They start by creating the basis of the
structure (scaffold) that provides anchor points for more permanent systems to be
mounted.
The same applies when deploying workloads in the cloud. You need an enterprise
scaffold that provides structure to the cloud environment and anchors for services
built on top. It is the foundation that builders (IT teams) use to build services with
speed of delivery in mind. The enterprise scaffold ensures that workloads you
deploy in the cloud meet the minimum security and governance practices your
organization is adopting while giving developers the ability to deploy services and
applications quickly to meet their goals and deadlines, which is a win win solution.
technologies that helps you isolate and deploy security controls to monitor and
inspect traffic across your cloud infrastructure. One of the best recommendations
here is to use a hub and spoke topology and adopt the shared service model where
common resources are consumed from different LOB applications which has many
benefits that we will discuss in great details later.
In this layer, you decide how to extend your on-premises data center to the cloud.
You also define how to design and implement isolation using virtual networks and
user defined routes .This is also the time where you deploy Network Virtual
Appliances (NVAs) and firewalls to inspect data flow inside your cloud
infrastructure.
Another key feature of the cloud is the Software Defined Networks (SDNs) that
gives you the opportunity to do micro-segmentation by implementing Network
Security Groups and Application Security Groups to better control traffic even
within subnets, not only at the edge of the network which is an evolution of how
we think about isolation and protection in such elastic cloud computing
environment.
After you are done with the core networking layer, and just before deploying your
resources, you should consider how are you going to enforce Resource
Governance. This is important because the goal of the cloud reference architecture
is to give developers more control and freedom to deploy workloads quickly and
meet their deadlines, while adhering to corporate security and governance needs.
One way to achieve this balance is by applying resource tags, implementing cost
management controls, and also by translating your organizational governance rules
and policies into Azure policies that governs the usage of cloud resources.
Once all this foundation work is finished, you can start planning how to deploy
your line of business applications (LOB applications). Most likely you need to
define different application life-cycle environments like (Production, Dev, and
QA).
Here you can also establish a shared services workspace to hosts shared
infrastructure resources for your line of business applications to consume. If one of
Use declarative formats for setup automation, to minimize time and cost for
new developers joining the project
Have a clean contract with the underlying operating system, offering
maximum portability between execution environments
Are suitable for deployment on modern cloud platforms (Google cloud,
Heroku, AWS etc..), obviating the need for servers and systems
administration
Minimize divergence between development and production, enabling
continuous deployment for maximum agility and can scale up without
significant changes to tooling, architecture, or development practices
If we simplify this term further, 12 Factor App design methodology is nothing but
a collection of 12 factors which act as building blocks for deploying or developing
an app in the cloud. Listed below are the 12 Factors:
4. Build, Run, Release: It is important to run separately all the build and run
stages making sure everything has the right libraries. For this, you can make
use of required automation and tools to generate build and release packages
with proper tags. This is further backed up by running the app in the
execution environment while using proper release management tools like
Capistrano for ensuring timely rollback.
5. Stateless Processes: This factor is about making sure the app is executed in
the execution environment as one or more processes. In other words, you
want to make sure that all your data is stored in a backing store, which gives
you the right to scale out anything and do what you need to do. During
stateless processes, you do not want to have a state that you need to pass
along as you scale up and out.
6. Port Binding: Twelve factor apps are self-contained and do not rely on
runtime injection of a web server into the execution environment to create a
web-facing service. With the help of port binding, you can directly access
your app via a port to know if it’s your app or any other point in the stack
that is not working properly.
7. Concurrency: This factor looks into the best practices for scaling the app.
These practices are used to manage each process in the app independently
i.e. start/stop, clone to different machines etc. The factor also deals with
breaking your app into much smaller pieces and then look for services out
there that you either have to write or can consume.
8. Disposability: Your app might have different multiple processes handling
different tasks. So, the ninth factor looks into the robustness of the app with
fast startup and shutdown methods. Disposability is about making sure your
app can startup and takes down fast and can handle any crash anytime. You
can use some high quality robust queuing backend (Beanstalk, RabbitMQ
etc.) that would help return unfinished jobs back to the queue in the case of a
failure.
9. Dev/Prod Parity: Development, staging and production should be as similar
as possible. In case of continuous deployment, you need to have continuous
integration based on matching environments to limit deviation and errors.
Some of the features of keeping the gap between development and
production small are as follows:
1. Make the time gap small: a developer may write code and have it
deployed hours or even just minutes later.
2. Make the personnel gap small: developers who wrote code are
closely involved in deploying it and watching its behavior in
production.
loud Storage is a service that allows to save data on offsite storage system
managed by third-party and is made accessible by a web services API.
Storage Devices
Unmanaged cloud storage means the storage is preconfigured for the customer.
The customer can neither format, nor install his own file system or change drive
properties.
Managed Cloud Storage
Managed cloud storage offers online storage space on-demand. The managed cloud
storage system appears to the user to be a raw disk that the user can partition and
format.
The cloud storage system stores multiple copies of data on multiple servers, at
multiple locations. If one system fails, then it is required only to change the pointer
to the location, where the object is stored.
To aggregate the storage assets into cloud storage systems, the cloud provider can
use storage virtualization software known as StorageGRID. It creates a
virtualization layer that fetches storage from different storage devices into a single
management system. It can also manage data from CIFS and NFS file systems
over the Internet. The following diagram shows how StorageGRID virtualizes the
storage into storage clouds:
Challenges
Storing the data in cloud is not that simple task. Apart from its flexibility and
convenience, it also has several challenges faced by the customers. The customers
must be able to:
Get provision for additional storage on-demand.
Know and restrict the physical location of the stored data.
Verify how data was erased.
Transport/Package Format:
MPEG-2 TS
Transport/Package Format:
The transport/package format for RTMP is unavailable.
3. Secure Reliable Transport (SRT): Secure Reliable Transport
(SRT) is a relatively new streaming protocol from Haivision,
a leading player in the online streaming space. SRT is an open-source
protocol that is likely the future of live streaming.
This video streaming protocol is known for its security, reliability, and
low latency streaming.
SRT is still quite futuristic because there are still some compatibility
limitations with this protocol.
Transport/Package Format:
SRT is media and content agnostic, so it supports all transport and package
formats.
Despite the failure of MSS, Microsoft is still behind a few other protocols
like MPEG DASH. Although MSS was promising in its early days, tech
enthusiasts could see that Silverlight wasn’t going to last long and as a result,
MSS came crashing down with it.
Transport/Package Format:
MP4 fragments
Transport/Package Format:
MP4 fragments
MPEG-2 TS
6. WebRTC:
Web Real-Time Communication (WebRTC) is relatively new compared to
the others on our list and technically not considered a streaming protocol, but
often talked about as though it is.
It is what’s largely responsible for your ability to participate in live video
conferences directly in your browser.
WebRTC supports adaptive bitrate streaming in the same way HLS and
MPEG-DASH does.
Microsoft Teams, which exploded in popularity during the pandemic, uses
WebRTC for both audio and video communications.
Video Codecs Supported:
H.264
VP8 + VP9
Playback Support:
Native support on Android devices
As of 2020, iOS Safari 11 and newer versions support WebRTC
Works on Google Chrome, Mozilla Firefox, and Microsoft Edge
Supported by YouTube and Google
What is Transcoding?
Firstly, transcoding needs to be differentiated from two other easily confused digital video
processes: compression and transmuxing/rewrapping.
“Transcoding” as an umbrella term
Essentially, transcoding is a two-step process in which (encoded) data is
decoded to an intermediate format and then encoded into a target format.
Three tasks might fall under the larger umbrella when someone refers to
transcoding video content:
Transrating
2. Decode the video: Decompress the file as close as possible to the original
uncompressed frames.
3. Process the video: Scale, interlace, deinterlace, or perform advanced image-
processing steps that change picture elements to improve the perceived result
of encoding.
4. Encode the video: Do that with the required destination codec.
5. Mux the video: Pack it into a wrapper or package, combining the video with
other components as required.
Transrating, which
changes the bitrate without modifying the video content,
format, or codec. For example, to ensure that the video can fit into less
storage space or can be broadcast over a lower bandwidth connection,
reduce an 8-Mbps bitrate to 3-Mbps.
You can launch an EMR cluster in minutes. The service is designed to automate infrastructure
provisioning, cluster setup, configuration, and tuning. EMR takes care of these tasks allowing you to
focus your teams on developing differentiated big data applications.
You can set scale out and scale in using EMR Managed Scaling policies and let your EMR cluster
automatically manage the compute resources to meet your usage and performance needs. This
improves cluster utilization.
EMR Studio is an integrated development environment (IDE) that makes it easier for data scientists
and data engineers to develop, visualize, and debug data engineering and data science applications
written in R, Python, Scala, and PySpark. EMR Studio provides managed Jupyter Notebooks, and
tools like Spark UI and YARN Timeline Service to simplify debugging.
High availability
You can configure high availability for multi-master applications such as YARN, HDFS, Apache
Spark, Apache HBase, and Apache Hive. When you enable multi-master support in EMR, EMR is
designed to configure these applications for High Availability, and in the event of failures,
automatically fail-over to a standby master so that your cluster is not disrupted, and place your
master nodes in distinct racks to reduce risk of simultaneous failure. Hosts are monitored to detect
failures, and when issues are detected, new hosts are provisioned and added to the cluster
automatically.
With EMR Managed Scaling you specify the minimum and maximum compute limits for your clusters
and Amazon EMR automatically resizes them for improved performance and resource utilization.
EMR Managed Scaling is designed to continuously sample key metrics associated with the
workloads running on clusters.
You can now modify the configuration of applications running on EMR clusters including Apache
Hadoop, Apache Spark, Apache Hive, and Hue without re-starting the cluster. EMR Application
Reconfiguration allows you to modify applications on the fly without needing to shut down or re-
create the cluster. Amazon EMR will apply your new configurations and gracefully restart the
reconfigured application. Configurations can be applied through the Console, SDK, or CLI.
Elastic
Amazon EMR enables you to provision capacity as you need it, and automatically or manually add
and remove capacity. This is useful if you have variable or unpredictable processing requirements.
For example, if the bulk of your processing occurs at night, you might need 100 instances during the
day and 500 instances at night. Alternatively, you might need a significant amount of capacity for a
short period of time. With Amazon EMR you can provision instances, automatically scale to match
compute requirements, and shut your cluster down when your job is complete.
Deploy multiple clusters: If you need more capacity, you can launch a new cluster and terminate it
when you no longer need it. There is no limit to how many clusters you can have. You may want to
use multiple clusters if you have multiple users or applications. For example, you can store your
input data in Amazon S3 and launch one cluster for each application that needs to process the data.
One cluster might be optimized for CPU, a second cluster might be optimized for storage, etc.
Resize a running cluster: EMR Managed Scaling is designed to automatically scale or manually
resize a running cluster. You may want to scale out a cluster to temporarily add more processing
power to the cluster, or scale in your cluster to save on costs when you have idle capacity. For
example, some customers add hundreds of instances to their clusters when their batch processing
occurs, and remove the extra instances when processing completes. When adding instances to your
cluster, EMR can now start utilizing provisioned capacity as soon it becomes available. When
scaling in, EMR will proactively choose idle nodes to reduce impact on running jobs.
Amazon EMR enables use of Spot instances so you can save both time and money. Amazon EMR
clusters include 'core nodes' that run HDFS and ‘task nodes’ that do not; task nodes are ideal for
Spot because if the Spot price increases and you lose those instances you will not lose data stored
in HDFS. With the combination of instance fleets, allocation strategies for spot instances, EMR
Managed Scaling and more diversification options, you can now optimize EMR for resilience and
cost.
Amazon S3 Integration
The EMR File System (EMRFS) allows EMR clusters to use Amazon S3 as an object store for
Hadoop. You can store your data in Amazon S3 and use multiple Amazon EMR clusters to process
the same data set. Each cluster can be optimized for a particular workload, which can be more
efficient than a single cluster serving multiple workloads with different requirements. For example,
you might have one cluster that is optimized for I/O and another that is optimized for CPU, each
processing the same data set in Amazon S3. In addition, by storing your input and output data in
Amazon S3, you can shut down clusters when they are no longer needed.
EMRFS supports S3 server-side or S3 client-side encryption using AWS Key Management Service
(KMS) or customer-managed keys, and offers an optional consistent view which checks for list and
read-after-write consistency for objects tracked in its metadata. Also, Amazon EMR clusters can use
both EMRFS and HDFS, so you don’t have to choose between on-cluster storage and Amazon S3.
You can use the AWS Glue Data Catalog as a managed metadata repository to store external table
metadata for Apache Spark and Apache Hive. Additionally, it provides automatic schema discovery
and schema version history. This allows you to persist metadata for your external tables on Amazon
S3 outside of your cluster.
Amazon S3
Amazon S3 is a highly durable, scalable, secure, fast, and inexpensive storage service. With the
EMR File System (EMRFS), Amazon EMR can efficiently and securely use Amazon S3 as an object
store for Hadoop. Amazon EMR has made numerous improvements to Hadoop, allowing you to
process large amounts of data stored in Amazon S3. Also, EMRFS can enable consistent view to
check for list and read-after-write consistency for objects in Amazon S3. EMRFS supports S3 server-
side or S3 client-side encryption to process encrypted Amazon S3 objects, and you can use the
AWS Key Management Service (KMS) or a custom key vendor.
When you launch your cluster, Amazon EMR streams the data from Amazon S3 to each instance in
your cluster and begins processing. One advantage of storing your data in Amazon S3 and
processing it with Amazon EMR is you can use multiple clusters to process the same data. For
example, you might have a Hive development cluster that is optimized for memory and a Pig
production cluster that is optimized for CPU both using the same input data set.
HDFS is the Hadoop file system. Amazon EMR’s current topology groups its instances into 3 logical
instance groups: Master Group, which runs the YARN Resource Manager and the HDFS Name
Node Service; Core Group, which runs the HDFS DataNode Daemon and the YARN Node Manager
service; and Task Group, which runs the YARN Node Manager service. Amazon EMR installs HDFS
on the storage associated with the instances in the Core Group.
Each EC2 instance comes with a fixed amount of storage, referenced as "instance store", attached
with the instance. You can also customize the storage on an instance by adding Amazon EBS
volumes to an instance. Amazon EMR allows you to add General Purpose (SSD), Provisioned (SSD)
and Magnetic volumes types. The EBS volumes added to an EMR cluster do not persist data after
the cluster is shutdown. EMR will automatically clean-up the volumes, once you terminate your
cluster.
You can also enable complete encryption for HDFS using an Amazon EMR security configuration, or
manually create HDFS encryption zones with the Hadoop Key Management Server. You can use a
security configuration option to encrypt EBS root device and storage volumes when you specify
AWS KMS as your key provider.
Amazon DynamoDB
Amazon DynamoDB is a managed NoSQL database service. Amazon EMR has direct integration
with Amazon DynamoDB so you can process data stored in Amazon DynamoDB and transfer data
between Amazon DynamoDB, Amazon S3, and HDFS in Amazon EMR.
You can also use Amazon Relational Database Service (a web service designed to set up, operate,
and scale a relational database in the cloud), Amazon Glacier (an storage service that provides
secure and durable storage for data archiving and backup), and Amazon Redshift (a managed data
warehouse service). AWS Data Pipeline is a web service that helps you process and move data
between different AWS compute and storage services (including Amazon EMR) as well as on-
premises data sources at specified intervals.
Amazon EMR supports Hadoop tools such as Apache Spark, Apache Hive, Presto, and Apache
HBase. Data scientists use EMR to run deep learning and machine learning tools such as
TensorFlow, Apache MXNet, and, using bootstrap actions, add use case-specific tools and libraries.
Data analysts use EMR Studio, Hue and EMR Notebooks for interactive development, authoring
Apache Spark jobs, and submitting SQL queries to Apache Hive and Presto. Data Engineers use
EMR for data pipeline development and data processing, and use Apache Hudi to simplify
incremental data management and data privacy use cases requiring record-level insert, updates,
and delete operations.
Apache Spark is an engine in the Hadoop ecosystem for processing for large data sets. It uses in-
memory, fault-tolerant resilient distributed datasets (RDDs) and directed, acyclic graphs (DAGs) to
define data transformations. Spark also includes Spark SQL, Spark Streaming, MLlib, and GraphX..
Apache Flink is a streaming dataflow engine that enables you to run real-time stream processing on
high-throughput data sources. It supports event time semantics for out of order events, exactly-once
semantics, backpressure control, and APIs optimized for writing both streaming and batch
applications.
TensorFlow is an open source symbolic math library for machine intelligence and deep learning
applications. TensorFlow bundles together multiple machine learning and deep learning models and
algorithms and can train and run deep neural networks for many different use cases.
Apache Hudi is an open-source data management framework used to simplify incremental data
processing and data pipeline development. Apache Hudi enables you to manage data at the record-
level in Amazon S3 to simplify Change Data Capture (CDC) and streaming data ingestion, and
provides a framework to handle data privacy use cases requiring record level updates and deletes.
SQL
Apache Hive is an open source data warehouse and analytics package that runs on top of Hadoop.
Hive is operated by Hive QL, a SQL-based language which allows users to structure, summarize,
and query data. Hive QL goes beyond standard SQL, adding first-class support for map/reduce
functions and complex extensible user-defined data types like JSON and Thrift. This capability
allows processing of complex and unstructured data sources such as text documents and log files.
Hive allows user extensions via user-defined functions written in Java. Amazon EMR has made
numerous improvements to Hive, including direct integration with Amazon DynamoDB and Amazon
S3. For example, with Amazon EMR you can load table partitions automatically from Amazon S3,
you can write data to tables in Amazon S3 without using temporary files, and you can access
resources in Amazon S3 such as scripts for custom map/reduce operations and additional libraries.
Presto is an open-source distributed SQL query engine optimized for low-latency, ad-hoc analysis of
data. It supports the ANSI SQL standard, including complex queries, aggregations, joins, and
window functions. Presto can process data from multiple data sources including the Hadoop
Distributed File System (HDFS) and Amazon S3.
Apache Phoenix enables low-latency SQL with ACID transaction capabilities over data stored in
Apache HBase. You can create secondary indexes for additional performance, and create different
views over the same underlying HBase table.
NoSQL
Apache HBase is an open source, non-relational, distributed database modeled after Google's
BigTable. It was developed as part of Apache Software Foundation's Hadoop project and runs on
top of Hadoop Distributed File System(HDFS) to provide BigTable-like capabilities for Hadoop.
HBase provides you a fault-tolerant, efficient way of storing large quantities of sparse data using
column-based compression and storage. In addition, HBase provides fast lookup of data because it
caches data in-memory. HBase is optimized for sequential write operations, and for batch inserts,
updates, and deletes. HBase works with Hadoop, sharing its file system and serving as a direct input
and output to Hadoop jobs. HBase also integrates with Apache Hive, enabling SQL-like queries over
HBase tables, joins with Hive-based tables, and support for Java Database Connectivity (JDBC).
With EMR, you can use S3 as a data store for HBase, enabling you to reduce operational
complexity. If you use HDFS as a data store, you can back up HBase to S3 and you can restore
from a previously created backup.
Interactive Analytics
EMR Studio is an integrated development environment (IDE) that enables data scientists and data
engineers to develop, visualize, and debug data engineering and data science applications written in
R, Python, Scala, and PySpark. EMR Studio provides managed Jupyter Notebooks, and tools like
Spark UI and YARN Timeline Service to simplify debugging.
Hue is an open source user interface for Hadoop that makes it easier to run and develop Hive
queries, manage files in HDFS, run and develop Pig scripts, and manage tables. Hue on EMR also
integrates with Amazon S3, so you can query directly against S3 and transfer files between HDFS
and Amazon S3.
Jupyter Notebook is an open-source web application that you can use to create and share
documents that contain live code, equations, visualizations, and narrative text. JupyterHub allows
you to host multiple instances of a single-user Jupyter notebook server. When you create a EMR
cluster with JupyterHub, EMR creates a Docker container on the cluster's master node. JupyterHub,
all the components required for Jupyter, and Sparkmagic run within the container.
Apache Zeppelin is an open source GUI which creates interactive and collaborative notebooks for
data exploration using Spark. You can use Scala, Python, SQL (using Spark SQL), or HiveQL to
manipulate data and visualize results. Zeppelin notebooks can be shared among several users, and
visualizations can be published to external dashboards.
Apache Oozie is a workflow scheduler for Hadoop, where you can create Directed Acyclic Graphs
(DAGs) of actions. Also, you can trigger your Hadoop workflows by actions or time. AWS Step
Functions allows you to add serverless workflow automation to your applications. The steps of your
workflow can run anywhere, including in AWS Lambda functions, on Amazon Elastic Compute Cloud
(EC2), or on-premises.
EMR also supports a variety of other popular applications and tools, such as R, Apache Pig (data
processing and ETL), Apache Tez (complex DAG execution), Apache MXNet (deep learning),
Ganglia (monitoring), Apache Sqoop (relational database connector), HCatalog (table and storage
management), and more. The Amazon EMR team maintains an open source repository of bootstrap
actions that can be used to install additional software, configure your cluster, or serve as examples
for writing your own bootstrap actions.
Integration with AWS Lake Formation allows you to define and manage fine-grained authorization
policies in AWS Lake Formation to access databases, tables, and columns in AWS Glue Data
Catalog. You can enforce the authorization policies on jobs submitted through Amazon EMR
Notebooks and Apache Zeppelin for interactive EMR Spark workloads, and send auditing events to
AWS CloudTrail. By enabling this integration, you also enable federated Single Sign-On to EMR
Notebooks or Apache Zeppelin from enterprise identity systems compatible with Security Assertion
Markup Language (SAML) 2.0.
Native integration with Apache Ranger allows you to set up a new or an existing Apache Ranger
server to define and manage fine-grained authorization policies for users to access databases,
tables, and columns of Amazon S3 data via Hive Metastore. Apache Ranger is an open-source tool
to enable, monitor, and manage comprehensive data security across the Hadoop platform.
This native integration allows you to define three types of authorization policies on the Apache
Ranger Policy Admin server. You can set table, column, and row level authorization for Hive, table
and column level authorization for Spark, and prefix and object level authorization for Amazon S3.
Amazon EMR installs and configures the corresponding Apache Ranger plugins on the cluster.
These Ranger plugins sync up with the Policy Admin server for authorization polices, enforce data
access control, and send auditing events to Amazon CloudWatch Logs.
Additional features
Select the right instance for your cluster
You choose what types of EC2 instances to provision in your cluster (standard, high memory, high
CPU, high I/O, etc.) based on your application’s requirements. You have root access to every
instance and you can customize your cluster to suit your requirements.
When you enable debugging on a cluster, Amazon EMR archives the log files to Amazon S3 and
then indexes those files. You can then use a graphical interface in the console to browse the logs
and view job history in an intuitive way.
You can use Amazon CloudWatch to monitor custom Amazon EMR metrics, such as the average
number of running map and reduce tasks. You can also set alarms on these metrics.
Respond to events
You can use Amazon EMR event types in Amazon CloudWatch Events to respond to state changes
in your Amazon EMR clusters. Using simple rules that you can set up, match events and route them
to Amazon SNS topics, AWS Lambda functions, Amazon SQS queues, and more.
You can use AWS Data Pipeline to schedule recurring workflows involving Amazon EMR. AWS Data
Pipeline is a web service that helps you reliably process and move data between different AWS
compute and storage services as well as on-premise data sources at specified intervals.
Deep learning
Use popular deep learning frameworks like Apache MXNet to define, train, and deploy deep neural
networks. You can use these frameworks on Amazon EMR clusters with GPU instances.
You can launch your cluster in an Amazon Virtual Private Cloud (VPC), a logically isolated section of
the AWS cloud. You have control over your virtual networking environment, including selection of
your own IP address range, creation of subnets, and configuration of route tables and network
gateways.
You can use AWS Identity and Access Management (IAM) tools such as IAM Users and Roles to
control access and permissions. For example, you could give certain users read but not write access
to your clusters. Also, you can use Amazon EMR security configurations to set various encryption at-
rest and in-transit options, including support for Amazon S3 encryption, and Kerberos
authentication.
You can use bootstrap actions or a custom Amazon Machine Image (AMI) running Amazon Linux to
install additional software on your cluster. Bootstrap actions are scripts that are run on the cluster
nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node
begins processing data. You can also preload and use software on a custom Amazon Linux AMI.
Copy data
You can move data from Amazon S3 to HDFS, from HDFS to Amazon S3, and between Amazon S3
buckets using Amazon EMR’s S3DistCp, an extension of the open source tool Distcp, which uses
MapReduce to move large amounts of data.
Custom JAR
Write a Java program, compile against the version of Hadoop you want to use, and upload to
Amazon S3. You can then submit Hadoop jobs to the cluster using the Hadoop JobClient interface.
EMR Studio is an integrated development environment (IDE) that enables data scientists and data
engineers to develop, visualize, and debug data engineering and data science applications written in
R, Python, Scala, and PySpark.
EMR Studio provides managed Jupyter Notebooks, and tools like Spark UI and YARN Timeline
Service to simplify debugging. EMR Studio uses AWS Single Sign-On and allows you to log in
directly with your corporate credentials without logging into the AWS console. Data scientists and
analysts can install custom kernels and libraries, collaborate with peers using code repositories such
as GitHub and BitBucket, or execute parameterized notebooks as part of scheduled workflows using
orchestration services like Apache Airflow or Amazon Managed Workflows for Apache Airflow.
EMR Studio kernels and applications run on EMR clusters, so you get the benefit of distributed data
processing using the performance optimized Amazon EMR runtime for Apache
Spark. Administrators can set up EMR Studio such that analysts can run their applications on
existing EMR clusters or create new clusters using pre-defined AWS Cloud Formation templates for
EMR.
Benefits:
Simple to use
EMR Studio is designed to make it simple to interact with applications on an EMR cluster. You can
access EMR Studio with your corporate credentials using AWS Single Sign-On, without logging into
the AWS console or the cluster. You can interactively explore, process and visualize data using
notebooks, build and schedule pipelines, and debug applications without logging into EMR clusters.
With EMR Studio, you can start developing analytics and data science applications in R, Python,
Scala, and PySpark with managed Jupyter Notebooks. You can attach notebooks to existing EMR
clusters or auto-provision clusters using pre-configured templates to run jobs. You can collaborate
with others using repositories, and install custom Python libraries or kernels directly from Notebooks.
EMR Studio enables you to move from prototyping to production. You can trigger pipelines from
code repositories, simply run Notebooks as pipelines using orchestration tools like Apache Airflow or
Amazon Managed Workflows for Apache Airflow, or attach notebooks to a bigger cluster using a
single click.
Simplified debugging
With EMR Studio, you can debug jobs and access logs without logging into the cluster for both
active and terminated clusters. You can use native application interfaces such as Spark UI and
YARN timeline service directly from EMR Studio. EMR Studio also allows you to locate the cluster or
job to debug by using filters such as cluster state, creation time, and cluster ID.
Use cases:
With EMR Studio, you can log in directly to managed notebooks without logging into the AW S
console, start notebooks in seconds, get onboarded with sample notebooks, and perform your data
exploration. You can collaborate with peers by sharing notebooks via GitHub and other repositories.
You can also customize your environment by loading custom kernels and Python libraries from
notebooks.
In EMR Studio, you can use code repository to trigger pipelines. You can also parameterize and
chain notebooks to build pipelines. You can integrate notebooks into scheduled workflows using
workflow orchestration services such as Apache Airflow or Amazon Managed Workflows for Apache
Airflow. EMR Studio also allows you to re-attach notebooks to a bigger cluster to run a job.
In EMR Studio, you can debug notebook applications from the notebook UI. You can also debug
pipelines by first narrowing down clusters using filters like cluster state, and diagnose jobs on both
active and terminated clusters with as few clicks as possible to open native debugging UIs like Spark
UI, Tez UI, and Yarn Timeline Service.
With EMR Notebooks, there is no software or instances to manage. You can either attach the
notebook to an existing cluster or provision a new cluster directly from the console. You can attach
multiple notebooks to a single cluster, detach notebooks and re-attach them to new clusters.
Use cases:
Variable workloads
With EMR Serverless, you can automatically scale application resources as workload demands
change, without having to preconfigure how much compute power and memory you need.
You can pre-initialize application resources in EMR Serverless to help speed up response time for
SLA-sensitive data pipelines.
EMR Serverless can help you quickly spin up a development and test environment that automatically
scales with unpredictable usage.
With Amazon EMR on Amazon EKS, you can share compute and memory resources across all of
your applications and use a single set of Kubernetes tools to centrally monitor and manage your
infrastructure. You can also use a single EKS cluster to run applications that require different
Apache Spark versions and configurations, and take advantage of automated provisioning, scaling,
faster runtimes, and development and debugging tools that EMR provides.
Benefits:
Simplify management
EMR benefits for Apache Spark on EKS include managed versions of Apache Spark 2.4 and 3.0,
automatic provisioning, scaling, performance optimized runtime, and tools like EMR Studio for
authoring jobs and an Apache Spark UI for debugging.
Optimize performance
By running analytics applications on EKS, you can reuse existing EC2 instances in your shared
Kubernetes cluster and avoid the startup time of creating a new cluster of EC2 instances dedicated
for analytics.
Use cases:
With EMR on EKS, you can automate the provisioning, management, and scaling of Apache Spark,
and use a single set of tools to centrally manage and monitor your infrastructure.
Co-location of workloads
Run multiple EMR workloads that require different frameworks, versions, and configurations on the
same EKS cluster as your other application workloads.
EMR on EKS provides a managed experience for developing, troubleshooting, and optimizing your
analytics. You can deploy configurations and start jobs to test new EMR versions on the same EKS
cluster without allocating dedicated resources.
Benefits:
Once your Outpost is set up, you can launch a new EMR cluster on-premises and connect to
existing HDFS storage. This allows you to respond when on-premises systems need additional
processing capacity. Adding capacity to on-premises Hadoop and Spark clusters helps meet
workload demands in periods of high utilization.
Apache Hadoop, Apache Hive, Apache Spark, and Presto are commonly used to process,
transform, and analyze data that is part of a larger data architecture. For data that needs to remain
on-premises for governance, compliance, or other reasons, you can use EMR to deploy and run
applications like Apache Hadoop and Apache Spark on-premises, close to your data. This reduces
the need to move large amounts of on-premises data to the cloud, reducing the overall time needed
to process that data.
If you’re in the process of migrating data and Apache Hadoop workloads to the cloud and want to
start using EMR before your migration is complete, you can use AWS Outposts to launch EMR
clusters on-premises that connect to your existing HDFS storage. You can then gradually migrate
your data to Amazon S3 as part of an evolution to a cloud architecture.
Additional Information
For additional information about service controls, security features and functionalities, including, as
applicable, information about storing, retrieving, modifying, restricting, and deleting data, please
see https://docs.aws.amazon.com/index.html. This additional information does not form part of the
Documentation for purposes of the AWS Customer Agreement available
at http://aws.amazon.com/agreement, or other agreement between you and AWS governing your
use of AWS’s services.
You're a Python developer, and you're ready to develop cloud applications for Microsoft Azure.
To help you prepare for a long and productive career, this series of three articles orients you to
the basic landscape of cloud development on Azure.
Microsoft's CEO, Satya Nadella, often refers to Azure as "the world's computer." A computer, as
you well know, is a collection of hardware components that are managed by an operating system.
The operating system provides a platform upon which you can build software that helps
people apply the system's computing power to any number of tasks. (That's why we use the word
"application" to describe such software.)
In Azure, the computer's hardware isn't a single machine but an enormous pool of virtualized
server computers contained in dozens of massive datacenters around the world. The Azure
"operating system" is then composed of services that dynamically allocate and de-allocate
different parts of that resource pool as applications need them. Those dynamic allocations allow
applications to respond quickly to any number of changing conditions, such as customer demand.
Each allocation is called a resource, and each resource is assigned both a unique object
identifier (a GUID) and a unique URL. Examples of resources include virtual machines (CPU
cores and memory), storage, databases, virtual networks, container registries, container
orchestrators, web hosts, and AI and analytics engines.
Resources are the building blocks of a cloud application. The cloud development process thus
begins with creating the appropriate environment into which you can deploy the different parts of
the application. Put simply, you can't deploy any code or data to Azure until you've allocated and
configured—that is provisioned—the suitable target resources.
The process of creating the environment for your application involves identifying the relevant
services and resource types involved, and then provisioning those resources. The provisioning
process is essentially how you construct the computing system to which you deploy your
application. Provisioning is also the point at which you begin renting those resources from
Azure.
There are hundreds of different types of Azure resources at your disposal. You can choose a
basic "infrastructure" resource like a virtual machine when you need to retain full control and
responsibility for the software you deploy. In other scenarios, you can choose a higher-level
"platform" services that provide a more managed environment where you concern yourself with
only data and application code.
While finding the right services for your application and balancing their relative costs can be
challenging, it's also part of the creative fun of cloud development. To understand the many
choices, review the Azure developer's guide. Here, let's next discuss how you actually work with
all of these services and resources.
Note
You've probably seen and perhaps have grown weary of the terms IaaS (infrastructure-as-a-
service), PaaS (platform-as-a-service), and so on. The as-a-service part reflects the reality that
you generally don't have physical access to the datacenters themselves. Instead, you use tools
like the Azure portal, Visual Studio Code, the Azure CLI, or Azure's REST API to
provision infrastructure resources, platform resources, and so on. As a service, Azure is always
standing by waiting to receive your requests.
On this developer center, we spare you the IaaS, PaaS, etc. jargon because "as-a-service" is just
inherent to the cloud to begin with!
Note
A hybrid cloud refers to the combination of private computers and datacenters with cloud
resources like Azure, and has its own considerations beyond what's covered in the previous
discussion. Furthermore, this discussion assumes new application development; scenarios that
involve rearchitecting and migrating existing on-premises applications are not covered here. For
more information on those topics, see Get started with the Cloud Adoption Framework.
Note
You might hear the terms cloud native and cloud enabled applications, which are often discussed
as the same thing. There are differences, however. A cloud enabled application is often one that
is migrated, as a whole, from an on-premises datacenter to cloud-based servers. Oftentimes, such
applications retain their original structure and are simply deployed to virtual machines in the
cloud (and therefore across geographic regions). Such a migration allows the application to scale
to meet global demand without having to provision new hardware in your own datacenter.
However, scaling must be done at the virtual machine (or infrastructure) level, even if only one
part of the application needs increased performance.
A cloud native application, on the other hand, is written from the outset to take advantage of the
many different, independently scalable services available in a cloud such as Azure. Cloud native
applications are more loosely structured (using micro-service architectures, for example), which
allows you to more precisely configure deployment and scaling for each part. Such a structure
simplifies maintenance and often dramatically reduces costs because you need pay for premium
services only where necessary.
For more information, see Build cloud-native applications in Azure and Introduction to
cloud-native applications, the principles of which apply to applications written in any language,
including Python.
Next step
Recommended content
The Azure development flow
An overview of the Azure cloud development cycle, which involves provisioning (creating and
configuring), coding, testing, deployment, and management of Azure resources.
Provisioning, accessing, and managing resources in Azure
An overview of ways you can work with Azure resources, including Azure portal, VS Code,
Azure CLI, Azure PowerShell, and Azure SDKs.
Getting started with hosting Python apps on Azure
Index of getting started material in the Azure documentation for hosting Python app code.
Use the Azure libraries (SDK) for Python
Overview of the features and capabilities of the Azure libraries for Python that help developers
be more productive when creating, using, and managing Azure resources.
Show more
>vi learn.sh
>echo “learn and share – Welcome”
>chmod 777 learn.sh
>ls –a
>./learn.sh
8. Stop the EC2 Instance , then the connection to EC2
instance in terminal should be closed.
#Open Image
im=Image.open("TajMahal.jpg")
im= cv2.imread('TajMahal.jpg',cv2.IMREAD_GRAYSCALE)
cv2.imshow('image',im)
cv2.waitKey(0)
cv2.destroyAllWindows()
Another way to write above program with a tick/line to
mark the image.
import cv2
importnumpyasnp
frommatplotlibimportpyplotasplt
im= cv2.imread('TajMahal.jpg',cv2.IMREAD_GRAYSCALE)
# Application definition
INSTALLED_APPS = [
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'projectApp'
urlpatterns = [
path('admin/', admin.site.urls),
path('', include("projectApp.urls")),
Now You can use the default MVT model to create URLs,
models, views, etc. in your app and they will be
automatically included in your main project.
The main feature of Django Apps is independence, every
app functions as an independent unit in supporting the
main project.
Now the urls.py in the project file will not access the app’s
url.
To run your Django Web application properly the following
actions must be taken:-
1. Create a file in the apps directory called urls.py
2. Include the following code:
Python3
urlpatterns=[
path('',views.index)
def index(request):
To this:
Security Planning
Before deploying a particular resource to cloud, one should need to analyze several
aspects of the resource such as:
Select resource that needs to move to the cloud and analyze its sensitivity to risk.
Consider cloud service models such as IaaS, PaaS, and SaaS. These models
require customer to be responsible for security at different levels of service.
Consider the cloud type to be used such as public, private,
community or hybrid.
Understand the cloud service provider's system about data storage and its
transfer into and out of the cloud.
The risk in cloud deployment mainly depends upon the service models and cloud types.
Access Control
Auditing
Authentication
Authorization
All of the service models should incorporate security mechanism operating in all above-
mentioned areas.
Encryption
Encryption helps to protect data from being compromised. It protects data that is being
transferred as well as data stored in the cloud. Although encryption helps to protect data
from any unauthorized access, it does not prevent data loss.
2. Better Collaboration
Real time collaboration is an important aspect of cloud computing education. Cloud software helps to:
4. Scalability
Compared with scaling on-premise data centers, cloud based software helps to reduce costs associated
with facility growth. No matter how many students you have or higher education facilities you manage,
your cloud system can grow alongside you.
Despite its benefits, there are also a few cloud computing issues and challenges in education.
Working with a managed service provider helps to quickly determine whether the
source of the issue is the end user or the cloud provider. A solution can then be implemented to
provide you with improved access and connectivity.
2. Less Control
Although a benefit of cloud is accessibility to services and platforms in the education sector (like
Blackboard), the concern is you have less control over updates, training, and other features.
Since the solution is being handled “as a service,” the infrastructure is handled by the cloud
service provider and abstracted from your in-house team.
Everything is hosted off-site, so you’ll have less control over the infrastructure and the system
setup. These are handled by your cloud service provider.
3. Vendor Commitment
Cloud solutions for higher education depend on the services of a single vendor. You typically
can’t switch between service providers.
Working with an MSP can help you choose the right vendor for your needs. Moving the
educational workload to the cloud is critical in picking the right provider.
A good provider listens to you, understands your risk and manages it from beginning to end,
eliminating any unforeseen issues that may occur.
In most cases, once you sign with a provider, you will be locked into a service contract with
them. However, most providers will let you out of a contract, but will charge you a penalty for
breaking the contract early. This may not be a problem if you’re satisfied with your services, but
it’s worth mentioning all the same.
4. Security
Cloud-based education technology is secure when set up correctly, but there are inherent security
risks when all assets are hosted online. Improperly-secured cloud systems may be vulnerable to
cyberattacks, and data security becomes a bigger concern
This concern escalates when users access resources across devices. If a device with saved
credentials gets stolen, the cloud platform becomes accessible to an unauthorized user.
To avoid these issues, you’ll need to make security a priority. This begins with a proper setup of
your cloud infrastructure and ensuring that all users are trained in cloud security best practices.
5. Up-Front Costs
While cost reduction is one of the primary benefits of cloud computing in education, there are
also some up-front costs.
The migration may be costly, depending on how many applications or services you’re moving to
the cloud. There’s also an opportunity cost in the time required to train staff on the new system
and security best practices.
The savings come more from long-term reductions in operational IT costs, but administrators
will need to be prepared for the long-term savings it will yield.