0% found this document useful (0 votes)

4 views38 pages

Unit 5

Uploaded by

harshit110927

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views38 pages

Unit 5

Uploaded by

harshit110927

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

What is Amazon Athena?

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon
Simple Storage Service (Amazon S3) using standard SQL. With a few actions in the AWS Management
Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run
ad-hoc queries and get results in seconds.

Amazon Athena also makes it easy to interactively run data analytics using Apache Spark without
having to plan for, configure, or manage resources. When you run Apache Spark applications on
Athena, you submit Spark code for processing and receive the results directly.

Athena SQL and Apache Spark on Amazon Athena are serverless, so there is no infrastructure to set
up or manage, and you pay only for the queries you run. Athena scales automatically—running
queries in parallel—so results are fast, even with large datasets and complex queries.

Athena helps you analyze unstructured, semi-structured, and structured data stored in Amazon S3.
Examples include CSV, JSON, or columnar data formats such as Apache Parquet and Apache ORC. You
can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data
into Athena.

Athena integrates with Amazon QuickSight for easy data visualization. You can use Athena to
generate reports or to explore data with business intelligence tools or SQL clients connected with a
JDBC or an ODBC driver.

Athena integrates with the AWS Glue Data Catalog, which offers a persistent metadata store for your
data in Amazon S3. This allows you to create tables and query data in Athena based on a central
metadata store available throughout your Amazon Web Services account and integrated with the ETL
and data discovery features of AWS Glue.

Amazon Athena makes it easy to run interactive queries against data directly in Amazon S3 without
having to format data or manage infrastructure. For example, Athena is useful if you want to run a
quick query on web logs to troubleshoot a performance issue on your site. With Athena, you can get
started fast: you just define a table for your data and start querying using standard SQL.

You should use Amazon Athena if you want to run interactive ad hoc SQL queries against data on
Amazon S3, without having to manage any infrastructure or clusters. Amazon Athena provides the
easiest way to run ad hoc queries for data in Amazon S3 without the need to setup or manage any
servers.

AWS Athena is a powerful serverless query service provided by AWS for analyzing the data directly in
Amazon S3 using standard SQL. It facilitates features like high scalability, cost-effectiveness, easy-to-
use platform for running complex queries without the need for extensive infrastructure setup. In this
article we will discuss on what is aws athena, its archtiecture, benefits, limitations, advantages,
disadvantages and how it difference from Amazon Redshift, Amazon Glue and Microsoft SQL server
effectively.

AWS Athena is a serverless interactive query service that enables normal SQL data analysis in
Amazon S3. Athena is based on Presto, a distributed SQL query engine, and it can query data in
Amazon S3 fast using conventional SQL syntax. There is no infrastructure to handle with Athena, so
you can focus on analyzing data at scale. To have more idea of AWS Ethena, let us understand the
architecture first.

AWS Athena Architecture

Apache Presto, an open-source distributed SQL query engine, serves as the foundation for Athena.
When a query is submitted by a user, Athena generates a query plan and sends it to Presto for
execution. Presto then distributes the query over numerous cluster nodes for parallel processing. The
results are subsequently compiled and presented to the user. Athena stores table and partition
metadata in a controlled Hive metastore.

When a query is run, Athena gets the metadata from the metastore to establish the data's location
and format. Athena also interfaces with AWS Glue, a fully managed extract, transform, and load (ETL)
service, allowing customers to create and manage data catalogs and ETL processes. Furthermore, we
will go through the various components of AWS Athena.

Amazon S3: Athena searches data stored in Amazon S3, an object storage service that is highly
durable, highly accessible, and infinitely scalable.

Amazon Glue: Athena leverages AWS Glue, a fully managed extract, transform, and load (ETL)
service, to catalog and query the data stored in S3.
Apache Presto: Apache Presto is Athena's distributed SQL query engine. Presto is well-suited for
querying data stored in distributed systems and can handle queries that require data from numerous
sources to be joined.

Amazon CloudWatch: Athena interacts with Amazon CloudWatch, a monitoring service that offers
metrics and logs for all of your AWS account's resources. CloudWatch may be used to track the
performance of your Athena queries and create alerts for specific query patterns.

Amazon VPC: Athena supports performing queries within an Amazon Virtual Private Cloud (VPC),
which allows you to isolate your data and limit access to it using Amazon VPC security groups and
network ACLs.

Encryption: Athena supports S3 server-side encryption with Amazon S3-managed keys (SSE-S3) or
AWS Key Management Service-managed keys (SSE-KMS), as well as SSL/TLS encryption of data in
transit.

What are the benefits of using Amazon Athena?

The following are the benefits of using Amazon Athena:

• Serverless: For using Amazon Athena no infrastructure managment is needed as it

automatically handles the scaling, patching and configuration.

• Cost-Effective: It only charges for the queries you run with no upfront costs or resource
provisioning.

• Ease of Use: it query the data directly from the Amazon S3 using standard SQL that is
accessbile to users familiar with SQL.

• No infrastructure setup: Athena is a serverless service that eliminates the need for users to
set up and manage infrastructure, making data querying easier and faster.

• Cost-effective: Athena charges customers solely for the quantity of data scanned by their
searches, making it an affordable solution for ad hoc and exploratory queries.

• Scalability: Athena is a fully-managed service that can automatically scale to accommodate

massive amounts of data and queries.

• SQL support: Since Athena supports ANSI SQL, users can query data in S3 using their existing
SQL knowledge and tools.

What are some Amazon Athena Limitations?

The following are the some limitations of Amazon Athena:

• Comple Queries: It performance mya be degraded with highly complex queries or ery large
datasets that requires multiple joins and aggregations.

• Cold Start Latency: Its inital query exection may experience some delay due to its cold start
latency, especially for the infrequently accessed data.

• Limited Data Manipulation: Athena is primarily useful for querying and doesn't support any
data modification operations like INSERT,, UPDATE, or DELETE.
• Restricted query performance: The volume of data scanned and the intricacy of the query
can limit Athena's speed, resulting in lengthier query times.

• No real-time querying: Because Athena is intended for batch processing, it may not be ideal
for real-time querying.

• Limited data types: In comparison to other database systems, Athena only supports a
restricted selection of data types.

Features of AWS Athena

The following are the features of AWS Athena:

• Serverless architecture: Athena is a fully-managed service that does not require any
infrastructure setup, management, or scaling.

• Standard SQL support: Since the Athena supports of ANSI SQL, users can easily query the
data in S3 through using their existing SQL knowledge and tools.

• Connection with the AWS ecosystem: Athena interfaces with other AWS services such as
Amazon S3, AWS Glue, and AWS Lambda, enabling customers to import and convert data
from a variety of sources.

• Cost-effective pricing model: Athena's pricing approach is cost-effective since it costs

customers based on the amount of data scanned by their queries, making it ideal for ad-hoc
and exploratory queries.

• Integration with BI tools: Athena provides connectivity with major business intelligence
tools like as Tableau, Power BI, and Amazon QuickSight, allowing users to build visualizations
and reports.

Amazon AWS Microsoft SQL AWS

Features Athena Redshift Server Glue

It is is is
It is fully
servless It is relational serverless
managed
interactiv database data
by data
Service e query management system integration
warehouse.
Type service service.

It It
performs facilitates It facilitates
It facilitates with
adhoc in Data with ETL an
transactional and
querying warehousi d data
analytical processing
Primary on Amzon ng cataloging.
Use Case S3 and OLAP
Amazon AWS Microsoft SQL AWS
Features Athena Redshift Server Glue

Pay per Pay per

Pay per
query node/hour
Licensing costs and usage (job
based on and
pay-per-usage for runs, data
data additional
cloud catalog
Pricing scanned storage
storage)
Model ($5/TB) costs

Redshift
Amazon S3
managed Local or cloud
Amazon and other
storage, storage, depends on
S3 data
Data integrates setup
sources
Storage with S3

High
Optimized
Optimized performanc
for ETL
for quick e for High performance for
operations
queries complex transactional and
and data
on large queries analytical workloads
transformat
Performa datasets and large
ion
nce datasets

Fully Managed
Fully
managed, service, but
Requires regular managed,
no requires
maintenance and minimal
maintena some
updates maintenanc
Maintena nce administrat
e required
nce required ion

JSON,
Data JSON, CSV, JSON, CSV,
CSV,
Formats Parquet, Traditional RDBMS fo Parquet,
Parquet,
ORC, Avro, rmats ORC, Avro,
Supporte ORC,
and more and more
d Avro
What is AWS Data Exchange?
AWS Data Exchange is a service that helps AWS customers easily share and
manage data entitlements from other organizations at scale.

As a data receiver, you can track and manage all of your data grants and AWS
Marketplace data subscriptions in one place. When you have access to an AWS
Data Exchange data set, you can use compatible AWS or partner analytics and
machine learning to extract insights from it.

For data senders, AWS Data Exchange eliminates the need to build and maintain
any data delivery and entitlement infrastructure. Anyone with an AWS account can
create and send data grants to data receivers. To sell your data as a product in AWS
Marketplace, make sure that you follow the guidelines to determine eligibility.

What is a data grant in AWS Data Exchange

A data grant is the unit of exchange in AWS Data Exchange that is created by a data
sender in order to grant a data receiver access to a data set. When a data sender
creates a data grant, a grant request is sent to the data receiver's AWS account. A
data receiver accepts the data grant to gain access to the underlying data.

A grant has the following parts:

• Data set – A data set in AWS Data Exchange is a resource curated by the
sender. It contains the data assets a receiver will gain access to after
accepting a data grant. AWS Data Exchange supports five types of data sets:
Files, API, Amazon Redshift, Amazon S3, and AWS Lake Formation
(Preview).
• Data grant details – This information includes a name and description of the
data grant that will be visible to data receivers.
• Recipient access details – This information includes the receiver’s AWS
account ID and specifies how long the receiver should have access to the
data.

What is an AWS Marketplace data product?

A product is the unit of exchange in AWS Marketplace that is published by a provider

and made available for use to subscribers. A data product is a product that includes
AWS Data Exchange data sets. When a data provider publishes a data product, that
product is listed in the AWS Marketplace product catalog after being reviewed by
AWS against our guidelines and terms and conditions. Each product published is
uniquely identified by its product ID.

A data product has the following parts:

• Product details – This information includes name, descriptions (both short

and long), data samples, a logo image, and support contact information.
Providers complete the product details.
• Product offers – Oﬀers define the terms that subscribers are agreeing to
when they subscribe to a product. To make a product available in the public
AWS Marketplace Catalog, providers must define a public offer. This offer
includes prices and durations, data subscription agreement, refund policy, and
the option to create custom offers.
• Data sets – A product can contain one or more data sets. A data set in AWS
Data Exchange is a resource curated by the data provider and contains the
data assets a receiver will gain access to after accepting a data grant. AWS
Data Exchange supports five types of data sets: Files, API, Amazon Redshift,
Amazon S3, and AWS Lake Formation (Preview).

Supported data sets

AWS Marketplace takes a responsible approach to facilitating data transactions by

promoting transparency through use of the service. AWS Marketplace reviews
permitted data types, restricting products that are not permitted. Providers are limited
to distributing data sets that meet the legal eligibility requirements set forth in the
Terms and Conditions for AWS Marketplace Sellers.

Accessing AWS Data Exchange

Data receivers

As a data receiver, you can view all of your current, pending, and expired data grants
from the AWS Data Exchange console.

You can also discover and subscribe to new third-party data sets available through
AWS Data Exchange from the AWS Marketplace catalog.

Data senders and providers

As a data sender or provider, you can access AWS Data Exchange through the
following options:
• Directly through the AWS Data Exchange console (Publish data)
• Data providers with data products available in AWS Marketplace can access
programmatically using the following APIs:
o AWS Data Exchange API – Use the API operations to create, view,
update, and delete data sets and revisions. You can also use these
API operations to import and export assets to and from those revisions.
o AWS Marketplace Catalog API – Use the API operations to view and
update data products published to AWS Marketplace.

Supported Regions

AWS Data Exchange data grants, subscriptions, data sets, revisions, and assets are
Region resources that can be managed programmatically or through the AWS Data
Exchange console in supported Regions. Data products published to AWS
Marketplace are available in a single product catalog. Subscribers can see the same
catalog regardless of which supported AWS Region they are using.

The following regions are supported:

• US East (N. Virginia)

• US East (Ohio)
• US West (Oregon)
• US West (N. California)
• EU (Ireland)
• EU (Frankfurt)
• EU (London)
• Asia Pacific (Singapore)
• Asia Pacific (Tokyo)
• Asia Pacific (Sydney)
• Asia Pacific (Seoul)

The following services are related to AWS Data Exchange:

• Amazon S3 – AWS Data Exchange allows providers to import and store data
files from their Amazon S3 buckets. Data recipients can export these files to
Amazon S3 programmatically. AWS Data Exchange also enables recipients to
directly access and use providers' Amazon S3 buckets.
• Amazon API Gateway – Another supported asset type for data sets is APIs.
Data recipients can call the API programmatically, call the API from the AWS
Data Exchange console, or download the OpenAPI specification file.
• Amazon Redshift – AWS Data Exchange supports Amazon Redshift data
sets. Data recipients can get read-only access to query the data in Amazon
Redshift without extracting, transforming, and loading data.
• AWS Marketplace – AWS Data Exchange allows data sets to be published
as products in AWS Marketplace. AWS Data Exchange data providers must
be registered as AWS Marketplace sellers, and can use the AWS Marketplace
Management Portal or the AWS Marketplace Catalog API.
• AWS Lake Formation – AWS Data Exchange supports AWS Lake Formation
data permission data sets (Preview). Data recipients get access to data stored
in a data provider's AWS Lake Formation data lake and can query, transform,
and share access to this data from their own AWS Lake Formation data set.

Amazon EMR
Amazon EMR makes it simple and cost effective to run highly distributed processing
frameworks such as Hadoop, Spark, and Presto when compared to on-premises
deployments. Amazon EMR is flexible – you can run custom applications and code,
and define specific compute, memory, storage, and application parameters to
optimize your analytic requirements.

In addition to running SQL queries, Amazon EMR can run a wide variety of scale-out
data processing tasks for applications such as machine learning, graph analytics,
data transformation, streaming data, and virtually anything you can code. You should
use Amazon EMR if you use custom code to process and analyze extremely large
datasets with the latest big data processing frameworks such as Spark, Hadoop,
Presto, or Hbase. Amazon EMR gives you full control over the configuration of your
clusters and the software installed on them.

You can use Amazon Athena to query data that you process using Amazon EMR.
Amazon Athena supports many of the same data formats as Amazon EMR. Athena's
data catalog is Hive metastore compatible. If you use EMR and already have a Hive
metastore, you can run your DDL statements on Amazon Athena and query your
data immediately without affecting your Amazon EMR jobs.

Amazon Elastic MapReduce is an important cloud-based platform service that is

designed for the effective scaling and processing of large-volume datasets. Its
platform facilitates the users in quickly and easily setting up the cluster with Amazon
EC2 Instances that are already pre-configured with big data frameworks. In this
article, you will explore the easy setup and administration of EMR clusters in AWS.

Amazon EMR ( Elastic Map Reduce ) is an AWS-based platform service that

processes large-volume datasets using shared computing frameworks such
as Apache Hadoop and Apache Spark. It facilitates the users in quickly setting up,
configuring, and scaling virtual server clusters for analyzing and processing vast
amounts of data efficiently.

How Does Amazon EMR Work?

Amazon EMR functionalities simplify the complex processing of large datasets over
the cloud. Users can create the clusters and can be utilized with elastic nature
of Amazon EC2 instances. The natures of Amazon EC2 instances are configured
with pre existing frameworks like Apache Hadoop and Apache Spark. By distributing
the processing jobs across the several nodes these clusters effectively handle and
guarantee the parallel executions with faster outcomes. It provides scalability by
automatically adjusting the cluster size in accordance to workload needs. It optimizes
the data storages on integrating with other AWS services making things easier.
Users can find the things easily rather than going for complicated detailing of
infrastructure and administration. It provides a simplified approach for big data
analytics.

Amazon EMR Architecture

Amazon EMR (Elastic MapReduce) architecture is designed for efficient big data
processing using a distributed computing framework.
1. Clusters: Consist of a master node (manages the cluster), core nodes
(process data and store data in HDFS), and optional task nodes (handle
additional processing).

2. Hadoop Ecosystem: Utilizes tools like Apache Spark, HBase, and Hive, pre-
configured and optimized for big data analytics.

3. AWS Integration: Seamlessly integrates with AWS services like S3

(storage), IAM (security), CloudWatch (monitoring), and Amazon
VPC (network isolation), enhancing functionality and security.

Features of Amazon EMR

The following are the popular features of Amazon EMR:

• Integration: It support integration with other AWS services that enhances the
efficiency in data processing, making connections with Amazon S3 possible
facilitating efficiency in workflow.

• Salability: Amazon EMR providing scaling and handling of workloads

dynamically. It support automatic adjustments in sizing of the cluster and
optimizing the performance and minimizing costs.

• Ease Of Use: Amazon EMR makes the deployments of big data easier by
offering pre-configured environments for Apache Hadoop and Apache spark.
Setuping and maintaining of clusters will be easier for users without
requirement of complex setups on this Amazon ECR.

• Cost Management: EMR facilitates with cost optimization through letting

users to pay only for the resources during the processing of big data making
analytics more affordable. Spot instances and Reserved Instances further
minimizes the costs.

• Security: EMR provides strong security features such as Data encryption,

IAM roles and fine-grained access controls. It ensures data protection through
the pipeline processing.

Deployment Options of Amazon EMR

Amazon EMR offers many different deployment options to fulfill the business needs
and preferences. The following are a few development options:
• On-Demand Instances: Without making any advanced commitments, users
can easily create the EMR clusters utilizing on demand instances for they
need and will pay for the resources on hourly basis. This will be as a flexible
choice for shifting workloads well.

• Reserved Instances: Reserved Instances are helpful for customers to

commit for a specific instance for a duration of 1 or 3 years in a particular
region. This option provides an appropriate steady workloads with predictable
usage and less expensive than on-demand pricing.

• Spot Instances: By using Amazon EC2 spot instances, users can create
requests for EC2 capacity that are unused possibly saving a lot of money.
Spot instances are best suited for workloads that are tolerant of faults and
disrupts.

Advantages Of Amazon EMR

The following are the advantages of amazon EMR:

1. Scalability: EMR allows users to easily scale up or down the number of

instances in a cluster to handle varying amounts of data processing and
analysis tasks.

2. Cost Effectiveness: EMR allows users to pay for the resources they need,
when they need them, making it a cost-effective solution for big data
processing.

3. Integration With Other AWS Services: EMR can be easily integrated with
other AWS services such as Amazon S3, Amazon DynamoDB, and Amazon
Redshift for data storage and analysis.

4. Flexibility: EMR supports a wide range of open-source big data frameworks,

including Hadoop, Spark, and Hive, giving users the flexibility to choose the
tools that best fit their needs.

5. Easy To Use: EMR provides an easy-to-use web interface that allows users
to launch and manage clusters, as well as monitor and troubleshoot
performance issues.

Disadvantages Of Amazon EMR

The following are the disadvantages of Amazon EMR:

1. Limited Customization: EMR is pre-configured with popular big data
frameworks such as Hadoop and Spark, so users may have limited options for
customizing their cluster.

2. Latency: The latency of data processing tasks may increase as the size of
the data set increases.

3. Cost: EMR can be expensive for users with large amounts of data or high-
performance requirements, as costs are based on the number of instances
and the amount of storage used.

4. Limited Control Over The Infrastructure: EMR is a managed service, which

means that users have limited control over the underlying infrastructure. This
can be a disadvantage for users who need more control over their big data
environments.

5. Limited Support For Certain Big Data Frameworks: EMR does not support
some big data frameworks such as Flink, which may be a deal breaker for
some organizations.

6. Limited Support For Certain Applications: EMR is not suitable for all types
of applications, it mainly supports big data processes and analytics.

Use Cases Of Amazon EMR

The following are the use cases of Amazon EMR:

• Big Data Processing: Amazon ECR is ideal for business Organizations

where their is a dealing of distributed processing with large amounts of data. It
is capable of managing large volumes of data conversions,
data warehousing and analysis of logs efficiently.

• Data Analysis: EMR is well known for performing complicated data analytics.
It supports with big data frameworks like Apache spark. It facilitates the
companies in making well informed decisions by letting them to extract
insightful information from various types of datasets.

• Genomic Analysis: EMR is used in bio informatics for analyzing genomic

data. Large scaled genomic datasets are used for processing and analyzing
to helps the researchers in enhancing the scalability and interoperabilities with
genomic technologies in life sciences and healthcare.
• Machine Learning: EMR supports integration with other AWS services such
as Amazon SageMaker seamlessly. It facilitates the organizations to run
distributed ML algorithms on large datasets. It usage is very beneficial for
predictive analysis and model training.

Amazon Redshift
A data warehouse like Amazon Redshift is your best choice when you need to pull
together data from many different sources – like inventory systems, financial
systems, and retail sales systems – into a common format, and store it for long
periods of time. If you want to build sophisticated business reports from historical
data, then a data warehouse like Amazon Redshift is the best choice. The query
engine in Amazon Redshift has been optimized to perform especially well on running
complex queries that join large numbers of very large database tables. When you
need to run queries against highly structured data with lots of joins across lots of
very large tables, choose Amazon Redshift.

Amazon Redshift is a fast, fully managed data warehousing service in the cloud,
enabling businesses to execute complex analytic queries on volumes of data—thus
minimizing delays and ensuring sound support for decision-making across
organizations. It was released in 2013, built to remedy the problems associated with
traditional, on-premises data warehousing, such as scalability, cost, and complexity.

Amazon Redshift is a flexible, massively scalable, cloud-based service that ranges

from a few hundred gigabytes of data to several petabytes, it allows businesses to
handle increasingly larger data sizes without much upfront investment, the
architecture of Redshift is optimized for complex queries and analytics using
techniques like columnar storage and massively parallel processing to deliver high-
speed query performance.

Amazon Redshift is a fully managed service in the cloud, dealing with petabyte-scale
warehouses of data made to store large-scale data and implement effective ways of
running even complex queries. Thus, it enables businesses to quickly and cost-
effectively analyze huge amounts of data by using SQL-based queries and business
intelligence tools.

Key Features of Amazon Redshift

• Scalability: Scale from a few hundreds of gigabytes to a petabyte or even
more, allowing businesses to grow their data warehouses based on necessity.
In its core is a columnar and MPP-based storage that ensures quick query
performance, even over large datasets.

• Integration: Redshift seamlessly integrates with Amazon S3, Amazon RDS,

AWS Glue, and much more to create a data ecosystem.

• Cost-Effective: Amazon Redshift is structured in a way such that it turns out

to be cost-effective for you, with a couple of pricing options that enable one to
pay just for storage and computing power.

How Amazon Redshift Works?

• Clusters and Nodes: Redshift groups its resources into clusters. A cluster
consists of one or more compute nodes. A leader node manages client
connections and SQL processing. Compute nodes execute the queries and
store data.

• Data Storage: Redshift organizes data in row format followed by organizing it

columnar. This architecture minimizes the volume of disk reads and hence
increases performance for analytical queries.

• Query Execution: Redshift runs each query in parallel on multiple nodes,

enabling it to distribute workloads and process large data quantities with MPP
architecture.

Use Cases:

• Business Intelligence: Companies have large datasets and use Redshift to

process complex queries, generate reports, and gain insights into their data
for supporting decision-making processes.

• Data Warehousing: Primarily, Redshift provides a central data warehouse to

store and analyze all data created in various sources.

• Big Data Analytics: Since it accommodates petabyte-scale data capacity,

Redshift is large enough for an enterprise to analyze big data that allows them
to observe any trends or patterns within their data.
What is AWS Glue?
AWS Glue is a serverless data integration service that makes it easy for analytics users to discover,
prepare, move, and integrate data from multiple sources. You can use it for analytics, machine
learning, and application development. It also includes additional productivity and data ops tooling
for authoring, running jobs, and implementing business workflows.

With AWS Glue, you can discover and connect to more than 70 diverse data sources and manage
your data in a centralized data catalog. You can visually create, run, and monitor extract, transform,
and load (ETL) pipelines to load data into your data lakes. Also, you can immediately search and
query cataloged data using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

AWS Glue consolidates major data integration capabilities into a single service. These include data
discovery, modern ETL, cleansing, transforming, and centralized cataloging. It's also serverless, which
means there's no infrastructure to manage. With flexible support for all workloads like ETL, ELT, and
streaming in one service, AWS Glue supports users across various workloads and types of users.

Also, AWS Glue makes it easy to integrate data across your architecture. It integrates with AWS
analytics services and Amazon S3 data lakes. AWS Glue has integration interfaces and job-authoring
tools that are easy to use for all users, from developers to business users, with tailored solutions for
varied technical skill sets.

With the ability to scale on demand, AWS Glue helps you focus on high-value activities that maximize
the value of your data. It scales for any data size, and supports all data types and schema variances.
To increase agility and optimize costs, AWS Glue provides built-in high availability and pay-as-you-go
billing.

AWS Glue features

AWS Glue features fall into three major categories:

• Discover and organize data

• Transform, prepare, and clean data for analysis

• Build and monitor data pipelines

Discover and organize data

• Unify and search across multiple data stores – Store, index, and search across multiple data
sources and sinks by cataloging all your data in AWS.

• Automatically discover data – Use AWS Glue crawlers to automatically infer schema
information and integrate it into your AWS Glue Data Catalog.

• Manage schemas and permissions – Validate and control access to your databases and
tables.

• Connect to a wide variety of data sources – Tap into multiple data sources, both on
premises and on AWS, using AWS Glue connections to build your data lake.

Transform, prepare, and clean data for analysis

• Visually transform data with a job canvas interface – Define your ETL process in the visual
job editor and automatically generate the code to extract, transform, and load your data.

• Build complex ETL pipelines with simple job scheduling – Invoke AWS Glue jobs on a
schedule, on demand, or based on an event.

• Clean and transform streaming data in transit – Enable continuous data consumption, and
clean and transform it in transit. This makes it available for analysis in seconds in your target
data store.

• Deduplicate and cleanse data with built-in machine learning – Clean and prepare your data
for analysis without becoming a machine learning expert by using the FindMatches feature.
This feature deduplicates and finds records that are imperfect matches for each other.

• Built-in job notebooks – AWS Glue job notebooks provide serverless notebooks with
minimal setup in AWS Glue so you can get started quickly.

• Edit, debug, and test ETL code – With AWS Glue interactive sessions, you can interactively
explore and prepare data. You can explore, experiment on, and process data interactively
using the IDE or notebook of your choice.

• Define, detect, and remediate sensitive data – AWS Glue sensitive data detection lets you
define, identify, and process sensitive data in your data pipeline and in your data lake.

Build and monitor data pipelines

• Automatically scale based on workload – Dynamically scale resources up and down based
on workload. This assigns workers to jobs only when needed.

• Automate jobs with event-based triggers – Start crawlers or AWS Glue jobs with event-
based triggers, and design a chain of dependent jobs and crawlers.

• Run and monitor jobs – Run AWS Glue jobs with your choice of engine, Spark or Ray.
Monitor them with automated monitoring tools, AWS Glue job run insights, and AWS
CloudTrail. Improve your monitoring of Spark-backed jobs with the Apache Spark UI.

• Define workflows for ETL and integration activities – Define workflows for ETL and
integration activities for multiple crawlers, jobs, and triggers.

Accessing AWS Glue

You can create, view, and manage your AWS Glue jobs using the following interfaces:

• AWS Glue console – Provides a web interface for you to create, view, and manage your AWS
Glue jobs.

• AWS Glue Studio – Provides a graphical interface for you to create and edit your AWS Glue
jobs visually.

• AWS Glue section of the AWS CLI Reference – Provides AWS CLI commands that you can use
with AWS Glue.

• AWS Glue API – Provides a complete API reference for developers.

Use Cases of AWS Glue

• To build Data Warehouse to Organize, Cleanse, Validate, and Format Data: We can
transform and move AWS cloud data into our data store. We can also load data from
different sources into our data warehouse for regular reporting and analysis. By storing it in
the warehouse, we integrate information from different parts of our business and form a
common source of data for decision-making.

• When we run Serverless Queries against our Amazon S3 Data Link: S3 here means simple
storage service. AWS Glue can catalog our simple storage service that is Amazon S3 data
making it available for querying with Amazon Athena and Amazon RedShift Spectrum. With
crawlers, our metadata stays in synchronization with the underlying data. AWS RedShift
Spectrum can access and analyze data through one unified interface without loading it into
multiple data.

• Creating event-driven ETL Pipelines: We can run our ETL jobs as soon as new data becomes
available in Amazon S3 by invoking our AWS Glue ETL jobs from an AWS Lambda function.
We can also register this new data in the AWS load data catalog as a part of our details.

• To understand our Data Assets: We can store our data using various AWS services and still
maintain a unique, unified view of our data using the AWS Glue data catalog. We can view
the data catalog to quickly search and discover the datasets that we own and maintain the
relative data in one central location.

Benefits of AWS Glue

• Less Hassle: AWS Glue is integrated across a wide range of AWS services. AWS Glue natively
supports data stored in Amazon Aurora and other Amazon Relational Database Service
engines, Amazon RedShift and Amazon S3 along with common database engines and
databases in our virtual private cloud running on Amazon EC2.

• Cost Effective: AWS Glue is serverless. There is no infrastructure to provision or manage AWS
Glue handles, provisioning, configuration, and scaling of the resources required to run our
ETL jobs. We only pay for the resources that we use while our jobs are running.

• More Power: AWS Glue automates much of the effort in building, maintaining, and running
ETL jobs. It identifies data formats and suggests schemas and transformations. Glue
automatically generates the code to execute our data transformations and loading processes.

Disadvantages of AWS Glue

• Amount of Work Involved: It is not a full-fledged ETL service. Hence in order to customize
the services as per our requirements, we need experienced and skillful candidates. And it
involves a huge amount of work to be done as well.

• Platform Compatibility: AWS Glue is specifically made for the AWS console and its
subsidiaries. And hence it isn’t compatible with other technologies.

• Limited Data Sources: It only supports limited data sources like S3 and JDBC

• High Skillset Requirement: AWS Glue is a serverless application, and it is still a new
technology. Hence, the skillset required to implement and operate the AWS Glue is high.

AWS - Amazon Kinesis

Amazon Kinesis is a service provided by Amazon Web Service that allows users to process a
large amount of data (which can be audio, video, application logs, website clickstreams, and
IoT telemetry ) per second in real time. In today's scenario handling a large amount of data
becomes very important and for that, there is a complete whole subject known as Big Data
which works upon how to process or handle the streams of large amounts of data. So
Amazon came up with a solution known as Amazon Kinesis which is fully managed and
automated and can handle the real-time large streams of data with ease. It allows users to
collect, store, capture, and process many logs from distributed streams such as social media
feeds.

AWS Amazon Kinesis Working

Amazon Kinesis working can be divided into 4 stages as follows.
1. Data Ingestion
Amazon kinesis will collect the data or receives the data from the different data streams like
application, sensors, and so on. The data that is going to be received from the different
sources can be in different formats like JSON and Binary. It can also accept the data of real-
time applications.
2. Sharding and Scaling
The smaller parts of the data called the shards the data which is received from the different
sources are divided into smaller shards for redundancy and fault tolerance. There are no
limits for the shards amazon kinesis can scale the shards horizontally depending on the
requirement.
3. Processing and buffering
After sharded the data will be prepared for further use like it will apply filtering or record
aggregation before storing it.
4. Making the data accessible
After completing all the steps mentioned above know the data should be accessible it offers
various ways to access and utilize your data stream.
• Kinesis Data Streams API.
• Kinesis Firehose.
• Kinesis Analytics.
Below is the Image that Explain Exactly how Kinesis work in real time:

AWS Kinesis Data Streams

Amazon Kinesis will stream the data in real-time help in handling it and also tell you what to do with
that data according to the organization's goals following are the broad categories to get you started:

• Real time data ingestion and processing: Amazon kinesis will take the data from the real
time it will helps in the applications like health which are used for the identifying the and
regulating the health of the patient and also if any any emergency with the help of data we
can predict it in before head. It can also used in the applications of OTT platforms by which
you can personalize the according to the user experience.

• Streamlined data delivery and storage: The real time data can be stored in the storage and
can be used for the further research and can be used for the further use and also amazon
kinesis can be integrated with the other services also.

• Real-time insights and automation: The data which is collected from the real time data will
be analyzed the whole data and recats to anomalies, fraud attempts or any other critical
immediately. And also monitor the key metrics which can be used for the data driven
decision making.
Types of Services Offered by Amazon Kinesis

There are 4 types of services which Amazon kinesis offers are as follows:

1. Amazon Kinesis Data Streams

2. Amazon Kinesis Video Streams

3. Amazon Kinesis Firehose

4. Amazon Kinesis Data Analytics

1. Amazon Kinesis Data Streams

It provides a platform for real-time and continuous processing of data. It is also used to encrypt the
sensitive data by using the KMS master keys and the server-side encryption for the security purpose.
The architecture of Amazon Kinesis looks somewhat like the given below image.

2. Amazon Kinesis Video Streams

Amazon Kinesis Video Streams is a fully managed service makes it easy to secure stream video data
from devices to the cloud for real-time processing and analytics. With Kinesis Video Streams you can
effortlessly capture, store and also process video streams for various use cases including surveillance,
media streaming and also IoT applications. It enables seamless integration with machine learning
models for video analysis and provides powerful tools for real-time video data processing without
worrying about scalability or storage limitations.

Amazon Kinesis video streams is a power tool that is provided by AWS as a service that can deliver
live on-demand streams in the real-time following are some of the key features used with kinesis
streams.

• Ingest live video from various sources

• Process video in real-time

• Analyze video content

• Deliver video to various destinations

• Build custom video applications

3. Amazon Kinesis Data Analytics

It allows the streams of data provided by the kinesis firehose and kinesis streams to analyze and
process it with the standard SQL. It analyzes the data format and automatically parses the data and
by using some standard interactive schema editor to edit it in recommend schema. It also provides
pre-built stream process templates that can be used to select a suitable template for their data
analytics.

4. Amazon Kinesis Firehouse

Firehouse allows the users to load or transformed their streams of data into amazon web service
latter transfer for the other functionalities like analyzing or storing. It does not require continuous
management as it is fully automated and scales automatically according to the data.

Amazon Kinesis Features

• Cost-efficient: All the services provided by the amazon are cost-efficient as it follows the pay
as you go model which means you have to pay for the service according to the usage, not a
flat price. So it becomes advantageous for the user s that they have to pay only what they
use.

• Integrate with other AWS services: Amazon Kinesis allows users to use the other AWS
services and integrate with it. Services that can be integrated are Amazon DynamoDB,
Amazon Redshift, and all the other services that deal with the large amount of data.

• Availability: You can access it from anywhere and anytime. Just need a good connectivity of
net.

• Real-time processing- It allows you to work upon the data which is needed to be updated
every time with changes instantaneously. Most advantageous feature of Kinesis because real-
time processing becomes important when you are dealing with such a huge amount of data.

Amazon Kinesis Use Cases

• Real-time application monitoring: Amazon kinesis will provide the real time data of the
applications like if you consider the health application it will provides the live feed of the
data by which you can take care of the health by which the issues that is pointed by amazon
kinesis.

• Fraud detection and prevention: Amazon Kinesis will helps you to protect the data from
fraudulent activity by analyzing transaction data by which you can detect the suspicious
patterns and blocks fraudulent transactions before they happen.

• Personalized recommendations and marketing: Amazon kinesis will helps you in analyzing
the data of the customers by which you can understand your customers very better. You can
recommends the personalised products in real time to the costumers.

• IoT analytics and predictive maintenance: Your connected gadgets' full potential is unlocked
with Kinesis. Through the examination of sensor data from electronics, automobiles, or
machinery.

Amazon Kinesis Limitations

• The limitation that Amazon kinesis has that it only access the stream of records log for 24
hours by default but it can extend but up to only 7 days not longer than that.

• There is no upper limit in the number of streams that can users have in their accounts.

• One shard supports up to 1000 PUT records per second.

AWS Quicksight
AWS Quicksight is one of the most powerful Business Intelligence tools which allows you to create
interactive dashboards within minutes to provide business insights into the organizations. There are
number of visualizations or graphical formats available in which the dashboards can be created. The
dashboards get automatically updated as the data is updated or scheduled. You can also embed the
dashboard created in Quicksight to your web application.
With the latest ML insights, also known as Machine Learning insights, Quicksight uses its inbuilt
algorithms to find any kind of anomalies or peaks in the historical data. This helps to get prepared
with the business requirements ahead of time based on these insights. Here is quick guide to get
started with Quicksight.

AWS Quicksight is an AWS based Business Intelligence and visualization tool that is used to visualize
data and create stories to provide graphical details of the data. Data is entered as dataset and you
can apply filters, hierarchies, and columns to prepare documents. You can choose various charts like
Bar charts, Pie charts, etc. to visualize the data effectively. This basic tutorial will help you to
understand and learn AWS Quicksight tool.

Amazon QuickSight is a cloud-scale business intelligence (BI) service that you can use to deliver easy-
to-understand insights to the people who you work with, wherever they are. Amazon QuickSight
connects to your data in the cloud and combines data from many different sources. In a single data
dashboard, QuickSight can include AWS data, third-party data, big data, spreadsheet data, SaaS data,
B2B data, and more. As a fully managed cloud-based service, Amazon QuickSight provides
enterprise-grade security, global availability, and built-in redundancy. It also provides the user-
management tools that you need to scale from 10 users to 10,000, all with no infrastructure to
deploy or manage.

QuickSight gives decision-makers the opportunity to explore and interpret information in an

interactive visual environment. They have secure access to dashboards from any device on your
network and from mobile devices.

Why QuickSight?

Every day, the people in your organization make decisions that affect your business. When they have
the right information at the right time, they can make the choices that move your company in the
right direction.

Here are some of the benefits of using Amazon QuickSight for analytics, data visualization, and
reporting:

• The in-memory engine, called SPICE, responds with blazing speed.

• No upfront costs for licenses and a low total cost of ownership (TCO).

• Collaborative analytics with no need to install an application.

• Combine a variety of data into one analysis.

• Publish and share your analysis as a dashboard.

• Control features available in a dashboard.

• No need to manage granular database permissions—dashboard viewers can see only what
you share.

For advanced users, QuickSight Enterprise edition offers even more features:

• Saves you time and money with automated and customizable data insights, powered by
machine learning (ML). This enables your organization to do the following, without requiring
any knowledge of machine learning:

o Automatically make reliable forecasts.

o Automatically identify outliers.

o Find hidden trends.

o Act on key business drivers.

o Translate data into easy-to-read narratives, like headline tiles for your dashboard.

• Provides extra Enterprise security features, including the following:

o Federated users, groups, and single sign-on (IAM Identity Center) with AWS Identity
and Access Management (IAM) Federation, SAML, OpenID Connect, or AWS
Directory Service for Microsoft Active Directory.

o Granular permissions for AWS data access.

o Row level security.

o Highly secure data encryption at rest.

o Access to AWS data and on-premises data in Amazon Virtual Private Cloud

• Offers pay-per-session pricing for the users that you place in the "reader" security role—
readers are dashboard subscribers, people who view reports but don't create them.

• Empowers you to make QuickSight part of your own websites and applications by deploying
embedded console analytics and dashboard sessions.

• Makes our business your business with multitenancy features for value-added resellers
(VARs) of analytical services.

• Enables you to programmatically script dashboard templates that can be transferred to other
AWS accounts.

• Simplifies access management and organization with shared and personal folders for
analytical assets.

• Enables larger data import quotas for SPICE data ingestion and more frequently scheduled
data refreshes.

AWS Database Migration Service (DMS)

The AWS Database Migration Service (DMS) helps you quickly and safely transfer your databases. It
also lets you build, analyze, transform, and relocate databases and analytics platforms all in one
place, saving you time, resources, and money. In that process, application downtime, based on the
source database, is minimal. This may also include Database Migration Service supporting migrations
to and from most common commercial and open source databases.

AWS Database Migration Service allows the movement of data across databases. AWS Database
Migration Service makes it possible to replicate data constantly. Continuous data replication serves a
variety of purposes, including synchronizing Dev/Test environment locales, global database
dispersion, and Disaster Recovery instance locations. The service also allows continuous data
replication to keep the source and target databases synchronized during the migration process.

AWS DMS - Database Migration Service Working

• It is a web service to helps you migrate data from one data store to another within a minimal
period.

• You can migrate between source and target endpoints, which are the same kind of database
engine.

• The example would be from an Oracle Database to another Oracle Database. The only
requirement to use AWS DMS is one of your endpoints must be on this service.

• You may pay for it as you use it. Neither the initial license purchase money nor the recurring
maintenance fee are required.

AWS DMS - Database Migration Service Components

Understanding the key components of AWS DMS is crucial for planning and executing a successful
data migration. Below is an illustrative diagram showing the components used in an AWS DMS data
migration
This image highlights the essential components and how they interact throughout the migration
process. Let's break down these components and their roles in detail.

1. Source Database (Source DB)

The Source Database is where your original data resides before migration. AWS DMS supports a wide
range of database types for the source, including on-premises databases, Amazon Relational
Database Service (RDS), or data stored in other AWS services.

Example: You might have data in an on-premises MySQL database that you plan to migrate to AWS.

2. Source Endpoint

A Source Endpoint defines the connection between AWS DMS and your source database. It includes
the configuration details and authentication credentials required for AWS DMS to access the source
database.

Key Responsibilities:

• Ensuring that AWS DMS can securely read data from the source database.

• Managing any necessary conversions or configurations to facilitate data extraction.

3. Replication Instance

The Replication Instance is the core engine of AWS DMS. It connects to both the source and target
endpoints to migrate data. All data extraction, transformation, and loading tasks run on this instance.

Key Responsibilities:

• Extracting data from the source database.

• Performing any data transformations required.

• Loading data into the target database.

Note: You can choose from various replication instance types and sizes depending on the amount of
data and the performance requirements of your migration.

4. Migration Tasks (Replication Tasks)

Migration Tasks, also known as Replication Tasks, define how data is moved from the source to the
target. Each task controls the data flow and can be customized based on your migration needs.

• Types of Migration Tasks:

o Full Load Tasks: Migrates the entire dataset from the source to the target. This is
typically used for the initial data load.

o Change Data Capture (CDC) Tasks: Continues to capture changes (inserts, updates,
deletes) made to the source database and replicates them to the target in real-time.

• Customization Options:

o Filtering data to migrate only specific schemas or tables.

o Transforming data (e.g., renaming columns, changing data types) during the
migration process.
5. Target Endpoint

A Target Endpoint defines the connection to your target database. Like the source endpoint, it
includes the settings and credentials required for AWS DMS to write data to the target location.

Key Responsibilities:

• Ensuring secure and successful data loading into the target database.

• Managing configurations that affect how data is written to the target, such as compression or
encryption settings.

6. Target Database (Target DB)

The Target Database is where the data will reside after the migration. AWS DMS supports a variety of
target databases, which can be in the AWS cloud (like Amazon RDS, Amazon Aurora, or Amazon
Redshift) or on-premises.

Example: If you are migrating data from an on-premises MySQL database to AWS, the target
database might be Amazon RDS for MySQL or Amazon Aurora.

How These Components Interacts?

1. Source DB and Source Endpoint: AWS DMS uses the source endpoint to securely connect to
and read data from the source database.

2. Replication Instance and Migration Tasks: The replication instance hosts migration tasks that
carry out the data movement from the source to the target. Depending on the task type, the
replication instance either performs a full load, CDC, or both.

3. Target Endpoint and Target DB: The migration tasks write the data to the target database
through the target endpoint ensuring data integrity and consistency.

AWS Database Migration Service Advantages

Following are the advantages of the AWS Database Migration Service

• Low cost: Data Migration Service is a free migration solution

for DocumentDB, Redshift, Aurora, and DynamoDB.

• Serverless: AWS DMS is serverless, and hence can deploy, maintain, and monitor all
hardware and software that it needs for migration, which frees you from traditional tasks
such as analyzing capacity, hardware and software procurement, system installation, and
administration.

• Reliability: Database Migration Service is a self-curing service that immediately restarts in

case an interruption occurs.

• Minimal downtime: The source remains operational while DMS continuously sends changes
throughout the migration process to your data source.

AWS Database Migration Service Limitations

The following are the limitations of the AWS Database Migration Service

• Performance: AWS DMS can load up to eight tables in parallel. This can be increased
somewhat by using a large replication server.
• Engine Name: The name of the target engine that the Fleet Advisor should use in its
recommendation.

• Data migration: AWS DMS provides security for your data migration. As the data moves from
source to target, you can also encrypt it in flight during migration using Secure Socket Layers
(SSL).

• Data gathering: Change data. Capturing incremental loads in AWS DMS requires some
programming, therefore, it takes more time and is labor-intensive.

AWS Server Migration Service

AWS Server Migration Service (SMS) is a cloud service provided by Amazon Web Services (AWS) that
enables you to migrate on-premises servers, virtual machines (VMs), and applications to the AWS
Cloud. It helps streamline the process of migrating your existing workloads, reducing the complexity
and downtime associated with traditional migration methods.

Key features and benefits of AWS Server Migration Service:

1. Server and VM Migration: AWS SMS supports migration of both physical servers and VMs
from various sources, including VMware vSphere, Microsoft Hyper-V, and Microsoft Azure. It
allows you to consolidate and migrate these servers to the AWS Cloud.

2. Continuous Replication: AWS SMS offers continuous replication, which means your source
servers’ data is kept up-to-date in the AWS Cloud, reducing the risk of data loss during the
migration process.

3. Incremental Replication: After the initial replication, AWS SMS only replicates the changes
made to the source servers, minimizing the amount of data transferred and reducing
migration time.

4. Automated Replication and Testing: The service automates the replication and validation of
your server settings, making sure that they are compatible with the AWS environment before
the migration.

5. Flexible Cutover: You can schedule cutover for your migration, allowing you to choose the
right time to switch your applications from the source to the target environment.

6. Support for Large-scale Migrations: AWS SMS is designed to handle large-scale migrations,
making it suitable for enterprises with extensive on-premises infrastructure.

7. Integration with AWS Application Discovery Service: AWS SMS can work in conjunction with
AWS Application Discovery Service to provide a better understanding of your on-premises
applications, dependencies, and resource utilization before planning the migration.

It is essential to plan your migration carefully, considering factors like network connectivity, security,
and application dependencies to ensure a successful migration to the AWS Cloud. AWS SMS can
simplify the process and help you achieve a smooth and efficient migration while minimizing
downtime and risk.
What is AWS Snowball
AWS Snowball is a petabyte-scale offline data transfer device designed to securely and quickly
migrate large datasets to and from Amazon Web Services (AWS). Instead of transferring data over the
internet, Snowball allows organizations to physically ship encrypted storage devices for efficient data
ingestion.

AWS Snowball Variants

AWS offers a couple of different Snowball models to fit needs:

• Snowball Edge Storage Optimized – Designed for large-scale data migration and storage,
offering up to 80 TB of usable storage per device.

• Snowball Edge Compute Optimized – Includes built-in compute capabilities such as AWS
Lambda and EC2 instances for local data processing before cloud transfer.

These rugged, portable devices are ideal for industries dealing with remote storage, edge
computing, and massive offline data transfers.

Features of Amazon Snowball

Amazon Snowfall is part of the AWS Snow Family, which includes other devices like Snowball Edge,
Snowcone, and Snowmobile. It is rugged and secure and physical device designed to help businesses
move large amounts of data into AWS Quickly, without relying on slow internet connections. This
makes it perfect for large-scale data transfers, disaster recovery, media archives and edge computing.

Here are the key features that makes AWS Snowball a standout solution:

1. High-Speed Offline Data Transfer

AWS Snowball transfers terabytes or petabytes of data up to 10 times faster than traditional
internet-based methods. It uses rugged, tamper-proof storage for secure and reliable transport.

2. Secure and Encrypted Storage

Data is secured with 256-bit encryption through AWS Key Management Service (KMS).
Snowball’s tamper-proof hardware ensures the security of your data during offline migration.

3. Built-In Compute for Edge Processing

The Snowball Edge Compute Optimized version includes support for AWS Lambda and EC2
instances. This allows you to process data locally before migrating it to the cloud—ideal for remote
locations and edge computing applications.

4. Easy Integration with NFS Support

AWS Snowball functions as an NFS (Network File System) mount point, making it easy to integrate
with your on-premises servers and applications. This helps streamline file-based data migration to
AWS while preserving file system metadata.

5. S3 Compatibility

AWS Snowball is S3-compatible, ensuring your applications can interact with your data using
standard S3 APIs. This integration allows for seamless data management and smooth workflows.
6. GPU Support for High-Performance Workloads

The device supports GPU capabilities for processing complex tasks like machine learning, video
analysis, and other high-performance computing applications. This is perfect for resource-intensive
tasks in remote environments.

7. Automatic Data Erasure

Once the data is successfully uploaded to AWS, the device automatically erases all data from the
Snowball, ensuring that no sensitive information is left behind. This provides complete protection for
your data.

8. Automatic E Ink Shipping Label

The Snowball device comes with an E Ink shipping label that automatically updates with the correct
destination address, reducing the chances of logistics errors and simplifying the return process.

9. Cost-Efficient

AWS Snowball offers a cost-effective solution for large-scale data movement. It avoids the expensive
fees typically associated with high-bandwidth internet-based transfers, making it an affordable
alternative for businesses.

10. Flexible Online & Offline Transfer Options

• Offline: Ship Snowball devices for bulk data transfer without an internet connection.

• Online: Use AWS DataSync for real-time hybrid cloud synchronization.

11. Clustering for Scalable Storage

You can cluster multiple Snowball devices to increase both storage capacity and durability. This
ensures resilient data storage solutions for edge computing, even if one device fails.

How AWS Snowball Works

AWS Snowball simplifies large-scale data migration by allowing you to transfer data securely using
a physical device instead of relying on slow internet connections. Follow the below steps to use AWS
Snowball:

Step 1: Order the Snowball Device

• Log in to the AWS Snowball Console.

• Choose the type of device:

o AWS Snowball Edge Compute Optimized (for running applications at the edge).

o AWS Snowball Edge Storage Optimized (for large-scale data transfer).

• Create a job:

o Select an Amazon S3 bucket where your data will be stored.

o Enable Amazon Simple Notification Service (SNS) for tracking updates.

o Configure options such as Amazon EC2 AMIs if needed.

• AWS Prepares and Ships the Device:

o AWS configures the device and ships it to your location.

Step 2: Set Up the Device

• Unpack the AWS Snowball upon arrival.

• Power it up and connect it to your local area network (LAN).

• Use AWS OpsHub (a graphical user interface) to:

o Unlock the device securely.

o Manage the device settings.

o Start the data transfer or launch EC2 instances if using the compute-optimized
version.

Step 3: Transfer Your Data

• Use AWS OpsHub or the AWS Snowball Client to move data from your local storage to the
Snowball device.

• Monitor the transfer process to ensure all files are securely copied.

Step 4: Return the Device to AWS

• Once the data transfer is complete, shut down the device.

• Prepare it for return shipping:

o The E Ink shipping label automatically updates to display the correct return address.

• Ship the device back to AWS using the pre-paid label.

Step 5: Data Upload & Secure Erasure

• AWS receives the Snowball device at its data center.

• Your data is automatically transferred from the Snowball device to your Amazon S3 bucket.

• AWS verifies the data transfer to ensure completeness.

• Once the transfer is complete, AWS securely erases all data from the device and sanitizes it,
ensuring no customer data remains.

Real-World Use Cases for AWS Snowball

AWS Snowball is ideal for a variety of industries and use cases. Here are some examples:

• Cloud Migration: If your company needs to move large datasets from on-premises storage to
the cloud, Snowball makes it fast and easy.

• Disaster Recovery: Secure backups of business-critical data can be sent to AWS Snowball for
long-term storage and easy recovery.
• Media & Entertainment: Snowball is commonly used by movie studios, game developers,
and media companies to transfer high-resolution videos and raw media files for cloud
processing.

• Edge Computing: Industries like oil and gas, mining, and research use Snowball Edge to
process data locally before transferring it to the cloud.

• Healthcare & Genomics: Medical institutions and research facilities use Snowball to
move large genomic datasets securely and compliantly to AWS for analysis.

Benefits of AWS Snowball

1. Simple Migration

AWS Snowball makes moving large amounts of data easy. With a pre-configured device, you can load
your data, ship it back to AWS, and let the service handle the transfer, saving time and avoiding
complex setups.

2. Faster Performance

Snowball accelerates data migration, allowing you to move data much faster than over the internet.
It reduces transfer time, making large-scale migrations more efficient.

3. Strong Data Protection

Your data is safe with AWS Snowball. It’s encrypted with 256-bit encryption, and once the device is
returned, the data is securely erased, ensuring full protection throughout the process.

AWS Snowball Vs. Other AWS Data Transfer Services

AWS provides multiple data transfer solutions to meet different business needs. Below is a
comparison of AWS Snowball, AWS DataSync, AWS Snowmobile, and AWS Direct Connect to help you
choose the best option.

AWS Direct
Feature AWS Snowball AWS DataSync AWS Snowmobile Connect

Offline bulk Automated online Exabyte-scale Dedicated high-

Best For
data transfer sync migration speed link

Transfer Fast (Days for Medium (Depends Ultra-fast (Truck-

Up to 100 Gbps
Speed TBs-PBs) on bandwidth) based)

Up to 80 TB Unlimited (Depends Up to 100 PB per Up to 100 Gbps

Capacity
per device on internet speed) truck per connection
AWS Direct
Feature AWS Snowball AWS DataSync AWS Snowmobile Connect

256-bit Highly secure, Private,

End-to-end
Security encryption, surveillance- encrypted
encryption
tamper-proof equipped network

Internet
No Yes No Yes
Needed?

Simple (Order,
Setup Moderate (Requires Complex (Requires Complex (Needs
transfer,
Complexity configuration) AWS approval) network setup)
return)

Cloud
migration, Real-time data
Regular file Massive
disaster transfer, low-
Use Cases transfers, hybrid enterprise
recovery, latency
cloud sync migrations
remote workloads
locations

Best for long-

Cost Avoids high Varies by data Expensive, used
term, large-scale
Efficiency network costs volume for extreme cases
use

Features of Snowball

AWS Snowball with the snowball device has the following features:-

• You don't need to purchase or maintain your own hardware.

• Your data is secure both while it's at rest and when it's being physically moved.

• You can manage your jobs programmatically using the job management API or through the
AWS Snow Family Management Console.

• The 50 TB model is accessible in all other AWS Regions; 80 TB and 50 TB variants are only
available in US Regions.

• Local data transfers are possible between a Snowball and an on-premises data center. These
transfers can be carried out using the standalone, downloadable Snowball client. Or you can
use the free Amazon S3 Adapter for Snowball to transfer data programmatically using calls to
the Amazon S3 REST API. See Data Transmission with a Snowball for further details.
• When the Snowball is prepared to ship, the E Ink display on its own shipping container
switches to display your shipping label. See Shipping Factors for AWS Snowball for more
details.

What is AWS Snowmobile?

AWS Snowmobile is a physical data transfer service designed for large-scale data migrations to the
AWS cloud. It is essentially to secure 45-foot long shipping container that can transfer up to 100
petabytes of data in a single shipment. AWS Snowmobile is ideal for industries needing to move vast
amounts of data such as media, entertainment or government sectors. It offers high-level security
Including encryption GPS tracking and security personnel to ensure safe transport of your data to
AWS data centers.

How does AWS Snowmobile Work?

AWS Snowmobile works by physically transporting large amounts of data from your location to AWS
data centers. Refer to the diagram below for a visual explanation On How does AWS Snowmobile
Work.

Use Cases For AWS Snowmobile

AWS Snowmobile is ideal for large-scale data migrations where transferring massive amounts of data
quickly is crucial. It is commonly used in the following scenarios:

• Data center migration: When organizations need to move petabytes or exabytes of data to
the cloud Snowmobile offers a fast and secure solution.

• Disaster recovery: Companies use Snowmobile to back up vast amounts of data to the cloud
ensuring quick recovery in case of a disaster.

• Video libraries: Media companies transferring large video archives to AWS for storage and
processing benefit from Snowmobile's massive capacity.
• Research data transfer: Industries like healthcare and genomics use Snowmobile to
transport enormous datasets, enabling advanced cloud analytics and processing.

• Government and security agencies: For secure, large-scale data transfer, Snowmobile
provides enhanced security features for sensitive data migration.

Benefits of AWS Snowmobile

AWS Snowmobile offers several advantages for organizations with massive data migration needs:

• Massive Data Transfer: Snowmobile can transfer exabytes of data making it ideal for large-
scale migrations that would otherwise take months or years over standard internet
connections.

• Cost-Effective: It reduces the time and expense associated with transferring vast amounts of
data by eliminating the need for long-term network setups.

• High Security: Snowmobile provides end-to-end encryption secure tamper-proof containers

and GPS tracking ensuring your data is safe throughout the transfer process.

• Fast Migration: By moving data physically rather than over the internet Snowmobile
drastically shortens the data migration timeline.

• Scalability: Snowmobile is designed to handle the largest data migrations offering scalability
for organizations moving entire data centers to the AWS cloud.

Snowball: This is a petabyte-scale data migration device designed to transfer large amounts of data
to and from AWS. It's a rugged, secure appliance that can be physically shipped to your location.

Snowball Edge: This device combines compute and storage capabilities, making it suitable for hybrid
and edge workloads. It can process data locally and store it before transferring it to AWS.

Snowmobile: For exabyte-scale data migration, AWS offers Snowmobile. It's a semi-trailer truck
equipped with petabytes of storage capacity, allowing you to transfer massive amounts of data to
and from AWS.

What is Snowball Edge?

Snowball Edge is a device with on-board storage and compute power for select AWS capabilities.
Snowball Edge can process data locally, run edge-computing workloads, and transfer data to or from
the AWS Cloud.

Each Snowball Edge device can transport data at speeds faster than the internet. This transport is
done by shipping the data in the devices through a regional carrier. The appliances are rugged,
complete with E Ink shipping labels.

Snowball Edge devices have two options for device configurations—Storage Optimized 210
TB and Compute Optimized. When this guide refers to Snowball Edge devices, it's referring to all
options of the device. When specific information applies only to one or more optional configurations
of devices, it is called out specifically

Snowball Edge features

Snowball Edge devices have the following features:

• Large amounts of storage capacity or compute functionality for devices. This depends on the
options you choose when you create your job.

• Network adapters with transfer speeds of up to 100 Gbit/second.

• Encryption is enforced, protecting your data at rest and in physical transit.

• You can import or export data between your local environments and Amazon S3, and
physically transport the data with one or more devices without using the internet.

• Snowball Edge devices are their own rugged box. The built-in E Ink display changes to show
your shipping label when the device is ready to ship.

• Snowball Edge devices come with an on-board LCD display that can be used to manage
network connections and get service status information.

• You can cluster Snowball Edge devices for local storage and compute jobs to achieve data
durability across 3 to 16 devices and locally grow or shrink storage on demand.

• You can use Amazon EKS Anywhere on Snowball Edge devices for Kubernetes workloads.

• Snowball Edge devices have Amazon S3 and Amazon EC2 compatible endpoints available,
enabling programmatic use cases.

• Snowball Edge devices support the new sbe1, sbe-c, and sbe-g instance types, which you can
use to run compute instances on the device using Amazon Machine Images (AMIs).

• Snowball Edge supports these data transfer protocols for data migration:

o NFSv3

o NFSv4

o NFSv4.1

o Amazon S3 over HTTP or HTTPS (via API compatible with AWS CLI version 1.16.14
and earlier)

Services related to Snowball Edge

You can use an AWS Snowball Edge device with the following related AWS services:

• Amazon S3 adapter — Use for programmatic data transfer in to and out of AWS using the
Amazon S3 API for Snowball Edge, which supports a subset of Amazon S3 API operations. In
this role, data is transferred to the Snow device by AWS on your behalf and the device is
shipped to you (for an export job), or AWS ships an empty Snow device to you and you
transfer data from your on-premises sources to the device and ship it back to AWS (for an
import job)"

• Amazon S3 compatible storage on Snowball Edge — Use to support the data needs of
compute services such as Amazon EC2, Amazon EKS Anywhere on Snow, and others. This
feature is available on Snowball Edge devices and provides an expanded Amazon S3 API set
and features such as increased resiliency with flexible cluster setup for 3 to 16 nodes, local
bucket management, and local notifications.

• Amazon EC2 – Run compute instances on a Snowball Edge device using the Amazon EC2
compatible endpoint, which supports a subset of the Amazon EC2 API operations.

• Amazon EKS Anywhere on Snow – Create and operate Kubernetes clusters on Snowball Edge
devices.

• AWS Lambda powered by AWS IoT Greengrass – Invoke Lambda functions based on Amazon
S3 compatible storage on Snowball Edge storage actions made on an AWS Snowball Edge
device.

• Amazon Elastic Block Store (Amazon EBS) – Provide block-level storage volumes for use with
EC2-compatible instances.

• AWS Identity and Access Management (IAM) – Use this service to securely control access to
AWS resources.

• AWS Security Token Service (AWS STS) – Request temporary, limited-privilege credentials for
IAM users or for users that you authenticate (federated users).

• Amazon EC2 Systems Manager – Use this service to view and control your infrastructure on
AWS.

AWS monitoring of Snowball Edge

• AWS will monitor the Snow device and may collect metrics and usage
information when the Snow device is connected to an AWS Region. If the
Snow device is not connected to the AWS Region, then AWS will not monitor
the Snow device.
• If AWS detects an irreparable issue, and there is a need to replace physical
equipment, AWS will notify you. You can then place a replacement job that we
will ship to your site. There is no additional charge for this, as Snow device
monitoring is included as part of the Snow device service fee.

AWS Athena Knowledgebase
No ratings yet
AWS Athena Knowledgebase
4 pages
Cloud 4
No ratings yet
Cloud 4
4 pages
Rsjaws En-Us SG m13 Awsserviceintegrate - 2
No ratings yet
Rsjaws En-Us SG m13 Awsserviceintegrate - 2
8 pages
AWS Athena: Easy Data Analysis
No ratings yet
AWS Athena: Easy Data Analysis
4 pages
6 +Athena,+QuickSight,+EMR
No ratings yet
6 +Athena,+QuickSight,+EMR
63 pages
A Project Is A Series of Planned Tasks With A Clear Objective, Often Involv - 20250402 - 143708 - 0000
No ratings yet
A Project Is A Series of Planned Tasks With A Clear Objective, Often Involv - 20250402 - 143708 - 0000
7 pages
DMWQ1D4S3T1 - Building Analytics at Scale With Amazon Athena
No ratings yet
DMWQ1D4S3T1 - Building Analytics at Scale With Amazon Athena
48 pages
Aws Sol Mod 5
No ratings yet
Aws Sol Mod 5
24 pages
Amazon Athena Federated Query Guide
No ratings yet
Amazon Athena Federated Query Guide
60 pages
Athena Ug
No ratings yet
Athena Ug
545 pages
Cheat Sheet AWS Data Engineer Associate
No ratings yet
Cheat Sheet AWS Data Engineer Associate
117 pages
Aws Certified Data Engineer Associate 9
No ratings yet
Aws Certified Data Engineer Associate 9
14 pages
Aws Project
No ratings yet
Aws Project
16 pages
AWS 05 DataLake
No ratings yet
AWS 05 DataLake
78 pages
Cheat Sheet AWS Solutions Architect Professional
No ratings yet
Cheat Sheet AWS Solutions Architect Professional
177 pages
AWS Certified Data Engineer
No ratings yet
AWS Certified Data Engineer
186 pages
Amazon Athena
No ratings yet
Amazon Athena
28 pages
AWS Services List and CLF02 Content - Services and Usage-1
No ratings yet
AWS Services List and CLF02 Content - Services and Usage-1
57 pages
AWS Cloud Practitioner Exam Prep
100% (1)
AWS Cloud Practitioner Exam Prep
111 pages
WhizCard CLF C01 06 09 2022
No ratings yet
WhizCard CLF C01 06 09 2022
111 pages
AWS Cheatbook For Dummies
No ratings yet
AWS Cheatbook For Dummies
172 pages
AWS Services - Analytics and ML
No ratings yet
AWS Services - Analytics and ML
2 pages
SAA 03 Notes
No ratings yet
SAA 03 Notes
32 pages
AWS White Paper
No ratings yet
AWS White Paper
6 pages
5 AWS Database Services 19-08-2024
No ratings yet
5 AWS Database Services 19-08-2024
21 pages
WhizCard AWS Certified Developer Associate (DVA C02)
No ratings yet
WhizCard AWS Certified Developer Associate (DVA C02)
87 pages
Notes
No ratings yet
Notes
28 pages
Subtitle
No ratings yet
Subtitle
2 pages
Aws Certified Data Engineer Associate 8
No ratings yet
Aws Certified Data Engineer Associate 8
16 pages
Subtitle
No ratings yet
Subtitle
2 pages
AWS Whitepaper
No ratings yet
AWS Whitepaper
31 pages
AWS Databases for Businesses
No ratings yet
AWS Databases for Businesses
15 pages
WhizCard CLF C02 Cheat Sheet Nov 2024
No ratings yet
WhizCard CLF C02 Cheat Sheet Nov 2024
110 pages
AWS Cloud Practitioner (CLF C02)
100% (1)
AWS Cloud Practitioner (CLF C02)
102 pages
Build An ETL Service Pipeline To Load Data Incrementally From Amazon S3 To Amazon Redshift Using AWS Glue - AWS Prescriptive Guidance
No ratings yet
Build An ETL Service Pipeline To Load Data Incrementally From Amazon S3 To Amazon Redshift Using AWS Glue - AWS Prescriptive Guidance
15 pages
AWS Lake House for Data Insights
No ratings yet
AWS Lake House for Data Insights
59 pages
An Introduction To Data Lakes and Data Analytics On AWS ANT204
No ratings yet
An Introduction To Data Lakes and Data Analytics On AWS ANT204
34 pages
AWS Data Lake
No ratings yet
AWS Data Lake
118 pages
Amazon Web Services
No ratings yet
Amazon Web Services
19 pages
AWS Products Compare
No ratings yet
AWS Products Compare
3 pages
AWS Certified Data Engineer - Associate Practice Exam
No ratings yet
AWS Certified Data Engineer - Associate Practice Exam
23 pages
Database in AWS
No ratings yet
Database in AWS
24 pages
Cloud Computing QB and Solution
No ratings yet
Cloud Computing QB and Solution
25 pages
AWS Athena & Glue for Data Analysis
No ratings yet
AWS Athena & Glue for Data Analysis
13 pages
AWS Analytics and Data Solutions
No ratings yet
AWS Analytics and Data Solutions
34 pages
AWS Walkthrough and Service Location Guide Rev
No ratings yet
AWS Walkthrough and Service Location Guide Rev
9 pages
AWS Services
No ratings yet
AWS Services
22 pages
Data Lakes For Maximum Flexibility
No ratings yet
Data Lakes For Maximum Flexibility
29 pages
Bigdata Pipeline With AWS: Author: Diksha Singh Tomer Computer and Science Engineering Banasthali University, India
No ratings yet
Bigdata Pipeline With AWS: Author: Diksha Singh Tomer Computer and Science Engineering Banasthali University, India
9 pages
DevOps - Module 2 Notes
No ratings yet
DevOps - Module 2 Notes
49 pages
AWS Quick Start - AWS Purpose-Built Database Strategy - Final
No ratings yet
AWS Quick Start - AWS Purpose-Built Database Strategy - Final
32 pages
AWS Storage Options
100% (1)
AWS Storage Options
34 pages
Using AWS Athena & Document For ALB-ELB Access Logs Analysis
No ratings yet
Using AWS Athena & Document For ALB-ELB Access Logs Analysis
5 pages
AWS Certified Data Engineer
No ratings yet
AWS Certified Data Engineer
693 pages
Sample Article - KG-RP180925
No ratings yet
Sample Article - KG-RP180925
5 pages
Wipro LTD
No ratings yet
Wipro LTD
14 pages
PRACTICAL 2 Python
No ratings yet
PRACTICAL 2 Python
7 pages
2 - Introduction To Python Programming
No ratings yet
2 - Introduction To Python Programming
19 pages
1543-Apr9010578uen C
No ratings yet
1543-Apr9010578uen C
30 pages
Bangalore IT Companies Directory
No ratings yet
Bangalore IT Companies Directory
13 pages
IoT Networking Part 1
100% (1)
IoT Networking Part 1
17 pages
Dcit 50 Reviewer 201B
No ratings yet
Dcit 50 Reviewer 201B
5 pages
Country Profile of Pakistan
No ratings yet
Country Profile of Pakistan
4 pages
Terra Notes
No ratings yet
Terra Notes
73 pages
Simulated Security Breach - Ransomware Attack
No ratings yet
Simulated Security Breach - Ransomware Attack
2 pages
Process of Making Multimedia
No ratings yet
Process of Making Multimedia
12 pages
Network Models for IT Professionals
No ratings yet
Network Models for IT Professionals
88 pages
21 Century Literature From The Philippines and The Worl: Creative Literary Adaptations
No ratings yet
21 Century Literature From The Philippines and The Worl: Creative Literary Adaptations
8 pages
CI CD Pipeline As Code
No ratings yet
CI CD Pipeline As Code
61 pages
Curriclumvitae: Biography: Date SEX: Male Marital Status Nationality
No ratings yet
Curriclumvitae: Biography: Date SEX: Male Marital Status Nationality
7 pages
Business Analyst Expertise & Experience
No ratings yet
Business Analyst Expertise & Experience
2 pages
Opmanager Datasheet
No ratings yet
Opmanager Datasheet
5 pages
Planning and Managing A Product Backlog Slides
No ratings yet
Planning and Managing A Product Backlog Slides
62 pages
Abap On Hana
No ratings yet
Abap On Hana
10 pages
Microsoft 365 Mobility and Security
No ratings yet
Microsoft 365 Mobility and Security
2 pages
Benohead Sybase Ase Cookbook
No ratings yet
Benohead Sybase Ase Cookbook
82 pages
261 Kpit Sep 2022
No ratings yet
261 Kpit Sep 2022
4 pages
Topical Authority Workshop
No ratings yet
Topical Authority Workshop
27 pages
3rd International Conference On Computer Science, Engineering and Information Technology Trends (CSEITT 2025)
No ratings yet
3rd International Conference On Computer Science, Engineering and Information Technology Trends (CSEITT 2025)
3 pages
Brocade SAN Switch Models (Anthony's Blog - Using System Storage - An Aussie Storage Blog)
No ratings yet
Brocade SAN Switch Models (Anthony's Blog - Using System Storage - An Aussie Storage Blog)
5 pages
Asm Note
No ratings yet
Asm Note
1 page
Tibco Soa - Course Contents: Overview of Service Oriented Architecture (SOA)
No ratings yet
Tibco Soa - Course Contents: Overview of Service Oriented Architecture (SOA)
2 pages
MBAM 25 Cmdlets
No ratings yet
MBAM 25 Cmdlets
76 pages
Pankaj - Sharma - Resume (12+ Exp)
No ratings yet
Pankaj - Sharma - Resume (12+ Exp)
11 pages
3.8.8 Lab - Explore DNS Traffic - ILM
No ratings yet
3.8.8 Lab - Explore DNS Traffic - ILM
10 pages