Unit 5
Unit 5
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon
Simple Storage Service (Amazon S3) using standard SQL. With a few actions in the AWS Management
Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run
ad-hoc queries and get results in seconds.
Amazon Athena also makes it easy to interactively run data analytics using Apache Spark without
having to plan for, configure, or manage resources. When you run Apache Spark applications on
Athena, you submit Spark code for processing and receive the results directly.
Athena SQL and Apache Spark on Amazon Athena are serverless, so there is no infrastructure to set
up or manage, and you pay only for the queries you run. Athena scales automatically—running
queries in parallel—so results are fast, even with large datasets and complex queries.
Athena helps you analyze unstructured, semi-structured, and structured data stored in Amazon S3.
Examples include CSV, JSON, or columnar data formats such as Apache Parquet and Apache ORC. You
can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data
into Athena.
Athena integrates with Amazon QuickSight for easy data visualization. You can use Athena to
generate reports or to explore data with business intelligence tools or SQL clients connected with a
JDBC or an ODBC driver.
Athena integrates with the AWS Glue Data Catalog, which offers a persistent metadata store for your
data in Amazon S3. This allows you to create tables and query data in Athena based on a central
metadata store available throughout your Amazon Web Services account and integrated with the ETL
and data discovery features of AWS Glue.
Amazon Athena makes it easy to run interactive queries against data directly in Amazon S3 without
having to format data or manage infrastructure. For example, Athena is useful if you want to run a
quick query on web logs to troubleshoot a performance issue on your site. With Athena, you can get
started fast: you just define a table for your data and start querying using standard SQL.
You should use Amazon Athena if you want to run interactive ad hoc SQL queries against data on
Amazon S3, without having to manage any infrastructure or clusters. Amazon Athena provides the
easiest way to run ad hoc queries for data in Amazon S3 without the need to setup or manage any
servers.
AWS Athena is a powerful serverless query service provided by AWS for analyzing the data directly in
Amazon S3 using standard SQL. It facilitates features like high scalability, cost-effectiveness, easy-to-
use platform for running complex queries without the need for extensive infrastructure setup. In this
article we will discuss on what is aws athena, its archtiecture, benefits, limitations, advantages,
disadvantages and how it difference from Amazon Redshift, Amazon Glue and Microsoft SQL server
effectively.
AWS Athena is a serverless interactive query service that enables normal SQL data analysis in
Amazon S3. Athena is based on Presto, a distributed SQL query engine, and it can query data in
Amazon S3 fast using conventional SQL syntax. There is no infrastructure to handle with Athena, so
you can focus on analyzing data at scale. To have more idea of AWS Ethena, let us understand the
architecture first.
When a query is run, Athena gets the metadata from the metastore to establish the data's location
and format. Athena also interfaces with AWS Glue, a fully managed extract, transform, and load (ETL)
service, allowing customers to create and manage data catalogs and ETL processes. Furthermore, we
will go through the various components of AWS Athena.
Amazon S3: Athena searches data stored in Amazon S3, an object storage service that is highly
durable, highly accessible, and infinitely scalable.
Amazon Glue: Athena leverages AWS Glue, a fully managed extract, transform, and load (ETL)
service, to catalog and query the data stored in S3.
Apache Presto: Apache Presto is Athena's distributed SQL query engine. Presto is well-suited for
querying data stored in distributed systems and can handle queries that require data from numerous
sources to be joined.
Amazon CloudWatch: Athena interacts with Amazon CloudWatch, a monitoring service that offers
metrics and logs for all of your AWS account's resources. CloudWatch may be used to track the
performance of your Athena queries and create alerts for specific query patterns.
Amazon VPC: Athena supports performing queries within an Amazon Virtual Private Cloud (VPC),
which allows you to isolate your data and limit access to it using Amazon VPC security groups and
network ACLs.
Encryption: Athena supports S3 server-side encryption with Amazon S3-managed keys (SSE-S3) or
AWS Key Management Service-managed keys (SSE-KMS), as well as SSL/TLS encryption of data in
transit.
• Cost-Effective: It only charges for the queries you run with no upfront costs or resource
provisioning.
• Ease of Use: it query the data directly from the Amazon S3 using standard SQL that is
accessbile to users familiar with SQL.
• No infrastructure setup: Athena is a serverless service that eliminates the need for users to
set up and manage infrastructure, making data querying easier and faster.
• Cost-effective: Athena charges customers solely for the quantity of data scanned by their
searches, making it an affordable solution for ad hoc and exploratory queries.
• SQL support: Since Athena supports ANSI SQL, users can query data in S3 using their existing
SQL knowledge and tools.
• Comple Queries: It performance mya be degraded with highly complex queries or ery large
datasets that requires multiple joins and aggregations.
• Cold Start Latency: Its inital query exection may experience some delay due to its cold start
latency, especially for the infrequently accessed data.
• Limited Data Manipulation: Athena is primarily useful for querying and doesn't support any
data modification operations like INSERT,, UPDATE, or DELETE.
• Restricted query performance: The volume of data scanned and the intricacy of the query
can limit Athena's speed, resulting in lengthier query times.
• No real-time querying: Because Athena is intended for batch processing, it may not be ideal
for real-time querying.
• Limited data types: In comparison to other database systems, Athena only supports a
restricted selection of data types.
• Serverless architecture: Athena is a fully-managed service that does not require any
infrastructure setup, management, or scaling.
• Standard SQL support: Since the Athena supports of ANSI SQL, users can easily query the
data in S3 through using their existing SQL knowledge and tools.
• Connection with the AWS ecosystem: Athena interfaces with other AWS services such as
Amazon S3, AWS Glue, and AWS Lambda, enabling customers to import and convert data
from a variety of sources.
• Integration with BI tools: Athena provides connectivity with major business intelligence
tools like as Tableau, Power BI, and Amazon QuickSight, allowing users to build visualizations
and reports.
It is is is
It is fully
servless It is relational serverless
managed
interactiv database data
by data
Service e query management system integration
warehouse.
Type service service.
It It
performs facilitates It facilitates
It facilitates with
adhoc in Data with ETL an
transactional and
querying warehousi d data
analytical processing
Primary on Amzon ng cataloging.
Use Case S3 and OLAP
Amazon AWS Microsoft SQL AWS
Features Athena Redshift Server Glue
Redshift
Amazon S3
managed Local or cloud
Amazon and other
storage, storage, depends on
S3 data
Data integrates setup
sources
Storage with S3
High
Optimized
Optimized performanc
for ETL
for quick e for High performance for
operations
queries complex transactional and
and data
on large queries analytical workloads
transformat
Performa datasets and large
ion
nce datasets
Fully Managed
Fully
managed, service, but
Requires regular managed,
no requires
maintenance and minimal
maintena some
updates maintenanc
Maintena nce administrat
e required
nce required ion
JSON,
Data JSON, CSV, JSON, CSV,
CSV,
Formats Parquet, Traditional RDBMS fo Parquet,
Parquet,
ORC, Avro, rmats ORC, Avro,
Supporte ORC,
and more and more
d Avro
What is AWS Data Exchange?
AWS Data Exchange is a service that helps AWS customers easily share and
manage data entitlements from other organizations at scale.
As a data receiver, you can track and manage all of your data grants and AWS
Marketplace data subscriptions in one place. When you have access to an AWS
Data Exchange data set, you can use compatible AWS or partner analytics and
machine learning to extract insights from it.
For data senders, AWS Data Exchange eliminates the need to build and maintain
any data delivery and entitlement infrastructure. Anyone with an AWS account can
create and send data grants to data receivers. To sell your data as a product in AWS
Marketplace, make sure that you follow the guidelines to determine eligibility.
A data grant is the unit of exchange in AWS Data Exchange that is created by a data
sender in order to grant a data receiver access to a data set. When a data sender
creates a data grant, a grant request is sent to the data receiver's AWS account. A
data receiver accepts the data grant to gain access to the underlying data.
• Data set – A data set in AWS Data Exchange is a resource curated by the
sender. It contains the data assets a receiver will gain access to after
accepting a data grant. AWS Data Exchange supports five types of data sets:
Files, API, Amazon Redshift, Amazon S3, and AWS Lake Formation
(Preview).
• Data grant details – This information includes a name and description of the
data grant that will be visible to data receivers.
• Recipient access details – This information includes the receiver’s AWS
account ID and specifies how long the receiver should have access to the
data.
Data receivers
As a data receiver, you can view all of your current, pending, and expired data grants
from the AWS Data Exchange console.
You can also discover and subscribe to new third-party data sets available through
AWS Data Exchange from the AWS Marketplace catalog.
As a data sender or provider, you can access AWS Data Exchange through the
following options:
• Directly through the AWS Data Exchange console (Publish data)
• Data providers with data products available in AWS Marketplace can access
programmatically using the following APIs:
o AWS Data Exchange API – Use the API operations to create, view,
update, and delete data sets and revisions. You can also use these
API operations to import and export assets to and from those revisions.
o AWS Marketplace Catalog API – Use the API operations to view and
update data products published to AWS Marketplace.
Supported Regions
AWS Data Exchange data grants, subscriptions, data sets, revisions, and assets are
Region resources that can be managed programmatically or through the AWS Data
Exchange console in supported Regions. Data products published to AWS
Marketplace are available in a single product catalog. Subscribers can see the same
catalog regardless of which supported AWS Region they are using.
• Amazon S3 – AWS Data Exchange allows providers to import and store data
files from their Amazon S3 buckets. Data recipients can export these files to
Amazon S3 programmatically. AWS Data Exchange also enables recipients to
directly access and use providers' Amazon S3 buckets.
• Amazon API Gateway – Another supported asset type for data sets is APIs.
Data recipients can call the API programmatically, call the API from the AWS
Data Exchange console, or download the OpenAPI specification file.
• Amazon Redshift – AWS Data Exchange supports Amazon Redshift data
sets. Data recipients can get read-only access to query the data in Amazon
Redshift without extracting, transforming, and loading data.
• AWS Marketplace – AWS Data Exchange allows data sets to be published
as products in AWS Marketplace. AWS Data Exchange data providers must
be registered as AWS Marketplace sellers, and can use the AWS Marketplace
Management Portal or the AWS Marketplace Catalog API.
• AWS Lake Formation – AWS Data Exchange supports AWS Lake Formation
data permission data sets (Preview). Data recipients get access to data stored
in a data provider's AWS Lake Formation data lake and can query, transform,
and share access to this data from their own AWS Lake Formation data set.
Amazon EMR
Amazon EMR makes it simple and cost effective to run highly distributed processing
frameworks such as Hadoop, Spark, and Presto when compared to on-premises
deployments. Amazon EMR is flexible – you can run custom applications and code,
and define specific compute, memory, storage, and application parameters to
optimize your analytic requirements.
In addition to running SQL queries, Amazon EMR can run a wide variety of scale-out
data processing tasks for applications such as machine learning, graph analytics,
data transformation, streaming data, and virtually anything you can code. You should
use Amazon EMR if you use custom code to process and analyze extremely large
datasets with the latest big data processing frameworks such as Spark, Hadoop,
Presto, or Hbase. Amazon EMR gives you full control over the configuration of your
clusters and the software installed on them.
You can use Amazon Athena to query data that you process using Amazon EMR.
Amazon Athena supports many of the same data formats as Amazon EMR. Athena's
data catalog is Hive metastore compatible. If you use EMR and already have a Hive
metastore, you can run your DDL statements on Amazon Athena and query your
data immediately without affecting your Amazon EMR jobs.
Amazon EMR functionalities simplify the complex processing of large datasets over
the cloud. Users can create the clusters and can be utilized with elastic nature
of Amazon EC2 instances. The natures of Amazon EC2 instances are configured
with pre existing frameworks like Apache Hadoop and Apache Spark. By distributing
the processing jobs across the several nodes these clusters effectively handle and
guarantee the parallel executions with faster outcomes. It provides scalability by
automatically adjusting the cluster size in accordance to workload needs. It optimizes
the data storages on integrating with other AWS services making things easier.
Users can find the things easily rather than going for complicated detailing of
infrastructure and administration. It provides a simplified approach for big data
analytics.
Amazon EMR (Elastic MapReduce) architecture is designed for efficient big data
processing using a distributed computing framework.
1. Clusters: Consist of a master node (manages the cluster), core nodes
(process data and store data in HDFS), and optional task nodes (handle
additional processing).
2. Hadoop Ecosystem: Utilizes tools like Apache Spark, HBase, and Hive, pre-
configured and optimized for big data analytics.
• Integration: It support integration with other AWS services that enhances the
efficiency in data processing, making connections with Amazon S3 possible
facilitating efficiency in workflow.
• Ease Of Use: Amazon EMR makes the deployments of big data easier by
offering pre-configured environments for Apache Hadoop and Apache spark.
Setuping and maintaining of clusters will be easier for users without
requirement of complex setups on this Amazon ECR.
Amazon EMR offers many different deployment options to fulfill the business needs
and preferences. The following are a few development options:
• On-Demand Instances: Without making any advanced commitments, users
can easily create the EMR clusters utilizing on demand instances for they
need and will pay for the resources on hourly basis. This will be as a flexible
choice for shifting workloads well.
• Spot Instances: By using Amazon EC2 spot instances, users can create
requests for EC2 capacity that are unused possibly saving a lot of money.
Spot instances are best suited for workloads that are tolerant of faults and
disrupts.
2. Cost Effectiveness: EMR allows users to pay for the resources they need,
when they need them, making it a cost-effective solution for big data
processing.
3. Integration With Other AWS Services: EMR can be easily integrated with
other AWS services such as Amazon S3, Amazon DynamoDB, and Amazon
Redshift for data storage and analysis.
5. Easy To Use: EMR provides an easy-to-use web interface that allows users
to launch and manage clusters, as well as monitor and troubleshoot
performance issues.
2. Latency: The latency of data processing tasks may increase as the size of
the data set increases.
3. Cost: EMR can be expensive for users with large amounts of data or high-
performance requirements, as costs are based on the number of instances
and the amount of storage used.
5. Limited Support For Certain Big Data Frameworks: EMR does not support
some big data frameworks such as Flink, which may be a deal breaker for
some organizations.
6. Limited Support For Certain Applications: EMR is not suitable for all types
of applications, it mainly supports big data processes and analytics.
• Data Analysis: EMR is well known for performing complicated data analytics.
It supports with big data frameworks like Apache spark. It facilitates the
companies in making well informed decisions by letting them to extract
insightful information from various types of datasets.
Amazon Redshift
A data warehouse like Amazon Redshift is your best choice when you need to pull
together data from many different sources – like inventory systems, financial
systems, and retail sales systems – into a common format, and store it for long
periods of time. If you want to build sophisticated business reports from historical
data, then a data warehouse like Amazon Redshift is the best choice. The query
engine in Amazon Redshift has been optimized to perform especially well on running
complex queries that join large numbers of very large database tables. When you
need to run queries against highly structured data with lots of joins across lots of
very large tables, choose Amazon Redshift.
Amazon Redshift is a fast, fully managed data warehousing service in the cloud,
enabling businesses to execute complex analytic queries on volumes of data—thus
minimizing delays and ensuring sound support for decision-making across
organizations. It was released in 2013, built to remedy the problems associated with
traditional, on-premises data warehousing, such as scalability, cost, and complexity.
Amazon Redshift is a fully managed service in the cloud, dealing with petabyte-scale
warehouses of data made to store large-scale data and implement effective ways of
running even complex queries. Thus, it enables businesses to quickly and cost-
effectively analyze huge amounts of data by using SQL-based queries and business
intelligence tools.
• Clusters and Nodes: Redshift groups its resources into clusters. A cluster
consists of one or more compute nodes. A leader node manages client
connections and SQL processing. Compute nodes execute the queries and
store data.
Use Cases:
With AWS Glue, you can discover and connect to more than 70 diverse data sources and manage
your data in a centralized data catalog. You can visually create, run, and monitor extract, transform,
and load (ETL) pipelines to load data into your data lakes. Also, you can immediately search and
query cataloged data using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
AWS Glue consolidates major data integration capabilities into a single service. These include data
discovery, modern ETL, cleansing, transforming, and centralized cataloging. It's also serverless, which
means there's no infrastructure to manage. With flexible support for all workloads like ETL, ELT, and
streaming in one service, AWS Glue supports users across various workloads and types of users.
Also, AWS Glue makes it easy to integrate data across your architecture. It integrates with AWS
analytics services and Amazon S3 data lakes. AWS Glue has integration interfaces and job-authoring
tools that are easy to use for all users, from developers to business users, with tailored solutions for
varied technical skill sets.
With the ability to scale on demand, AWS Glue helps you focus on high-value activities that maximize
the value of your data. It scales for any data size, and supports all data types and schema variances.
To increase agility and optimize costs, AWS Glue provides built-in high availability and pay-as-you-go
billing.
• Unify and search across multiple data stores – Store, index, and search across multiple data
sources and sinks by cataloging all your data in AWS.
• Automatically discover data – Use AWS Glue crawlers to automatically infer schema
information and integrate it into your AWS Glue Data Catalog.
• Manage schemas and permissions – Validate and control access to your databases and
tables.
• Connect to a wide variety of data sources – Tap into multiple data sources, both on
premises and on AWS, using AWS Glue connections to build your data lake.
• Build complex ETL pipelines with simple job scheduling – Invoke AWS Glue jobs on a
schedule, on demand, or based on an event.
• Clean and transform streaming data in transit – Enable continuous data consumption, and
clean and transform it in transit. This makes it available for analysis in seconds in your target
data store.
• Deduplicate and cleanse data with built-in machine learning – Clean and prepare your data
for analysis without becoming a machine learning expert by using the FindMatches feature.
This feature deduplicates and finds records that are imperfect matches for each other.
• Built-in job notebooks – AWS Glue job notebooks provide serverless notebooks with
minimal setup in AWS Glue so you can get started quickly.
• Edit, debug, and test ETL code – With AWS Glue interactive sessions, you can interactively
explore and prepare data. You can explore, experiment on, and process data interactively
using the IDE or notebook of your choice.
• Define, detect, and remediate sensitive data – AWS Glue sensitive data detection lets you
define, identify, and process sensitive data in your data pipeline and in your data lake.
• Automatically scale based on workload – Dynamically scale resources up and down based
on workload. This assigns workers to jobs only when needed.
• Automate jobs with event-based triggers – Start crawlers or AWS Glue jobs with event-
based triggers, and design a chain of dependent jobs and crawlers.
• Run and monitor jobs – Run AWS Glue jobs with your choice of engine, Spark or Ray.
Monitor them with automated monitoring tools, AWS Glue job run insights, and AWS
CloudTrail. Improve your monitoring of Spark-backed jobs with the Apache Spark UI.
• Define workflows for ETL and integration activities – Define workflows for ETL and
integration activities for multiple crawlers, jobs, and triggers.
You can create, view, and manage your AWS Glue jobs using the following interfaces:
• AWS Glue console – Provides a web interface for you to create, view, and manage your AWS
Glue jobs.
• AWS Glue Studio – Provides a graphical interface for you to create and edit your AWS Glue
jobs visually.
• AWS Glue section of the AWS CLI Reference – Provides AWS CLI commands that you can use
with AWS Glue.
• When we run Serverless Queries against our Amazon S3 Data Link: S3 here means simple
storage service. AWS Glue can catalog our simple storage service that is Amazon S3 data
making it available for querying with Amazon Athena and Amazon RedShift Spectrum. With
crawlers, our metadata stays in synchronization with the underlying data. AWS RedShift
Spectrum can access and analyze data through one unified interface without loading it into
multiple data.
• Creating event-driven ETL Pipelines: We can run our ETL jobs as soon as new data becomes
available in Amazon S3 by invoking our AWS Glue ETL jobs from an AWS Lambda function.
We can also register this new data in the AWS load data catalog as a part of our details.
• To understand our Data Assets: We can store our data using various AWS services and still
maintain a unique, unified view of our data using the AWS Glue data catalog. We can view
the data catalog to quickly search and discover the datasets that we own and maintain the
relative data in one central location.
• Less Hassle: AWS Glue is integrated across a wide range of AWS services. AWS Glue natively
supports data stored in Amazon Aurora and other Amazon Relational Database Service
engines, Amazon RedShift and Amazon S3 along with common database engines and
databases in our virtual private cloud running on Amazon EC2.
• Cost Effective: AWS Glue is serverless. There is no infrastructure to provision or manage AWS
Glue handles, provisioning, configuration, and scaling of the resources required to run our
ETL jobs. We only pay for the resources that we use while our jobs are running.
• More Power: AWS Glue automates much of the effort in building, maintaining, and running
ETL jobs. It identifies data formats and suggests schemas and transformations. Glue
automatically generates the code to execute our data transformations and loading processes.
• Amount of Work Involved: It is not a full-fledged ETL service. Hence in order to customize
the services as per our requirements, we need experienced and skillful candidates. And it
involves a huge amount of work to be done as well.
• Platform Compatibility: AWS Glue is specifically made for the AWS console and its
subsidiaries. And hence it isn’t compatible with other technologies.
• Limited Data Sources: It only supports limited data sources like S3 and JDBC
• High Skillset Requirement: AWS Glue is a serverless application, and it is still a new
technology. Hence, the skillset required to implement and operate the AWS Glue is high.
Amazon Kinesis will stream the data in real-time help in handling it and also tell you what to do with
that data according to the organization's goals following are the broad categories to get you started:
• Real time data ingestion and processing: Amazon kinesis will take the data from the real
time it will helps in the applications like health which are used for the identifying the and
regulating the health of the patient and also if any any emergency with the help of data we
can predict it in before head. It can also used in the applications of OTT platforms by which
you can personalize the according to the user experience.
• Streamlined data delivery and storage: The real time data can be stored in the storage and
can be used for the further research and can be used for the further use and also amazon
kinesis can be integrated with the other services also.
• Real-time insights and automation: The data which is collected from the real time data will
be analyzed the whole data and recats to anomalies, fraud attempts or any other critical
immediately. And also monitor the key metrics which can be used for the data driven
decision making.
Types of Services Offered by Amazon Kinesis
There are 4 types of services which Amazon kinesis offers are as follows:
It provides a platform for real-time and continuous processing of data. It is also used to encrypt the
sensitive data by using the KMS master keys and the server-side encryption for the security purpose.
The architecture of Amazon Kinesis looks somewhat like the given below image.
Amazon Kinesis Video Streams is a fully managed service makes it easy to secure stream video data
from devices to the cloud for real-time processing and analytics. With Kinesis Video Streams you can
effortlessly capture, store and also process video streams for various use cases including surveillance,
media streaming and also IoT applications. It enables seamless integration with machine learning
models for video analysis and provides powerful tools for real-time video data processing without
worrying about scalability or storage limitations.
Amazon Kinesis video streams is a power tool that is provided by AWS as a service that can deliver
live on-demand streams in the real-time following are some of the key features used with kinesis
streams.
It allows the streams of data provided by the kinesis firehose and kinesis streams to analyze and
process it with the standard SQL. It analyzes the data format and automatically parses the data and
by using some standard interactive schema editor to edit it in recommend schema. It also provides
pre-built stream process templates that can be used to select a suitable template for their data
analytics.
Firehouse allows the users to load or transformed their streams of data into amazon web service
latter transfer for the other functionalities like analyzing or storing. It does not require continuous
management as it is fully automated and scales automatically according to the data.
• Integrate with other AWS services: Amazon Kinesis allows users to use the other AWS
services and integrate with it. Services that can be integrated are Amazon DynamoDB,
Amazon Redshift, and all the other services that deal with the large amount of data.
• Availability: You can access it from anywhere and anytime. Just need a good connectivity of
net.
• Real-time processing- It allows you to work upon the data which is needed to be updated
every time with changes instantaneously. Most advantageous feature of Kinesis because real-
time processing becomes important when you are dealing with such a huge amount of data.
• Real-time application monitoring: Amazon kinesis will provide the real time data of the
applications like if you consider the health application it will provides the live feed of the
data by which you can take care of the health by which the issues that is pointed by amazon
kinesis.
• Fraud detection and prevention: Amazon Kinesis will helps you to protect the data from
fraudulent activity by analyzing transaction data by which you can detect the suspicious
patterns and blocks fraudulent transactions before they happen.
• Personalized recommendations and marketing: Amazon kinesis will helps you in analyzing
the data of the customers by which you can understand your customers very better. You can
recommends the personalised products in real time to the costumers.
• IoT analytics and predictive maintenance: Your connected gadgets' full potential is unlocked
with Kinesis. Through the examination of sensor data from electronics, automobiles, or
machinery.
• The limitation that Amazon kinesis has that it only access the stream of records log for 24
hours by default but it can extend but up to only 7 days not longer than that.
• There is no upper limit in the number of streams that can users have in their accounts.
AWS Quicksight
AWS Quicksight is one of the most powerful Business Intelligence tools which allows you to create
interactive dashboards within minutes to provide business insights into the organizations. There are
number of visualizations or graphical formats available in which the dashboards can be created. The
dashboards get automatically updated as the data is updated or scheduled. You can also embed the
dashboard created in Quicksight to your web application.
With the latest ML insights, also known as Machine Learning insights, Quicksight uses its inbuilt
algorithms to find any kind of anomalies or peaks in the historical data. This helps to get prepared
with the business requirements ahead of time based on these insights. Here is quick guide to get
started with Quicksight.
AWS Quicksight is an AWS based Business Intelligence and visualization tool that is used to visualize
data and create stories to provide graphical details of the data. Data is entered as dataset and you
can apply filters, hierarchies, and columns to prepare documents. You can choose various charts like
Bar charts, Pie charts, etc. to visualize the data effectively. This basic tutorial will help you to
understand and learn AWS Quicksight tool.
Amazon QuickSight is a cloud-scale business intelligence (BI) service that you can use to deliver easy-
to-understand insights to the people who you work with, wherever they are. Amazon QuickSight
connects to your data in the cloud and combines data from many different sources. In a single data
dashboard, QuickSight can include AWS data, third-party data, big data, spreadsheet data, SaaS data,
B2B data, and more. As a fully managed cloud-based service, Amazon QuickSight provides
enterprise-grade security, global availability, and built-in redundancy. It also provides the user-
management tools that you need to scale from 10 users to 10,000, all with no infrastructure to
deploy or manage.
Why QuickSight?
Every day, the people in your organization make decisions that affect your business. When they have
the right information at the right time, they can make the choices that move your company in the
right direction.
Here are some of the benefits of using Amazon QuickSight for analytics, data visualization, and
reporting:
• No upfront costs for licenses and a low total cost of ownership (TCO).
• No need to manage granular database permissions—dashboard viewers can see only what
you share.
For advanced users, QuickSight Enterprise edition offers even more features:
• Saves you time and money with automated and customizable data insights, powered by
machine learning (ML). This enables your organization to do the following, without requiring
any knowledge of machine learning:
o Translate data into easy-to-read narratives, like headline tiles for your dashboard.
o Federated users, groups, and single sign-on (IAM Identity Center) with AWS Identity
and Access Management (IAM) Federation, SAML, OpenID Connect, or AWS
Directory Service for Microsoft Active Directory.
o Access to AWS data and on-premises data in Amazon Virtual Private Cloud
• Offers pay-per-session pricing for the users that you place in the "reader" security role—
readers are dashboard subscribers, people who view reports but don't create them.
• Empowers you to make QuickSight part of your own websites and applications by deploying
embedded console analytics and dashboard sessions.
• Makes our business your business with multitenancy features for value-added resellers
(VARs) of analytical services.
• Enables you to programmatically script dashboard templates that can be transferred to other
AWS accounts.
• Simplifies access management and organization with shared and personal folders for
analytical assets.
• Enables larger data import quotas for SPICE data ingestion and more frequently scheduled
data refreshes.
AWS Database Migration Service allows the movement of data across databases. AWS Database
Migration Service makes it possible to replicate data constantly. Continuous data replication serves a
variety of purposes, including synchronizing Dev/Test environment locales, global database
dispersion, and Disaster Recovery instance locations. The service also allows continuous data
replication to keep the source and target databases synchronized during the migration process.
• You can migrate between source and target endpoints, which are the same kind of database
engine.
• The example would be from an Oracle Database to another Oracle Database. The only
requirement to use AWS DMS is one of your endpoints must be on this service.
• You may pay for it as you use it. Neither the initial license purchase money nor the recurring
maintenance fee are required.
Understanding the key components of AWS DMS is crucial for planning and executing a successful
data migration. Below is an illustrative diagram showing the components used in an AWS DMS data
migration
This image highlights the essential components and how they interact throughout the migration
process. Let's break down these components and their roles in detail.
The Source Database is where your original data resides before migration. AWS DMS supports a wide
range of database types for the source, including on-premises databases, Amazon Relational
Database Service (RDS), or data stored in other AWS services.
Example: You might have data in an on-premises MySQL database that you plan to migrate to AWS.
2. Source Endpoint
A Source Endpoint defines the connection between AWS DMS and your source database. It includes
the configuration details and authentication credentials required for AWS DMS to access the source
database.
Key Responsibilities:
• Ensuring that AWS DMS can securely read data from the source database.
3. Replication Instance
The Replication Instance is the core engine of AWS DMS. It connects to both the source and target
endpoints to migrate data. All data extraction, transformation, and loading tasks run on this instance.
Key Responsibilities:
Note: You can choose from various replication instance types and sizes depending on the amount of
data and the performance requirements of your migration.
Migration Tasks, also known as Replication Tasks, define how data is moved from the source to the
target. Each task controls the data flow and can be customized based on your migration needs.
o Full Load Tasks: Migrates the entire dataset from the source to the target. This is
typically used for the initial data load.
o Change Data Capture (CDC) Tasks: Continues to capture changes (inserts, updates,
deletes) made to the source database and replicates them to the target in real-time.
• Customization Options:
o Transforming data (e.g., renaming columns, changing data types) during the
migration process.
5. Target Endpoint
A Target Endpoint defines the connection to your target database. Like the source endpoint, it
includes the settings and credentials required for AWS DMS to write data to the target location.
Key Responsibilities:
• Ensuring secure and successful data loading into the target database.
• Managing configurations that affect how data is written to the target, such as compression or
encryption settings.
The Target Database is where the data will reside after the migration. AWS DMS supports a variety of
target databases, which can be in the AWS cloud (like Amazon RDS, Amazon Aurora, or Amazon
Redshift) or on-premises.
Example: If you are migrating data from an on-premises MySQL database to AWS, the target
database might be Amazon RDS for MySQL or Amazon Aurora.
1. Source DB and Source Endpoint: AWS DMS uses the source endpoint to securely connect to
and read data from the source database.
2. Replication Instance and Migration Tasks: The replication instance hosts migration tasks that
carry out the data movement from the source to the target. Depending on the task type, the
replication instance either performs a full load, CDC, or both.
3. Target Endpoint and Target DB: The migration tasks write the data to the target database
through the target endpoint ensuring data integrity and consistency.
• Serverless: AWS DMS is serverless, and hence can deploy, maintain, and monitor all
hardware and software that it needs for migration, which frees you from traditional tasks
such as analyzing capacity, hardware and software procurement, system installation, and
administration.
• Minimal downtime: The source remains operational while DMS continuously sends changes
throughout the migration process to your data source.
The following are the limitations of the AWS Database Migration Service
• Performance: AWS DMS can load up to eight tables in parallel. This can be increased
somewhat by using a large replication server.
• Engine Name: The name of the target engine that the Fleet Advisor should use in its
recommendation.
• Data migration: AWS DMS provides security for your data migration. As the data moves from
source to target, you can also encrypt it in flight during migration using Secure Socket Layers
(SSL).
• Data gathering: Change data. Capturing incremental loads in AWS DMS requires some
programming, therefore, it takes more time and is labor-intensive.
1. Server and VM Migration: AWS SMS supports migration of both physical servers and VMs
from various sources, including VMware vSphere, Microsoft Hyper-V, and Microsoft Azure. It
allows you to consolidate and migrate these servers to the AWS Cloud.
2. Continuous Replication: AWS SMS offers continuous replication, which means your source
servers’ data is kept up-to-date in the AWS Cloud, reducing the risk of data loss during the
migration process.
3. Incremental Replication: After the initial replication, AWS SMS only replicates the changes
made to the source servers, minimizing the amount of data transferred and reducing
migration time.
4. Automated Replication and Testing: The service automates the replication and validation of
your server settings, making sure that they are compatible with the AWS environment before
the migration.
5. Flexible Cutover: You can schedule cutover for your migration, allowing you to choose the
right time to switch your applications from the source to the target environment.
6. Support for Large-scale Migrations: AWS SMS is designed to handle large-scale migrations,
making it suitable for enterprises with extensive on-premises infrastructure.
7. Integration with AWS Application Discovery Service: AWS SMS can work in conjunction with
AWS Application Discovery Service to provide a better understanding of your on-premises
applications, dependencies, and resource utilization before planning the migration.
It is essential to plan your migration carefully, considering factors like network connectivity, security,
and application dependencies to ensure a successful migration to the AWS Cloud. AWS SMS can
simplify the process and help you achieve a smooth and efficient migration while minimizing
downtime and risk.
What is AWS Snowball
AWS Snowball is a petabyte-scale offline data transfer device designed to securely and quickly
migrate large datasets to and from Amazon Web Services (AWS). Instead of transferring data over the
internet, Snowball allows organizations to physically ship encrypted storage devices for efficient data
ingestion.
• Snowball Edge Storage Optimized – Designed for large-scale data migration and storage,
offering up to 80 TB of usable storage per device.
• Snowball Edge Compute Optimized – Includes built-in compute capabilities such as AWS
Lambda and EC2 instances for local data processing before cloud transfer.
These rugged, portable devices are ideal for industries dealing with remote storage, edge
computing, and massive offline data transfers.
Amazon Snowfall is part of the AWS Snow Family, which includes other devices like Snowball Edge,
Snowcone, and Snowmobile. It is rugged and secure and physical device designed to help businesses
move large amounts of data into AWS Quickly, without relying on slow internet connections. This
makes it perfect for large-scale data transfers, disaster recovery, media archives and edge computing.
Here are the key features that makes AWS Snowball a standout solution:
AWS Snowball transfers terabytes or petabytes of data up to 10 times faster than traditional
internet-based methods. It uses rugged, tamper-proof storage for secure and reliable transport.
Data is secured with 256-bit encryption through AWS Key Management Service (KMS).
Snowball’s tamper-proof hardware ensures the security of your data during offline migration.
The Snowball Edge Compute Optimized version includes support for AWS Lambda and EC2
instances. This allows you to process data locally before migrating it to the cloud—ideal for remote
locations and edge computing applications.
AWS Snowball functions as an NFS (Network File System) mount point, making it easy to integrate
with your on-premises servers and applications. This helps streamline file-based data migration to
AWS while preserving file system metadata.
5. S3 Compatibility
AWS Snowball is S3-compatible, ensuring your applications can interact with your data using
standard S3 APIs. This integration allows for seamless data management and smooth workflows.
6. GPU Support for High-Performance Workloads
The device supports GPU capabilities for processing complex tasks like machine learning, video
analysis, and other high-performance computing applications. This is perfect for resource-intensive
tasks in remote environments.
Once the data is successfully uploaded to AWS, the device automatically erases all data from the
Snowball, ensuring that no sensitive information is left behind. This provides complete protection for
your data.
The Snowball device comes with an E Ink shipping label that automatically updates with the correct
destination address, reducing the chances of logistics errors and simplifying the return process.
9. Cost-Efficient
AWS Snowball offers a cost-effective solution for large-scale data movement. It avoids the expensive
fees typically associated with high-bandwidth internet-based transfers, making it an affordable
alternative for businesses.
• Offline: Ship Snowball devices for bulk data transfer without an internet connection.
You can cluster multiple Snowball devices to increase both storage capacity and durability. This
ensures resilient data storage solutions for edge computing, even if one device fails.
AWS Snowball simplifies large-scale data migration by allowing you to transfer data securely using
a physical device instead of relying on slow internet connections. Follow the below steps to use AWS
Snowball:
o AWS Snowball Edge Compute Optimized (for running applications at the edge).
• Create a job:
o Start the data transfer or launch EC2 instances if using the compute-optimized
version.
• Use AWS OpsHub or the AWS Snowball Client to move data from your local storage to the
Snowball device.
• Monitor the transfer process to ensure all files are securely copied.
o The E Ink shipping label automatically updates to display the correct return address.
• Your data is automatically transferred from the Snowball device to your Amazon S3 bucket.
• Once the transfer is complete, AWS securely erases all data from the device and sanitizes it,
ensuring no customer data remains.
AWS Snowball is ideal for a variety of industries and use cases. Here are some examples:
• Cloud Migration: If your company needs to move large datasets from on-premises storage to
the cloud, Snowball makes it fast and easy.
• Disaster Recovery: Secure backups of business-critical data can be sent to AWS Snowball for
long-term storage and easy recovery.
• Media & Entertainment: Snowball is commonly used by movie studios, game developers,
and media companies to transfer high-resolution videos and raw media files for cloud
processing.
• Edge Computing: Industries like oil and gas, mining, and research use Snowball Edge to
process data locally before transferring it to the cloud.
• Healthcare & Genomics: Medical institutions and research facilities use Snowball to
move large genomic datasets securely and compliantly to AWS for analysis.
1. Simple Migration
AWS Snowball makes moving large amounts of data easy. With a pre-configured device, you can load
your data, ship it back to AWS, and let the service handle the transfer, saving time and avoiding
complex setups.
2. Faster Performance
Snowball accelerates data migration, allowing you to move data much faster than over the internet.
It reduces transfer time, making large-scale migrations more efficient.
Your data is safe with AWS Snowball. It’s encrypted with 256-bit encryption, and once the device is
returned, the data is securely erased, ensuring full protection throughout the process.
AWS provides multiple data transfer solutions to meet different business needs. Below is a
comparison of AWS Snowball, AWS DataSync, AWS Snowmobile, and AWS Direct Connect to help you
choose the best option.
AWS Direct
Feature AWS Snowball AWS DataSync AWS Snowmobile Connect
Internet
No Yes No Yes
Needed?
Simple (Order,
Setup Moderate (Requires Complex (Requires Complex (Needs
transfer,
Complexity configuration) AWS approval) network setup)
return)
Cloud
migration, Real-time data
Regular file Massive
disaster transfer, low-
Use Cases transfers, hybrid enterprise
recovery, latency
cloud sync migrations
remote workloads
locations
Features of Snowball
AWS Snowball with the snowball device has the following features:-
• Your data is secure both while it's at rest and when it's being physically moved.
• You can manage your jobs programmatically using the job management API or through the
AWS Snow Family Management Console.
• The 50 TB model is accessible in all other AWS Regions; 80 TB and 50 TB variants are only
available in US Regions.
• Local data transfers are possible between a Snowball and an on-premises data center. These
transfers can be carried out using the standalone, downloadable Snowball client. Or you can
use the free Amazon S3 Adapter for Snowball to transfer data programmatically using calls to
the Amazon S3 REST API. See Data Transmission with a Snowball for further details.
• When the Snowball is prepared to ship, the E Ink display on its own shipping container
switches to display your shipping label. See Shipping Factors for AWS Snowball for more
details.
AWS Snowmobile works by physically transporting large amounts of data from your location to AWS
data centers. Refer to the diagram below for a visual explanation On How does AWS Snowmobile
Work.
AWS Snowmobile is ideal for large-scale data migrations where transferring massive amounts of data
quickly is crucial. It is commonly used in the following scenarios:
• Data center migration: When organizations need to move petabytes or exabytes of data to
the cloud Snowmobile offers a fast and secure solution.
• Disaster recovery: Companies use Snowmobile to back up vast amounts of data to the cloud
ensuring quick recovery in case of a disaster.
• Video libraries: Media companies transferring large video archives to AWS for storage and
processing benefit from Snowmobile's massive capacity.
• Research data transfer: Industries like healthcare and genomics use Snowmobile to
transport enormous datasets, enabling advanced cloud analytics and processing.
• Government and security agencies: For secure, large-scale data transfer, Snowmobile
provides enhanced security features for sensitive data migration.
AWS Snowmobile offers several advantages for organizations with massive data migration needs:
• Massive Data Transfer: Snowmobile can transfer exabytes of data making it ideal for large-
scale migrations that would otherwise take months or years over standard internet
connections.
• Cost-Effective: It reduces the time and expense associated with transferring vast amounts of
data by eliminating the need for long-term network setups.
• Fast Migration: By moving data physically rather than over the internet Snowmobile
drastically shortens the data migration timeline.
• Scalability: Snowmobile is designed to handle the largest data migrations offering scalability
for organizations moving entire data centers to the AWS cloud.
Snowball: This is a petabyte-scale data migration device designed to transfer large amounts of data
to and from AWS. It's a rugged, secure appliance that can be physically shipped to your location.
Snowball Edge: This device combines compute and storage capabilities, making it suitable for hybrid
and edge workloads. It can process data locally and store it before transferring it to AWS.
Snowmobile: For exabyte-scale data migration, AWS offers Snowmobile. It's a semi-trailer truck
equipped with petabytes of storage capacity, allowing you to transfer massive amounts of data to
and from AWS.
Each Snowball Edge device can transport data at speeds faster than the internet. This transport is
done by shipping the data in the devices through a regional carrier. The appliances are rugged,
complete with E Ink shipping labels.
Snowball Edge devices have two options for device configurations—Storage Optimized 210
TB and Compute Optimized. When this guide refers to Snowball Edge devices, it's referring to all
options of the device. When specific information applies only to one or more optional configurations
of devices, it is called out specifically
• Large amounts of storage capacity or compute functionality for devices. This depends on the
options you choose when you create your job.
• You can import or export data between your local environments and Amazon S3, and
physically transport the data with one or more devices without using the internet.
• Snowball Edge devices are their own rugged box. The built-in E Ink display changes to show
your shipping label when the device is ready to ship.
• Snowball Edge devices come with an on-board LCD display that can be used to manage
network connections and get service status information.
• You can cluster Snowball Edge devices for local storage and compute jobs to achieve data
durability across 3 to 16 devices and locally grow or shrink storage on demand.
• You can use Amazon EKS Anywhere on Snowball Edge devices for Kubernetes workloads.
• Snowball Edge devices have Amazon S3 and Amazon EC2 compatible endpoints available,
enabling programmatic use cases.
• Snowball Edge devices support the new sbe1, sbe-c, and sbe-g instance types, which you can
use to run compute instances on the device using Amazon Machine Images (AMIs).
• Snowball Edge supports these data transfer protocols for data migration:
o NFSv3
o NFSv4
o NFSv4.1
o Amazon S3 over HTTP or HTTPS (via API compatible with AWS CLI version 1.16.14
and earlier)
• Amazon S3 adapter — Use for programmatic data transfer in to and out of AWS using the
Amazon S3 API for Snowball Edge, which supports a subset of Amazon S3 API operations. In
this role, data is transferred to the Snow device by AWS on your behalf and the device is
shipped to you (for an export job), or AWS ships an empty Snow device to you and you
transfer data from your on-premises sources to the device and ship it back to AWS (for an
import job)"
• Amazon S3 compatible storage on Snowball Edge — Use to support the data needs of
compute services such as Amazon EC2, Amazon EKS Anywhere on Snow, and others. This
feature is available on Snowball Edge devices and provides an expanded Amazon S3 API set
and features such as increased resiliency with flexible cluster setup for 3 to 16 nodes, local
bucket management, and local notifications.
• Amazon EC2 – Run compute instances on a Snowball Edge device using the Amazon EC2
compatible endpoint, which supports a subset of the Amazon EC2 API operations.
• Amazon EKS Anywhere on Snow – Create and operate Kubernetes clusters on Snowball Edge
devices.
• AWS Lambda powered by AWS IoT Greengrass – Invoke Lambda functions based on Amazon
S3 compatible storage on Snowball Edge storage actions made on an AWS Snowball Edge
device.
• Amazon Elastic Block Store (Amazon EBS) – Provide block-level storage volumes for use with
EC2-compatible instances.
• AWS Identity and Access Management (IAM) – Use this service to securely control access to
AWS resources.
• AWS Security Token Service (AWS STS) – Request temporary, limited-privilege credentials for
IAM users or for users that you authenticate (federated users).
• Amazon EC2 Systems Manager – Use this service to view and control your infrastructure on
AWS.