0% found this document useful (0 votes)
28 views7 pages

AWS Delta Lake Solution

Uploaded by

satish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views7 pages

AWS Delta Lake Solution

Uploaded by

satish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 7

AWS Delta Lake

Data Archive Solution– Architecture (draft)


AWS Glue and Delta Lake
• Delta Lake
• An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads. It provides serializability, the strongest level of
isolation level. Scalable Metadata Handling, Time Travel, and is 100% compatible with Apache Spark APIs.
• Basically, it allows you to do DELETES and UPSERTS directly to your data lake.

• AWS Glue
• A serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.
• AWS Glue 3.0 and later supports the Linux Foundation Delta Lake framework.

• Amazon Athena
• An interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and
you pay only for the queries that you run.

• Amazon S3
• An object storage service that offers industry-leading scalability, data availability, security, and performance.
AWS Glue Pricing
• Glue Pricing is calculated in DPUs. A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory.
• AWS Glue Crawlers Pricing, Cost
• There is an hourly rate for AWS Glue crawler runtime to discover data and populate the AWS Glue Data Catalog. $0.44 per DPU-Hour, billed per
second, with a 10-minute minimum per crawler run
• For Example , to process 10 hours of Crawling time every day , it will take around 10x$0.44*30=$13.2 per month

• AWS Glue Data Catalog Pricing, Cost


• With the AWS Glue Data Catalog, you can store up to a million objects for free. If you store more than a million objects, you will be charged $1.00 per
100,000 objects over a million, per month.
• The first million access requests to the AWS Glue Data Catalog per month are free. If you exceed a million requests in a month, you will be
charged $1.00 per million requests over the first million.
• For Example, if you have 10 million objects, You will pay around $1x10x10=$100 per month for storage.

• AWS Glue ETL Jobs Pricing, Cost


• With AWS Glue, you only pay for the time your ETL job takes to run.
• There are three types of jobs in AWS Glue: Apache Spark, Spark Streaming, and Python shell.
• An Apache Spark job run in AWS Glue requires a minimum of 2 DPUs. By default, AWS Glue allocates 10 DPUs to each Apache Spark job.
• An AWS Glue job of type Spark Streaming requires a minimum of 2 DPUs. By default, AWS Glue allocates 5 DPUs to each Spark Streaming job.
• An AWS Glue job of type Python shell can be allocated either 1 DPU or 0.0625 DPU. By default, AWS Glue allocates 0.0625 DPU to each
Python shell job.
• For example:
• 1 Apache Spark job, will be charged by default 10x$0.44=$4.4 Per hour
• 1 Spark Streaming job, will be charged by default 5x$0.44=$2.2 Per hour
• 1 Python Job, will be charged by default 1x$0.44=$0.44 per hour
• For example:
• If a single Glue job runs for 10 minutes and triggered for every 15 minutes (4 runs/hr * 24hrs a day * 30 days a month = 2880 runs). With 10 DPUs, total
monthly cost will be : 10 DPU * 480 billing hours * 0.44 = $2,112
AWS Athena Pricing Example
• In this example, below is a screenshot from the Amazon Athena pricing calculator where we are assuming 1 query per
work day per month, so 20 queries a month, that would scan 4TB of data. The cost per query works out as follows. At $5
per TB scanned, we would pay $20 for a query that scans 4 TB of data. If we are running that query 20 times per month,
we get to 20 * 20 = $400 per month.

Price per TB scanned: $5


Queries per month: 20
TB of data scanned, per query: 4
Total monthly cost: $400

You can mitigate these costs by storing your data compressed, if that is an option for you. A very conservative 2:1
compression rate would cut your costs in half to just $200 per month.

Now, if you were to store your data in a columnar format like ORC or Parquet, then you can reduce your costs even
further by only scanning the columns you need, instead of the entire row every time. We’ll use the same 50% notion where
we now only have to look at half our data, and now our cost is down to $100 per month.
Amazon S3 Cost Modeling
• Raw data received from source systems will be stored in Amazon S3, which will eventually then be processed by AWS Glue and loaded into
target storage. Amazon S3 will incur storage cost and costs associated with PUT and GET requests. For the total dataset storage size of 3 TB,
below are estimated costs for this sample data.

Dimensions Cost

Amazon S3 Storage (3 TB) $69

PUT requests (300000) $1.5

GET requests(3000000) $1.2

Total ~$71.7

Monthly Total ~$2126.7

You might also like