0% found this document useful (0 votes)
396 views22 pages

How To Design AWS Data Architectures - by Narjes Karmeni - The Startup - Medium

This document provides an overview of how to design AWS data architectures. It discusses the differences between batch and real-time processing and provides examples of functional architectures for fraud detection and risk prediction. It also reviews common AWS services for data management, storage, processing and analytics including S3, Glue, EMR, Athena, Kinesis, QuickSight, Data Pipeline, EC2 and Lambda.

Uploaded by

Binesh Jos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
396 views22 pages

How To Design AWS Data Architectures - by Narjes Karmeni - The Startup - Medium

This document provides an overview of how to design AWS data architectures. It discusses the differences between batch and real-time processing and provides examples of functional architectures for fraud detection and risk prediction. It also reviews common AWS services for data management, storage, processing and analytics including S3, Glue, EMR, Athena, Kinesis, QuickSight, Data Pipeline, EC2 and Lambda.

Uploaded by

Binesh Jos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

You have 2 free member-only stories left this month. Sign up for Medium and get an extra one

How to Design AWS Data Architectures


narjes karmeni Follow
Oct 21, 2020 · 16 min read

Article Content :

Introduction :
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 1/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

Within the increased development of cloud computing services, migration to the public
cloud is continuously increasing amongst many industries due to its flexibility and the
large wide range of automated services it offers. Amazon Web Services (AWS) provides a
broad platform of managed services to help in building, securing, scaling and
maintaining data architectures; a company has not to care about server’s maintenance
and failures. It provides a set of services used to perform common data engineering
tasks ; collect, store, transform, analyze, search, visualize and ingest data with
scalable features adapted for different requirements and use-cases.

In this article, I will explain the main services that each data engineer should know
when deploying data architectures on an AWS platform and I will explain adopted steps
to define an architecture adapted to the use-case and then how to select the technical
environment S3/Glue/EMR/Athena/Kinesis/QuickSight/DataPipeline/EC2/Lambda.

I — Data Architectures :

When it comes to data processing, there are more ways to do it than ever. How you do it
and the tools you choose depend largely on what your purposes are for processing the
data in the first place. In many cases, you’re processing historical and archived data and
time isn’t so critical. You can wait a few hours for your answer, and if necessary, a few
days. Conversely, other processing tasks are crucial, and the answers need to be
delivered within seconds to be of value. Here are the differences among real-time, near
real-time, and batch processing, and when each is your best option.

1 — Batch Processing vs Real-time Processing :

Batch Processing : In batch processing, a planned system is triggered periodically to


process large volumes of data that arrives in datalake. In this type of architecture,
processing period could be at the order of days|weeks|months|quarters… and it’s
usually used to help deciders to better understand data and make business
recommendations.

Real-time (Online) Processing : Real-time data architecture is dedicated to process


real time continuous flow ( Iot data , meteorology, … ) or on-demand flow ( web &
mobile services ); it could be used to reduce a cost, maximize a gain (online
recommendation), real-time anomaly detection. This type of architecture treats data

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 2/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

that arrives in ( microsecond|second| minutes) and used to act/react in real time in


order to make a continuous control; what makes latency is a key factor to evaluate
performance of such architecture. We could as well build some dashboards in order to
make data monitoring and generate alerts.

2- How to design architecture ?

While getting specifications & business requirements, a first step would be defining
existing data sources (if they already exist) and the define use-case (simple projects) /
multiple use-cases (complex projects). Based on expressed needs, one could choose if
processing alongside data arrival would be continuously running or batch processing
with a large period of time would be adopted. And then, we could translate these
business requirements into a functional architecture by defining elementary
components that would transform arrived data with its raw form into an exploitable
state that fulfills requirements. While designing this prototype, it would be important to
take into consideration extensibility and then propose a flexible architecture that
would be easily evolved. In this article, I will put into practice the design of a functional
architecture alongside best practices to design low cost and automated AWS data
architectures with two projects.

3- A . Functional Architecture for real-time fraud detection :

The first project consists of detecting fraud from payments requests that arrive from web
& mobile applications and then make fraud analysis by making information search &
statistical visualizations. Hence, we need real time architecture to react against fraud
attempts and serve payments requests.

In this architecture prototype, payments requests continuously arrive from web and
mobile, they are grouped and merged with old user-history ( database ) in order to
match requested payment operation with old user behavior.

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 3/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

Once a decision is delivered by a pre-trained machine learning classification model, we


would store the decision and update user-history once the operation is valid. Otherwise,
we would demand a digital authentication [fingerprint, face recognition, voice speaker
recognition, etc) and again store decisions and update history once payment passes.
Then, we would like to make statistical analysis on user history data; we would link
historical data with a visualization tool. We as well would like to analyze performances
of fraud detector mechanisms and identify statistics on detected fraud attempts; we
would connect detected fraud by both mechanisms to a visualization tool.

3. B . Functional Architecture for Risk Prediction & Analyzes : The second project
consists of monthly predicting with a classification model consumers quality of risk and
then analyzes profiles with bad risk to better make decisions.

Data is collected into a database and processed at each end of month in order to detect
clients with bad risk scores and analyze their profiles. So data is charged in the datalake
then cleaned and then computing indicators in order to aliment visualization dashboard.

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 4/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

A second step was to develop feature engineering on cleaned data and then apply the
pretrained classification model. We indeed add a dashboard to analyze detected profiles,
discover groups and identify factors impacting quality of risk.

II — AWS Data Stack :

Let’s then make an overview and discuss features of data services in order to select the
best adapted services to realize a target function.

1 — Managing AWS Services :

There are several existing tools used to manage and control aws services attached to a
user account given service region alongside account credentials access key
AWS_ACCESS_KEY_ID and secret key AWS_SECRET_ACCESS_KEY.

Awscli : the AWS Command Line Interface (CLI) is a unified tool to manage and control
AWS services from command lines and then automate this control through scripts.

The simplest way to install this module on a ubuntu machine is to lunch this two
commands:

$ sudo apt install awscli


$ sudo pip3 install awscli

And the add credentials through adding a config file in ~/.aws/credentials

Hence you have to run:

$ aws configure
AWS Access Key ID: your_access_key
AWS Secret Access Key: your_secret_key
Default region name [us-west-2]: your_aws_region
Default output format [None]: json

Another way to define to define default credentials that would be used by all tools is to
set the environments variables as:

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 5/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

$ export AWS_ACCESS_KEY_ID=<your_access_key>
$ export AWS_SECRET_ACCESS_KEY=<your_secret_key>
$ export AWS_REGION=<your_aws_region>

Terraform is software that enables users to define infrastructure as a code, manage the
full lifecycle, update configuration without service interruption and delete no longer
resources through a template written with HashiCorp syntax. It supports many providers
including aws provider

The link provides a guide to install & setup terraform on local machine :
https://www.techrepublic.com/article/how-to-install-terraform-on-ubuntu-server/

Boto3 & sdk-aws :

AWS offers a set of SDK for developers such as boto3 (for python developers) and (aws-
sdk) for javaScript developers.

These tools enable developers to create, delete, update and use AWS services.

2 — S3 bucket:

The first mandatory service that should be known by each AWS learner is s3 bucket
(popular and a “must-used” service).It’s presented by AWS as an alternative datalake of
HDFS on a hadoop cluster; it has several use cases; roughly, it’s used for logging activity
store of other services alongside user-application generated logs, storing application
scripts that would be executed on a VM alongside storing templates to create resources
and services, configurations and finally to store data in different format (csv, parquet,
text, json, etc) that would be called from other services in order to run hosted
application.

It’s considered as one of the most preferred services to host large volumes of data due to
its low cost alongside the good data persistence properties; it’s as well highly used as a
data backup of other data sources in order to recover data in case of loss.

It has great integration with various Hadoop distributions such as Hortonworks, CDH,
MapR, EMR and DataProc.

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 6/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

Although they have similar use-cases, it’s worth noting to keep in mind some key
differences between hdfs and s3 bucket that would be taken into consideration when
choosing a data architecture.

S3 is an object store whereas HDFS is a distributed file system meaning that fault
tolerance is guaranteed.

S3 has an unlimited storage capacity in the cloud which is not the case on HDFS.

HDFS is hosted on physical machines; computing tasks could be done on hdfs aside from
storing data.

Common commands: https://www.thegeekstuff.com/2019/04/aws-s3-cli-examples/

S3 pricing informations are: https://aws.amazon.com/s3/pricing/?nc1=h_ls

3 — Querying S3 data: Data Crawler & Catalog

Just like hive in Hadoop , querying data stored in s3 using SQL queries is also possible

Three main services where SQL syntax from.

AWS offer Glue|Athena|Redshift Spectrum services to transform data stored in s3 files


into structured tables that would be partitioned & rapidly queried.

Selection would depend on the user’s purpose: that would be explained in the rest of
this article.

4 — AWS Glue:

Similarly AWS Glue is a must-known service for AWS Data Engineers; it’s a serverless
service

It has elements to transform files into tables on the s3 namely the Crawler & Data
Catalog

Data Catalog refers to the set of persistent metadata stored in AWS Glue; it contains
table definitions, schemas, job definitions and other control information to manage the

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 7/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

Glue environment. It’s considered by AWS as a drop-in replacement to the apache Hive
MetaStore,

The classifier defines the data schema from a data file.AWS Glue provides data
classifiers for mostly used files types such as CSV, JSON, AVRO, XML, and others.

More informations are provided on the AWS Glue documentation:

https://docs.aws.amazon.com/glue/latest/dg/components-overview.html
https://docs.aws.amazon.com/glue/latest/dg/components-key-concepts.html

Crawler: A crawler is a program that connects to the data store (source & target), uses
the list of classifiers to define the data schema and then creates the metadata tables in
the data catalog.

The AWS Glue is used to apply a set of transformations on the defined tables.

Glue ETL job; it’s composed of target data, source data alongside transformation script.

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 8/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

The used transformations could be a set of operations on raw data to be cleaned and
transformed into an exploitable form.

Since Glue is a Serverless service; no physical machine that hosts ETL jobs and then
pricing is only based on the executed jobs.

The first million access requests to the AWS Glue Data Catalog per month are free. If you
exceed a million requests in a month, you will be charged $1.00 per million requests
over the first million.

https://aws.amazon.com/glue/pricing/?nc1=h_ls

Glue with Spark : A glue package is provided in order to enable spark accessibility to
Glue data catalog.

It extends spark and provides an entry point to read and write glue dynamicFrames from
and to Amazon S3, the AWS Data Catalog and the JDBC through a class called Glue
context associated with a Spark Context.

Here I explain how to develop steps used to develop pyspark code using a GlueContext.

Update PySpark driver environment variables: add these lines to your ~/.bashrc (or
~/.zshrc) file.

$ export PYSPARK_DRIVER_PYTHON=jupyter
$ export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

This operation would launch jupyter with a default spark session instead of launching
pyspark shell.

Then gluepyspark command would enable launching jupyter notebook with glue jars
and a spark session.

And then gluesparksubmit command would be used while launching the python script.

5 — Athena Service :

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 9/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

AWS Athena is a similar serveless (No server to mange) service used to create and query
s3 data with an SQL syntax;

It’s not used in a code deployment service (in production) but used as an interactive
query search in order to search & analyze information alongside making statistical
visualizations in QuickSight service.

Athena uses AWS Glue data catalog (default) to store metadata containing table
definition information or Hive metastore.

By choosing the first option (glue), the table would be visible in the Glue console and
ETL jobs could be done on it.

There are multiple methods to create tables and execute queries from a local machine on
Athena service.

With AWS commands:

$ aws athena start-query-execution \


- query-string "CREATE EXTERNAL ... ;"\
- query-execution-context Database=default \
- result-configuration OutputLocation=s3://testbucket/atm/

With Terraform :

From the template of table configuration athena.tf :

1 provider "aws" {
2 region = "your_region"
3 access_key = "your_aws_access_id"
4 secret_key = "your_aws_secret"
5 }
6 resource "aws_glue_catalog_table" "aws_glue_catalog_table" {
7 name = "atmdfromterra"
8 database_name = "default"
9
10 table_type = "EXTERNAL_TABLE"
11
12 parameters = {

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 10/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

13 EXTERNAL = "FALSE"
14 # "field.delim" = ","
15 # "skip.header.line.count" = "1"
16 }
17
18 storage_descriptor {
19 location = "s3://test/atm/"
20 input_format = "org.apache.hadoop.mapred.TextInputFormat"
21 output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
22 ser_de_info {
23 name = "my-serde"
24 serialization_library = "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
25 parameters = {
26 "serialization.format" = ","
27 "line.delim" = "\n"
28 "field.delim" = ","
29 "skip.header.line.count" = "1"
30 }
31 }
32
33
34 columns {
35 name = "DATE"
36 type = "string"
37 }
38
39 columns {
40 name = "ATM_ID"
41 type = "int"
42 }
43
44 columns {
45 name = "CLIENT_OUT"
46 type = "int"
47 }
48
49 }
50 }

athena.tf hosted with ❤ by GitHub view raw

$ terraform plan -out=table.plan


$ terraform apply table.plan

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 11/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

With boto3 :

1 config = Config(region_name = 'your_region') # if no default region is defined


2 athena_client = boto3.client('athena',config = config, aws_access_key_id=ACCESS_KEY, aws_secret_ac
3 response = athena_client.start_query_execution(
4 QueryString="CREATE EXTERNAL TABLE IF NOT EXISTS atmdata(DATE STRING,ATM_ID INT,CLIENT_OUT INT)
5 QueryExecutionContext={ 'Database': 'default' },
6 ResultConfiguration={ 'OutputLocation':'s3://testbucket/'}
7 )

athena.py hosted with ❤ by GitHub view raw

6 — Redshift Spectrum Service:

AWS offer smart feature called “Redshift Spectrum” that enables application of Redshift
queries on data stored in S3 bucket; we would benefit from the low costing hosting of
the s3 while keeping processing near real-time

Redshift Spectrum works by introducing a layer between after that execute Redshift
queries on the Redshift cluster behave. This layer is scaled and managed by Amazon
according to the query usage and Price is charged based on the amount of data scanned
in S3.

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 12/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

AWS uses Data catalog to store schema definition of data accessed in s3 ; namely there
are two options to choose a data catalog for Redshift Spectrum that are : Athena data
catalog, Glue catalog and hive metastore of an EMR cluster.

7 — EC2 Service:

The Amazon Elastic Compute Cloud (EC2) is the historically the first launched Amazon
service.

It’s a standard resizable computing service.

It’s used to deploy on a user-selected machine, where the development team could select
an operating system alongside machine size adapted to the deployed application.

It’s used to make code tests and deploy high performance applications.

Connection to the EC2 is intuitively done by establishing and ssh connection

ssh -i "your_aws_key.pem" [email protected]

Data transfer scp from EC to local machine

scp -i "your_aws_key.pem" ubuntu@ec2-xx-xx-x-


xxx.compute.amazonaws.com:/path_to_your_files
path_to_your_local_folder

We could as well open a web-interface service (jupyter , zeppelin, ambari, etc ) running
on the EC2 from a local machine browser.

ssh -i "your_aws_key.pem" -N -f -L 8888:localhost:8888 ubuntu@ec2-xx-


xx-x-xxx.compute.amazonaws.com

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 13/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

8 — AWS ElasticMapReduce Cluster (EMR)

It’s the AWS platform of hadoop cluster,it contains the well-known hadoop components
to execute High performances MapReduce jobs ( HDFS, YARN,SPARK) on high vol data
alongside other big data services (Hive, Hbase, pig, Oozie, Zeppelin, Zookeeper, etc)

EMR uses EC2 as nodes (master and slave) to distribute data storage and processing.

As the ssh access is enabled by default, configuration of cluster components


(hive,yarn,spark, etc) is saved in a template containing the value of properties called
during the creation of cluster.

Creation, configuration and removal of the EMR cluster is enabled with


terraform/awscli and Software Development Kits.

The good news that , for architectures where distributed jobs are executed infrequently (
1 time per day / week /month/ etc), the cluster could be automatically terminated after
job run since hdfs data ( input and output data ) are stored in S3.

9 — AWS Lambda :

Lambda is a computing service; it’s highly loved due to its auto-scalability properties and
low cost. It’s triggered by an event (there many sources), and jobs are executed only if
this event occurs; hence there’s no service to pay during inactivity.

Adaptable to automate application workflow and intermediate small tasks, used as well
to collect data from users with kinesis.

10 — Kinesis Service:

Kinesis : is a data ingestion service ( equivalent to kafka )used in data streaming for real-
time application, boto3 is a well-known client to publish & consume data in python
application, it’s adaptable with structured spark streaming to transform the memorized
data flow.

Storing received data into s3/redshift/DynamoDB could be automated (without


developing code) with kinesis firehose and a dedicated lambda function to record
streams and then periodically update data sources.
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 14/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

A code for a kinesis producer with boto3 :

1 import boto3
2 import json
3 from datetime import datetime
4 import calendar
5 import random
6 import time
7
8 my_stream_name = 'python-stream'
9
10 kinesis_client = boto3.client('kinesis', region_name='us-east-1')
11
12 def put_to_stream(thing_id, property_value, property_timestamp):
13 payload = {
14 'prop': str(property_value),
15 'timestamp': str(property_timestamp),
16 'thing_id': thing_id
17 }
18
19 print payload
20
21 put_response = kinesis_client.put_record(
22 StreamName=my_stream_name,
23 Data=json.dumps(payload),
24 PartitionKey=thing_id)
25
26 while True:
27 property_value = random.randint(40, 120)
28 property_timestamp = calendar.timegm(datetime.utcnow().timetuple())
29 thing_id = 'aa-bb'
30
31 put_to_stream(thing_id, property_value, property_timestamp)
32
33 # wait for 5 second
34 time.sleep(5)

kinesis_producer.py hosted with ❤ by GitHub view raw

A code for a kinesis consumer with boto3 :

1 import boto3
2 import json
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 15/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

3 from datetime import datetime


4 import time
5
6 my_stream_name = 'python-stream'
7
8 kinesis_client = boto3.client('kinesis', region_name='us-east-1')
9
10 response = kinesis_client.describe_stream(StreamName=my_stream_name)
11
12 my_shard_id = response['StreamDescription']['Shards'][0]['ShardId']
13
14 shard_iterator = kinesis_client.get_shard_iterator(StreamName=my_stream_name,
15 ShardId=my_shard_id,
16 ShardIteratorType='LATEST')
17
18 my_shard_iterator = shard_iterator['ShardIterator']
19
20 record_response = kinesis_client.get_records(ShardIterator=my_shard_iterator,
21 Limit=2)
22
23 while 'NextShardIterator' in record_response:
24 record_response = kinesis_client.get_records(ShardIterator=record_response['NextShardIterato
25 Limit=2)
26
27 print record_response
28
29 # wait for 5 seconds
30 time.sleep(5)

kinesis_consumer.py hosted with ❤ by GitHub view raw

III — Putting it all together : design of a cloud-based data architecture

Once functional architecture is established; it would be translated into a technical


architecture that respects constraints and delivers the functional needs with a minimum
of cost and a satisfied latency.

This technical environment includes the technical environment (languages, frameworks,


etc) alongside the data storing systems.

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 16/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

Many criterions have to be put into consideration while choosing a datastore/database


and computing services mainly; data persistence, hosting cost, security, latency, data
structures.

1 — A Use case for Batch processing:

In this use-case, data is monthly charged into database and then data processing is
triggered with the new incoming data.

Therefore, we won’t need a database that supports real time processing to store data
flow.

We could use a datalake to execute spark transformations in order to clean and


transform the batch data and then launch the classification model to evaluate risk
quality.

HDFS or S3 for input data?

a- First scenario: Input and output sources are stored on S3 bucket:

Running spark jobs on data stored on Hdfs would enable using the yarn component of
hadoop and then we could benefit from Scheduling capabilities of yarn and then
accelerate spark jobs.

b- Second scenario: Input sources on HDFS and output (target) sources on S3 bucket:

Launching spark jobs on an EMR cluster would be an adequate choice since large data
volumes are rapidly transformed with the cluster resources.

On the other side, keeping running on multiple EC2 instances of the EMR without
activity (as processing is triggered once a month) would lead to a cost overflow and then
an expensive monthly billing.

c — Third scenario:

The evolved solutions consist in using s3 to store input and output data for a long time
and temporarily store input data in a created hdfs nodes of an EMR cluster in order to
run the spark jobs.

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 17/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

And then the cluster would be auto-terminated once last job is successfully done.

How to generate output source: In order to run spark jobs on the EMR cluster we could
use an EC2 as a client server and then in order to directly a spark dataframe into a glue
data catalog, we have to use a spark-glue connector.

Creating dashboard: In order to create a dashboard and make analysis on input and
output data sources we would use an Athena service to launch search Queries from an
interactive console as well as connecting that source to a QuickSight service in order to
make visualizations.

Automating the cluster creation workflow & sources alimentations?

We should think about a process to automate the creation of cluster once data are
charged into the S3 bucket

It’s feasible by introducing a dedicated lambda triggered once a creation event is done
on the input s3, then cluster configuration file the cluster would be charged from a
bucket to create the cluster with boto3 client.

Once the cluster is ready, we would start EC2 machine to launch the application and the
upload the application scripts from s3 to the EC2 and lunch it

Secondly, we have created a process that automatically updates Athena sources once
new data arrives on both input and output buckets and charges new data.

There are two scenarios to do that: lunch update scripts from the launch EC2 machine.

The second option is to create two dedicated lambda functions that are woken-up once
new data arrives and automate the update of existing data with a boto3 client ( if using
python ).

Once the workflow of application is terminated we could as (auto-stopped machine)

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 18/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

2- A use case for online processing: Real time fraud detection

In this architecture, we have adopted two reactive components (the first one to match
transactions with client old behavior and output a first alert), and then the second
component is to identify the client based on its confidential information (biometry, face
identification, voice recognition, etc).

As historical data sources are requested by 3 clients (reading, writing) in real time, our
data source should support real time processing.

Redshift cluster or Redshift Spectrum?

Amazon Redshift excels at running complex analytics queries, joins and aggregations
and provides a real time response over large datasets due its high performance local
disks.

Redshift service is per storage node pricing; it runs at about ~$1,000 / TB / Year which
is (x4 times) more expensive than the S3 bucket (~$250 / TB / Year).

On the other side, Redshift Spectrum is a serverless service; pricing is based on scanned
data ($5 per terabyte of data scanned)

Amazon Redshift spectrum users can benefit from the cheap storage price of the S3 and
then run analytics queries, filter, aggregate and group data with the spectrum layer

Noting that, Redshift spectrum queries do run queries slower than the Amazon Redshift
cluster.

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 19/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

Indeed, running frequent queries on Amazon Redshift would probably make scanning
data costs more than just storing in Redshift, it’s to move data back into the cluster.

In a nutshell, there is no intuitive choice between the two options; a smart way to
simulate the daily incoming workload of the application and estimate average daily cost
and latency.

If scanning data is more costly than storage, redshift cluster would be best option;
otherwise, if redshift spectrum gives an acceptable latency while incoming workload
reaches its maximum and the cost is your priority then Spectrum would be the good
option for you.

Data collection operation could be hosted by a lambda function connected with kinesis
service and then sent via a kinesis stream service result to the data transformation unit.

Deploying the transformation jobs :

Then data transformation results alongside classification algorithms could be deployed


on an EC2 machine with an adapted size.

Or with a separate lambda function where data is ingested through kinesis streams.

We could use an s3 bucket to store classification algorithm activity and then a lambda
function to update new incoming data into the Athena table.

Backup data stores : use of data backups to replicate data is as well a mandatory
practice while designing architectures and s3 is the most convenient choice.

Therefore, technical architecture would be like :

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 20/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

Read More

1 - A Quick Tutorial on Kubernetes for Developers

2 - A Friendly Intro to Helm

3 - Software Management Fundamentals

Sign up for Top 10 Stories


By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories —
delivered straight into your inbox, once a week. Take a look.

Your email

Get this newsletter

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.

AWS Big Data Data Science

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1
About Help Legal21/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
About Help Legal

Get the Medium app

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 22/22

You might also like