0% found this document useful (0 votes)

396 views22 pages

How To Design AWS Data Architectures - by Narjes Karmeni - The Startup - Medium

This document provides an overview of how to design AWS data architectures. It discusses the differences between batch and real-time processing and provides examples of functional architectures for fraud detection and risk prediction. It also reviews common AWS services for data management, storage, processing and analytics including S3, Glue, EMR, Athena, Kinesis, QuickSight, Data Pipeline, EC2 and Lambda.

Uploaded by

Binesh Jos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

396 views22 pages

How To Design AWS Data Architectures - by Narjes Karmeni - The Startup - Medium

Uploaded by

Binesh Jos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

You have 2 free member-only stories left this month. Sign up for Medium and get an extra one

How to Design AWS Data Architectures

narjes karmeni Follow
Oct 21, 2020 · 16 min read

Article Content :

Introduction :
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 1/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

Within the increased development of cloud computing services, migration to the public
cloud is continuously increasing amongst many industries due to its flexibility and the
large wide range of automated services it offers. Amazon Web Services (AWS) provides a
broad platform of managed services to help in building, securing, scaling and
maintaining data architectures; a company has not to care about server’s maintenance
and failures. It provides a set of services used to perform common data engineering
tasks ; collect, store, transform, analyze, search, visualize and ingest data with
scalable features adapted for different requirements and use-cases.

In this article, I will explain the main services that each data engineer should know
when deploying data architectures on an AWS platform and I will explain adopted steps
to define an architecture adapted to the use-case and then how to select the technical
environment S3/Glue/EMR/Athena/Kinesis/QuickSight/DataPipeline/EC2/Lambda.

I — Data Architectures :

When it comes to data processing, there are more ways to do it than ever. How you do it
and the tools you choose depend largely on what your purposes are for processing the
data in the first place. In many cases, you’re processing historical and archived data and
time isn’t so critical. You can wait a few hours for your answer, and if necessary, a few
days. Conversely, other processing tasks are crucial, and the answers need to be
delivered within seconds to be of value. Here are the differences among real-time, near
real-time, and batch processing, and when each is your best option.

1 — Batch Processing vs Real-time Processing :

Batch Processing : In batch processing, a planned system is triggered periodically to

process large volumes of data that arrives in datalake. In this type of architecture,
processing period could be at the order of days|weeks|months|quarters… and it’s
usually used to help deciders to better understand data and make business
recommendations.

Real-time (Online) Processing : Real-time data architecture is dedicated to process

real time continuous flow ( Iot data , meteorology, … ) or on-demand flow ( web &
mobile services ); it could be used to reduce a cost, maximize a gain (online
recommendation), real-time anomaly detection. This type of architecture treats data

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 2/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

that arrives in ( microsecond|second| minutes) and used to act/react in real time in

order to make a continuous control; what makes latency is a key factor to evaluate
performance of such architecture. We could as well build some dashboards in order to
make data monitoring and generate alerts.

2- How to design architecture ?

While getting specifications & business requirements, a first step would be defining
existing data sources (if they already exist) and the define use-case (simple projects) /
multiple use-cases (complex projects). Based on expressed needs, one could choose if
processing alongside data arrival would be continuously running or batch processing
with a large period of time would be adopted. And then, we could translate these
business requirements into a functional architecture by defining elementary
components that would transform arrived data with its raw form into an exploitable
state that fulfills requirements. While designing this prototype, it would be important to
take into consideration extensibility and then propose a flexible architecture that
would be easily evolved. In this article, I will put into practice the design of a functional
architecture alongside best practices to design low cost and automated AWS data
architectures with two projects.

3- A . Functional Architecture for real-time fraud detection :

The first project consists of detecting fraud from payments requests that arrive from web
& mobile applications and then make fraud analysis by making information search &
statistical visualizations. Hence, we need real time architecture to react against fraud
attempts and serve payments requests.

In this architecture prototype, payments requests continuously arrive from web and
mobile, they are grouped and merged with old user-history ( database ) in order to
match requested payment operation with old user behavior.

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 3/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

Once a decision is delivered by a pre-trained machine learning classification model, we

would store the decision and update user-history once the operation is valid. Otherwise,
we would demand a digital authentication [fingerprint, face recognition, voice speaker
recognition, etc) and again store decisions and update history once payment passes.
Then, we would like to make statistical analysis on user history data; we would link
historical data with a visualization tool. We as well would like to analyze performances
of fraud detector mechanisms and identify statistics on detected fraud attempts; we
would connect detected fraud by both mechanisms to a visualization tool.

3. B . Functional Architecture for Risk Prediction & Analyzes : The second project
consists of monthly predicting with a classification model consumers quality of risk and
then analyzes profiles with bad risk to better make decisions.

Data is collected into a database and processed at each end of month in order to detect
clients with bad risk scores and analyze their profiles. So data is charged in the datalake
then cleaned and then computing indicators in order to aliment visualization dashboard.

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 4/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

A second step was to develop feature engineering on cleaned data and then apply the
pretrained classification model. We indeed add a dashboard to analyze detected profiles,
discover groups and identify factors impacting quality of risk.

II — AWS Data Stack :

Let’s then make an overview and discuss features of data services in order to select the
best adapted services to realize a target function.

1 — Managing AWS Services :

There are several existing tools used to manage and control aws services attached to a
user account given service region alongside account credentials access key
AWS_ACCESS_KEY_ID and secret key AWS_SECRET_ACCESS_KEY.

Awscli : the AWS Command Line Interface (CLI) is a unified tool to manage and control
AWS services from command lines and then automate this control through scripts.

The simplest way to install this module on a ubuntu machine is to lunch this two
commands:

$ sudo apt install awscli

$ sudo pip3 install awscli

And the add credentials through adding a config file in ~/.aws/credentials

Hence you have to run:

$ aws configure
AWS Access Key ID: your_access_key
AWS Secret Access Key: your_secret_key
Default region name [us-west-2]: your_aws_region
Default output format [None]: json

Another way to define to define default credentials that would be used by all tools is to
set the environments variables as:

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 5/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

$ export AWS_ACCESS_KEY_ID=<your_access_key>
$ export AWS_SECRET_ACCESS_KEY=<your_secret_key>
$ export AWS_REGION=<your_aws_region>

Terraform is software that enables users to define infrastructure as a code, manage the
full lifecycle, update configuration without service interruption and delete no longer
resources through a template written with HashiCorp syntax. It supports many providers
including aws provider

The link provides a guide to install & setup terraform on local machine :
https://www.techrepublic.com/article/how-to-install-terraform-on-ubuntu-server/

Boto3 & sdk-aws :

AWS offers a set of SDK for developers such as boto3 (for python developers) and (aws-
sdk) for javaScript developers.

These tools enable developers to create, delete, update and use AWS services.

2 — S3 bucket:

The first mandatory service that should be known by each AWS learner is s3 bucket
(popular and a “must-used” service).It’s presented by AWS as an alternative datalake of
HDFS on a hadoop cluster; it has several use cases; roughly, it’s used for logging activity
store of other services alongside user-application generated logs, storing application
scripts that would be executed on a VM alongside storing templates to create resources
and services, configurations and finally to store data in different format (csv, parquet,
text, json, etc) that would be called from other services in order to run hosted
application.

It’s considered as one of the most preferred services to host large volumes of data due to
its low cost alongside the good data persistence properties; it’s as well highly used as a
data backup of other data sources in order to recover data in case of loss.

It has great integration with various Hadoop distributions such as Hortonworks, CDH,
MapR, EMR and DataProc.

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 6/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

Although they have similar use-cases, it’s worth noting to keep in mind some key
differences between hdfs and s3 bucket that would be taken into consideration when
choosing a data architecture.

S3 is an object store whereas HDFS is a distributed file system meaning that fault
tolerance is guaranteed.

S3 has an unlimited storage capacity in the cloud which is not the case on HDFS.

HDFS is hosted on physical machines; computing tasks could be done on hdfs aside from
storing data.

Common commands: https://www.thegeekstuff.com/2019/04/aws-s3-cli-examples/

S3 pricing informations are: https://aws.amazon.com/s3/pricing/?nc1=h_ls

3 — Querying S3 data: Data Crawler & Catalog

Just like hive in Hadoop , querying data stored in s3 using SQL queries is also possible

Three main services where SQL syntax from.

AWS offer Glue|Athena|Redshift Spectrum services to transform data stored in s3 files

into structured tables that would be partitioned & rapidly queried.

Selection would depend on the user’s purpose: that would be explained in the rest of
this article.

4 — AWS Glue:

Similarly AWS Glue is a must-known service for AWS Data Engineers; it’s a serverless
service

It has elements to transform files into tables on the s3 namely the Crawler & Data
Catalog

Data Catalog refers to the set of persistent metadata stored in AWS Glue; it contains
table definitions, schemas, job definitions and other control information to manage the

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 7/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

Glue environment. It’s considered by AWS as a drop-in replacement to the apache Hive
MetaStore,

The classifier defines the data schema from a data file.AWS Glue provides data
classifiers for mostly used files types such as CSV, JSON, AVRO, XML, and others.

More informations are provided on the AWS Glue documentation:

https://docs.aws.amazon.com/glue/latest/dg/components-overview.html
https://docs.aws.amazon.com/glue/latest/dg/components-key-concepts.html

Crawler: A crawler is a program that connects to the data store (source & target), uses
the list of classifiers to define the data schema and then creates the metadata tables in
the data catalog.

The AWS Glue is used to apply a set of transformations on the defined tables.

Glue ETL job; it’s composed of target data, source data alongside transformation script.

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 8/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

The used transformations could be a set of operations on raw data to be cleaned and
transformed into an exploitable form.

Since Glue is a Serverless service; no physical machine that hosts ETL jobs and then
pricing is only based on the executed jobs.

The first million access requests to the AWS Glue Data Catalog per month are free. If you
exceed a million requests in a month, you will be charged $1.00 per million requests
over the first million.

https://aws.amazon.com/glue/pricing/?nc1=h_ls

Glue with Spark : A glue package is provided in order to enable spark accessibility to
Glue data catalog.

It extends spark and provides an entry point to read and write glue dynamicFrames from
and to Amazon S3, the AWS Data Catalog and the JDBC through a class called Glue
context associated with a Spark Context.

Here I explain how to develop steps used to develop pyspark code using a GlueContext.

Update PySpark driver environment variables: add these lines to your ~/.bashrc (or
~/.zshrc) file.

$ export PYSPARK_DRIVER_PYTHON=jupyter
$ export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

This operation would launch jupyter with a default spark session instead of launching
pyspark shell.

Then gluepyspark command would enable launching jupyter notebook with glue jars
and a spark session.

And then gluesparksubmit command would be used while launching the python script.

5 — Athena Service :

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 9/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

AWS Athena is a similar serveless (No server to mange) service used to create and query
s3 data with an SQL syntax;

It’s not used in a code deployment service (in production) but used as an interactive
query search in order to search & analyze information alongside making statistical
visualizations in QuickSight service.

Athena uses AWS Glue data catalog (default) to store metadata containing table
definition information or Hive metastore.

By choosing the first option (glue), the table would be visible in the Glue console and
ETL jobs could be done on it.

There are multiple methods to create tables and execute queries from a local machine on
Athena service.

With AWS commands:

$ aws athena start-query-execution \

- query-string "CREATE EXTERNAL ... ;"\
- query-execution-context Database=default \
- result-configuration OutputLocation=s3://testbucket/atm/

With Terraform :

From the template of table configuration athena.tf :

1 provider "aws" {
2 region = "your_region"
3 access_key = "your_aws_access_id"
4 secret_key = "your_aws_secret"
5 }
6 resource "aws_glue_catalog_table" "aws_glue_catalog_table" {
7 name = "atmdfromterra"
8 database_name = "default"
9
10 table_type = "EXTERNAL_TABLE"
11
12 parameters = {

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 10/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

13 EXTERNAL = "FALSE"
14 # "field.delim" = ","
15 # "skip.header.line.count" = "1"
16 }
17
18 storage_descriptor {
19 location = "s3://test/atm/"
20 input_format = "org.apache.hadoop.mapred.TextInputFormat"
21 output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
22 ser_de_info {
23 name = "my-serde"
24 serialization_library = "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
25 parameters = {
26 "serialization.format" = ","
27 "line.delim" = "\n"
28 "field.delim" = ","
29 "skip.header.line.count" = "1"
30 }
31 }
32
33
34 columns {
35 name = "DATE"
36 type = "string"
37 }
38
39 columns {
40 name = "ATM_ID"
41 type = "int"
42 }
43
44 columns {
45 name = "CLIENT_OUT"
46 type = "int"
47 }
48
49 }
50 }

athena.tf hosted with ❤ by GitHub view raw

$ terraform plan -out=table.plan

$ terraform apply table.plan

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 11/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

With boto3 :

1 config = Config(region_name = 'your_region') # if no default region is defined

2 athena_client = boto3.client('athena',config = config, aws_access_key_id=ACCESS_KEY, aws_secret_ac
3 response = athena_client.start_query_execution(
4 QueryString="CREATE EXTERNAL TABLE IF NOT EXISTS atmdata(DATE STRING,ATM_ID INT,CLIENT_OUT INT)
5 QueryExecutionContext={ 'Database': 'default' },
6 ResultConfiguration={ 'OutputLocation':'s3://testbucket/'}
7 )

athena.py hosted with ❤ by GitHub view raw

6 — Redshift Spectrum Service:

AWS offer smart feature called “Redshift Spectrum” that enables application of Redshift
queries on data stored in S3 bucket; we would benefit from the low costing hosting of
the s3 while keeping processing near real-time

Redshift Spectrum works by introducing a layer between after that execute Redshift
queries on the Redshift cluster behave. This layer is scaled and managed by Amazon
according to the query usage and Price is charged based on the amount of data scanned
in S3.

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 12/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

AWS uses Data catalog to store schema definition of data accessed in s3 ; namely there
are two options to choose a data catalog for Redshift Spectrum that are : Athena data
catalog, Glue catalog and hive metastore of an EMR cluster.

7 — EC2 Service:

The Amazon Elastic Compute Cloud (EC2) is the historically the first launched Amazon
service.

It’s a standard resizable computing service.

It’s used to deploy on a user-selected machine, where the development team could select
an operating system alongside machine size adapted to the deployed application.

It’s used to make code tests and deploy high performance applications.

Connection to the EC2 is intuitively done by establishing and ssh connection

ssh -i "your_aws_key.pem" [email protected]

Data transfer scp from EC to local machine

scp -i "your_aws_key.pem" ubuntu@ec2-xx-xx-x-

xxx.compute.amazonaws.com:/path_to_your_files
path_to_your_local_folder

We could as well open a web-interface service (jupyter , zeppelin, ambari, etc ) running
on the EC2 from a local machine browser.

ssh -i "your_aws_key.pem" -N -f -L 8888:localhost:8888 ubuntu@ec2-xx-

xx-x-xxx.compute.amazonaws.com

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 13/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

8 — AWS ElasticMapReduce Cluster (EMR)

It’s the AWS platform of hadoop cluster,it contains the well-known hadoop components
to execute High performances MapReduce jobs ( HDFS, YARN,SPARK) on high vol data
alongside other big data services (Hive, Hbase, pig, Oozie, Zeppelin, Zookeeper, etc)

EMR uses EC2 as nodes (master and slave) to distribute data storage and processing.

As the ssh access is enabled by default, configuration of cluster components

(hive,yarn,spark, etc) is saved in a template containing the value of properties called
during the creation of cluster.

Creation, configuration and removal of the EMR cluster is enabled with

terraform/awscli and Software Development Kits.

The good news that , for architectures where distributed jobs are executed infrequently (
1 time per day / week /month/ etc), the cluster could be automatically terminated after
job run since hdfs data ( input and output data ) are stored in S3.

9 — AWS Lambda :

Lambda is a computing service; it’s highly loved due to its auto-scalability properties and
low cost. It’s triggered by an event (there many sources), and jobs are executed only if
this event occurs; hence there’s no service to pay during inactivity.

Adaptable to automate application workflow and intermediate small tasks, used as well
to collect data from users with kinesis.

10 — Kinesis Service:

Kinesis : is a data ingestion service ( equivalent to kafka )used in data streaming for real-
time application, boto3 is a well-known client to publish & consume data in python
application, it’s adaptable with structured spark streaming to transform the memorized
data flow.

Storing received data into s3/redshift/DynamoDB could be automated (without

developing code) with kinesis firehose and a dedicated lambda function to record
streams and then periodically update data sources.
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 14/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

A code for a kinesis producer with boto3 :

1 import boto3
2 import json
3 from datetime import datetime
4 import calendar
5 import random
6 import time
7
8 my_stream_name = 'python-stream'
9
10 kinesis_client = boto3.client('kinesis', region_name='us-east-1')
11
12 def put_to_stream(thing_id, property_value, property_timestamp):
13 payload = {
14 'prop': str(property_value),
15 'timestamp': str(property_timestamp),
16 'thing_id': thing_id
17 }
18
19 print payload
20
21 put_response = kinesis_client.put_record(
22 StreamName=my_stream_name,
23 Data=json.dumps(payload),
24 PartitionKey=thing_id)
25
26 while True:
27 property_value = random.randint(40, 120)
28 property_timestamp = calendar.timegm(datetime.utcnow().timetuple())
29 thing_id = 'aa-bb'
30
31 put_to_stream(thing_id, property_value, property_timestamp)
32
33 # wait for 5 second
34 time.sleep(5)

kinesis_producer.py hosted with ❤ by GitHub view raw

A code for a kinesis consumer with boto3 :

1 import boto3
2 import json
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 15/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

3 from datetime import datetime

4 import time
5
6 my_stream_name = 'python-stream'
7
8 kinesis_client = boto3.client('kinesis', region_name='us-east-1')
9
10 response = kinesis_client.describe_stream(StreamName=my_stream_name)
11
12 my_shard_id = response['StreamDescription']['Shards'][0]['ShardId']
13
14 shard_iterator = kinesis_client.get_shard_iterator(StreamName=my_stream_name,
15 ShardId=my_shard_id,
16 ShardIteratorType='LATEST')
17
18 my_shard_iterator = shard_iterator['ShardIterator']
19
20 record_response = kinesis_client.get_records(ShardIterator=my_shard_iterator,
21 Limit=2)
22
23 while 'NextShardIterator' in record_response:
24 record_response = kinesis_client.get_records(ShardIterator=record_response['NextShardIterato
25 Limit=2)
26
27 print record_response
28
29 # wait for 5 seconds
30 time.sleep(5)

kinesis_consumer.py hosted with ❤ by GitHub view raw

III — Putting it all together : design of a cloud-based data architecture

Once functional architecture is established; it would be translated into a technical

architecture that respects constraints and delivers the functional needs with a minimum
of cost and a satisfied latency.

This technical environment includes the technical environment (languages, frameworks,

etc) alongside the data storing systems.

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 16/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

Many criterions have to be put into consideration while choosing a datastore/database

and computing services mainly; data persistence, hosting cost, security, latency, data
structures.

1 — A Use case for Batch processing:

In this use-case, data is monthly charged into database and then data processing is
triggered with the new incoming data.

Therefore, we won’t need a database that supports real time processing to store data
flow.

We could use a datalake to execute spark transformations in order to clean and

transform the batch data and then launch the classification model to evaluate risk
quality.

HDFS or S3 for input data?

a- First scenario: Input and output sources are stored on S3 bucket:

Running spark jobs on data stored on Hdfs would enable using the yarn component of
hadoop and then we could benefit from Scheduling capabilities of yarn and then
accelerate spark jobs.

b- Second scenario: Input sources on HDFS and output (target) sources on S3 bucket:

Launching spark jobs on an EMR cluster would be an adequate choice since large data
volumes are rapidly transformed with the cluster resources.

On the other side, keeping running on multiple EC2 instances of the EMR without
activity (as processing is triggered once a month) would lead to a cost overflow and then
an expensive monthly billing.

c — Third scenario:

The evolved solutions consist in using s3 to store input and output data for a long time
and temporarily store input data in a created hdfs nodes of an EMR cluster in order to
run the spark jobs.

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 17/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

And then the cluster would be auto-terminated once last job is successfully done.

How to generate output source: In order to run spark jobs on the EMR cluster we could
use an EC2 as a client server and then in order to directly a spark dataframe into a glue
data catalog, we have to use a spark-glue connector.

Creating dashboard: In order to create a dashboard and make analysis on input and
output data sources we would use an Athena service to launch search Queries from an
interactive console as well as connecting that source to a QuickSight service in order to
make visualizations.

Automating the cluster creation workflow & sources alimentations?

We should think about a process to automate the creation of cluster once data are
charged into the S3 bucket

It’s feasible by introducing a dedicated lambda triggered once a creation event is done
on the input s3, then cluster configuration file the cluster would be charged from a
bucket to create the cluster with boto3 client.

Once the cluster is ready, we would start EC2 machine to launch the application and the
upload the application scripts from s3 to the EC2 and lunch it

Secondly, we have created a process that automatically updates Athena sources once
new data arrives on both input and output buckets and charges new data.

There are two scenarios to do that: lunch update scripts from the launch EC2 machine.

The second option is to create two dedicated lambda functions that are woken-up once
new data arrives and automate the update of existing data with a boto3 client ( if using
python ).

Once the workflow of application is terminated we could as (auto-stopped machine)

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 18/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

2- A use case for online processing: Real time fraud detection

In this architecture, we have adopted two reactive components (the first one to match
transactions with client old behavior and output a first alert), and then the second
component is to identify the client based on its confidential information (biometry, face
identification, voice recognition, etc).

As historical data sources are requested by 3 clients (reading, writing) in real time, our
data source should support real time processing.

Redshift cluster or Redshift Spectrum?

Amazon Redshift excels at running complex analytics queries, joins and aggregations
and provides a real time response over large datasets due its high performance local
disks.

Redshift service is per storage node pricing; it runs at about ~$1,000 / TB / Year which
is (x4 times) more expensive than the S3 bucket (~$250 / TB / Year).

On the other side, Redshift Spectrum is a serverless service; pricing is based on scanned
data ($5 per terabyte of data scanned)

Amazon Redshift spectrum users can benefit from the cheap storage price of the S3 and
then run analytics queries, filter, aggregate and group data with the spectrum layer

Noting that, Redshift spectrum queries do run queries slower than the Amazon Redshift
cluster.

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 19/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

Indeed, running frequent queries on Amazon Redshift would probably make scanning
data costs more than just storing in Redshift, it’s to move data back into the cluster.

In a nutshell, there is no intuitive choice between the two options; a smart way to
simulate the daily incoming workload of the application and estimate average daily cost
and latency.

If scanning data is more costly than storage, redshift cluster would be best option;
otherwise, if redshift spectrum gives an acceptable latency while incoming workload
reaches its maximum and the cost is your priority then Spectrum would be the good
option for you.

Data collection operation could be hosted by a lambda function connected with kinesis
service and then sent via a kinesis stream service result to the data transformation unit.

Deploying the transformation jobs :

Then data transformation results alongside classification algorithms could be deployed

on an EC2 machine with an adapted size.

Or with a separate lambda function where data is ingested through kinesis streams.

We could use an s3 bucket to store classification algorithm activity and then a lambda
function to update new incoming data into the Athena table.

Backup data stores : use of data backups to replicate data is as well a mandatory
practice while designing architectures and s3 is the most convenient choice.

Therefore, technical architecture would be like :

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 20/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

2 - A Friendly Intro to Helm

3 - Software Management Fundamentals

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories —
delivered straight into your inbox, once a week. Take a look.

Your email

Get this newsletter

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.

AWS Big Data Data Science

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1
About Help Legal21/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
About Help Legal

Get the Medium app

https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 22/22

Aws Glu
No ratings yet
Aws Glu
17 pages
Data Lake On The Aws Cloud With Talend Big Data Platform
100% (1)
Data Lake On The Aws Cloud With Talend Big Data Platform
13 pages
AWS Glue for ETL Developers
No ratings yet
AWS Glue for ETL Developers
5 pages
Ram Manohar Bheemana: Contact About Me
No ratings yet
Ram Manohar Bheemana: Contact About Me
7 pages
WP 8 Tips To Simplify AWS Backup and Recovery
No ratings yet
WP 8 Tips To Simplify AWS Backup and Recovery
9 pages
1 AWS Analytics and Data Lakes
No ratings yet
1 AWS Analytics and Data Lakes
15 pages
Chapter 12 - Data Warehousing and Online Analytical Processing
No ratings yet
Chapter 12 - Data Warehousing and Online Analytical Processing
20 pages
Principal Architect - Big Data Architect - Solutions Architect Resume
No ratings yet
Principal Architect - Big Data Architect - Solutions Architect Resume
9 pages
SQL Server Migration Methodology
50% (2)
SQL Server Migration Methodology
10 pages
Apache NiFi Vs Streamsets
No ratings yet
Apache NiFi Vs Streamsets
6 pages
iCEDQ DataOps Platform Overview
No ratings yet
iCEDQ DataOps Platform Overview
5 pages
SQL Stored Procedure Optimization Tips
No ratings yet
SQL Stored Procedure Optimization Tips
3 pages
Redshift ETL with AWS Glue & Step Functions
No ratings yet
Redshift ETL with AWS Glue & Step Functions
31 pages
Python Data Pipeline Guide
No ratings yet
Python Data Pipeline Guide
38 pages
SSIS in The Cloud
No ratings yet
SSIS in The Cloud
17 pages
Data Skills & Roles Cheat Sheet
No ratings yet
Data Skills & Roles Cheat Sheet
11 pages
Data Warehousing On AWS
No ratings yet
Data Warehousing On AWS
27 pages
AWS Boto - 1
No ratings yet
AWS Boto - 1
55 pages
Data Warehouse Testing Challenges
No ratings yet
Data Warehouse Testing Challenges
10 pages
Database Migration-What Do You Need To Know Before You Start - AWS Database Blog
No ratings yet
Database Migration-What Do You Need To Know Before You Start - AWS Database Blog
8 pages
Datalakehouse Implementation Proposal - v1.1
No ratings yet
Datalakehouse Implementation Proposal - v1.1
5 pages
AWS Data Analytics - Technical - Student
No ratings yet
AWS Data Analytics - Technical - Student
160 pages
Meeting DWH QA Challenges Part 1
No ratings yet
Meeting DWH QA Challenges Part 1
9 pages
Big Data and Visualization
No ratings yet
Big Data and Visualization
141 pages
Snow SQL
No ratings yet
Snow SQL
3 pages
ETL QA Sample Scenario V3
100% (2)
ETL QA Sample Scenario V3
3 pages
ETL Project Lifecycle Guide
No ratings yet
ETL Project Lifecycle Guide
3 pages
Google Interview Prep Guide
No ratings yet
Google Interview Prep Guide
100 pages
Azure AI Solution Design Exam Prep
No ratings yet
Azure AI Solution Design Exam Prep
112 pages
Building Data Pipelines - 3
No ratings yet
Building Data Pipelines - 3
29 pages
Backend Database Stored Procedure Testing Strategy, Approach
No ratings yet
Backend Database Stored Procedure Testing Strategy, Approach
8 pages
Aws Glue Consulting - Helical IT Solutions
No ratings yet
Aws Glue Consulting - Helical IT Solutions
3 pages
Rahul Sharma
100% (1)
Rahul Sharma
2 pages
Glue DG
No ratings yet
Glue DG
639 pages
Databricks Academy Self Paced Content
No ratings yet
Databricks Academy Self Paced Content
18 pages
DW-BI Best Practices
100% (1)
DW-BI Best Practices
15 pages
Unite Real-Time and Batch Analytics With AWS Glue
No ratings yet
Unite Real-Time and Batch Analytics With AWS Glue
28 pages
Databricks Cloud Workshop: SF, 2015-05-20! Download Slides
100% (1)
Databricks Cloud Workshop: SF, 2015-05-20! Download Slides
168 pages
INFA ETL Standards and Guidelines
No ratings yet
INFA ETL Standards and Guidelines
55 pages
IT & Big Data Professional Profile
No ratings yet
IT & Big Data Professional Profile
7 pages
AWS 05 DataLake
No ratings yet
AWS 05 DataLake
78 pages
Database Migration Case Study
No ratings yet
Database Migration Case Study
3 pages
Trivago Pipeline
No ratings yet
Trivago Pipeline
18 pages
Cloudera Hive
No ratings yet
Cloudera Hive
132 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
The Modernization of The Data Warehouse
No ratings yet
The Modernization of The Data Warehouse
17 pages
SQL Replication Setup Guide
No ratings yet
SQL Replication Setup Guide
22 pages
Top 5 Data Engineering Projects You Can't Afford To Miss by Yusuf Ganiyu Feb, 2024 Medium
No ratings yet
Top 5 Data Engineering Projects You Can't Afford To Miss by Yusuf Ganiyu Feb, 2024 Medium
23 pages
7 Best Practices For Building Data Applications On Snowflake
No ratings yet
7 Best Practices For Building Data Applications On Snowflake
15 pages
Talend Architecture White Paper - Branded - Final 11302020
No ratings yet
Talend Architecture White Paper - Branded - Final 11302020
18 pages
Cloud Integration Specialist Guide
No ratings yet
Cloud Integration Specialist Guide
6 pages
AWS - 06 - Best Practice To Secure DataLake
No ratings yet
AWS - 06 - Best Practice To Secure DataLake
75 pages
Hadoop/Spark Developer Resume
No ratings yet
Hadoop/Spark Developer Resume
7 pages
SQL Interview Qustion
No ratings yet
SQL Interview Qustion
47 pages
Google Data Engineer Certification Guide
No ratings yet
Google Data Engineer Certification Guide
4 pages
AWS Snow Family-Edited
No ratings yet
AWS Snow Family-Edited
6 pages
Data Lake Implementation Improved Processing Time by 4X
No ratings yet
Data Lake Implementation Improved Processing Time by 4X
5 pages
Systems Analysis and Design 3
No ratings yet
Systems Analysis and Design 3
5 pages
Design Data Architecture 1st Unit
No ratings yet
Design Data Architecture 1st Unit
58 pages
Aiesec X Aws Workshop
No ratings yet
Aiesec X Aws Workshop
45 pages
Web Design Company Leicestershire
No ratings yet
Web Design Company Leicestershire
2 pages
All Portions
No ratings yet
All Portions
85 pages
Literature Review On Online Bus Ticketing
100% (2)
Literature Review On Online Bus Ticketing
4 pages
Introduction To SQL: Database Languages
No ratings yet
Introduction To SQL: Database Languages
6 pages
COMP10001 Final Exam
No ratings yet
COMP10001 Final Exam
18 pages
Quantum Spark 1500, 1600 AND 1800 Appliance Series: CLI Reference Guide
No ratings yet
Quantum Spark 1500, 1600 AND 1800 Appliance Series: CLI Reference Guide
1,425 pages
Lab Report 1
No ratings yet
Lab Report 1
12 pages
Middleware Architecture Course Guide
No ratings yet
Middleware Architecture Course Guide
17 pages
An Independent Review of COOL:2E Release 7.0
No ratings yet
An Independent Review of COOL:2E Release 7.0
73 pages
Python Lab Manual
No ratings yet
Python Lab Manual
27 pages
THEORY of C
No ratings yet
THEORY of C
56 pages
Cygwin Build For Windows
No ratings yet
Cygwin Build For Windows
4 pages
Bootstrap Was Developed by Mark Otto and Jacob Thornton at Twitter. It Was Released As An Open Source Product in August 2011 On Github
No ratings yet
Bootstrap Was Developed by Mark Otto and Jacob Thornton at Twitter. It Was Released As An Open Source Product in August 2011 On Github
14 pages
Custom IDoc Type Creation Guide
100% (1)
Custom IDoc Type Creation Guide
8 pages
DSA Question
No ratings yet
DSA Question
12 pages
Cwsdpmi
No ratings yet
Cwsdpmi
3 pages
What Is Implicit Cursor in Oracle
No ratings yet
What Is Implicit Cursor in Oracle
6 pages
Data Lab Student Guide Part 1
No ratings yet
Data Lab Student Guide Part 1
6 pages
Database Systems Lab 7 PL/SQL Programming 1
No ratings yet
Database Systems Lab 7 PL/SQL Programming 1
3 pages
Building of Personal Ai Assistant Edt
No ratings yet
Building of Personal Ai Assistant Edt
5 pages
Appendix Tensorflow PDF
50% (8)
Appendix Tensorflow PDF
14 pages
JavaScript Async/Await Guide
No ratings yet
JavaScript Async/Await Guide
9 pages
SOLID Principles in JavaScript
No ratings yet
SOLID Principles in JavaScript
5 pages
Tutorial Windows
100% (1)
Tutorial Windows
3 pages
Oop ST
No ratings yet
Oop ST
10 pages
Buffer Overflow Attack Prevention
No ratings yet
Buffer Overflow Attack Prevention
17 pages
Fall 2011 COP 3223 (C Programming) Syllabus: Will Provide The Specifics To His Section
No ratings yet
Fall 2011 COP 3223 (C Programming) Syllabus: Will Provide The Specifics To His Section
5 pages
(Mastering Computer Science) Sufyan Bin Uzayr - Mastering Rust - A Beginner's Guide-CRC Press (2022)
No ratings yet
(Mastering Computer Science) Sufyan Bin Uzayr - Mastering Rust - A Beginner's Guide-CRC Press (2022)
319 pages
Guideline Development and Operation (DevOps) - V1.0 - 2
No ratings yet
Guideline Development and Operation (DevOps) - V1.0 - 2
33 pages
Data Insert/Update Using RESTFul Web Service (Hibernate + Spring + Maven)
No ratings yet
Data Insert/Update Using RESTFul Web Service (Hibernate + Spring + Maven)
15 pages

How To Design AWS Data Architectures - by Narjes Karmeni - The Startup - Medium

Uploaded by

How To Design AWS Data Architectures - by Narjes Karmeni - The Startup - Medium

Uploaded by

5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium

How to Design AWS Data Architectures

1 — Batch Processing vs Real-time Processing :

Batch Processing : In batch processing, a planned system is triggered periodically to

Real-time (Online) Processing : Real-time data architecture is dedicated to process

that arrives in ( microsecond|second| minutes) and used to act/react in real time in

2- How to design architecture ?

3- A . Functional Architecture for real-time fraud detection :

Once a decision is delivered by a pre-trained machine learning classification model, we

II — AWS Data Stack :

1 — Managing AWS Services :

$ sudo apt install awscli

And the add credentials through adding a config file in ~/.aws/credentials

Hence you have to run:

Boto3 & sdk-aws :

Common commands: https://www.thegeekstuff.com/2019/04/aws-s3-cli-examples/

S3 pricing informations are: https://aws.amazon.com/s3/pricing/?nc1=h_ls

3 — Querying S3 data: Data Crawler & Catalog

Three main services where SQL syntax from.

AWS offer Glue|Athena|Redshift Spectrum services to transform data stored in s3 files

More informations are provided on the AWS Glue documentation:

With AWS commands:

$ aws athena start-query-execution \

From the template of table configuration athena.tf :

athena.tf hosted with ❤ by GitHub view raw

$ terraform plan -out=table.plan

1 config = Config(region_name = 'your_region') # if no default region is defined

athena.py hosted with ❤ by GitHub view raw

6 — Redshift Spectrum Service:

It’s a standard resizable computing service.

Connection to the EC2 is intuitively done by establishing and ssh connection

ssh -i "your_aws_key.pem" [email protected]

Data transfer scp from EC to local machine

scp -i "your_aws_key.pem" ubuntu@ec2-xx-xx-x-

ssh -i "your_aws_key.pem" -N -f -L 8888:localhost:8888 ubuntu@ec2-xx-

8 — AWS ElasticMapReduce Cluster (EMR)

As the ssh access is enabled by default, configuration of cluster components

Creation, configuration and removal of the EMR cluster is enabled with

Storing received data into s3/redshift/DynamoDB could be automated (without

A code for a kinesis producer with boto3 :

kinesis_producer.py hosted with ❤ by GitHub view raw

A code for a kinesis consumer with boto3 :

3 from datetime import datetime

kinesis_consumer.py hosted with ❤ by GitHub view raw

III — Putting it all together : design of a cloud-based data architecture

Once functional architecture is established; it would be translated into a technical

This technical environment includes the technical environment (languages, frameworks,

Many criterions have to be put into consideration while choosing a datastore/database

1 — A Use case for Batch processing:

We could use a datalake to execute spark transformations in order to clean and

HDFS or S3 for input data?

a- First scenario: Input and output sources are stored on S3 bucket:

Automating the cluster creation workflow & sources alimentations?

Once the workflow of application is terminated we could as (auto-stopped machine)

2- A use case for online processing: Real time fraud detection

Redshift cluster or Redshift Spectrum?

Deploying the transformation jobs :

Then data transformation results alongside classification algorithms could be deployed

Therefore, technical architecture would be like :

1 - A Quick Tutorial on Kubernetes for Developers

2 - A Friendly Intro to Helm

3 - Software Management Fundamentals

Sign up for Top 10 Stories

Get this newsletter

AWS Big Data Data Science

Get the Medium app

You might also like