How To Design AWS Data Architectures - by Narjes Karmeni - The Startup - Medium
How To Design AWS Data Architectures - by Narjes Karmeni - The Startup - Medium
You have 2 free member-only stories left this month. Sign up for Medium and get an extra one
Article Content :
Introduction :
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 1/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
Within the increased development of cloud computing services, migration to the public
cloud is continuously increasing amongst many industries due to its flexibility and the
large wide range of automated services it offers. Amazon Web Services (AWS) provides a
broad platform of managed services to help in building, securing, scaling and
maintaining data architectures; a company has not to care about server’s maintenance
and failures. It provides a set of services used to perform common data engineering
tasks ; collect, store, transform, analyze, search, visualize and ingest data with
scalable features adapted for different requirements and use-cases.
In this article, I will explain the main services that each data engineer should know
when deploying data architectures on an AWS platform and I will explain adopted steps
to define an architecture adapted to the use-case and then how to select the technical
environment S3/Glue/EMR/Athena/Kinesis/QuickSight/DataPipeline/EC2/Lambda.
I — Data Architectures :
When it comes to data processing, there are more ways to do it than ever. How you do it
and the tools you choose depend largely on what your purposes are for processing the
data in the first place. In many cases, you’re processing historical and archived data and
time isn’t so critical. You can wait a few hours for your answer, and if necessary, a few
days. Conversely, other processing tasks are crucial, and the answers need to be
delivered within seconds to be of value. Here are the differences among real-time, near
real-time, and batch processing, and when each is your best option.
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 2/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
While getting specifications & business requirements, a first step would be defining
existing data sources (if they already exist) and the define use-case (simple projects) /
multiple use-cases (complex projects). Based on expressed needs, one could choose if
processing alongside data arrival would be continuously running or batch processing
with a large period of time would be adopted. And then, we could translate these
business requirements into a functional architecture by defining elementary
components that would transform arrived data with its raw form into an exploitable
state that fulfills requirements. While designing this prototype, it would be important to
take into consideration extensibility and then propose a flexible architecture that
would be easily evolved. In this article, I will put into practice the design of a functional
architecture alongside best practices to design low cost and automated AWS data
architectures with two projects.
The first project consists of detecting fraud from payments requests that arrive from web
& mobile applications and then make fraud analysis by making information search &
statistical visualizations. Hence, we need real time architecture to react against fraud
attempts and serve payments requests.
In this architecture prototype, payments requests continuously arrive from web and
mobile, they are grouped and merged with old user-history ( database ) in order to
match requested payment operation with old user behavior.
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 3/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
3. B . Functional Architecture for Risk Prediction & Analyzes : The second project
consists of monthly predicting with a classification model consumers quality of risk and
then analyzes profiles with bad risk to better make decisions.
Data is collected into a database and processed at each end of month in order to detect
clients with bad risk scores and analyze their profiles. So data is charged in the datalake
then cleaned and then computing indicators in order to aliment visualization dashboard.
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 4/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
A second step was to develop feature engineering on cleaned data and then apply the
pretrained classification model. We indeed add a dashboard to analyze detected profiles,
discover groups and identify factors impacting quality of risk.
Let’s then make an overview and discuss features of data services in order to select the
best adapted services to realize a target function.
There are several existing tools used to manage and control aws services attached to a
user account given service region alongside account credentials access key
AWS_ACCESS_KEY_ID and secret key AWS_SECRET_ACCESS_KEY.
Awscli : the AWS Command Line Interface (CLI) is a unified tool to manage and control
AWS services from command lines and then automate this control through scripts.
The simplest way to install this module on a ubuntu machine is to lunch this two
commands:
$ aws configure
AWS Access Key ID: your_access_key
AWS Secret Access Key: your_secret_key
Default region name [us-west-2]: your_aws_region
Default output format [None]: json
Another way to define to define default credentials that would be used by all tools is to
set the environments variables as:
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 5/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
$ export AWS_ACCESS_KEY_ID=<your_access_key>
$ export AWS_SECRET_ACCESS_KEY=<your_secret_key>
$ export AWS_REGION=<your_aws_region>
Terraform is software that enables users to define infrastructure as a code, manage the
full lifecycle, update configuration without service interruption and delete no longer
resources through a template written with HashiCorp syntax. It supports many providers
including aws provider
The link provides a guide to install & setup terraform on local machine :
https://www.techrepublic.com/article/how-to-install-terraform-on-ubuntu-server/
AWS offers a set of SDK for developers such as boto3 (for python developers) and (aws-
sdk) for javaScript developers.
These tools enable developers to create, delete, update and use AWS services.
2 — S3 bucket:
The first mandatory service that should be known by each AWS learner is s3 bucket
(popular and a “must-used” service).It’s presented by AWS as an alternative datalake of
HDFS on a hadoop cluster; it has several use cases; roughly, it’s used for logging activity
store of other services alongside user-application generated logs, storing application
scripts that would be executed on a VM alongside storing templates to create resources
and services, configurations and finally to store data in different format (csv, parquet,
text, json, etc) that would be called from other services in order to run hosted
application.
It’s considered as one of the most preferred services to host large volumes of data due to
its low cost alongside the good data persistence properties; it’s as well highly used as a
data backup of other data sources in order to recover data in case of loss.
It has great integration with various Hadoop distributions such as Hortonworks, CDH,
MapR, EMR and DataProc.
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 6/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
Although they have similar use-cases, it’s worth noting to keep in mind some key
differences between hdfs and s3 bucket that would be taken into consideration when
choosing a data architecture.
S3 is an object store whereas HDFS is a distributed file system meaning that fault
tolerance is guaranteed.
S3 has an unlimited storage capacity in the cloud which is not the case on HDFS.
HDFS is hosted on physical machines; computing tasks could be done on hdfs aside from
storing data.
Just like hive in Hadoop , querying data stored in s3 using SQL queries is also possible
Selection would depend on the user’s purpose: that would be explained in the rest of
this article.
4 — AWS Glue:
Similarly AWS Glue is a must-known service for AWS Data Engineers; it’s a serverless
service
It has elements to transform files into tables on the s3 namely the Crawler & Data
Catalog
Data Catalog refers to the set of persistent metadata stored in AWS Glue; it contains
table definitions, schemas, job definitions and other control information to manage the
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 7/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
Glue environment. It’s considered by AWS as a drop-in replacement to the apache Hive
MetaStore,
The classifier defines the data schema from a data file.AWS Glue provides data
classifiers for mostly used files types such as CSV, JSON, AVRO, XML, and others.
https://docs.aws.amazon.com/glue/latest/dg/components-overview.html
https://docs.aws.amazon.com/glue/latest/dg/components-key-concepts.html
Crawler: A crawler is a program that connects to the data store (source & target), uses
the list of classifiers to define the data schema and then creates the metadata tables in
the data catalog.
The AWS Glue is used to apply a set of transformations on the defined tables.
Glue ETL job; it’s composed of target data, source data alongside transformation script.
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 8/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
The used transformations could be a set of operations on raw data to be cleaned and
transformed into an exploitable form.
Since Glue is a Serverless service; no physical machine that hosts ETL jobs and then
pricing is only based on the executed jobs.
The first million access requests to the AWS Glue Data Catalog per month are free. If you
exceed a million requests in a month, you will be charged $1.00 per million requests
over the first million.
https://aws.amazon.com/glue/pricing/?nc1=h_ls
Glue with Spark : A glue package is provided in order to enable spark accessibility to
Glue data catalog.
It extends spark and provides an entry point to read and write glue dynamicFrames from
and to Amazon S3, the AWS Data Catalog and the JDBC through a class called Glue
context associated with a Spark Context.
Here I explain how to develop steps used to develop pyspark code using a GlueContext.
Update PySpark driver environment variables: add these lines to your ~/.bashrc (or
~/.zshrc) file.
$ export PYSPARK_DRIVER_PYTHON=jupyter
$ export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
This operation would launch jupyter with a default spark session instead of launching
pyspark shell.
Then gluepyspark command would enable launching jupyter notebook with glue jars
and a spark session.
And then gluesparksubmit command would be used while launching the python script.
5 — Athena Service :
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 9/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
AWS Athena is a similar serveless (No server to mange) service used to create and query
s3 data with an SQL syntax;
It’s not used in a code deployment service (in production) but used as an interactive
query search in order to search & analyze information alongside making statistical
visualizations in QuickSight service.
Athena uses AWS Glue data catalog (default) to store metadata containing table
definition information or Hive metastore.
By choosing the first option (glue), the table would be visible in the Glue console and
ETL jobs could be done on it.
There are multiple methods to create tables and execute queries from a local machine on
Athena service.
With Terraform :
1 provider "aws" {
2 region = "your_region"
3 access_key = "your_aws_access_id"
4 secret_key = "your_aws_secret"
5 }
6 resource "aws_glue_catalog_table" "aws_glue_catalog_table" {
7 name = "atmdfromterra"
8 database_name = "default"
9
10 table_type = "EXTERNAL_TABLE"
11
12 parameters = {
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 10/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
13 EXTERNAL = "FALSE"
14 # "field.delim" = ","
15 # "skip.header.line.count" = "1"
16 }
17
18 storage_descriptor {
19 location = "s3://test/atm/"
20 input_format = "org.apache.hadoop.mapred.TextInputFormat"
21 output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
22 ser_de_info {
23 name = "my-serde"
24 serialization_library = "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
25 parameters = {
26 "serialization.format" = ","
27 "line.delim" = "\n"
28 "field.delim" = ","
29 "skip.header.line.count" = "1"
30 }
31 }
32
33
34 columns {
35 name = "DATE"
36 type = "string"
37 }
38
39 columns {
40 name = "ATM_ID"
41 type = "int"
42 }
43
44 columns {
45 name = "CLIENT_OUT"
46 type = "int"
47 }
48
49 }
50 }
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 11/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
With boto3 :
AWS offer smart feature called “Redshift Spectrum” that enables application of Redshift
queries on data stored in S3 bucket; we would benefit from the low costing hosting of
the s3 while keeping processing near real-time
Redshift Spectrum works by introducing a layer between after that execute Redshift
queries on the Redshift cluster behave. This layer is scaled and managed by Amazon
according to the query usage and Price is charged based on the amount of data scanned
in S3.
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 12/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
AWS uses Data catalog to store schema definition of data accessed in s3 ; namely there
are two options to choose a data catalog for Redshift Spectrum that are : Athena data
catalog, Glue catalog and hive metastore of an EMR cluster.
7 — EC2 Service:
The Amazon Elastic Compute Cloud (EC2) is the historically the first launched Amazon
service.
It’s used to deploy on a user-selected machine, where the development team could select
an operating system alongside machine size adapted to the deployed application.
It’s used to make code tests and deploy high performance applications.
We could as well open a web-interface service (jupyter , zeppelin, ambari, etc ) running
on the EC2 from a local machine browser.
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 13/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
It’s the AWS platform of hadoop cluster,it contains the well-known hadoop components
to execute High performances MapReduce jobs ( HDFS, YARN,SPARK) on high vol data
alongside other big data services (Hive, Hbase, pig, Oozie, Zeppelin, Zookeeper, etc)
EMR uses EC2 as nodes (master and slave) to distribute data storage and processing.
The good news that , for architectures where distributed jobs are executed infrequently (
1 time per day / week /month/ etc), the cluster could be automatically terminated after
job run since hdfs data ( input and output data ) are stored in S3.
9 — AWS Lambda :
Lambda is a computing service; it’s highly loved due to its auto-scalability properties and
low cost. It’s triggered by an event (there many sources), and jobs are executed only if
this event occurs; hence there’s no service to pay during inactivity.
Adaptable to automate application workflow and intermediate small tasks, used as well
to collect data from users with kinesis.
10 — Kinesis Service:
Kinesis : is a data ingestion service ( equivalent to kafka )used in data streaming for real-
time application, boto3 is a well-known client to publish & consume data in python
application, it’s adaptable with structured spark streaming to transform the memorized
data flow.
1 import boto3
2 import json
3 from datetime import datetime
4 import calendar
5 import random
6 import time
7
8 my_stream_name = 'python-stream'
9
10 kinesis_client = boto3.client('kinesis', region_name='us-east-1')
11
12 def put_to_stream(thing_id, property_value, property_timestamp):
13 payload = {
14 'prop': str(property_value),
15 'timestamp': str(property_timestamp),
16 'thing_id': thing_id
17 }
18
19 print payload
20
21 put_response = kinesis_client.put_record(
22 StreamName=my_stream_name,
23 Data=json.dumps(payload),
24 PartitionKey=thing_id)
25
26 while True:
27 property_value = random.randint(40, 120)
28 property_timestamp = calendar.timegm(datetime.utcnow().timetuple())
29 thing_id = 'aa-bb'
30
31 put_to_stream(thing_id, property_value, property_timestamp)
32
33 # wait for 5 second
34 time.sleep(5)
1 import boto3
2 import json
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 15/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 16/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
In this use-case, data is monthly charged into database and then data processing is
triggered with the new incoming data.
Therefore, we won’t need a database that supports real time processing to store data
flow.
Running spark jobs on data stored on Hdfs would enable using the yarn component of
hadoop and then we could benefit from Scheduling capabilities of yarn and then
accelerate spark jobs.
b- Second scenario: Input sources on HDFS and output (target) sources on S3 bucket:
Launching spark jobs on an EMR cluster would be an adequate choice since large data
volumes are rapidly transformed with the cluster resources.
On the other side, keeping running on multiple EC2 instances of the EMR without
activity (as processing is triggered once a month) would lead to a cost overflow and then
an expensive monthly billing.
c — Third scenario:
The evolved solutions consist in using s3 to store input and output data for a long time
and temporarily store input data in a created hdfs nodes of an EMR cluster in order to
run the spark jobs.
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 17/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
And then the cluster would be auto-terminated once last job is successfully done.
How to generate output source: In order to run spark jobs on the EMR cluster we could
use an EC2 as a client server and then in order to directly a spark dataframe into a glue
data catalog, we have to use a spark-glue connector.
Creating dashboard: In order to create a dashboard and make analysis on input and
output data sources we would use an Athena service to launch search Queries from an
interactive console as well as connecting that source to a QuickSight service in order to
make visualizations.
We should think about a process to automate the creation of cluster once data are
charged into the S3 bucket
It’s feasible by introducing a dedicated lambda triggered once a creation event is done
on the input s3, then cluster configuration file the cluster would be charged from a
bucket to create the cluster with boto3 client.
Once the cluster is ready, we would start EC2 machine to launch the application and the
upload the application scripts from s3 to the EC2 and lunch it
Secondly, we have created a process that automatically updates Athena sources once
new data arrives on both input and output buckets and charges new data.
There are two scenarios to do that: lunch update scripts from the launch EC2 machine.
The second option is to create two dedicated lambda functions that are woken-up once
new data arrives and automate the update of existing data with a boto3 client ( if using
python ).
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 18/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
In this architecture, we have adopted two reactive components (the first one to match
transactions with client old behavior and output a first alert), and then the second
component is to identify the client based on its confidential information (biometry, face
identification, voice recognition, etc).
As historical data sources are requested by 3 clients (reading, writing) in real time, our
data source should support real time processing.
Amazon Redshift excels at running complex analytics queries, joins and aggregations
and provides a real time response over large datasets due its high performance local
disks.
Redshift service is per storage node pricing; it runs at about ~$1,000 / TB / Year which
is (x4 times) more expensive than the S3 bucket (~$250 / TB / Year).
On the other side, Redshift Spectrum is a serverless service; pricing is based on scanned
data ($5 per terabyte of data scanned)
Amazon Redshift spectrum users can benefit from the cheap storage price of the S3 and
then run analytics queries, filter, aggregate and group data with the spectrum layer
Noting that, Redshift spectrum queries do run queries slower than the Amazon Redshift
cluster.
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 19/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
Indeed, running frequent queries on Amazon Redshift would probably make scanning
data costs more than just storing in Redshift, it’s to move data back into the cluster.
In a nutshell, there is no intuitive choice between the two options; a smart way to
simulate the daily incoming workload of the application and estimate average daily cost
and latency.
If scanning data is more costly than storage, redshift cluster would be best option;
otherwise, if redshift spectrum gives an acceptable latency while incoming workload
reaches its maximum and the cost is your priority then Spectrum would be the good
option for you.
Data collection operation could be hosted by a lambda function connected with kinesis
service and then sent via a kinesis stream service result to the data transformation unit.
Or with a separate lambda function where data is ingested through kinesis streams.
We could use an s3 bucket to store classification algorithm activity and then a lambda
function to update new incoming data into the Athena table.
Backup data stores : use of data backups to replicate data is as well a mandatory
practice while designing architectures and s3 is the most convenient choice.
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 20/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
Read More
Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories —
delivered straight into your inbox, once a week. Take a look.
Your email
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1
About Help Legal21/22
5/15/2021 How to Design AWS Data Architectures | by narjes karmeni | The Startup | Medium
About Help Legal
https://medium.com/swlh/a-guide-on-designing-aws-data-architectures-b331ce9dbbc1 22/22