Big data on AWS

Big Data on AWS
Johann Romefort

Agenda
• What is Big Data?
• What is AWS?
• Presenting the tools: How Big Data and AWS fit
together

What is Big Data?
• It’s at the intersection of data’s 3 V:
• Velocity (Batch / Real time / Streaming)
• Volume (Terabytes/Petabytes)
• Variety (structure/semi-structured/unstructured)

Why is everybody talking about it?
• Cost of generation of data has gone down
• By 2015, 3B people will be online, pushing data
volume created to 8 zettabytes
• More data = More insights = Better decisions
• Ease and cost of processing is falling thanks to
cloud platforms

Data flow and constraints
Generate
Ingest / Store
Process
Visualize / Share
The 3 V involve
heterogeneity and
make it hard to
achieve those steps

What is AWS?
• AWS is a cloud computing platform
• On-demand delivery of IT resources
• Pay-as-you-go pricing model

Cloud Computing
+ +
Compute Storage Networking
Adapts dynamically to ever
changing needs to stick closely
to user infrastructure and
applications requirements

How does AWS helps
with Big Data?
• Remove constraints on the ingesting, storing, and
processing layer and adapts closely to demands.
• Provides a collection of integrated tools to adapt to
the 3 V’s of Big Data
• Unlimited capacity of storage and processing power
fits well to changing data storage and analysis
requirements.

Computing Solutions
for Big Data on AWS
EC2 EMR
Kinesis
Redshift

Computing Solutions
for Big Data on AWS
EC2
All-purpose computing instances.
Dynamic Provisioning and resizing
Let you scale your infrastructure
at low cost
Use Case: Well suited for running custom or proprietary
application (ex: SAP Hana, Tableau…)

Computing Solutions
for Big Data on AWS
EMR
‘Hadoop in the cloud’
Adapt to complexity of the analysis
and volume of data to process
Use Case: Offline processing of very large volume of data,
possibly unstructured (Variety variable)

Computing Solutions
for Big Data on AWS
Kinesis
Stream Processing
Real-time data
Scale to adapt to the flow of
inbound data
Use Case: Complex Event Processing, click streams,
sensors data, computation over window of time

Computing Solutions
for Big Data on AWS
RedShift
Data Warehouse in the cloud
Scales to Petabytes
Supports SQL Querying
Start small for just $0.25/h
Use Case: BI Analysis, Use of ODBC/JDBC legacy software
to analyze or visualize data

Storage Solution
for Big Data on AWS
DynamoDB RedShift
S3 Glacier

Storage Solution
for Big Data on AWS
DynamoDB
NoSQL Database
Consistent
Low latency access
Column-base flexible
data model
Use Case: Offline processing of very large volume of data,
possibly unstructured (Variety variable)

Storage Solution
for Big Data on AWS
S3
Versatile storage system
Low-cost
Fast retrieving of data
Use Case: Backups and Disaster recovery, Media storage,
Storage for data analysis

Storage Solution
for Big Data on AWS
Glacier
Archive storage of cold data
Extremely low-cost
optimized for data infrequently
accessed
Use Case: Storing raw logs of data. Storing media archives.
Magnetic tape replacement

What makes AWS different
when it comes to big data?

Integrated Environment for Big Data
Given the 3V’s a collection of tools is most of the time
needed for your data processing and storage.
AWS Big Data solutions comes integrated with each others
already
AWS Big Data solutions also integrate with the whole AWS
ecosystem (Security, Identity Management, Logging, Backups,
Management Console…)

Example of products interacting with
each other.

Tightly integrated rich
environment of tools
+
On-demand scaling sticking to
processing requirements
=
Extremely cost-effective and easy to
deploy solution for big data needs

Use Case:
Real-time IOT Analytics
Gathering data in real time from sensors deployed in
factory and send them for immediate processing
• Error Detection: Real-time detection of hardware
problems
• Optimization and Energy management

First Version of the
infrastructure
Aggregate
Sensors
data
nodejs
stream
processor
On customer site
evaluate rules
over time
window
mongodb
feed algorithm
in-house hadoop cluster
write raw
data for
further
processing
backup

Second Version of the
infrastructure
Aggregate
Sensors
data
On customer site
evaluate rules
over time
window
write raw
data for
archiving
Kinesis RedShift
for BI
analysis
Glacier

Thank You
romefort@gmail.com
follow me on @romefort

Big data on AWS

More Related Content

What's hot(19)

Viewers also liked(20)

Recently uploaded(20)

Big data on AWS