Big Data on AWS 
Johann Romefort
Agenda 
• What is Big Data? 
• What is AWS? 
• Presenting the tools: How Big Data and AWS fit 
together
What is Big Data? 
• It’s at the intersection of data’s 3 V: 
• Velocity (Batch / Real time / Streaming) 
• Volume (Terabytes/Petabytes) 
• Variety (structure/semi-structured/unstructured)
Why is everybody talking about it? 
• Cost of generation of data has gone down 
• By 2015, 3B people will be online, pushing data 
volume created to 8 zettabytes 
• More data = More insights = Better decisions 
• Ease and cost of processing is falling thanks to 
cloud platforms
Data flow and constraints 
Generate 
Ingest / Store 
Process 
Visualize / Share 
The 3 V involve 
heterogeneity and 
make it hard to 
achieve those steps
What is AWS? 
• AWS is a cloud computing platform 
• On-demand delivery of IT resources 
• Pay-as-you-go pricing model
Cloud Computing 
+ + 
Compute Storage Networking 
Adapts dynamically to ever 
changing needs to stick closely 
to user infrastructure and 
applications requirements
How does AWS helps 
with Big Data? 
• Remove constraints on the ingesting, storing, and 
processing layer and adapts closely to demands. 
• Provides a collection of integrated tools to adapt to 
the 3 V’s of Big Data 
• Unlimited capacity of storage and processing power 
fits well to changing data storage and analysis 
requirements.
Computing Solutions 
for Big Data on AWS 
EC2 EMR 
Kinesis 
Redshift
Computing Solutions 
for Big Data on AWS 
EC2 
All-purpose computing instances. 
Dynamic Provisioning and resizing 
Let you scale your infrastructure 
at low cost 
Use Case: Well suited for running custom or proprietary 
application (ex: SAP Hana, Tableau…)
Computing Solutions 
for Big Data on AWS 
EMR 
‘Hadoop in the cloud’ 
Adapt to complexity of the analysis 
and volume of data to process 
Use Case: Offline processing of very large volume of data, 
possibly unstructured (Variety variable)
Computing Solutions 
for Big Data on AWS 
Kinesis 
Stream Processing 
Real-time data 
Scale to adapt to the flow of 
inbound data 
Use Case: Complex Event Processing, click streams, 
sensors data, computation over window of time
Computing Solutions 
for Big Data on AWS 
RedShift 
Data Warehouse in the cloud 
Scales to Petabytes 
Supports SQL Querying 
Start small for just $0.25/h 
Use Case: BI Analysis, Use of ODBC/JDBC legacy software 
to analyze or visualize data
Storage Solution 
for Big Data on AWS 
DynamoDB RedShift 
S3 Glacier
Storage Solution 
for Big Data on AWS 
DynamoDB 
NoSQL Database 
Consistent 
Low latency access 
Column-base flexible 
data model 
Use Case: Offline processing of very large volume of data, 
possibly unstructured (Variety variable)
Storage Solution 
for Big Data on AWS 
S3 
Versatile storage system 
Low-cost 
Fast retrieving of data 
Use Case: Backups and Disaster recovery, Media storage, 
Storage for data analysis
Storage Solution 
for Big Data on AWS 
Glacier 
Archive storage of cold data 
Extremely low-cost 
optimized for data infrequently 
accessed 
Use Case: Storing raw logs of data. Storing media archives. 
Magnetic tape replacement
What makes AWS different 
when it comes to big data?
Integrated Environment for Big Data 
Given the 3V’s a collection of tools is most of the time 
needed for your data processing and storage. 
AWS Big Data solutions comes integrated with each others 
already 
AWS Big Data solutions also integrate with the whole AWS 
ecosystem (Security, Identity Management, Logging, Backups, 
Management Console…)
Example of products interacting with 
each other.
Tightly integrated rich 
environment of tools 
+ 
On-demand scaling sticking to 
processing requirements 
= 
Extremely cost-effective and easy to 
deploy solution for big data needs
Use Case: 
Real-time IOT Analytics 
Gathering data in real time from sensors deployed in 
factory and send them for immediate processing 
• Error Detection: Real-time detection of hardware 
problems 
• Optimization and Energy management
First Version of the 
infrastructure 
Aggregate 
Sensors 
data 
nodejs 
stream 
processor 
On customer site 
evaluate rules 
over time 
window 
mongodb 
feed algorithm 
in-house hadoop cluster 
write raw 
data for 
further 
processing 
backup
Second Version of the 
infrastructure 
Aggregate 
Sensors 
data 
On customer site 
evaluate rules 
over time 
window 
write raw 
data for 
archiving 
Kinesis RedShift 
for BI 
analysis 
Glacier
Thank You 
romefort@gmail.com 
follow me on @romefort

Big data on AWS

  • 1.
    Big Data onAWS Johann Romefort
  • 2.
    Agenda • Whatis Big Data? • What is AWS? • Presenting the tools: How Big Data and AWS fit together
  • 3.
    What is BigData? • It’s at the intersection of data’s 3 V: • Velocity (Batch / Real time / Streaming) • Volume (Terabytes/Petabytes) • Variety (structure/semi-structured/unstructured)
  • 4.
    Why is everybodytalking about it? • Cost of generation of data has gone down • By 2015, 3B people will be online, pushing data volume created to 8 zettabytes • More data = More insights = Better decisions • Ease and cost of processing is falling thanks to cloud platforms
  • 5.
    Data flow andconstraints Generate Ingest / Store Process Visualize / Share The 3 V involve heterogeneity and make it hard to achieve those steps
  • 6.
    What is AWS? • AWS is a cloud computing platform • On-demand delivery of IT resources • Pay-as-you-go pricing model
  • 7.
    Cloud Computing ++ Compute Storage Networking Adapts dynamically to ever changing needs to stick closely to user infrastructure and applications requirements
  • 8.
    How does AWShelps with Big Data? • Remove constraints on the ingesting, storing, and processing layer and adapts closely to demands. • Provides a collection of integrated tools to adapt to the 3 V’s of Big Data • Unlimited capacity of storage and processing power fits well to changing data storage and analysis requirements.
  • 9.
    Computing Solutions forBig Data on AWS EC2 EMR Kinesis Redshift
  • 10.
    Computing Solutions forBig Data on AWS EC2 All-purpose computing instances. Dynamic Provisioning and resizing Let you scale your infrastructure at low cost Use Case: Well suited for running custom or proprietary application (ex: SAP Hana, Tableau…)
  • 11.
    Computing Solutions forBig Data on AWS EMR ‘Hadoop in the cloud’ Adapt to complexity of the analysis and volume of data to process Use Case: Offline processing of very large volume of data, possibly unstructured (Variety variable)
  • 12.
    Computing Solutions forBig Data on AWS Kinesis Stream Processing Real-time data Scale to adapt to the flow of inbound data Use Case: Complex Event Processing, click streams, sensors data, computation over window of time
  • 13.
    Computing Solutions forBig Data on AWS RedShift Data Warehouse in the cloud Scales to Petabytes Supports SQL Querying Start small for just $0.25/h Use Case: BI Analysis, Use of ODBC/JDBC legacy software to analyze or visualize data
  • 14.
    Storage Solution forBig Data on AWS DynamoDB RedShift S3 Glacier
  • 15.
    Storage Solution forBig Data on AWS DynamoDB NoSQL Database Consistent Low latency access Column-base flexible data model Use Case: Offline processing of very large volume of data, possibly unstructured (Variety variable)
  • 16.
    Storage Solution forBig Data on AWS S3 Versatile storage system Low-cost Fast retrieving of data Use Case: Backups and Disaster recovery, Media storage, Storage for data analysis
  • 17.
    Storage Solution forBig Data on AWS Glacier Archive storage of cold data Extremely low-cost optimized for data infrequently accessed Use Case: Storing raw logs of data. Storing media archives. Magnetic tape replacement
  • 18.
    What makes AWSdifferent when it comes to big data?
  • 19.
    Integrated Environment forBig Data Given the 3V’s a collection of tools is most of the time needed for your data processing and storage. AWS Big Data solutions comes integrated with each others already AWS Big Data solutions also integrate with the whole AWS ecosystem (Security, Identity Management, Logging, Backups, Management Console…)
  • 20.
    Example of productsinteracting with each other.
  • 21.
    Tightly integrated rich environment of tools + On-demand scaling sticking to processing requirements = Extremely cost-effective and easy to deploy solution for big data needs
  • 22.
    Use Case: Real-timeIOT Analytics Gathering data in real time from sensors deployed in factory and send them for immediate processing • Error Detection: Real-time detection of hardware problems • Optimization and Energy management
  • 23.
    First Version ofthe infrastructure Aggregate Sensors data nodejs stream processor On customer site evaluate rules over time window mongodb feed algorithm in-house hadoop cluster write raw data for further processing backup
  • 24.
    Second Version ofthe infrastructure Aggregate Sensors data On customer site evaluate rules over time window write raw data for archiving Kinesis RedShift for BI analysis Glacier
  • 25.
    Thank You [email protected] follow me on @romefort