SlideShare a Scribd company logo
Supercharging ETL with Spark 
Rafal Kwasny 
First Spark London Meetup 
2014-05-28
Who are you?
About me 
• Sysadmin/DevOps background 
• Worked as DevOps @Visualdna 
• Now building game analytics platform 
@Sony Computer Entertainment Europe
Outline 
• What is ETL 
• How do we do it in the standard Hadoop stack 
• How can we supercharge it with Spark 
• Real-life use cases 
• How to deploy Spark 
• Lessons learned
Standard technology stack 
Get the data
Standard technology stack 
Load into HDFS / S3
Standard technology stack 
Extract & Transform & Load
Standard technology stack 
Query, Analyze, train ML models
Standard technology stack 
Real Time pipeline
Hadoop 
• Industry standard 
• Have you ever looked at Hadoop code and 
tried to fix something?
How simple is simple? 
”Simple YARN application to run n copies of a unix command - 
deliberately kept simple (with minimal error handling etc.)” 
➜ $ git clone https://github.com/hortonworks/simple-yarn-app.git 
(…) 
➜ $ find simple-yarn-app -name "*.java" |xargs cat | wc -l 
232
ETL Workflow 
• Get some data from S3/HDFS 
• Map 
• Shuffle 
• Reduce 
• Save to S3/HDFS
ETL Workflow 
• Get some data from S3/HDFS 
• Map 
• Shuffle 
• Reduce 
• Save to S3/HDFS 
Repeat 10 times
Issue: Test run time 
• Job startup time ~20s to run a job that does nothing 
• Hard to test the code without a cluster ( cascading 
simulation mode != real life )
Issue: new applications 
MapReduce awkward for key big data workloads: 
• Low latency dispatch (E.G. quick queries) 
• Iterative algorithms (E.G. ML, Graph…) 
• Streaming data ingest
Issue: hardware is moving on 
Hardware had advanced since Hadoop started: 
• Very large RAMs, Faster networks (10Gb+) 
• Bandwidth to disk not keeping up 
• 1 GB of RAM ~ $0.75/month * 
*based on a spot price of AWS r3.8xlarge instance
How can we 
supercharge our ETL?
Use Spark 
• Fast and Expressive Cluster Computing Engine 
• Compatible with Apache Hadoop 
• In-memory storage 
• Rich APIs in Java, Scala, Python
Why Spark? 
• Up to 40x faster than Hadoop MapReduce 
( for some use cases, see: https://amplab.cs.berkeley.edu/benchmark/ ) 
• Jobs can be scheduled and run in <1s 
• Typically less code (2-5x) 
• Seamless Hadoop/HDFS integration 
• REPL 
• Accessible Source in terms of LOC and modularity
Why Spark? 
• Berkeley Data Analytics Stack ecosystem: 
• Spark, Spark Streaming, Shark, BlinkDB, MLlib 
• Deep integration into Hadoop ecosystem 
• Read/write Hadoop formats 
• Interoperability with other ecosystem components 
• Runs on Mesos & YARN, also MR1 
• EC2, EMR 
• HDFS, S3
Why Spark?
Using RAM for in-memory caching
Fault recovery
Stack 
Also: 
• SHARK ( Hive on Spark ) 
• Tachyon ( off heap caching ) 
• SparkR ( R wrapper ) 
• BlinkDB ( Approximate Queries)
ETL with SPARK - First Spark London meetup
Real-life use
Spark use-cases 
• next-generation ETL platform 
• No more “multiple chained MapReduce jobs” 
architecture 
• Less jobs to worry about 
• Better sleep for your DevOps team
Sessionization 
Add session_id to events
Why add session id? 
Combine all user activity into user sessions
Adding session ID 
user_id timestamp Referrer URL 
user1 1401207490 http://fb.com http://webpage/ 
user2 1401207491 http://twitter.com http://webpage/ 
user1 1401207543 http://webpage/ http://webpage/login 
user1 140120841 http://webpage/login http://webpage/add_to_cart 
user2 1401207491 http://webpage/ http://webpage/product1
Group by user 
user_id timestamp Referrer URL 
user1 1401207490 http://fb.com http://webpage/ 
user1 1401207543 http://webpage/ http://webpage/login 
user1 140120841 http://webpage/login http://webpage/add_to_cart 
user2 1401207491 http://twitter.com http://webpage/ 
user2 1401207491 http://webpage/ http://webpage/product1
Add unique session id 
user_id timestamp session_id Referrer URL 
user1 
140120749 
0 
8fddc743bfbafdc 
45e071e5c126ce 
ca7 
http://fb.com http://webpage/ 
user1 
140120754 
3 
8fddc743bfbafdc 
45e071e5c126ce 
ca7 
http://webpage/ http://webpage/login 
user1 140120841 
8fddc743bfbafdc 
45e071e5c126ce 
ca7 
http://webpage/lo 
gin 
http://webpage/add_to_ 
cart 
user2 
140120749 
1 
c00e742152500 
8584d9d1ff4201 
cbf65 
http://twitter.com http://webpage/ 
140120749 
c00e742152500 
http://webpage/product
Join with external data 
user_id timestamp session_id new_user Referrer URL 
user1 1401207490 
8fddc743bfba 
fdc45e071e5 
c126ceca7 
TRUE http://fb.com http://webpage/ 
user1 1401207543 
8fddc743bfba 
fdc45e071e5 
c126ceca7 
TRUE 
http://webpag 
e/ 
http://webpage/l 
ogin 
user1 140120841 
8fddc743bfba 
fdc45e071e5 
c126ceca7 
TRUE 
http://webpag 
e/login 
http://webpage/ 
add_to_cart 
user2 1401207491 
c00e7421525 
008584d9d1ff 
4201cbf65 
FALSE http://twitter.c 
om 
http://webpage/ 
c00e7421525
Sessionize user clickstream 
• Filter interesting events 
• Group by user 
• Add unique sessionId 
• Join with external data sources 
• Write output
val input = sc.textFile("file:///tmp/input") 
val rawEvents = input 
.map(line => line.split("t")) 
val userInfo = sc.textFile("file:///tmp/userinfo") 
.map(line => line.split("t")) 
.map(user => (user(0),user)) 
val processedEvents = rawEvents 
.map(arr => (arr(0),arr)) 
.cogroup(userInfo) 
.flatMapValues(k => { 
val new_user = k._2.length match { 
case x if x > 0 => "true" 
case _ => "false" 
} 
val session_id = java.util.UUID.randomUUID.toString 
k._1.map(line => 
line.slice(0,3) ++ Array(session_id) ++ Array(new_user) ++ line.drop(3) 
) 
}) 
.map(k => k._2)
Why is it better? 
• Single spark job 
• Easier to maintain than 3 consecutive map reduce 
stages 
• Can be unit tested
From the DevOps 
perspective
v1.0 - running on EC2 
• Start with an EC2 script 
./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> 
—instance-type=c3.xlarge launch <cluster-name> 
If it does not work for you - modify it, it’s just a simple 
python+boto
v2.0 - Autoscaling on spot instances 
1x Master - on-demand (c3.large) 
XX Slaves - spot instances depending on usage patterns (r3.*) 
• no HDFS 
• persistence in memory + S3
Other options 
• Mesos 
• YARN 
• MR1
Lessons learned
JVM issues 
• java.lang.OutOfMemoryError: GC overhead limit exceeded 
• add more memory? 
val sparkConf = new SparkConf() 
.set("spark.executor.memory", "120g") 
.set("spark.storage.memoryFraction","0.3") 
.set("spark.shuffle.memoryFraction","0.3") 
• increase parallelism: 
sc.textFile("s3://..path", 10000) 
groupByKey(10000)
Full GC 
2014-05-21T10:15:23.203+0000: 200.710: [Full GC 109G- 
>45G(110G), 79.3771030 secs] 
2014-05-21T10:16:42.580+0000: 280.087: Total time for which 
application threads were stopped: 79.3773830 seconds 
we want to avoid this 
• Use G1GC + Java 8 
• Store data serialized 
set("spark.serializer","org.apache.spark.serializer.KryoSerializer") 
set("spark.kryo.registrator","scee.SceeKryoRegistrator")
Bugs 
• for example: cdh5 does not work with Amazon S3 out of the 
box ( thx to Sean it will be fixed in next release ) 
• If in doubt use the provided ec2/spark-ec2 script 
• ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> 
—instance-type=c3.xlarge launch <cluster-name>
Tips & Tricks 
• you do not need to package whole spark with your app, just 
specify dependencies as provided in sbt 
libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.0-cdh5.0.1" % 
„provided" 
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0-cdh5.0.1" % 
"provided" 
assembly jar size from 120MB -> 5MB 
• always ensure you are compiling agains the same version of 
artifacts, if not ”bad things will happen”™
Future - Spark 1.0 
• Voting in progress to release Spark 1.0.0 RC11 
• Spark SQL 
• History server 
• Job Submission Tool 
• Java 8 support
Spark - Hadoop done right 
• Faster to run, less code to write 
• Deploying Spark can be easy and cost-effective 
• Still rough around the edges but improves quickly
Thank you for listening 
:)

More Related Content

PDF
Apache Airflow
PDF
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
PDF
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
PPTX
Stability Patterns for Microservices
PDF
Improving Spark SQL at LinkedIn
PDF
Spark 2.x Troubleshooting Guide
 
PDF
Native Support of Prometheus Monitoring in Apache Spark 3.0
PPTX
Apache Airflow overview
Apache Airflow
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Stability Patterns for Microservices
Improving Spark SQL at LinkedIn
Spark 2.x Troubleshooting Guide
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Apache Airflow overview

What's hot (20)

PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
PDF
Linux Performance Analysis: New Tools and Old Secrets
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Understanding Query Plans and Spark UIs
PDF
Airflow presentation
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PPTX
Airflow presentation
PDF
Data Source API in Spark
PDF
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
PDF
데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...
PDF
3D: DBT using Databricks and Delta
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
PDF
Batch Processing at Scale with Flink & Iceberg
PDF
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Linux Performance Analysis: New Tools and Old Secrets
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Deep Dive: Memory Management in Apache Spark
Understanding Query Plans and Spark UIs
Airflow presentation
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Iceberg: A modern table format for big data (Strata NY 2018)
Airflow presentation
Data Source API in Spark
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...
3D: DBT using Databricks and Delta
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Apache Spark on K8S Best Practice and Performance in the Cloud
Batch Processing at Scale with Flink & Iceberg
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
Ad

Viewers also liked (15)

PDF
Spark Compute as a Service at Paypal with Prabhu Kasinathan
PDF
Spark on yarn
PPTX
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
PPTX
Hadoop and Spark Analytics over Better Storage
PDF
Dynamically Allocate Cluster Resources to your Spark Application
PPTX
Get most out of Spark on YARN
PPT
SocSciBot(01 Mar2010) - Korean Manual
PPTX
Producing Spark on YARN for ETL
PDF
Productionizing Spark and the Spark Job Server
PDF
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
PDF
Why your Spark job is failing
PPTX
Apache Spark Model Deployment
PPT
Proxy Servers
PPT
Proxy Server
PDF
Zeppelin(Spark)으로 데이터 분석하기
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark on yarn
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Hadoop and Spark Analytics over Better Storage
Dynamically Allocate Cluster Resources to your Spark Application
Get most out of Spark on YARN
SocSciBot(01 Mar2010) - Korean Manual
Producing Spark on YARN for ETL
Productionizing Spark and the Spark Job Server
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Why your Spark job is failing
Apache Spark Model Deployment
Proxy Servers
Proxy Server
Zeppelin(Spark)으로 데이터 분석하기
Ad

Similar to ETL with SPARK - First Spark London meetup (20)

PPTX
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
Taboola Road To Scale With Apache Spark
PPTX
In Memory Analytics with Apache Spark
PPTX
APACHE SPARK.pptx
PDF
Spark Summit EU 2015: Lessons from 300+ production users
PDF
Bds session 13 14
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PDF
New Analytics Toolbox
PDF
Scala like distributed collections - dumping time-series data with apache spark
PDF
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
PDF
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
PPTX
Spark from the Surface
PDF
Spark Intro @ analytics big data summit
PPTX
Zaharia spark-scala-days-2012
PPTX
Glint with Apache Spark
PDF
Hadoop to spark-v2
PPTX
Apache Spark Fundamentals
PDF
夏俊鸾:Spark——基于内存的下一代大数据分析框架
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Intro to Apache Spark by CTO of Twingo
Taboola Road To Scale With Apache Spark
In Memory Analytics with Apache Spark
APACHE SPARK.pptx
Spark Summit EU 2015: Lessons from 300+ production users
Bds session 13 14
Simplifying Big Data Analytics with Apache Spark
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
New Analytics Toolbox
Scala like distributed collections - dumping time-series data with apache spark
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
Spark from the Surface
Spark Intro @ analytics big data summit
Zaharia spark-scala-days-2012
Glint with Apache Spark
Hadoop to spark-v2
Apache Spark Fundamentals
夏俊鸾:Spark——基于内存的下一代大数据分析框架

Recently uploaded (20)

PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
PDF
Chapter 2 Digital Image Fundamentals.pdf
PDF
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
KodekX | Application Modernization Development
PDF
Event Presentation Google Cloud Next Extended 2025
PDF
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
PDF
REPORT: Heating appliances market in Poland 2024
PDF
Sensors and Actuators in IoT Systems using pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
DevOps & Developer Experience Summer BBQ
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
20250228 LYD VKU AI Blended-Learning.pptx
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
Chapter 2 Digital Image Fundamentals.pdf
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Big Data Technologies - Introduction.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
KodekX | Application Modernization Development
Event Presentation Google Cloud Next Extended 2025
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
REPORT: Heating appliances market in Poland 2024
Sensors and Actuators in IoT Systems using pdf
Electronic commerce courselecture one. Pdf
NewMind AI Weekly Chronicles - August'25 Week I
GamePlan Trading System Review: Professional Trader's Honest Take
“AI and Expert System Decision Support & Business Intelligence Systems”
Dropbox Q2 2025 Financial Results & Investor Presentation
DevOps & Developer Experience Summer BBQ
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....

ETL with SPARK - First Spark London meetup

  • 1. Supercharging ETL with Spark Rafal Kwasny First Spark London Meetup 2014-05-28
  • 3. About me • Sysadmin/DevOps background • Worked as DevOps @Visualdna • Now building game analytics platform @Sony Computer Entertainment Europe
  • 4. Outline • What is ETL • How do we do it in the standard Hadoop stack • How can we supercharge it with Spark • Real-life use cases • How to deploy Spark • Lessons learned
  • 6. Standard technology stack Load into HDFS / S3
  • 7. Standard technology stack Extract & Transform & Load
  • 8. Standard technology stack Query, Analyze, train ML models
  • 9. Standard technology stack Real Time pipeline
  • 10. Hadoop • Industry standard • Have you ever looked at Hadoop code and tried to fix something?
  • 11. How simple is simple? ”Simple YARN application to run n copies of a unix command - deliberately kept simple (with minimal error handling etc.)” ➜ $ git clone https://github.com/hortonworks/simple-yarn-app.git (…) ➜ $ find simple-yarn-app -name "*.java" |xargs cat | wc -l 232
  • 12. ETL Workflow • Get some data from S3/HDFS • Map • Shuffle • Reduce • Save to S3/HDFS
  • 13. ETL Workflow • Get some data from S3/HDFS • Map • Shuffle • Reduce • Save to S3/HDFS Repeat 10 times
  • 14. Issue: Test run time • Job startup time ~20s to run a job that does nothing • Hard to test the code without a cluster ( cascading simulation mode != real life )
  • 15. Issue: new applications MapReduce awkward for key big data workloads: • Low latency dispatch (E.G. quick queries) • Iterative algorithms (E.G. ML, Graph…) • Streaming data ingest
  • 16. Issue: hardware is moving on Hardware had advanced since Hadoop started: • Very large RAMs, Faster networks (10Gb+) • Bandwidth to disk not keeping up • 1 GB of RAM ~ $0.75/month * *based on a spot price of AWS r3.8xlarge instance
  • 17. How can we supercharge our ETL?
  • 18. Use Spark • Fast and Expressive Cluster Computing Engine • Compatible with Apache Hadoop • In-memory storage • Rich APIs in Java, Scala, Python
  • 19. Why Spark? • Up to 40x faster than Hadoop MapReduce ( for some use cases, see: https://amplab.cs.berkeley.edu/benchmark/ ) • Jobs can be scheduled and run in <1s • Typically less code (2-5x) • Seamless Hadoop/HDFS integration • REPL • Accessible Source in terms of LOC and modularity
  • 20. Why Spark? • Berkeley Data Analytics Stack ecosystem: • Spark, Spark Streaming, Shark, BlinkDB, MLlib • Deep integration into Hadoop ecosystem • Read/write Hadoop formats • Interoperability with other ecosystem components • Runs on Mesos & YARN, also MR1 • EC2, EMR • HDFS, S3
  • 22. Using RAM for in-memory caching
  • 24. Stack Also: • SHARK ( Hive on Spark ) • Tachyon ( off heap caching ) • SparkR ( R wrapper ) • BlinkDB ( Approximate Queries)
  • 27. Spark use-cases • next-generation ETL platform • No more “multiple chained MapReduce jobs” architecture • Less jobs to worry about • Better sleep for your DevOps team
  • 29. Why add session id? Combine all user activity into user sessions
  • 30. Adding session ID user_id timestamp Referrer URL user1 1401207490 http://fb.com http://webpage/ user2 1401207491 http://twitter.com http://webpage/ user1 1401207543 http://webpage/ http://webpage/login user1 140120841 http://webpage/login http://webpage/add_to_cart user2 1401207491 http://webpage/ http://webpage/product1
  • 31. Group by user user_id timestamp Referrer URL user1 1401207490 http://fb.com http://webpage/ user1 1401207543 http://webpage/ http://webpage/login user1 140120841 http://webpage/login http://webpage/add_to_cart user2 1401207491 http://twitter.com http://webpage/ user2 1401207491 http://webpage/ http://webpage/product1
  • 32. Add unique session id user_id timestamp session_id Referrer URL user1 140120749 0 8fddc743bfbafdc 45e071e5c126ce ca7 http://fb.com http://webpage/ user1 140120754 3 8fddc743bfbafdc 45e071e5c126ce ca7 http://webpage/ http://webpage/login user1 140120841 8fddc743bfbafdc 45e071e5c126ce ca7 http://webpage/lo gin http://webpage/add_to_ cart user2 140120749 1 c00e742152500 8584d9d1ff4201 cbf65 http://twitter.com http://webpage/ 140120749 c00e742152500 http://webpage/product
  • 33. Join with external data user_id timestamp session_id new_user Referrer URL user1 1401207490 8fddc743bfba fdc45e071e5 c126ceca7 TRUE http://fb.com http://webpage/ user1 1401207543 8fddc743bfba fdc45e071e5 c126ceca7 TRUE http://webpag e/ http://webpage/l ogin user1 140120841 8fddc743bfba fdc45e071e5 c126ceca7 TRUE http://webpag e/login http://webpage/ add_to_cart user2 1401207491 c00e7421525 008584d9d1ff 4201cbf65 FALSE http://twitter.c om http://webpage/ c00e7421525
  • 34. Sessionize user clickstream • Filter interesting events • Group by user • Add unique sessionId • Join with external data sources • Write output
  • 35. val input = sc.textFile("file:///tmp/input") val rawEvents = input .map(line => line.split("t")) val userInfo = sc.textFile("file:///tmp/userinfo") .map(line => line.split("t")) .map(user => (user(0),user)) val processedEvents = rawEvents .map(arr => (arr(0),arr)) .cogroup(userInfo) .flatMapValues(k => { val new_user = k._2.length match { case x if x > 0 => "true" case _ => "false" } val session_id = java.util.UUID.randomUUID.toString k._1.map(line => line.slice(0,3) ++ Array(session_id) ++ Array(new_user) ++ line.drop(3) ) }) .map(k => k._2)
  • 36. Why is it better? • Single spark job • Easier to maintain than 3 consecutive map reduce stages • Can be unit tested
  • 37. From the DevOps perspective
  • 38. v1.0 - running on EC2 • Start with an EC2 script ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> —instance-type=c3.xlarge launch <cluster-name> If it does not work for you - modify it, it’s just a simple python+boto
  • 39. v2.0 - Autoscaling on spot instances 1x Master - on-demand (c3.large) XX Slaves - spot instances depending on usage patterns (r3.*) • no HDFS • persistence in memory + S3
  • 40. Other options • Mesos • YARN • MR1
  • 42. JVM issues • java.lang.OutOfMemoryError: GC overhead limit exceeded • add more memory? val sparkConf = new SparkConf() .set("spark.executor.memory", "120g") .set("spark.storage.memoryFraction","0.3") .set("spark.shuffle.memoryFraction","0.3") • increase parallelism: sc.textFile("s3://..path", 10000) groupByKey(10000)
  • 43. Full GC 2014-05-21T10:15:23.203+0000: 200.710: [Full GC 109G- >45G(110G), 79.3771030 secs] 2014-05-21T10:16:42.580+0000: 280.087: Total time for which application threads were stopped: 79.3773830 seconds we want to avoid this • Use G1GC + Java 8 • Store data serialized set("spark.serializer","org.apache.spark.serializer.KryoSerializer") set("spark.kryo.registrator","scee.SceeKryoRegistrator")
  • 44. Bugs • for example: cdh5 does not work with Amazon S3 out of the box ( thx to Sean it will be fixed in next release ) • If in doubt use the provided ec2/spark-ec2 script • ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> —instance-type=c3.xlarge launch <cluster-name>
  • 45. Tips & Tricks • you do not need to package whole spark with your app, just specify dependencies as provided in sbt libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.0-cdh5.0.1" % „provided" libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0-cdh5.0.1" % "provided" assembly jar size from 120MB -> 5MB • always ensure you are compiling agains the same version of artifacts, if not ”bad things will happen”™
  • 46. Future - Spark 1.0 • Voting in progress to release Spark 1.0.0 RC11 • Spark SQL • History server • Job Submission Tool • Java 8 support
  • 47. Spark - Hadoop done right • Faster to run, less code to write • Deploying Spark can be easy and cost-effective • Still rough around the edges but improves quickly
  • 48. Thank you for listening :)

Editor's Notes

  • #2: My experience supercharging Extract Transform Load workloads with Spark
  • #6: Get the data (access logs + application logs )
  • #7: Put it into S3 Load into HDFS
  • #8: Transform using Hive/Streaming/Cascading/Scalding into flat structure you can query
  • #9: Load into MPP database / Query using HIVE
  • #10: Rewrite all the logic for real-time On top of completely different technology Storm/Samza etc.
  • #11: Is it the best option?
  • #20: read–eval–print loop