SlideShare a Scribd company logo
Parallelize R code Using
Apache Spark
Hossein Falaki
@mhfalaki
About me
• Former Data Scientist at Apple Siri
• Software Engineer at Databricks
• Started using Apache Spark since version 0.6
• Developed first version of Apache Spark CSV data source
• Worked on SparkR & Databricks R Notebooks
• Currently focusing on R experience at Databricks
What is SparkR
An R package distributed with Apache Spark:
• Provides R front-endto Apache Spark
• Exposes Spark DataFrames (inspired by R and Pandas)
• Convenient interoperability between R and Spark DataFrames
robust	distributed		
processing,	data	source,	off-
memory	data	
dynamic	environment,	
interactivity,	+10K	packages,	
visualizations
+
SparkR architecture
Spark Driver
JVM
Worker
JVM
Worker
DataSources
JVMR
RBackend
JVM
SparkR architecture (2.x)
Spark Driver
JVM
Worker
JVM
Worker
DataSources
JVMR
RBackend
R R
R R
IO
read.df / write.df /
createDataFrame / collect
Caching
cache / persist / unpersist /
cacheTable / uncacheTable
SQL
sql / table / saveAsTable /
registerTempTable / tables
Overview of SparkR API
http://spark.apache.org/docs/latest/api/R/
ML Lib
glm / kmeans / Naïve Bayes
Survival regression
DataFrame API
select / subset / groupBy /
head / avg / column / dim
UDF functionality (since 2.0)
spark.lapply/ dapply /
gapply / dapplyCollect
SparkR UDF API
spark.lapply
Runs a function over
a list of elements
spark.lapply()
dapply
Appliesa function to
each partition of a
SparkDataFrame
dapply()
dapplyCollect()
gapply
Appliesa function to
each group within a
SparkDataFrame
gapply()
gapplyCollect()
spark.lapply
Simplest SparkR UDF patter
For each element of a list
1. Sends the function to an R worker
2. Executesthe function
3. Returns the result of all workers as a list to R driver
spark.lapply(1:100, function(x) {
runBootstrap(x)
}
spark.lapply control flow
RWorker JVMR Driver JVM
1. serialize R closure
4. transfer over
local socket
7. serialize result
2. transfer over
local socket
8. transfer over
local socket
10. transfer over
local socket
11. de-serialize result
9. Transfer serialized closureover thenetwork
3. Transfer serialized closureover thenetwork
5. de-serialize closure
6.Execution
dapply
For each partition of a Spark DataFrame
1. Collects each partition as an R data.frame
2. Sends the R function to the R worker
3. Executesthe function
dapply(sparkDF, func, schema)
combines resultsas DataFrame
with provided schema
dapplyCollect(sparkDF, func)
combines resultsas R
data.frame
dapply control & data flow
RWorker JVMR Driver JVM
localsocket cluster network localsocket
input data
ser/de transfer
result data
ser/de transfer
dapply control & data flow
RWorker JVMR Driver JVM
localsocket cluster network localsocket
input data
ser/de transfer
result data
ser/de transferresult deserialize
gapply
Groups a Spark DataFrame on one or more columns
1. Collects each group as an R data.frame
2. Sends the R function to the R worker
3. Executesthe function
gapply(sparkDF, func, schema)
combines resultsas DataFrame
with provided schema
gapplyCollect(sparkDF, func)
combines resultsas R
data.frame
dapply control & data flow
RWorker JVMR Driver JVM
localsocket cluster network localsocket
input data
ser/de transfer
result data
ser/de transferresult deserialize
data
shuffle
gapply dapply
signature gapply(df, cols, func, schema)
gapply(gdf, func, schema)
dapply(df, func, schema)
user	function
signature
function(key, data) function(data)
data	partition controlled	by	grouping not	controlled
Parallelizing data
• Do not use spark.lapply() to distribute large data sets
• Do not pack data in the closure
• Watch for skew in data
– Are partitions evenlysized?
• Auxiliary data
– Can be joined with input DataFrame
– Can be distributed to all the workers
Packages on workers
• SparkR closure capture does not include packages
• You need to import package son each worker inside your
function
• You need to install packages on workers
– spark.lapply() can be used to install packages
Debugging user code
1. Verify code on the driver
2. Interactively execute code on the cluster
• When R worker fails, Spark Driver throws exception with R error text
3. Inspect details of failure of failed job in Spark UI
4. Inspect stdout/stderrr of worker
Demo
Thank You

More Related Content

What's hot (20)

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Spark
SparkSpark
Spark
Koushik Mondal
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
Jordan Halterman
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Databricks
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
Julian Hyde
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
Mateusz Buśkiewicz
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
 
Oracle E-Business Suite R12.2.6 on Database 12c: Install, Patch and Administer
Oracle E-Business Suite R12.2.6 on Database 12c: Install, Patch and AdministerOracle E-Business Suite R12.2.6 on Database 12c: Install, Patch and Administer
Oracle E-Business Suite R12.2.6 on Database 12c: Install, Patch and Administer
Andrejs Karpovs
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Pwning mobile apps without root or jailbreak
Pwning mobile apps without root or jailbreakPwning mobile apps without root or jailbreak
Pwning mobile apps without root or jailbreak
Abraham Aranguren
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
Jaemun Jung
 
Snowflake Data Loading.pptx
Snowflake Data Loading.pptxSnowflake Data Loading.pptx
Snowflake Data Loading.pptx
Parag860410
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
Jordan Halterman
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Databricks
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
Julian Hyde
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
 
Oracle E-Business Suite R12.2.6 on Database 12c: Install, Patch and Administer
Oracle E-Business Suite R12.2.6 on Database 12c: Install, Patch and AdministerOracle E-Business Suite R12.2.6 on Database 12c: Install, Patch and Administer
Oracle E-Business Suite R12.2.6 on Database 12c: Install, Patch and Administer
Andrejs Karpovs
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Pwning mobile apps without root or jailbreak
Pwning mobile apps without root or jailbreakPwning mobile apps without root or jailbreak
Pwning mobile apps without root or jailbreak
Abraham Aranguren
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
Jaemun Jung
 
Snowflake Data Loading.pptx
Snowflake Data Loading.pptxSnowflake Data Loading.pptx
Snowflake Data Loading.pptx
Parag860410
 

Similar to Parallelize R Code Using Apache Spark (20)

Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
Databricks
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
Craig Warman
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
Databricks
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
Olgun Aydın
 
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
Ankara Big Data Meetup
 
Machine Learning with SparkR
Machine Learning with SparkRMachine Learning with SparkR
Machine Learning with SparkR
Olgun Aydın
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
Module01
 Module01 Module01
Module01
NPN Training
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재
hkyoon2
 
An Introduction to Apache spark with scala
An Introduction to Apache spark with scalaAn Introduction to Apache spark with scala
An Introduction to Apache spark with scala
johnn210
 
Intro to apache spark
Intro to apache sparkIntro to apache spark
Intro to apache spark
Amine Sagaama
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
Edureka!
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
Databricks
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
Craig Warman
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
Databricks
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
Olgun Aydın
 
Machine Learning with SparkR
Machine Learning with SparkRMachine Learning with SparkR
Machine Learning with SparkR
Olgun Aydın
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재
hkyoon2
 
An Introduction to Apache spark with scala
An Introduction to Apache spark with scalaAn Introduction to Apache spark with scala
An Introduction to Apache spark with scala
johnn210
 
Intro to apache spark
Intro to apache sparkIntro to apache spark
Intro to apache spark
Amine Sagaama
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
Edureka!
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

Agile Software Engineering Methodologies
Agile Software Engineering MethodologiesAgile Software Engineering Methodologies
Agile Software Engineering Methodologies
Gaurav Sharma
 
COBOL Programming with VSCode - IBM Certificate
COBOL Programming with VSCode - IBM CertificateCOBOL Programming with VSCode - IBM Certificate
COBOL Programming with VSCode - IBM Certificate
VICTOR MAESTRE RAMIREZ
 
Automating Map Production With FME and Python
Automating Map Production With FME and PythonAutomating Map Production With FME and Python
Automating Map Production With FME and Python
Safe Software
 
How to Generate Financial Statements in QuickBooks Like a Pro (1).pdf
How to Generate Financial Statements in QuickBooks Like a Pro (1).pdfHow to Generate Financial Statements in QuickBooks Like a Pro (1).pdf
How to Generate Financial Statements in QuickBooks Like a Pro (1).pdf
QuickBooks Training
 
Build enterprise-ready applications using skills you already have!
Build enterprise-ready applications using skills you already have!Build enterprise-ready applications using skills you already have!
Build enterprise-ready applications using skills you already have!
PhilMeredith3
 
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Prachi Desai
 
Topic 26 Security Testing Considerations.pptx
Topic 26 Security Testing Considerations.pptxTopic 26 Security Testing Considerations.pptx
Topic 26 Security Testing Considerations.pptx
marutnand8
 
FME for Climate Data: Turning Big Data into Actionable Insights
FME for Climate Data: Turning Big Data into Actionable InsightsFME for Climate Data: Turning Big Data into Actionable Insights
FME for Climate Data: Turning Big Data into Actionable Insights
Safe Software
 
How AI Can Improve Media Quality Testing Across Platforms (1).pptx
How AI Can Improve Media Quality Testing Across Platforms (1).pptxHow AI Can Improve Media Quality Testing Across Platforms (1).pptx
How AI Can Improve Media Quality Testing Across Platforms (1).pptx
kalichargn70th171
 
Integrating Survey123 and R&H Data Using FME
Integrating Survey123 and R&H Data Using FMEIntegrating Survey123 and R&H Data Using FME
Integrating Survey123 and R&H Data Using FME
Safe Software
 
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
Insurance Tech Services
 
The rise of e-commerce has redefined how retailers operate—and reconciliation...
The rise of e-commerce has redefined how retailers operate—and reconciliation...The rise of e-commerce has redefined how retailers operate—and reconciliation...
The rise of e-commerce has redefined how retailers operate—and reconciliation...
Prachi Desai
 
Software Engineering Process, Notation & Tools Introduction - Part 4
Software Engineering Process, Notation & Tools Introduction - Part 4Software Engineering Process, Notation & Tools Introduction - Part 4
Software Engineering Process, Notation & Tools Introduction - Part 4
Gaurav Sharma
 
Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...
Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...
Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...
BradBedford3
 
Design by Contract - Building Robust Software with Contract-First Development
Design by Contract - Building Robust Software with Contract-First DevelopmentDesign by Contract - Building Robust Software with Contract-First Development
Design by Contract - Building Robust Software with Contract-First Development
Par-Tec S.p.A.
 
Neuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
Neuralink TemplateeeeeeeeeeeeeeeeeeeeeeeeeeNeuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
Neuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
alexandernoetzold
 
Rebuilding Cadabra Studio: AI as Our Core Foundation
Rebuilding Cadabra Studio: AI as Our Core FoundationRebuilding Cadabra Studio: AI as Our Core Foundation
Rebuilding Cadabra Studio: AI as Our Core Foundation
Cadabra Studio
 
Essentials of Resource Planning in a Downturn
Essentials of Resource Planning in a DownturnEssentials of Resource Planning in a Downturn
Essentials of Resource Planning in a Downturn
OnePlan Solutions
 
OpenTelemetry 101 Cloud Native Barcelona
OpenTelemetry 101 Cloud Native BarcelonaOpenTelemetry 101 Cloud Native Barcelona
OpenTelemetry 101 Cloud Native Barcelona
Imma Valls Bernaus
 
Online Queue Management System for Public Service Offices [Focused on Municip...
Online Queue Management System for Public Service Offices [Focused on Municip...Online Queue Management System for Public Service Offices [Focused on Municip...
Online Queue Management System for Public Service Offices [Focused on Municip...
Rishab Acharya
 
Agile Software Engineering Methodologies
Agile Software Engineering MethodologiesAgile Software Engineering Methodologies
Agile Software Engineering Methodologies
Gaurav Sharma
 
COBOL Programming with VSCode - IBM Certificate
COBOL Programming with VSCode - IBM CertificateCOBOL Programming with VSCode - IBM Certificate
COBOL Programming with VSCode - IBM Certificate
VICTOR MAESTRE RAMIREZ
 
Automating Map Production With FME and Python
Automating Map Production With FME and PythonAutomating Map Production With FME and Python
Automating Map Production With FME and Python
Safe Software
 
How to Generate Financial Statements in QuickBooks Like a Pro (1).pdf
How to Generate Financial Statements in QuickBooks Like a Pro (1).pdfHow to Generate Financial Statements in QuickBooks Like a Pro (1).pdf
How to Generate Financial Statements in QuickBooks Like a Pro (1).pdf
QuickBooks Training
 
Build enterprise-ready applications using skills you already have!
Build enterprise-ready applications using skills you already have!Build enterprise-ready applications using skills you already have!
Build enterprise-ready applications using skills you already have!
PhilMeredith3
 
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Prachi Desai
 
Topic 26 Security Testing Considerations.pptx
Topic 26 Security Testing Considerations.pptxTopic 26 Security Testing Considerations.pptx
Topic 26 Security Testing Considerations.pptx
marutnand8
 
FME for Climate Data: Turning Big Data into Actionable Insights
FME for Climate Data: Turning Big Data into Actionable InsightsFME for Climate Data: Turning Big Data into Actionable Insights
FME for Climate Data: Turning Big Data into Actionable Insights
Safe Software
 
How AI Can Improve Media Quality Testing Across Platforms (1).pptx
How AI Can Improve Media Quality Testing Across Platforms (1).pptxHow AI Can Improve Media Quality Testing Across Platforms (1).pptx
How AI Can Improve Media Quality Testing Across Platforms (1).pptx
kalichargn70th171
 
Integrating Survey123 and R&H Data Using FME
Integrating Survey123 and R&H Data Using FMEIntegrating Survey123 and R&H Data Using FME
Integrating Survey123 and R&H Data Using FME
Safe Software
 
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
Insurance Tech Services
 
The rise of e-commerce has redefined how retailers operate—and reconciliation...
The rise of e-commerce has redefined how retailers operate—and reconciliation...The rise of e-commerce has redefined how retailers operate—and reconciliation...
The rise of e-commerce has redefined how retailers operate—and reconciliation...
Prachi Desai
 
Software Engineering Process, Notation & Tools Introduction - Part 4
Software Engineering Process, Notation & Tools Introduction - Part 4Software Engineering Process, Notation & Tools Introduction - Part 4
Software Engineering Process, Notation & Tools Introduction - Part 4
Gaurav Sharma
 
Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...
Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...
Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...
BradBedford3
 
Design by Contract - Building Robust Software with Contract-First Development
Design by Contract - Building Robust Software with Contract-First DevelopmentDesign by Contract - Building Robust Software with Contract-First Development
Design by Contract - Building Robust Software with Contract-First Development
Par-Tec S.p.A.
 
Neuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
Neuralink TemplateeeeeeeeeeeeeeeeeeeeeeeeeeNeuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
Neuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
alexandernoetzold
 
Rebuilding Cadabra Studio: AI as Our Core Foundation
Rebuilding Cadabra Studio: AI as Our Core FoundationRebuilding Cadabra Studio: AI as Our Core Foundation
Rebuilding Cadabra Studio: AI as Our Core Foundation
Cadabra Studio
 
Essentials of Resource Planning in a Downturn
Essentials of Resource Planning in a DownturnEssentials of Resource Planning in a Downturn
Essentials of Resource Planning in a Downturn
OnePlan Solutions
 
OpenTelemetry 101 Cloud Native Barcelona
OpenTelemetry 101 Cloud Native BarcelonaOpenTelemetry 101 Cloud Native Barcelona
OpenTelemetry 101 Cloud Native Barcelona
Imma Valls Bernaus
 
Online Queue Management System for Public Service Offices [Focused on Municip...
Online Queue Management System for Public Service Offices [Focused on Municip...Online Queue Management System for Public Service Offices [Focused on Municip...
Online Queue Management System for Public Service Offices [Focused on Municip...
Rishab Acharya
 

Parallelize R Code Using Apache Spark

  • 1. Parallelize R code Using Apache Spark Hossein Falaki @mhfalaki
  • 2. About me • Former Data Scientist at Apple Siri • Software Engineer at Databricks • Started using Apache Spark since version 0.6 • Developed first version of Apache Spark CSV data source • Worked on SparkR & Databricks R Notebooks • Currently focusing on R experience at Databricks
  • 3. What is SparkR An R package distributed with Apache Spark: • Provides R front-endto Apache Spark • Exposes Spark DataFrames (inspired by R and Pandas) • Convenient interoperability between R and Spark DataFrames robust distributed processing, data source, off- memory data dynamic environment, interactivity, +10K packages, visualizations +
  • 5. SparkR architecture (2.x) Spark Driver JVM Worker JVM Worker DataSources JVMR RBackend R R R R
  • 6. IO read.df / write.df / createDataFrame / collect Caching cache / persist / unpersist / cacheTable / uncacheTable SQL sql / table / saveAsTable / registerTempTable / tables Overview of SparkR API http://spark.apache.org/docs/latest/api/R/ ML Lib glm / kmeans / Naïve Bayes Survival regression DataFrame API select / subset / groupBy / head / avg / column / dim UDF functionality (since 2.0) spark.lapply/ dapply / gapply / dapplyCollect
  • 7. SparkR UDF API spark.lapply Runs a function over a list of elements spark.lapply() dapply Appliesa function to each partition of a SparkDataFrame dapply() dapplyCollect() gapply Appliesa function to each group within a SparkDataFrame gapply() gapplyCollect()
  • 8. spark.lapply Simplest SparkR UDF patter For each element of a list 1. Sends the function to an R worker 2. Executesthe function 3. Returns the result of all workers as a list to R driver spark.lapply(1:100, function(x) { runBootstrap(x) }
  • 9. spark.lapply control flow RWorker JVMR Driver JVM 1. serialize R closure 4. transfer over local socket 7. serialize result 2. transfer over local socket 8. transfer over local socket 10. transfer over local socket 11. de-serialize result 9. Transfer serialized closureover thenetwork 3. Transfer serialized closureover thenetwork 5. de-serialize closure 6.Execution
  • 10. dapply For each partition of a Spark DataFrame 1. Collects each partition as an R data.frame 2. Sends the R function to the R worker 3. Executesthe function dapply(sparkDF, func, schema) combines resultsas DataFrame with provided schema dapplyCollect(sparkDF, func) combines resultsas R data.frame
  • 11. dapply control & data flow RWorker JVMR Driver JVM localsocket cluster network localsocket input data ser/de transfer result data ser/de transfer
  • 12. dapply control & data flow RWorker JVMR Driver JVM localsocket cluster network localsocket input data ser/de transfer result data ser/de transferresult deserialize
  • 13. gapply Groups a Spark DataFrame on one or more columns 1. Collects each group as an R data.frame 2. Sends the R function to the R worker 3. Executesthe function gapply(sparkDF, func, schema) combines resultsas DataFrame with provided schema gapplyCollect(sparkDF, func) combines resultsas R data.frame
  • 14. dapply control & data flow RWorker JVMR Driver JVM localsocket cluster network localsocket input data ser/de transfer result data ser/de transferresult deserialize data shuffle
  • 15. gapply dapply signature gapply(df, cols, func, schema) gapply(gdf, func, schema) dapply(df, func, schema) user function signature function(key, data) function(data) data partition controlled by grouping not controlled
  • 16. Parallelizing data • Do not use spark.lapply() to distribute large data sets • Do not pack data in the closure • Watch for skew in data – Are partitions evenlysized? • Auxiliary data – Can be joined with input DataFrame – Can be distributed to all the workers
  • 17. Packages on workers • SparkR closure capture does not include packages • You need to import package son each worker inside your function • You need to install packages on workers – spark.lapply() can be used to install packages
  • 18. Debugging user code 1. Verify code on the driver 2. Interactively execute code on the cluster • When R worker fails, Spark Driver throws exception with R error text 3. Inspect details of failure of failed job in Spark UI 4. Inspect stdout/stderrr of worker
  • 19. Demo