SlideShare a Scribd company logo
Parallelizing Existing R
Packages with SparkR
Hossein Falaki
@mhfalaki
About me
• Former Data Scientist at Apple Siri
• Software Engineer at Databricks
• Started using Apache Spark since version 0.6
• Developed first version of Apache Spark CSV data
source
• Worked on SparkR &Databricks R Notebook feature
• Currently focusing on R experience at Databricks
2
What is SparkR?
An R package distributed with Apache Spark (soon CRAN):
- Provides R frontend to Spark
- Exposes Spark DataFrames (inspired by R and Pandas)
- Convenient interoperability between R and Spark DataFrames
3
distributed/robust processing, data
sources, off-memory data
structures
+	 Dynamic environment, interactivity,
packages, visualization
SparkR architecture
4
Spark	Driver	
							JVM	R	
R	Backend	 JVM	
Worker	
JVM	
Worker	
Data	Sources	
	JVM
SparkR architecture (since 2.0)
5
Spark	Driver	
R	 							JVM	
R	Backend	
JVM	
Worker	
JVM	
Worker	
Data	Sources	
R	
R
Overview of SparkR API
IO
read.df / write.df /
createDataFrame / collect
Caching
cache / persist / unpersist /
cacheTable / uncacheTable
SQL 
sql / table / saveAsTable /
registerTempTable / tables
6
ML Lib
glm / kmeans / Naïve Bayes
Survival regression
DataFrame API
select / subset / groupBy / 
head / avg / column / dim
UDF functionality (since 2.0)
spark.lapply / dapply /
gapply / dapplyCollect 
http://spark.apache.org/docs/latest/api/R/
Overview of SparkR API :: Session
Spark session is your interface to Spark functionality in R
o SparkR DataFrames are implemented on top of SparkSQL tables
o All DataFrame operations go through a SQL optimizer (catalyst)
o Since 2.0 sqlContext is wrapped in a new object called SparkR Session.
7
> spark <- sparkR.session()
All SparkR functions work if you pass them a session or will
assume an existing session.
Reading/Writing data
8
R	 							JVM	
R	Backend	
JVM	
Worker	
JVM	
Worker	
HDFS/S3/…	
read.df()	
write.df()
Moving data between R and JVM
9
R	 							JVM	
R	Backend	
SparkR::collect()	
SparkR::createDataFrame()
Overview of SparkR API :: DataFrame
API
SparkR DataFrame behaves similar to R data.frames
> sparkDF$newCol <- sparkDF$col + 1
> subsetDF <- sparkDF[, c(“date”, “type”)]
> recentData <- subset(sparkDF$date == “2015-10-24”)
> firstRow <- sparkDF[[1, ]]
> names(subsetDF) <- c(“Date”, “Type”)
> dim(recentData)
> head(count(group_by(subsetDF, “Date”)))
10
Overview of SparkR API :: SQL
You can register a DataFrame as a table and query it in SQL
> logs <- read.df(“data/logs”, source = “json”)
> registerTempTable(df, “logsTable”)
> errorsByCode <- sql(“select count(*) as num, type from
logsTable where type == “error” group by code order by
date desc”)
> reviewsDF <- tableToDF(“reviewsTable”)
> registerTempTable(filter(reviewsDF, reviewsDF$rating ==
5), “fiveStars”)
11
Moving between languages
12
R Scala
Spark	
df <- read.df(...)

wiki <- filter(df, ...)

registerTempTable(wiki,
“wiki”)
val wiki = table(“wiki”)

val parsed = wiki.map {
Row(_, _, text: String,
_, _) =>text.split(‘ ’)
}

val model =
Kmeans.train(parsed)
Overview of SparkR API
IO
read.df / write.df /
createDataFrame / collect
Caching
cache / persist / unpersist /
cacheTable / uncacheTable
SQL 
sql / table / saveAsTable /
registerTempTable / tables
13
ML Lib
glm / kmeans / Naïve Bayes
Survival regression
DataFrame API
select / subset / groupBy / 
head / avg / column / dim
UDF functionality (since 2.0)
spark.lapply / dapply /
gapply / dapplyCollect 
http://spark.apache.org/docs/latest/api/R/
SparkR UDF API
14
spark.lapply
Runs a function
over a list of
elements
spark.lapply()
dapply
Applies a function
to each partition of
a SparkDataFrame
dapply()
dapplyCollect()
gapply
Applies a function
to each group
within a
SparkDataFrame
gapply()
gapplyCollect()
spark.lapply
15
Simplest SparkR UDF pattern
For each element of a list:
1.  Sends the function to an R worker
2.  Executes the function
3.  Returns the result of all workers as a list to R driver
spark.lapply(1:100, function(x) {
runBootstrap(x)
}
spark.lapply control flow
16
R	Worker	JVM	
R	Worker	JVM	
R	Worker	JVM	R	 Driver	JVM	
1.	Serialize	R	closure	
3.	Transfer	serialized	closure	over	the	network	
5.	De-serialize	closure	
4.	Transfer	over	
						local	socket	
6.	Serialize	result	
2.	Transfer	over	
					local	socket	
7.	Transfer	over	
						local	socket	9.	Transfer	over	
					local	socket	
10.	Deserialize	result	
8.	Transfer	serialized	closure	over	the	network
dapply
17
For each partition of a Spark DataFrame
1.  collects each partition as an R data.frame
2.  sends the R function to the R worker
3.  executes the function
dapply(sparkDF, func, schema)
combines results as
DataFrame with schema
dapplyCollect(sparkDF, func)
combines results as R
data.frame
dapply control & data flow
18
R	Worker	JVM	
R	Worker	JVM	
R	Worker	JVM	R	 Driver	JVM	
local socket cluster network local socket
input data
ser/de transfer
result data
ser/de transfer
dapplyCollect control & data flow
19
R	Worker	JVM	
R	Worker	JVM	
R	Worker	JVM	R	 Driver	JVM	
local socket cluster network local socket
input data
ser/de transfer
result transfer
result deser
gapply
20
Groups a Spark DataFrame on one or more columns
1.  collects each group as an R data.frame
2.  sends the R function to the R worker
3.  executes the function
gapply(sparkDF, cols, func, schema)
combines results as
DataFrame with schema
gapplyCollect(sparkDF, cols, func)
combines results as R
data.frame
gapply control & data flow
21
R	Worker	JVM	
R	Worker	JVM	
R	Worker	JVM	R	 Driver	JVM	
local socket cluster network local socket
input data
ser/de transfer
result data
ser/de transfer
data
shuffle
dapply vs. gapply
22
gapply
 dapply	
signature gapply(df, cols, func, schema)
gapply(gdf, func, schema)	
	
dapply(df, func, schema)	
user function
signature
function(key, data)	 function(data)	
data partition controlled	by	grouping	 not	controlled
Parallelizing data
• Do not use spark.lapply() to distribute large data sets
• Do not pack data in the closure
• Watch for skew in data
– Are partitions evenly sized?
• Auxiliary data
– Can be joined with input DataFrame
– Can be distributed to all the workers using FileSystem
23
Packages on workers
• SparkR closure capture does not include packages
• You need to import packages on each worker inside your
function
• If not installed install packages on workers out-of-band
• spark.lapply() can be used to install packages
24
Debugging user code
1.  Verify your code on the Driver
2.  Interactively execute the code on the cluster
–  When R worker fails, Spark Driver throws exception with the R error
text
3.  Inspect details of failure reason of failed job in spark UI
4.  Inspect stdout/stderror of workers
25
Demo
26
Notebooks available at:
•  hSp://bit.ly/2krYMwC	
•  hSp://bit.ly/2ltLVKs
Thank you!

More Related Content

What's hot (20)

Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013
Vijay Srinivas Agneeswaran, Ph.D
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Spark meetup v2.0.5
Spark meetup v2.0.5
Yan Zhou
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
Apache spark basics
Apache spark basics
sparrowAnalytics.com
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Spark Summit
 
Spark etl
Spark etl
Imran Rashid
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Road to Analytics
Road to Analytics
Datio Big Data
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Spark meetup v2.0.5
Spark meetup v2.0.5
Yan Zhou
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Spark Summit
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 

Similar to Parallelizing Existing R Packages (20)

Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
Databricks
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
Databricks
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
Databricks
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
Introduction to SparkR
Introduction to SparkR
Ankara Big Data Meetup
 
Introduction to SparkR
Introduction to SparkR
Olgun Aydın
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 
Scalable Data Science with SparkR
Scalable Data Science with SparkR
DataWorks Summit
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
Big data analysis using spark r published
Big data analysis using spark r published
Dipendra Kusi
 
Machine Learning with SparkR
Machine Learning with SparkR
Olgun Aydın
 
Sparkr sigmod
Sparkr sigmod
waqasm86
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
SparkR best practices for R data scientist
SparkR best practices for R data scientist
DataWorks Summit
 
SparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data Scientists
DataWorks Summit
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Provectus
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
Databricks
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
Databricks
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
Databricks
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
Introduction to SparkR
Introduction to SparkR
Olgun Aydın
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 
Scalable Data Science with SparkR
Scalable Data Science with SparkR
DataWorks Summit
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
Big data analysis using spark r published
Big data analysis using spark r published
Dipendra Kusi
 
Machine Learning with SparkR
Machine Learning with SparkR
Olgun Aydın
 
Sparkr sigmod
Sparkr sigmod
waqasm86
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
SparkR best practices for R data scientist
SparkR best practices for R data scientist
DataWorks Summit
 
SparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data Scientists
DataWorks Summit
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Provectus
 
Ad

Recently uploaded (20)

Step by step guide to install Flutter and Dart
Step by step guide to install Flutter and Dart
S Pranav (Deepu)
 
How the US Navy Approaches DevSecOps with Raise 2.0
How the US Navy Approaches DevSecOps with Raise 2.0
Anchore
 
Milwaukee Marketo User Group June 2025 - Optimize and Enhance Efficiency - Sm...
Milwaukee Marketo User Group June 2025 - Optimize and Enhance Efficiency - Sm...
BradBedford3
 
Advanced Token Development - Decentralized Innovation
Advanced Token Development - Decentralized Innovation
arohisinghas720
 
MOVIE RECOMMENDATION SYSTEM, UDUMULA GOPI REDDY, Y24MC13085.pptx
MOVIE RECOMMENDATION SYSTEM, UDUMULA GOPI REDDY, Y24MC13085.pptx
Maharshi Mallela
 
Software Engineering Process, Notation & Tools Introduction - Part 3
Software Engineering Process, Notation & Tools Introduction - Part 3
Gaurav Sharma
 
Zoneranker’s Digital marketing solutions
Zoneranker’s Digital marketing solutions
reenashriee
 
What is data visualization and how data visualization tool can help.pdf
What is data visualization and how data visualization tool can help.pdf
Varsha Nayak
 
Transmission Media. (Computer Networks)
Transmission Media. (Computer Networks)
S Pranav (Deepu)
 
Integrating Survey123 and R&H Data Using FME
Integrating Survey123 and R&H Data Using FME
Safe Software
 
Smart Financial Solutions: Money Lender Software, Daily Pigmy & Personal Loan...
Smart Financial Solutions: Money Lender Software, Daily Pigmy & Personal Loan...
Intelli grow
 
GDG Douglas - Google AI Agents: Your Next Intern?
GDG Douglas - Google AI Agents: Your Next Intern?
felipeceotto
 
FME as an Orchestration Tool - Peak of Data & AI 2025
FME as an Orchestration Tool - Peak of Data & AI 2025
Safe Software
 
Wondershare PDFelement Pro 11.4.20.3548 Crack Free Download
Wondershare PDFelement Pro 11.4.20.3548 Crack Free Download
Puppy jhon
 
IMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK.P.pptx
IMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK.P.pptx
usmanch7829
 
AI-Powered Compliance Solutions for Global Regulations | Certivo
AI-Powered Compliance Solutions for Global Regulations | Certivo
certivoai
 
Who will create the languages of the future?
Who will create the languages of the future?
Jordi Cabot
 
Software Engineering Process, Notation & Tools Introduction - Part 4
Software Engineering Process, Notation & Tools Introduction - Part 4
Gaurav Sharma
 
IBM Rational Unified Process For Software Engineering - Introduction
IBM Rational Unified Process For Software Engineering - Introduction
Gaurav Sharma
 
Open Source Software Development Methods
Open Source Software Development Methods
VICTOR MAESTRE RAMIREZ
 
Step by step guide to install Flutter and Dart
Step by step guide to install Flutter and Dart
S Pranav (Deepu)
 
How the US Navy Approaches DevSecOps with Raise 2.0
How the US Navy Approaches DevSecOps with Raise 2.0
Anchore
 
Milwaukee Marketo User Group June 2025 - Optimize and Enhance Efficiency - Sm...
Milwaukee Marketo User Group June 2025 - Optimize and Enhance Efficiency - Sm...
BradBedford3
 
Advanced Token Development - Decentralized Innovation
Advanced Token Development - Decentralized Innovation
arohisinghas720
 
MOVIE RECOMMENDATION SYSTEM, UDUMULA GOPI REDDY, Y24MC13085.pptx
MOVIE RECOMMENDATION SYSTEM, UDUMULA GOPI REDDY, Y24MC13085.pptx
Maharshi Mallela
 
Software Engineering Process, Notation & Tools Introduction - Part 3
Software Engineering Process, Notation & Tools Introduction - Part 3
Gaurav Sharma
 
Zoneranker’s Digital marketing solutions
Zoneranker’s Digital marketing solutions
reenashriee
 
What is data visualization and how data visualization tool can help.pdf
What is data visualization and how data visualization tool can help.pdf
Varsha Nayak
 
Transmission Media. (Computer Networks)
Transmission Media. (Computer Networks)
S Pranav (Deepu)
 
Integrating Survey123 and R&H Data Using FME
Integrating Survey123 and R&H Data Using FME
Safe Software
 
Smart Financial Solutions: Money Lender Software, Daily Pigmy & Personal Loan...
Smart Financial Solutions: Money Lender Software, Daily Pigmy & Personal Loan...
Intelli grow
 
GDG Douglas - Google AI Agents: Your Next Intern?
GDG Douglas - Google AI Agents: Your Next Intern?
felipeceotto
 
FME as an Orchestration Tool - Peak of Data & AI 2025
FME as an Orchestration Tool - Peak of Data & AI 2025
Safe Software
 
Wondershare PDFelement Pro 11.4.20.3548 Crack Free Download
Wondershare PDFelement Pro 11.4.20.3548 Crack Free Download
Puppy jhon
 
IMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK.P.pptx
IMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK.P.pptx
usmanch7829
 
AI-Powered Compliance Solutions for Global Regulations | Certivo
AI-Powered Compliance Solutions for Global Regulations | Certivo
certivoai
 
Who will create the languages of the future?
Who will create the languages of the future?
Jordi Cabot
 
Software Engineering Process, Notation & Tools Introduction - Part 4
Software Engineering Process, Notation & Tools Introduction - Part 4
Gaurav Sharma
 
IBM Rational Unified Process For Software Engineering - Introduction
IBM Rational Unified Process For Software Engineering - Introduction
Gaurav Sharma
 
Open Source Software Development Methods
Open Source Software Development Methods
VICTOR MAESTRE RAMIREZ
 
Ad

Parallelizing Existing R Packages

  • 1. Parallelizing Existing R Packages with SparkR Hossein Falaki @mhfalaki
  • 2. About me • Former Data Scientist at Apple Siri • Software Engineer at Databricks • Started using Apache Spark since version 0.6 • Developed first version of Apache Spark CSV data source • Worked on SparkR &Databricks R Notebook feature • Currently focusing on R experience at Databricks 2
  • 3. What is SparkR? An R package distributed with Apache Spark (soon CRAN): - Provides R frontend to Spark - Exposes Spark DataFrames (inspired by R and Pandas) - Convenient interoperability between R and Spark DataFrames 3 distributed/robust processing, data sources, off-memory data structures + Dynamic environment, interactivity, packages, visualization
  • 5. SparkR architecture (since 2.0) 5 Spark Driver R JVM R Backend JVM Worker JVM Worker Data Sources R R
  • 6. Overview of SparkR API IO read.df / write.df / createDataFrame / collect Caching cache / persist / unpersist / cacheTable / uncacheTable SQL sql / table / saveAsTable / registerTempTable / tables 6 ML Lib glm / kmeans / Naïve Bayes Survival regression DataFrame API select / subset / groupBy / head / avg / column / dim UDF functionality (since 2.0) spark.lapply / dapply / gapply / dapplyCollect http://spark.apache.org/docs/latest/api/R/
  • 7. Overview of SparkR API :: Session Spark session is your interface to Spark functionality in R o SparkR DataFrames are implemented on top of SparkSQL tables o All DataFrame operations go through a SQL optimizer (catalyst) o Since 2.0 sqlContext is wrapped in a new object called SparkR Session. 7 > spark <- sparkR.session() All SparkR functions work if you pass them a session or will assume an existing session.
  • 9. Moving data between R and JVM 9 R JVM R Backend SparkR::collect() SparkR::createDataFrame()
  • 10. Overview of SparkR API :: DataFrame API SparkR DataFrame behaves similar to R data.frames > sparkDF$newCol <- sparkDF$col + 1 > subsetDF <- sparkDF[, c(“date”, “type”)] > recentData <- subset(sparkDF$date == “2015-10-24”) > firstRow <- sparkDF[[1, ]] > names(subsetDF) <- c(“Date”, “Type”) > dim(recentData) > head(count(group_by(subsetDF, “Date”))) 10
  • 11. Overview of SparkR API :: SQL You can register a DataFrame as a table and query it in SQL > logs <- read.df(“data/logs”, source = “json”) > registerTempTable(df, “logsTable”) > errorsByCode <- sql(“select count(*) as num, type from logsTable where type == “error” group by code order by date desc”) > reviewsDF <- tableToDF(“reviewsTable”) > registerTempTable(filter(reviewsDF, reviewsDF$rating == 5), “fiveStars”) 11
  • 12. Moving between languages 12 R Scala Spark df <- read.df(...) wiki <- filter(df, ...) registerTempTable(wiki, “wiki”) val wiki = table(“wiki”) val parsed = wiki.map { Row(_, _, text: String, _, _) =>text.split(‘ ’) } val model = Kmeans.train(parsed)
  • 13. Overview of SparkR API IO read.df / write.df / createDataFrame / collect Caching cache / persist / unpersist / cacheTable / uncacheTable SQL sql / table / saveAsTable / registerTempTable / tables 13 ML Lib glm / kmeans / Naïve Bayes Survival regression DataFrame API select / subset / groupBy / head / avg / column / dim UDF functionality (since 2.0) spark.lapply / dapply / gapply / dapplyCollect http://spark.apache.org/docs/latest/api/R/
  • 14. SparkR UDF API 14 spark.lapply Runs a function over a list of elements spark.lapply() dapply Applies a function to each partition of a SparkDataFrame dapply() dapplyCollect() gapply Applies a function to each group within a SparkDataFrame gapply() gapplyCollect()
  • 15. spark.lapply 15 Simplest SparkR UDF pattern For each element of a list: 1.  Sends the function to an R worker 2.  Executes the function 3.  Returns the result of all workers as a list to R driver spark.lapply(1:100, function(x) { runBootstrap(x) }
  • 16. spark.lapply control flow 16 R Worker JVM R Worker JVM R Worker JVM R Driver JVM 1. Serialize R closure 3. Transfer serialized closure over the network 5. De-serialize closure 4. Transfer over local socket 6. Serialize result 2. Transfer over local socket 7. Transfer over local socket 9. Transfer over local socket 10. Deserialize result 8. Transfer serialized closure over the network
  • 17. dapply 17 For each partition of a Spark DataFrame 1.  collects each partition as an R data.frame 2.  sends the R function to the R worker 3.  executes the function dapply(sparkDF, func, schema) combines results as DataFrame with schema dapplyCollect(sparkDF, func) combines results as R data.frame
  • 18. dapply control & data flow 18 R Worker JVM R Worker JVM R Worker JVM R Driver JVM local socket cluster network local socket input data ser/de transfer result data ser/de transfer
  • 19. dapplyCollect control & data flow 19 R Worker JVM R Worker JVM R Worker JVM R Driver JVM local socket cluster network local socket input data ser/de transfer result transfer result deser
  • 20. gapply 20 Groups a Spark DataFrame on one or more columns 1.  collects each group as an R data.frame 2.  sends the R function to the R worker 3.  executes the function gapply(sparkDF, cols, func, schema) combines results as DataFrame with schema gapplyCollect(sparkDF, cols, func) combines results as R data.frame
  • 21. gapply control & data flow 21 R Worker JVM R Worker JVM R Worker JVM R Driver JVM local socket cluster network local socket input data ser/de transfer result data ser/de transfer data shuffle
  • 22. dapply vs. gapply 22 gapply dapply signature gapply(df, cols, func, schema) gapply(gdf, func, schema) dapply(df, func, schema) user function signature function(key, data) function(data) data partition controlled by grouping not controlled
  • 23. Parallelizing data • Do not use spark.lapply() to distribute large data sets • Do not pack data in the closure • Watch for skew in data – Are partitions evenly sized? • Auxiliary data – Can be joined with input DataFrame – Can be distributed to all the workers using FileSystem 23
  • 24. Packages on workers • SparkR closure capture does not include packages • You need to import packages on each worker inside your function • If not installed install packages on workers out-of-band • spark.lapply() can be used to install packages 24
  • 25. Debugging user code 1.  Verify your code on the Driver 2.  Interactively execute the code on the cluster –  When R worker fails, Spark Driver throws exception with the R error text 3.  Inspect details of failure reason of failed job in spark UI 4.  Inspect stdout/stderror of workers 25
  • 26. Demo 26 Notebooks available at: •  hSp://bit.ly/2krYMwC •  hSp://bit.ly/2ltLVKs