0% found this document useful (0 votes)

35 views53 pages

Apache Spark and Scala

Uploaded by

Geeta Dhabugade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views53 pages

Apache Spark and Scala

Uploaded by

Geeta Dhabugade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Apache Spark and Scala

Reynold Xin @rxin

2017-10-22, Scala 2017
Apache Spark

Started in UC Berkeley ~ 2010

Most popular and de facto standard framework in big data

One of the largest OSS projects written in Scala (but with user-facing
APIs in Scala, Java, Python, R, SQL)

Many companies introduced to Scala due to Spark

whoami

Databricks co-founder & Chief Architect

- Designed most of the major things in “modern day” Spark
- #1 contributor to Spark by commits and net lines deleted

UC Berkeley PhD in databases (on leave since 2013)

My Scala / PL background

Working with Scala day-to-day since 2010; previously mostly C, C++,

Java, Python, Tcl …

Authored “Databricks Scala Style Guide”, i.e. Scala is a better Java.

No PL background, i.e. from a PL perspective, I think mostly based on

experience and use cases, not first principle.
How do you compare this with X? Wasn’t this done in X in the 80s?
Today’s Talk

Some archaeology
- IMS, relational databases
- MapReduce
- data frames

Last 7 years of Spark evolution (along with what Scala has enabled)
Databases
IBM IMS hierarchical database (1966)

Image from https://stratechery.com/2016/oracles-cloudy-future/

“Future users of large data banks must be protected from having to
know how the data is organized in the machine. …

most application programs should remain unaffected when the

internal representation of data is changed and even when some
aspects of the external representation are changed.”
Two important ideas in RDBMS

Physical Data Independence: The ability to change the physical data

layout without having to change the logical schema.

Declarative Query Language: Programmer specifies “what” rather than

“how”.
Why?

Business applications outlive the environments they were created in:

- New requirements might surface
- Underlying hardware might change
- Require physical layout changes (indexing, different storage medium, etc)

Enabled tremendous amount of innovation:

- Indexes, compression, column stores, etc
Relational Database Pros vs Cons

- Declarative and data independent

- SQL is the universal interface everybody knows

- SQL is not a “real” PL

- Difficult to compose & build complex applications
- Lack of testing frameworks, IDEs
- Too opinionated and inflexible
- Require data modeling before putting any data in
Big Data, MapReduce,
Hadoop
The Big Data Problem

Semi-/Un-structured data doesn’t fit well with databases

Single machine can no longer process or even store all the data!

Only solution is to distribute general storage & processing over

clusters.
Google Datacenter

How do we program this thing?

17
Data-Parallel Models

Restrict the programming interface so that the system can do more

automatically

“Here’s an operation, run it on all of the data”

- I don’t care where it runs (you schedule that)
- In fact, feel free to run it twice on different nodes
- Leverage key concepts in functional programming
- Similar to “declarative programming” in databases
MapReduce Pros vs Cons

+ Massively parallel
+ Flexible programming model & schema-on-read
+ Type-safe programming language (great for large eng projects)
- Bad performance
- Extremely verbose
- Hard to compose, while most real apps require multiple MR steps
- 21 MR steps -> 21 mapper and reducer classes
R, Python, data frame
Data frames in R / Python

Developed by stats community & concise syntax for ad-hoc analysis

Procedural (not declarative)

> head(filter(df, df$waiting < 50)) # an example in R

## eruptions waiting
##1 1.750 47
##2 1.750 47
##3 1.867 48
Traditional data frames

+ Built-on “real” programming languages

+ Easier to learn

- No parallelism & doesn’t work well on med/big data

- Lack sophisticated query optimization

• No compile-time type safety (great for data science, not so great for
data eng)
“Are you going to talk
about Spark at all!?”
Which one is better?
Databases, R, MapReduce?
Declarative, functional, procedural?
A slide from 2013 …
Spark’s initial focus: a better MapReduce

Language-integrated API (RDD): similar to Scala’s collection library

using functional programming; incredibly powerful and composable

lines = spark.textFile(“hdfs://...”) // RDD[String]

points = lines.map(line => parsePoint(line)) // RDD[Point]
points.filter(p => p.x > 100).count()

Better performance: through a more general DAG abstraction, faster

scheduling, and in-memory caching (i.e. “100X faster than Hadoop”)
Programmability

WordCount in 3 lines of Spark

WordCount in 50+ lines of Java MR

Why Scala (circa 2010)?

JVM-based, integrates well with existing Hadoop stack

Concise syntax

Interactive REPL
Challenge 1. Lack of Structure

Most data is structured (JSON, CSV, Parquet, Avro, …)

• Defining case classes for every step is too verbose
• Programming RDDs inevitably ends up with a lot of tuples (_1, _2, …)

Functional transformations not as intuitive to data scientists

• E.g. map, reduce
data
.map(x => (x.dept, (x.age, 1)))
.reduceByKey((v1, v2) => ((v1._1 + v2._1), (v1._2 + v2._2)))
.map { case(k, v) => (k, v._1.toDouble / v._2) }
.collect()

data.groupby(“dept”).avg()
Challenge 2. Performance

Closures are black boxes to Spark, and can’t be optimized

On data-heavy computation, small overheads add up

• Iterators
• Null checks
• Physical immutability, object allocations

Python/R (the data science languages) 10X slower than Scala

Demo

RDD API

DataFrame API

0 1 2 3 4 5 6 7
Runtime to count 1 billion elements (secs)
Solution:

Structured APIs
DataFrames + Spark SQL
DataFrames and Spark SQL

Efficient library for structured data (data with a known schema)

• Two interfaces: SQL for analysts + apps, DataFrames for programmers

Optimized computation and storage, similar to RDBMS

SIGMOD 2015
Execution Steps

Data
SQL
Frames

Logical Optimizer Physical Code

RDDs
Plan Plan Generator

Data Catalog
Source
API …
DataFrame API

DataFrames hold rows with a known schema and offer relational

operations on them through a DSL

val users = spark.sql(“select * from users”)

val massUsers = users('country === “Canada”)

massUsers.count() Expression AST

massUsers.groupBy(“name”).avg(“age”)
Spark RDD Execution

Java/Scala opaque closures Python

frontend (user-defined functions) frontend

JVM Python
backend backend
Spark DataFrame Execution

Python Java/Scala R
DF DF DF
Simple wrappers to create logical plan

Logical Plan Intermediate representation for computation

Catalyst
optimizer

Physical
execution
Structured API Example

events =
sc.read.json(“/logs”) SCAN logs SCAN users while(logs.hasNext) {
e = logs.next
stats = if(e.status == “ERR”) {
FILTER u = users.get(e.uid)
events.join(users)
.groupBy(“loc”,“status”) key = (u.loc, e.status)
.avg(“duration”) JOIN sum(key) += e.duration
count(key) += 1
}
errors = stats.where( AGG
stats.status == “ERR”) }
...

DataFrame API Optimized Plan Generated Code*

* Thomas Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. VLDB 2011.
What has Scala enabled?

Spark becomes effectively a compiler.

Pattern matching, case classes, tree manipulation invaluable.

Much more difficult to express the compiler part in Java.

Type-safety strikes back

DataFrames are runtime type checked; harder to ensure correctness

for large data engineering pipelines.

Lack the ability to reuse existing classes and functions.

Datasets
Dataset API

Runs on the same optimizer and execution engine as DataFrames

“Encoder” (context bounds) describes the structure of user-defined

classes to Spark, and code-gens serializer.
What are Spark’s structured APIs?

Multi-faceted APIs for different big data use cases:

- SQL: “lingua franca” of data analysis
- R / Python: data science
- Scala Dataset API: type safety for data engineering

Internals that achieve this:

- declarativity & data independence from databases – easy to optimize
- flexibility & parallelism from MapReduce – massively scalable & flexible
Future possibilities from decoupled frontend/backend

Spark as a fast, multi-core data collection library

- Spark running on my laptop is already much faster than Pandas

Spark as a performant streaming engine

Spark as a GPU/vectorized engine

All using the same API

No language is perfect, but things
I wished were designed differently in Scala
(I realize most of them have trade-offs that are difficult to make)
Binary Compatibility

Scala’s own binary compatibility (2.9 -> 2.10 -> 2.11 -> 2.12 …)
- Huge maintenance cost for PaaS provider (Databricks)

Case classes
- Incredibly powerful for internal use, but virtually impossible to guarantee
forward compatibility (i.e. add a field)

Traits with default implementations

Java APIs

Spark defines one API usable for both Scala and Java
- Everything needs to be defined twice (APIs, tests)
- Have to use weird return types, e.g. array
- Docs don’t work for Java
- Kotlin’s idea to reuse Java collection library can simplify this (although it
might come with other hassles)
Exception Handling

Often use lots of Java libraries, especially for disk I/O, network

No good way to ensure exceptions are handled correctly:

- Create Scala shims for all libraries to turn return types into Try’s
- Write low level I/O code in Java and rely on checked exceptions
Tooling so project can be more opinionated

Need to restrict and enforce consistency

- Otherwise impossible to train 1000+ OSS contributors (or even 100+
employees) on all language features properly

Lack of great tooling to enforce standards or disable features

Recap

Latest Spark take the best ideas out of earlier systems

- data frame from R as the “interface” – easy to learn
- declarativity & data independence from databases -- easy to optimize &
future-proof
- parallelism from functional programming -- massively scalable & flexible

Scala’s a critical part of all of these!

Thank you & we are hiring!
@rxin

PySpark Notes
No ratings yet
PySpark Notes
31 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Scala & Spark Training Overview
No ratings yet
Scala & Spark Training Overview
5 pages
Overview of Apache Spark Features and Benefits
No ratings yet
Overview of Apache Spark Features and Benefits
16 pages
Overview of Apache Spark and RDDs
100% (1)
Overview of Apache Spark and RDDs
109 pages
Apache Spark Defined
No ratings yet
Apache Spark Defined
14 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Apache Spark RDD Overview
No ratings yet
Apache Spark RDD Overview
15 pages
Shark
No ratings yet
Shark
24 pages
8 TH
No ratings yet
8 TH
19 pages
1.1.4 and 1.1.5
No ratings yet
1.1.4 and 1.1.5
38 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
24 pages
Unit 4
No ratings yet
Unit 4
60 pages
Apache Spark: Fast Data Processing Engine
No ratings yet
Apache Spark: Fast Data Processing Engine
80 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
Bda U4
No ratings yet
Bda U4
49 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
36 pages
Module 9: Processing Distributed Data With Apache Spark: WWW - Edureka.co/big-Data-And-Hadoop
No ratings yet
Module 9: Processing Distributed Data With Apache Spark: WWW - Edureka.co/big-Data-And-Hadoop
45 pages
Spark: Big Data Processing & Libraries
No ratings yet
Spark: Big Data Processing & Libraries
47 pages
DLMDSBDT01 04 Modern Big Data Processing Frameworks
No ratings yet
DLMDSBDT01 04 Modern Big Data Processing Frameworks
27 pages
T07 Spark
No ratings yet
T07 Spark
23 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
3.5 Apache Spark
No ratings yet
3.5 Apache Spark
12 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Big Data Tools: Scala & Spark Guide
No ratings yet
Big Data Tools: Scala & Spark Guide
53 pages
Unit - 4
No ratings yet
Unit - 4
49 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Spark For Python Developers - Sample Chapter
100% (6)
Spark For Python Developers - Sample Chapter
32 pages
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
No ratings yet
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
8 pages
Understanding Apache Spark Basics
No ratings yet
Understanding Apache Spark Basics
125 pages
Bda Unit 6
No ratings yet
Bda Unit 6
14 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Moudle 5 - Sparsk
No ratings yet
Moudle 5 - Sparsk
14 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Your Paragraph Text
No ratings yet
Your Paragraph Text
26 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Apache Spark Training Overview
No ratings yet
Apache Spark Training Overview
30 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
Sparklyr Online Training Overview
No ratings yet
Sparklyr Online Training Overview
80 pages
Introduction to Apache Spark
No ratings yet
Introduction to Apache Spark
4 pages
Spark and Scala - Module 1
No ratings yet
Spark and Scala - Module 1
42 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
23 pages
Apache Spark: Features & Components
No ratings yet
Apache Spark: Features & Components
9 pages
SPARK
No ratings yet
SPARK
27 pages
Presentation On Apache Spark
No ratings yet
Presentation On Apache Spark
7 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Big Data With Spark Presentation
No ratings yet
Big Data With Spark Presentation
11 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Spark Programming: Big Data Processing Guide
No ratings yet
Spark Programming: Big Data Processing Guide
43 pages