0% found this document useful (0 votes)
35 views53 pages

Apache Spark and Scala

Uploaded by

Geeta Dhabugade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views53 pages

Apache Spark and Scala

Uploaded by

Geeta Dhabugade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Apache Spark and Scala

Reynold Xin @rxin


2017-10-22, Scala 2017
Apache Spark

Started in UC Berkeley ~ 2010

Most popular and de facto standard framework in big data

One of the largest OSS projects written in Scala (but with user-facing
APIs in Scala, Java, Python, R, SQL)

Many companies introduced to Scala due to Spark


whoami

Databricks co-founder & Chief Architect


- Designed most of the major things in “modern day” Spark
- #1 contributor to Spark by commits and net lines deleted

UC Berkeley PhD in databases (on leave since 2013)


My Scala / PL background

Working with Scala day-to-day since 2010; previously mostly C, C++,


Java, Python, Tcl …

Authored “Databricks Scala Style Guide”, i.e. Scala is a better Java.

No PL background, i.e. from a PL perspective, I think mostly based on


experience and use cases, not first principle.
How do you compare this with X? Wasn’t this done in X in the 80s?
Today’s Talk

Some archaeology
- IMS, relational databases
- MapReduce
- data frames

Last 7 years of Spark evolution (along with what Scala has enabled)
Databases
IBM IMS hierarchical database (1966)

Image from https://stratechery.com/2016/oracles-cloudy-future/


“Future users of large data banks must be protected from having to
know how the data is organized in the machine. …

most application programs should remain unaffected when the


internal representation of data is changed and even when some
aspects of the external representation are changed.”
Two important ideas in RDBMS

Physical Data Independence: The ability to change the physical data


layout without having to change the logical schema.

Declarative Query Language: Programmer specifies “what” rather than


“how”.
Why?

Business applications outlive the environments they were created in:


- New requirements might surface
- Underlying hardware might change
- Require physical layout changes (indexing, different storage medium, etc)

Enabled tremendous amount of innovation:


- Indexes, compression, column stores, etc
Relational Database Pros vs Cons

- Declarative and data independent


- SQL is the universal interface everybody knows

- SQL is not a “real” PL


- Difficult to compose & build complex applications
- Lack of testing frameworks, IDEs
- Too opinionated and inflexible
- Require data modeling before putting any data in
Big Data, MapReduce,
Hadoop
The Big Data Problem

Semi-/Un-structured data doesn’t fit well with databases

Single machine can no longer process or even store all the data!

Only solution is to distribute general storage & processing over


clusters.
Google Datacenter

How do we program this thing?

17
Data-Parallel Models

Restrict the programming interface so that the system can do more


automatically

“Here’s an operation, run it on all of the data”


- I don’t care where it runs (you schedule that)
- In fact, feel free to run it twice on different nodes
- Leverage key concepts in functional programming
- Similar to “declarative programming” in databases
MapReduce Pros vs Cons

+ Massively parallel
+ Flexible programming model & schema-on-read
+ Type-safe programming language (great for large eng projects)
- Bad performance
- Extremely verbose
- Hard to compose, while most real apps require multiple MR steps
- 21 MR steps -> 21 mapper and reducer classes
R, Python, data frame
Data frames in R / Python

Developed by stats community & concise syntax for ad-hoc analysis

Procedural (not declarative)

> head(filter(df, df$waiting < 50)) # an example in R


## eruptions waiting
##1 1.750 47
##2 1.750 47
##3 1.867 48
Traditional data frames

+ Built-on “real” programming languages


+ Easier to learn

- No parallelism & doesn’t work well on med/big data


- Lack sophisticated query optimization

• No compile-time type safety (great for data science, not so great for
data eng)
“Are you going to talk
about Spark at all!?”
Which one is better?
Databases, R, MapReduce?
Declarative, functional, procedural?
A slide from 2013 …
Spark’s initial focus: a better MapReduce

Language-integrated API (RDD): similar to Scala’s collection library


using functional programming; incredibly powerful and composable

lines = spark.textFile(“hdfs://...”) // RDD[String]


points = lines.map(line => parsePoint(line)) // RDD[Point]
points.filter(p => p.x > 100).count()

Better performance: through a more general DAG abstraction, faster


scheduling, and in-memory caching (i.e. “100X faster than Hadoop”)
Programmability

WordCount in 3 lines of Spark

WordCount in 50+ lines of Java MR


Why Scala (circa 2010)?

JVM-based, integrates well with existing Hadoop stack

Concise syntax

Interactive REPL
Challenge 1. Lack of Structure

Most data is structured (JSON, CSV, Parquet, Avro, …)


• Defining case classes for every step is too verbose
• Programming RDDs inevitably ends up with a lot of tuples (_1, _2, …)

Functional transformations not as intuitive to data scientists


• E.g. map, reduce
data
.map(x => (x.dept, (x.age, 1)))
.reduceByKey((v1, v2) => ((v1._1 + v2._1), (v1._2 + v2._2)))
.map { case(k, v) => (k, v._1.toDouble / v._2) }
.collect()

data.groupby(“dept”).avg()
Challenge 2. Performance

Closures are black boxes to Spark, and can’t be optimized

On data-heavy computation, small overheads add up


• Iterators
• Null checks
• Physical immutability, object allocations

Python/R (the data science languages) 10X slower than Scala


Demo

RDD API

DataFrame API

0 1 2 3 4 5 6 7
Runtime to count 1 billion elements (secs)
Solution:

Structured APIs
DataFrames + Spark SQL
DataFrames and Spark SQL

Efficient library for structured data (data with a known schema)


• Two interfaces: SQL for analysts + apps, DataFrames for programmers

Optimized computation and storage, similar to RDBMS


SIGMOD 2015
Execution Steps

Data
SQL
Frames

Logical Optimizer Physical Code


RDDs
Plan Plan Generator

Data Catalog
Source
API …
DataFrame API

DataFrames hold rows with a known schema and offer relational


operations on them through a DSL

val users = spark.sql(“select * from users”)

val massUsers = users('country === “Canada”)

massUsers.count() Expression AST


massUsers.groupBy(“name”).avg(“age”)
Spark RDD Execution

Java/Scala opaque closures Python


frontend (user-defined functions) frontend

JVM Python
backend backend
Spark DataFrame Execution

Python Java/Scala R
DF DF DF
Simple wrappers to create logical plan

Logical Plan Intermediate representation for computation

Catalyst
optimizer

Physical
execution
Structured API Example

events =
sc.read.json(“/logs”) SCAN logs SCAN users while(logs.hasNext) {
e = logs.next
stats = if(e.status == “ERR”) {
FILTER u = users.get(e.uid)
events.join(users)
.groupBy(“loc”,“status”) key = (u.loc, e.status)
.avg(“duration”) JOIN sum(key) += e.duration
count(key) += 1
}
errors = stats.where( AGG
stats.status == “ERR”) }
...

DataFrame API Optimized Plan Generated Code*

* Thomas Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. VLDB 2011.
What has Scala enabled?

Spark becomes effectively a compiler.

Pattern matching, case classes, tree manipulation invaluable.

Much more difficult to express the compiler part in Java.


Type-safety strikes back

DataFrames are runtime type checked; harder to ensure correctness


for large data engineering pipelines.

Lack the ability to reuse existing classes and functions.


Datasets
Dataset API

Runs on the same optimizer and execution engine as DataFrames

“Encoder” (context bounds) describes the structure of user-defined


classes to Spark, and code-gens serializer.
What are Spark’s structured APIs?

Multi-faceted APIs for different big data use cases:


- SQL: “lingua franca” of data analysis
- R / Python: data science
- Scala Dataset API: type safety for data engineering

Internals that achieve this:


- declarativity & data independence from databases – easy to optimize
- flexibility & parallelism from MapReduce – massively scalable & flexible
Future possibilities from decoupled frontend/backend

Spark as a fast, multi-core data collection library


- Spark running on my laptop is already much faster than Pandas

Spark as a performant streaming engine

Spark as a GPU/vectorized engine

All using the same API


No language is perfect, but things
I wished were designed differently in Scala
(I realize most of them have trade-offs that are difficult to make)
Binary Compatibility

Scala’s own binary compatibility (2.9 -> 2.10 -> 2.11 -> 2.12 …)
- Huge maintenance cost for PaaS provider (Databricks)

Case classes
- Incredibly powerful for internal use, but virtually impossible to guarantee
forward compatibility (i.e. add a field)

Traits with default implementations


Java APIs

Spark defines one API usable for both Scala and Java
- Everything needs to be defined twice (APIs, tests)
- Have to use weird return types, e.g. array
- Docs don’t work for Java
- Kotlin’s idea to reuse Java collection library can simplify this (although it
might come with other hassles)
Exception Handling

Often use lots of Java libraries, especially for disk I/O, network

No good way to ensure exceptions are handled correctly:


- Create Scala shims for all libraries to turn return types into Try’s
- Write low level I/O code in Java and rely on checked exceptions
Tooling so project can be more opinionated

Need to restrict and enforce consistency


- Otherwise impossible to train 1000+ OSS contributors (or even 100+
employees) on all language features properly

Lack of great tooling to enforce standards or disable features


Recap

Latest Spark take the best ideas out of earlier systems


- data frame from R as the “interface” – easy to learn
- declarativity & data independence from databases -- easy to optimize &
future-proof
- parallelism from functional programming -- massively scalable & flexible

Scala’s a critical part of all of these!


Thank you & we are hiring!
@rxin

You might also like