Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

Apache Spark MLlib's
past trajectory and new directions
Joseph K. Bradley
Spark Summit 2017

2
About me
Software engineer at Databricks
Apache Spark committer & PMC member
Ph.D. Carnegie Mellon in Machine Learning

3
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
3 3
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple

4
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
•  Collaborative cloud environment
•  Free version (community edition)
4 4
DATABRICKS RUNTIME 3.0
•  Apache Spark - optimized for the cloud
•  Caching and optimization layer - DBIO
•  Enterprise security - DBES
Try for free today.
databricks.com

5
5 years ago…
Zaharia et al.
NSDI, 2012
Scalable ML
using Spark!

6
3 ½ years ago until today
People added some algorithms…
Classification
•  Logistic regression & linear SVMs
•  L1, L2, or elastic net regularization
•  Decision trees
•  Random forests
•  Gradient-boosted trees
•  Naive Bayes
•  Multilayer perceptron
•  One-vs-rest
•  Streaming logistic regression
Regression
•  Least squares
•  L1, L2, or elastic net regularization
•  Decision trees
•  Random forests
•  Gradient-boosted trees
•  Isotonic regression
•  Streaming least squares
Feature extraction, transformation & selection
•  Binarizer
•  Bucketizer
•  Chi-Squared selection
•  CountVectorizer
•  Discrete cosine transform
•  ElementwiseProduct
•  Hashing term frequency
•  Inverse document frequency
•  MinMaxScaler
•  Ngram
•  Normalizer
•  VectorAssembler
•  VectorIndexer
•  VectorSlicer
•  Word2Vec
Clustering
•  Gaussian mixture models
•  K-Means
•  Streaming K-Means
•  Latent Dirichlet Allocatio
•  Power Iteration Clusterin
Recommendation
•  Alternating Least Squares (ALS)
Frequent itemsets
•  FP-growth
•  Prefix span
Statistics
•  Pearson correlation
•  Spearman correlation
•  Online summarization
•  Chi-squared test
•  Kernel density estimation
Linear Algebra
•  Local & distributed dense & sparse matrices
•  Matrix decompositions (PCA, SVD, QR)

7
Today
Apache Spark is widely used for ML.
•  Many production use cases
•  1000s of commits
•  100s of contributors
•  10,000s of users
(on Databricks alone)

8
This talk
What?
How has MLlib developed, and what has it become?
So what?
What are the implications for the dev & user communities?
Now what?
Where could or should it go?

9
What?
How has MLlib developed, and what has it become?

10
Let’s do some data analytics
50+ algorithms and featurizers
Rapid development
Shift from adding algorithms to improving algorithms & infrastructure
Growing dev community

14
Major projects
0.8 master2.12.01.61.50.9 1.0 1.1 1.2 1.3 1.4
DataFrame-based API
ML persistence
Trees & ensembles
GLMs
SparkR
Pipelines Featurizers
Note: These are unoﬀicial “project” labels based on JIRA/PR activity.

15
So what?
What are the implications for the dev & user communities?

16
Integrating ML with the Big Data world
Seamless integration with SQL, DataFrames, Graphs, Streaming,
both within MLlib…
Data sources
CSV, JSON, Parquet, …
DataFrames
Simple, scalable
pre/post-processing
Graphs
Graph-based ML
implementations
Streaming
ML models deployed
with Streaming

17
Integrating ML with the Big Data world
Seamless integration with SQL, DataFrames, Graphs, Streaming,
both within MLlib…and outside MLlib
E.g., on spark-packages.org
•  scikit-learn integration package
(“spark-sklearn”)
•  Stanford CoreNLP integration
(“spark-corenlp”)
Deep Learning libraries
•  Deep Learning Pipelines
(“spark-deep-learning”)
•  BigDL
•  TensorFlowOnSpark
•  Distributed training with Keras
(“dist-keras”)
Framework for integrating non-“big data” or specialized ML libraries

18
Scaling
Workflow scalability
Same code on laptop with small data à cluster with big data
Big data scalability
Scale-out implementations
4x faster than
xgboost
Abuzaid et al. (2016) Chen & Guestrin (2016)
8x slower than
xgboost

19
Building workflows
Simple concepts: Transformers, Estimators, Models, Evaluators
•  Unified, standardized APIs, including for algorithm parameters
Pipelines
•  Repeatable workflows across Spark deployments and languages
•  DataFrame APIs and optimizations
•  Automated model tuning
FeaturizationETL Training Evaluation
Tuning

20
Now what?
Where could or should it go?

21
Top items from the dev@ list…
• Continued speed & scalability
improvements
• Enhancements to core algorithms
• ML building blocks: linear algebra,
optimization
• Extensible and customizable APIs
Goals
Scale-out ML
Standard library
Extensible API
These are unoﬀicial goals based
on my experience + discussions
on the dev@ mailing list.

22
Scale-out ML
General improvements
• DataFrame-based implementations
Targeted improvements
• Gradient-boosted trees: feature subsampling
• Linear models: vector-free L-BFGS for billions of features
• …
E.g., 10x scaling for
Connected Components
in GraphFrames

23
Standard library
General improvements
•  NaN/null handling
•  Instance weights
•  Warm starts / initial models
•  Multi-column feature transforms
Algorithm-specific improvements
•  Trees: access tree structure from Python
•  …

24
Extensible and customizable APIs
MLlib has ~50 algorithms.
CRAN lists 10,750 packages.
à Users must be able to
modify algorithms or
write their own.
algorithms
usage

25
Spark Packages
340+ packages written for Spark
80+ packages for ML and Graphs
E.g.:
• GraphFrames (“graphframes”): DataFrame-based graphs
• Bisecting K-Means (“bisecting-kmeans”): now part of MLlib
• Stanford CoreNLP wrapper (“spark-corenlp”): UDFs for NLP
spark-packages.org

26
How to write a custom algorithm
You need to:
•  Extend an API like Transformer, Estimator, Model, or Evaluator
MLlib provides some help:
•  UnaryTransformer
•  DefaultParamsWritable/Readable
•  defaultCopy
You get:
•  Natural integration with DataFrames, Pipelines, model tuning

27
Challenges in customization
Modify existing algorithms
with custom:
•  Loss functions
•  Optimizers
•  Stopping criteria
•  Callbacks
Implement new algorithms
without worrying about:
•  Boilerplate
•  Python API
•  ML building blocks
•  Feature attributes/
metadata

28
Example of MLlib APIs
Deep Learning Pipelines for Apache Spark
spark-packages.org/package/databricks/spark-deep-learning
APIs are great for users:
•  Plug & play algorithms
•  Standard API
But challenges remain:
•  Development requires expertise
•  Community under development

29
My challenge to you
Can we fill out the long tail of ML use cases?
algorithms
usage

30
Resources
Spark Packages: spark-packages.org
•  Plugin for building packages:
https://github.com/databricks/sbt-spark-package
•  Tool for generating package templates:
https://github.com/databricks/spark-package-cmd-tool
Example packages:
•  scikit-learn integration (“spark-sklearn”) – Python-only
•  GraphFrames – Scala/Java/Python

Thank you!
Oﬀice hours today @ 3:50pm at Databricks booth
Twitter: @jkbatcmu

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

More Related Content

What's hot (20)

Similar to Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley (20)

More from Databricks (20)

Recently uploaded (20)

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley