SlideShare a Scribd company logo
Apache Spark MLlib's
past trajectory and new directions
Joseph K. Bradley
Spark Summit 2017
2
About me
Software engineer at Databricks
Apache Spark committer & PMC member
Ph.D. Carnegie Mellon in Machine Learning
3
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
3	3	
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
4
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
•  Collaborative cloud environment
•  Free version (community edition)
4	4	
DATABRICKS RUNTIME 3.0
•  Apache Spark - optimized for the cloud
•  Caching and optimization layer - DBIO
•  Enterprise security - DBES
Try for free today.
databricks.com
5
5 years ago…
Zaharia	et	al.		
NSDI,	2012
Scalable ML
using Spark!
6
3 ½ years ago until today
People added some algorithms…
Classification
•  Logistic regression & linear SVMs
•  L1, L2, or elastic net regularization
•  Decision trees
•  Random forests
•  Gradient-boosted trees
•  Naive Bayes
•  Multilayer perceptron
•  One-vs-rest
•  Streaming logistic regression
Regression
•  Least squares
•  L1, L2, or elastic net regularization
•  Decision trees
•  Random forests
•  Gradient-boosted trees
•  Isotonic regression
•  Streaming least squares
Feature extraction, transformation & selection
•  Binarizer
•  Bucketizer
•  Chi-Squared selection
•  CountVectorizer
•  Discrete cosine transform
•  ElementwiseProduct
•  Hashing term frequency
•  Inverse document frequency
•  MinMaxScaler
•  Ngram
•  Normalizer
•  VectorAssembler
•  VectorIndexer
•  VectorSlicer
•  Word2Vec
Clustering
•  Gaussian mixture models
•  K-Means
•  Streaming K-Means
•  Latent Dirichlet Allocatio
•  Power Iteration Clusterin
Recommendation
•  Alternating Least Squares (ALS)
Frequent itemsets
•  FP-growth
•  Prefix span
Statistics
•  Pearson correlation
•  Spearman correlation
•  Online summarization
•  Chi-squared test
•  Kernel density estimation
Linear Algebra
•  Local & distributed dense & sparse matrices
•  Matrix decompositions (PCA, SVD, QR)
7
Today
Apache Spark is widely used for ML.
•  Many production use cases
•  1000s of commits
•  100s of contributors
•  10,000s of users
(on Databricks alone)
8
This talk
What?
How has MLlib developed, and what has it become?
So what?
What are the implications for the dev & user communities?
Now what?
Where could or should it go?
9
What?
How has MLlib developed, and what has it become?
10
Let’s do some data analytics
50+ algorithms and featurizers
Rapid development
Shift from adding algorithms to improving algorithms & infrastructure
Growing dev community
11
12
13
14
Major projects
0.8 master2.12.01.61.50.9 1.0 1.1 1.2 1.3 1.4
DataFrame-based API
ML persistence
Trees & ensembles
GLMs
SparkR
Pipelines Featurizers
Note: These are unofficial “project” labels based on JIRA/PR activity.
15
So what?
What are the implications for the dev & user communities?
16
Integrating ML with the Big Data world
Seamless integration with SQL, DataFrames, Graphs, Streaming,
both within MLlib…
Data sources
CSV, JSON, Parquet, …
DataFrames
Simple, scalable
pre/post-processing
Graphs
Graph-based ML
implementations
Streaming
ML models deployed
with Streaming
17
Integrating ML with the Big Data world
Seamless integration with SQL, DataFrames, Graphs, Streaming,
both within MLlib…and outside MLlib
E.g., on spark-packages.org
•  scikit-learn integration package
(“spark-sklearn”)
•  Stanford CoreNLP integration
(“spark-corenlp”)
Deep Learning libraries
•  Deep Learning Pipelines
(“spark-deep-learning”)
•  BigDL
•  TensorFlowOnSpark
•  Distributed training with Keras
(“dist-keras”)
Framework for integrating non-“big data” or specialized ML libraries
18
Scaling
Workflow scalability
Same code on laptop with small data à cluster with big data
Big data scalability
Scale-out implementations
4x faster than
xgboost
Abuzaid et al. (2016) Chen & Guestrin (2016)
8x slower than
xgboost
19
Building workflows
Simple concepts: Transformers, Estimators, Models, Evaluators
•  Unified, standardized APIs, including for algorithm parameters
Pipelines
•  Repeatable workflows across Spark deployments and languages
•  DataFrame APIs and optimizations
•  Automated model tuning
FeaturizationETL Training Evaluation
Tuning
20
Now what?
Where could or should it go?
21
Top items from the dev@ list…
• Continued speed & scalability
improvements
• Enhancements to core algorithms
• ML building blocks: linear algebra,
optimization
• Extensible and customizable APIs
Goals
Scale-out ML
Standard library
Extensible API
These are unofficial goals based
on my experience + discussions
on the dev@ mailing list.
22
Scale-out ML
General improvements
• DataFrame-based implementations
Targeted improvements
• Gradient-boosted trees: feature subsampling
• Linear models: vector-free L-BFGS for billions of features
• …
E.g., 10x scaling for
Connected Components
in GraphFrames
23
Standard library
General improvements
•  NaN/null handling
•  Instance weights
•  Warm starts / initial models
•  Multi-column feature transforms
Algorithm-specific improvements
•  Trees: access tree structure from Python
•  …
24
Extensible and customizable APIs
MLlib has ~50 algorithms.
CRAN lists 10,750 packages.
à Users must be able to
modify algorithms or
write their own.
algorithms
usage
25
Spark Packages
340+ packages written for Spark
80+ packages for ML and Graphs
E.g.:
• GraphFrames (“graphframes”): DataFrame-based graphs
• Bisecting K-Means (“bisecting-kmeans”): now part of MLlib
• Stanford CoreNLP wrapper (“spark-corenlp”): UDFs for NLP
spark-packages.org
26
How to write a custom algorithm
You need to:
•  Extend an API like Transformer, Estimator, Model, or Evaluator
MLlib provides some help:
•  UnaryTransformer
•  DefaultParamsWritable/Readable
•  defaultCopy
You get:
•  Natural integration with DataFrames, Pipelines, model tuning
27
Challenges in customization
Modify existing algorithms
with custom:
•  Loss functions
•  Optimizers
•  Stopping criteria
•  Callbacks
Implement new algorithms
without worrying about:
•  Boilerplate
•  Python API
•  ML building blocks
•  Feature attributes/
metadata
28
Example of MLlib APIs
Deep Learning Pipelines for Apache Spark
spark-packages.org/package/databricks/spark-deep-learning
APIs are great for users:
•  Plug & play algorithms
•  Standard API
But challenges remain:
•  Development requires expertise
•  Community under development
29
My challenge to you
Can we fill out the long tail of ML use cases?
algorithms
usage
30
Resources
Spark Packages: spark-packages.org
•  Plugin for building packages:
https://github.com/databricks/sbt-spark-package
•  Tool for generating package templates:
https://github.com/databricks/spark-package-cmd-tool
Example packages:
•  scikit-learn integration (“spark-sklearn”) – Python-only
•  GraphFrames – Scala/Java/Python
Thank you!
Office hours today @ 3:50pm at Databricks booth
Twitter: @jkbatcmu

More Related Content

PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
PDF
Apache Spark's MLlib's Past Trajectory and new Directions
PDF
Accelerating Data Science with Better Data Engineering on Databricks
PDF
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
PDF
Spark Summit EU 2015: Reynold Xin Keynote
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Apache Spark's MLlib's Past Trajectory and new Directions
Accelerating Data Science with Better Data Engineering on Databricks
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Spark Summit EU 2015: Reynold Xin Keynote
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...

What's hot (20)

PDF
Spark Summit 2015 keynote: Making Big Data Simple with Spark
PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
PDF
What's New in Apache Spark 2.3 & Why Should You Care
PDF
Clipper: A Low-Latency Online Prediction Serving System
PPTX
From Pipelines to Refineries: scaling big data applications with Tim Hunter
PDF
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
PDF
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
PDF
Using Databricks as an Analysis Platform
PDF
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
PDF
Announcing Databricks Cloud (Spark Summit 2014)
PDF
Spark DataFrames and ML Pipelines
PDF
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
PDF
Spark summit 2019 infrastructure for deep learning in apache spark 0425
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
PDF
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
PDF
Composable Parallel Processing in Apache Spark and Weld
Spark Summit 2015 keynote: Making Big Data Simple with Spark
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
What's New in Apache Spark 2.3 & Why Should You Care
Clipper: A Low-Latency Online Prediction Serving System
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Using Databricks as an Analysis Platform
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
Announcing Databricks Cloud (Spark Summit 2014)
Spark DataFrames and ML Pipelines
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark summit 2019 infrastructure for deep learning in apache spark 0425
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Composable Parallel Processing in Apache Spark and Weld
Ad

Similar to Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley (20)

PPTX
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
PPTX
Combining Machine Learning Frameworks with Apache Spark
PDF
Integrating Deep Learning Libraries with Apache Spark
PPTX
Combining Machine Learning frameworks with Apache Spark
PDF
Fighting Fraud with Apache Spark
PPTX
AI and Spark - IBM Community AI Day
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
MLlib: Spark's Machine Learning Library
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease
PDF
Designing Distributed Machine Learning on Apache Spark
PDF
2018 02-08-what's-new-in-apache-spark-2.3
PDF
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
PDF
What's New in Upcoming Apache Spark 2.3
PPTX
Apache Spark MLlib
PDF
Koalas: Unifying Spark and pandas APIs
PPTX
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
PDF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
PPTX
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
PDF
Tuning ML Models: Scaling, Workflows, and Architecture
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Combining Machine Learning Frameworks with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
Combining Machine Learning frameworks with Apache Spark
Fighting Fraud with Apache Spark
AI and Spark - IBM Community AI Day
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
MLlib: Spark's Machine Learning Library
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Designing Distributed Machine Learning on Apache Spark
2018 02-08-what's-new-in-apache-spark-2.3
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
What's New in Upcoming Apache Spark 2.3
Apache Spark MLlib
Koalas: Unifying Spark and pandas APIs
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Tuning ML Models: Scaling, Workflows, and Architecture
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Data Science Trends & Career Guide---ppt
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Computer network topology notes for revision
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Foundation of Data Science unit number two notes
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
1_Introduction to advance data techniques.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Data Science Trends & Career Guide---ppt
.pdf is not working space design for the following data for the following dat...
Computer network topology notes for revision
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Business Acumen Training GuidePresentation.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
Supervised vs unsupervised machine learning algorithms
oil_refinery_comprehensive_20250804084928 (1).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Foundation of Data Science unit number two notes
Launch Your Data Science Career in Kochi – 2025
1_Introduction to advance data techniques.pptx
Clinical guidelines as a resource for EBP(1).pdf
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

  • 1. Apache Spark MLlib's past trajectory and new directions Joseph K. Bradley Spark Summit 2017
  • 2. 2 About me Software engineer at Databricks Apache Spark committer & PMC member Ph.D. Carnegie Mellon in Machine Learning
  • 3. 3 TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 3 3 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  • 4. 4 UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! •  Collaborative cloud environment •  Free version (community edition) 4 4 DATABRICKS RUNTIME 3.0 •  Apache Spark - optimized for the cloud •  Caching and optimization layer - DBIO •  Enterprise security - DBES Try for free today. databricks.com
  • 6. 6 3 ½ years ago until today People added some algorithms… Classification •  Logistic regression & linear SVMs •  L1, L2, or elastic net regularization •  Decision trees •  Random forests •  Gradient-boosted trees •  Naive Bayes •  Multilayer perceptron •  One-vs-rest •  Streaming logistic regression Regression •  Least squares •  L1, L2, or elastic net regularization •  Decision trees •  Random forests •  Gradient-boosted trees •  Isotonic regression •  Streaming least squares Feature extraction, transformation & selection •  Binarizer •  Bucketizer •  Chi-Squared selection •  CountVectorizer •  Discrete cosine transform •  ElementwiseProduct •  Hashing term frequency •  Inverse document frequency •  MinMaxScaler •  Ngram •  Normalizer •  VectorAssembler •  VectorIndexer •  VectorSlicer •  Word2Vec Clustering •  Gaussian mixture models •  K-Means •  Streaming K-Means •  Latent Dirichlet Allocatio •  Power Iteration Clusterin Recommendation •  Alternating Least Squares (ALS) Frequent itemsets •  FP-growth •  Prefix span Statistics •  Pearson correlation •  Spearman correlation •  Online summarization •  Chi-squared test •  Kernel density estimation Linear Algebra •  Local & distributed dense & sparse matrices •  Matrix decompositions (PCA, SVD, QR)
  • 7. 7 Today Apache Spark is widely used for ML. •  Many production use cases •  1000s of commits •  100s of contributors •  10,000s of users (on Databricks alone)
  • 8. 8 This talk What? How has MLlib developed, and what has it become? So what? What are the implications for the dev & user communities? Now what? Where could or should it go?
  • 9. 9 What? How has MLlib developed, and what has it become?
  • 10. 10 Let’s do some data analytics 50+ algorithms and featurizers Rapid development Shift from adding algorithms to improving algorithms & infrastructure Growing dev community
  • 11. 11
  • 12. 12
  • 13. 13
  • 14. 14 Major projects 0.8 master2.12.01.61.50.9 1.0 1.1 1.2 1.3 1.4 DataFrame-based API ML persistence Trees & ensembles GLMs SparkR Pipelines Featurizers Note: These are unofficial “project” labels based on JIRA/PR activity.
  • 15. 15 So what? What are the implications for the dev & user communities?
  • 16. 16 Integrating ML with the Big Data world Seamless integration with SQL, DataFrames, Graphs, Streaming, both within MLlib… Data sources CSV, JSON, Parquet, … DataFrames Simple, scalable pre/post-processing Graphs Graph-based ML implementations Streaming ML models deployed with Streaming
  • 17. 17 Integrating ML with the Big Data world Seamless integration with SQL, DataFrames, Graphs, Streaming, both within MLlib…and outside MLlib E.g., on spark-packages.org •  scikit-learn integration package (“spark-sklearn”) •  Stanford CoreNLP integration (“spark-corenlp”) Deep Learning libraries •  Deep Learning Pipelines (“spark-deep-learning”) •  BigDL •  TensorFlowOnSpark •  Distributed training with Keras (“dist-keras”) Framework for integrating non-“big data” or specialized ML libraries
  • 18. 18 Scaling Workflow scalability Same code on laptop with small data à cluster with big data Big data scalability Scale-out implementations 4x faster than xgboost Abuzaid et al. (2016) Chen & Guestrin (2016) 8x slower than xgboost
  • 19. 19 Building workflows Simple concepts: Transformers, Estimators, Models, Evaluators •  Unified, standardized APIs, including for algorithm parameters Pipelines •  Repeatable workflows across Spark deployments and languages •  DataFrame APIs and optimizations •  Automated model tuning FeaturizationETL Training Evaluation Tuning
  • 20. 20 Now what? Where could or should it go?
  • 21. 21 Top items from the dev@ list… • Continued speed & scalability improvements • Enhancements to core algorithms • ML building blocks: linear algebra, optimization • Extensible and customizable APIs Goals Scale-out ML Standard library Extensible API These are unofficial goals based on my experience + discussions on the dev@ mailing list.
  • 22. 22 Scale-out ML General improvements • DataFrame-based implementations Targeted improvements • Gradient-boosted trees: feature subsampling • Linear models: vector-free L-BFGS for billions of features • … E.g., 10x scaling for Connected Components in GraphFrames
  • 23. 23 Standard library General improvements •  NaN/null handling •  Instance weights •  Warm starts / initial models •  Multi-column feature transforms Algorithm-specific improvements •  Trees: access tree structure from Python •  …
  • 24. 24 Extensible and customizable APIs MLlib has ~50 algorithms. CRAN lists 10,750 packages. à Users must be able to modify algorithms or write their own. algorithms usage
  • 25. 25 Spark Packages 340+ packages written for Spark 80+ packages for ML and Graphs E.g.: • GraphFrames (“graphframes”): DataFrame-based graphs • Bisecting K-Means (“bisecting-kmeans”): now part of MLlib • Stanford CoreNLP wrapper (“spark-corenlp”): UDFs for NLP spark-packages.org
  • 26. 26 How to write a custom algorithm You need to: •  Extend an API like Transformer, Estimator, Model, or Evaluator MLlib provides some help: •  UnaryTransformer •  DefaultParamsWritable/Readable •  defaultCopy You get: •  Natural integration with DataFrames, Pipelines, model tuning
  • 27. 27 Challenges in customization Modify existing algorithms with custom: •  Loss functions •  Optimizers •  Stopping criteria •  Callbacks Implement new algorithms without worrying about: •  Boilerplate •  Python API •  ML building blocks •  Feature attributes/ metadata
  • 28. 28 Example of MLlib APIs Deep Learning Pipelines for Apache Spark spark-packages.org/package/databricks/spark-deep-learning APIs are great for users: •  Plug & play algorithms •  Standard API But challenges remain: •  Development requires expertise •  Community under development
  • 29. 29 My challenge to you Can we fill out the long tail of ML use cases? algorithms usage
  • 30. 30 Resources Spark Packages: spark-packages.org •  Plugin for building packages: https://github.com/databricks/sbt-spark-package •  Tool for generating package templates: https://github.com/databricks/spark-package-cmd-tool Example packages: •  scikit-learn integration (“spark-sklearn”) – Python-only •  GraphFrames – Scala/Java/Python
  • 31. Thank you! Office hours today @ 3:50pm at Databricks booth Twitter: @jkbatcmu