Scala Data
Pipelines @
Spotify
Neville Li
@sinisa_lyh
Who am I?
‣ SpotifyNYCsince2011
‣ FormerlyYahoo!Search
‣ Musicrecommendations
‣ Datainfrastructure
‣ Scalasince2013
Spotify in numbers
• Started in 2006, 58 markets
• 75M+ active users, 20M+ paying
• 30M+ songs, 20K new per day
• 1.5 billion playlists
• 1 TB logs per day
• 1200+ node Hadoop cluster
• 10K+ Hadoop jobs per day
Music recommendation @ Spotify
• Discover Weekly
• Radio
• RelatedArtists
• Discover Page
Recommendation systems
A little teaser
PGroupedTable<K,V>::combineValues(CombineFn<K,V> combineFn,
CombineFn<K,V> reduceFn)
Crunch: CombineFns are used to represent the associative operations…
Grouped[K, +V]::reduce[U >: V](fn: (U, U) U)
Scalding: reduce with fn which must be associative and commutative…
PairRDDFunctions[K, V]::reduceByKey(fn: (V, V) => V)
Spark: Merge the values for each key using an associative reduce function…
Monoid!
enables map side reduce
Actually it’s a semigroup
One more teaser
Linear equation inAlternate Least Square (ALS) Matrix factorization
xu = (YTY + YT(Cu − I)Y)−1YTCup(u)
vectors.map { case (id, v) => (id, v * v) }.map(_._2).reduce(_ + _) // YtY
ratings.keyBy(fixedKey).join(outerProducts) // YtCuIY
.map { case (_, (r, op)) =>
(solveKey(r), op * (r.rating * alpha))
}.reduceByKey(_ + _)
ratings.keyBy(fixedKey).join(vectors) // YtCupu
.map { case (_, (r, v)) =>
val (Cui, pui) = (r.rating * alpha + 1, if (Cui > 0.0) 1.0 else 0.0)
(solveKey(r), v * (Cui * pui))
}.reduceByKey(_ + _)
http://www.slideshare.net/MrChrisJohnson/scala-data-pipelines-for-music-recommendations
Success story
• Mid 2013: 100+ Python Luigi M/R jobs, few tests
• 10+ new hires since, most fresh grads
• Few with Java experience, none with Scala
• Now: 300+ Scalding jobs, 400+ tests
• More ad-hoc jobs untracked
• Spark also taking off
First 10 months
……
Activity over time
Guess how many jobs
written by yours truly?
Performance vs. Agility
https://nicholassterling.wordpress.com/2012/11/16/scala-performance/
Let’sdiveinto
something
technical
To join or not to join?
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.join(tgp)
.values // (user, genre)
.group
.mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
Hash join
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.hashJoin(tgp.forceToDisk) // tgp replicated to all mappers
.values // (user, genre)
.group
.mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
CoGroup
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.cogroup(tgp) { case (_, users, genres) =>
users.map((_, genres.toSet))
} // (track, (user, genres))
.values // (user, genres)

.group
.reduce(_ ++ _) // map-side reduce!
CoGroup
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.cogroup(tgp) { case (_, users, genres) =>
users.map((_, genres.toSet))
} // (track, (user, genres))
.values // (user, genres)

.group
.sum // SetMonoid[Set[T]] from Algebird
* sum[U >:V](implicit sg: Semigroup[U])
Key-value file as distributed cache
val streams: TypedPipe[(String, String)] = _ // (gid, user)
val tgp: SparkeyManager = _ // tgp replicated to all mappers
streams
.map { case (track, user) =>
(user, tgp.get(track).split(",").toSet)
}
.group
.sum
https://github.com/spotify/sparkey
SparkeyManagerwraps DistributedCacheFile
Joins and CoGroups
• Require shuffle and reduce step
• Some ops force everything to reducers

e.g. mapGroup, mapValueStream
• CoGroup more flexible for complex logic
• Scalding flattens a.join(b).join(c)…

into MultiJoin(a, b, c, …)
Distributed cache
• Fasterwith off-heap binary files
• Building cache = more wiring
• Memory mapping may interfere withYARN
• E.g. 64GB nodes with 48GB for containers (no cgroup)
• 12 × 2GB containers each with 2GB JVM heap + mmap cache
• OOM and swap!
• Keep files small (< 1GB) or fallback to joins…
Analyze your jobs
• Concurrent Driven
• Visualize job execution
• Workflow optimization
• Bottlenecks
• Data skew
Notenough
math?
Recommending tracks
• User listened to Rammstein - Du Hast
• Recommend 10 similartracks
• 40 dimension feature vectors fortracks
• Compute cosine similarity between all pairs
• O(n) lookup per userwhere n ≈ 30m
• Trythat with 50m users * 10 seed tracks each
ANNOY - cheat by approximation
• Approximate Nearest Neighbor OhYeah
• Random projections and binarytree search
• Build index on single machine
• Load in mappers via distribute cache
• O(log n) lookup
https://github.com/spotify/annoy
https://github.com/spotify/annoy-java
ANN Benchmark
https://github.com/erikbern/ann-benchmarks
Filtering candidates
• Users don’t like seeing artist/album/tracks they already know
• But may forget what they listened long ago
• 50m * thousands of items each
• Over 5 years of streaming logs
• Need to update daily
• Need to purge old items per user
Options
• Aggregate all logs daily
• Aggregate last x days daily
• CSVof artist/album/track ids
• Bloom filters
Decayed value with cutoff
• Compute new user-item score daily
• Weighted on context, e.g. radio, search, playlist
• score’ = score + previous * 0.99
• half life = log0.99
0.5 = 69 days
• Cut off at top 2000
• Items that users might remember seeing recently
Bloom filters
• Probabilistic data structure
• Encoding set of items with m bits and k hash functions
• No false negative
• Tunable false positive probability
• Size proportional to capacity & FP probability
• Let’s build one per user-{artists,albums,tracks}
• Algebird BloomFilterMonoid: z = all zero bits, + = bitwise OR
Size versus max items & FP prob
• User-item distribution is uneven
• Assuming same setting for all users
• # items << capacity → wasting space
• # items > capacity → high FP rate
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
n=1k
item
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
n=1k n=10k
item
full
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
item
n=1k n=10k n=100k
fullfull
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
n=1k n=10k n=100k n=1m
item
fullfullfull
Opportunistic Bloom Filter
• Building n BFs of increasing capacity in parallel
• Up to << N max possible items
• Keep smallest one with capacity > items inserted
• Expensive to build
• Cheap to store and lookup
Opportunistic Bloom Filter
• Building n BFs of increasing capacity in parallel
• Up to << N max possible items
• Keep smallest one with capacity > items inserted
• Expensive to build
• Cheap to store and lookup
n=1k
 
80%
n=10k
 
8%
n=100k
 
0.8%
n=1m
 
0.08%
item
Opportunistic Bloom Filter
• Building n BFs of increasing capacity in parallel
• Up to  N max possible items
• Keep smallest one with capacity  items inserted
• Expensive to build
• Cheap to store and lookup
n=1k
 
100%
n=10k
 
70%
n=100k
 
7%
n=1m
 
0.7%
item
full
Opportunistic Bloom Filter
• Building n BFs of increasing capacity in parallel
• Up to  N max possible items
• Keep smallest one with capacity  items inserted
• Expensive to build
• Cheap to store and lookup
n=1k
 
100%
n=10k
 
100%
n=100k
 
60%
n=1m

More Related Content

PDF
Collaborative Filtering with Spark
PDF
Building Data Pipelines for Music Recommendations at Spotify
PPTX
Apache Flink and what it is used for
PPTX
Collaborative Filtering at Spotify
PPTX
Recommender system introduction
PDF
Approximate nearest neighbor methods and vector models – NYC ML meetup
PDF
Homepage Personalization at Spotify
PDF
Dynamic Partition Pruning in Apache Spark
Collaborative Filtering with Spark
Building Data Pipelines for Music Recommendations at Spotify
Apache Flink and what it is used for
Collaborative Filtering at Spotify
Recommender system introduction
Approximate nearest neighbor methods and vector models – NYC ML meetup
Homepage Personalization at Spotify
Dynamic Partition Pruning in Apache Spark

What's hot (20)

PDF
Overview of recommender system
PPTX
Real-time Analytics with Trino and Apache Pinot
PDF
Personalized Playlists at Spotify
PDF
Personalizing the listening experience
PDF
Big data and machine learning @ Spotify
PDF
How Apache Drives Music Recommendations At Spotify
PDF
Artwork Personalization at Netflix
PPTX
An Introduction to Druid
PPTX
Netflix talk at ML Platform meetup Sep 2019
PDF
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
PPTX
Hive: Loading Data
PPTX
ELK Elasticsearch Logstash and Kibana Stack for Log Management
PDF
Algorithmic Music Recommendations at Spotify
PDF
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
PDF
An introduction to Recommender Systems
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
PDF
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
PDF
Apache Kafka Architecture & Fundamentals Explained
PDF
Producer Performance Tuning for Apache Kafka
Overview of recommender system
Real-time Analytics with Trino and Apache Pinot
Personalized Playlists at Spotify
Personalizing the listening experience
Big data and machine learning @ Spotify
How Apache Drives Music Recommendations At Spotify
Artwork Personalization at Netflix
An Introduction to Druid
Netflix talk at ML Platform meetup Sep 2019
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Hive: Loading Data
ELK Elasticsearch Logstash and Kibana Stack for Log Management
Algorithmic Music Recommendations at Spotify
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
An introduction to Recommender Systems
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
Apache Kafka Architecture & Fundamentals Explained
Producer Performance Tuning for Apache Kafka
Ad

Viewers also liked (7)

PDF
Playlist Recommendations @ Spotify
DOCX
Music survey results (2)
PDF
Music & interaction
PDF
Mugo one pager
PDF
Jackdaw research music survey report
PDF
How We Listen to Music - SXSW 2015
PDF
Music Personalization At Spotify
Playlist Recommendations @ Spotify
Music survey results (2)
Music & interaction
Mugo one pager
Jackdaw research music survey report
How We Listen to Music - SXSW 2015
Music Personalization At Spotify
Ad

Similar to Scala Data Pipelines @ Spotify (20)

KEY
London devops logging
PDF
Intelligent Search
PDF
CPANTS: Kwalitative website and its tools
PDF
Machine learning @ Spotify - Madison Big Data Meetup
PDF
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
PDF
Let's Get to the Rapids
PDF
Intelligent Search
PDF
Scala Data Pipelines for Music Recommendations
PPSX
Introduction to Vertica (Architecture & More)
PPTX
Vertica architecture
PPTX
Stream processing from single node to a cluster
PDF
Akka streams
KEY
Tuning Your Engine
PDF
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
PDF
Scala in practice - 3 years later
PDF
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
PDF
Hive at Last.fm
PDF
Graphite
PDF
Scaling ingest pipelines with high performance computing principles - Rajiv K...
PPTX
Apache HAWQ Architecture
London devops logging
Intelligent Search
CPANTS: Kwalitative website and its tools
Machine learning @ Spotify - Madison Big Data Meetup
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Let's Get to the Rapids
Intelligent Search
Scala Data Pipelines for Music Recommendations
Introduction to Vertica (Architecture & More)
Vertica architecture
Stream processing from single node to a cluster
Akka streams
Tuning Your Engine
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Scala in practice - 3 years later
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
Hive at Last.fm
Graphite
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Apache HAWQ Architecture

More from Neville Li (7)

PDF
Sorry - How Bieber broke Google Cloud at Spotify
PDF
Scio - Moving to Google Cloud, A Spotify Story
PDF
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
PDF
PDF
From stream to recommendation using apache beam with cloud pubsub and cloud d...
PDF
Why functional why scala
PDF
Storm at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
Scio - Moving to Google Cloud, A Spotify Story
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
From stream to recommendation using apache beam with cloud pubsub and cloud d...
Why functional why scala
Storm at Spotify

Recently uploaded (20)

PDF
How to Set Realistic Project Milestones and Deadlines
PPTX
Human Computer Interaction lecture Chapter 2.pptx
PPTX
UNIT II: Software design, software .pptx
PPTX
MCP empowers AI Agents from Zero to Production
PPTX
Beige and Black Minimalist Project Deck Presentation (1).pptx
PDF
WhatsApp Chatbots The Key to Scalable Customer Support.pdf
PDF
Module 1 - Introduction to Generative AI.pdf
PDF
SOFTWARE ENGINEERING Software Engineering (3rd Edition) by K.K. Aggarwal & Yo...
PPTX
Post-Migration Optimization Playbook: Getting the Most Out of Your New Adobe ...
PPTX
Bandicam Screen Recorder 8.2.1 Build 2529 Crack
PPTX
Comprehensive Guide to Digital Image Processing Concepts and Applications
PPTX
Greedy best-first search algorithm always selects the path which appears best...
PPTX
WJQSJXNAZJVCVSAXJHBZKSJXKJKXJSBHJBJEHHJB
PDF
Top 10 Project Management Software for Small Teams in 2025.pdf
PPTX
FLIGHT TICKET API | API INTEGRATION PLATFORM
PDF
Difference Between Website and Web Application.pdf
PDF
Building an Inclusive Web Accessibility Made Simple with Accessibility Analyzer
PDF
What Makes a Great Data Visualization Consulting Service.pdf
PPTX
Why 2025 Is the Best Year to Hire Software Developers in India
PDF
Mobile App for Guard Tour and Reporting.pdf
How to Set Realistic Project Milestones and Deadlines
Human Computer Interaction lecture Chapter 2.pptx
UNIT II: Software design, software .pptx
MCP empowers AI Agents from Zero to Production
Beige and Black Minimalist Project Deck Presentation (1).pptx
WhatsApp Chatbots The Key to Scalable Customer Support.pdf
Module 1 - Introduction to Generative AI.pdf
SOFTWARE ENGINEERING Software Engineering (3rd Edition) by K.K. Aggarwal & Yo...
Post-Migration Optimization Playbook: Getting the Most Out of Your New Adobe ...
Bandicam Screen Recorder 8.2.1 Build 2529 Crack
Comprehensive Guide to Digital Image Processing Concepts and Applications
Greedy best-first search algorithm always selects the path which appears best...
WJQSJXNAZJVCVSAXJHBZKSJXKJKXJSBHJBJEHHJB
Top 10 Project Management Software for Small Teams in 2025.pdf
FLIGHT TICKET API | API INTEGRATION PLATFORM
Difference Between Website and Web Application.pdf
Building an Inclusive Web Accessibility Made Simple with Accessibility Analyzer
What Makes a Great Data Visualization Consulting Service.pdf
Why 2025 Is the Best Year to Hire Software Developers in India
Mobile App for Guard Tour and Reporting.pdf

Scala Data Pipelines @ Spotify

  • 2. Who am I? ‣ SpotifyNYCsince2011 ‣ FormerlyYahoo!Search ‣ Musicrecommendations ‣ Datainfrastructure ‣ Scalasince2013
  • 3. Spotify in numbers • Started in 2006, 58 markets • 75M+ active users, 20M+ paying • 30M+ songs, 20K new per day • 1.5 billion playlists • 1 TB logs per day • 1200+ node Hadoop cluster • 10K+ Hadoop jobs per day
  • 4. Music recommendation @ Spotify • Discover Weekly • Radio • RelatedArtists • Discover Page
  • 6. A little teaser PGroupedTable<K,V>::combineValues(CombineFn<K,V> combineFn, CombineFn<K,V> reduceFn) Crunch: CombineFns are used to represent the associative operations… Grouped[K, +V]::reduce[U >: V](fn: (U, U) U) Scalding: reduce with fn which must be associative and commutative… PairRDDFunctions[K, V]::reduceByKey(fn: (V, V) => V) Spark: Merge the values for each key using an associative reduce function…
  • 7. Monoid! enables map side reduce Actually it’s a semigroup
  • 8. One more teaser Linear equation inAlternate Least Square (ALS) Matrix factorization xu = (YTY + YT(Cu − I)Y)−1YTCup(u) vectors.map { case (id, v) => (id, v * v) }.map(_._2).reduce(_ + _) // YtY ratings.keyBy(fixedKey).join(outerProducts) // YtCuIY .map { case (_, (r, op)) => (solveKey(r), op * (r.rating * alpha)) }.reduceByKey(_ + _) ratings.keyBy(fixedKey).join(vectors) // YtCupu .map { case (_, (r, v)) => val (Cui, pui) = (r.rating * alpha + 1, if (Cui > 0.0) 1.0 else 0.0) (solveKey(r), v * (Cui * pui)) }.reduceByKey(_ + _) http://www.slideshare.net/MrChrisJohnson/scala-data-pipelines-for-music-recommendations
  • 9. Success story • Mid 2013: 100+ Python Luigi M/R jobs, few tests • 10+ new hires since, most fresh grads • Few with Java experience, none with Scala • Now: 300+ Scalding jobs, 400+ tests • More ad-hoc jobs untracked • Spark also taking off
  • 12. Guess how many jobs written by yours truly?
  • 15. To join or not to join? val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .join(tgp) .values // (user, genre) .group .mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
  • 16. Hash join val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .hashJoin(tgp.forceToDisk) // tgp replicated to all mappers .values // (user, genre) .group .mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
  • 17. CoGroup val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .cogroup(tgp) { case (_, users, genres) => users.map((_, genres.toSet)) } // (track, (user, genres)) .values // (user, genres)
 .group .reduce(_ ++ _) // map-side reduce!
  • 18. CoGroup val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .cogroup(tgp) { case (_, users, genres) => users.map((_, genres.toSet)) } // (track, (user, genres)) .values // (user, genres)
 .group .sum // SetMonoid[Set[T]] from Algebird * sum[U >:V](implicit sg: Semigroup[U])
  • 19. Key-value file as distributed cache val streams: TypedPipe[(String, String)] = _ // (gid, user) val tgp: SparkeyManager = _ // tgp replicated to all mappers streams .map { case (track, user) => (user, tgp.get(track).split(",").toSet) } .group .sum https://github.com/spotify/sparkey SparkeyManagerwraps DistributedCacheFile
  • 20. Joins and CoGroups • Require shuffle and reduce step • Some ops force everything to reducers
 e.g. mapGroup, mapValueStream • CoGroup more flexible for complex logic • Scalding flattens a.join(b).join(c)…
 into MultiJoin(a, b, c, …)
  • 21. Distributed cache • Fasterwith off-heap binary files • Building cache = more wiring • Memory mapping may interfere withYARN • E.g. 64GB nodes with 48GB for containers (no cgroup) • 12 × 2GB containers each with 2GB JVM heap + mmap cache • OOM and swap! • Keep files small (< 1GB) or fallback to joins…
  • 22. Analyze your jobs • Concurrent Driven • Visualize job execution • Workflow optimization • Bottlenecks • Data skew
  • 24. Recommending tracks • User listened to Rammstein - Du Hast • Recommend 10 similartracks • 40 dimension feature vectors fortracks • Compute cosine similarity between all pairs • O(n) lookup per userwhere n ≈ 30m • Trythat with 50m users * 10 seed tracks each
  • 25. ANNOY - cheat by approximation • Approximate Nearest Neighbor OhYeah • Random projections and binarytree search • Build index on single machine • Load in mappers via distribute cache • O(log n) lookup https://github.com/spotify/annoy https://github.com/spotify/annoy-java
  • 27. Filtering candidates • Users don’t like seeing artist/album/tracks they already know • But may forget what they listened long ago • 50m * thousands of items each • Over 5 years of streaming logs • Need to update daily • Need to purge old items per user
  • 28. Options • Aggregate all logs daily • Aggregate last x days daily • CSVof artist/album/track ids • Bloom filters
  • 29. Decayed value with cutoff • Compute new user-item score daily • Weighted on context, e.g. radio, search, playlist • score’ = score + previous * 0.99 • half life = log0.99 0.5 = 69 days • Cut off at top 2000 • Items that users might remember seeing recently
  • 30. Bloom filters • Probabilistic data structure • Encoding set of items with m bits and k hash functions • No false negative • Tunable false positive probability • Size proportional to capacity & FP probability • Let’s build one per user-{artists,albums,tracks} • Algebird BloomFilterMonoid: z = all zero bits, + = bitwise OR
  • 31. Size versus max items & FP prob • User-item distribution is uneven • Assuming same setting for all users • # items << capacity → wasting space • # items > capacity → high FP rate
  • 32. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead
  • 33. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead n=1k item
  • 34. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead n=1k n=10k item full
  • 35. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead item n=1k n=10k n=100k fullfull
  • 36. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead n=1k n=10k n=100k n=1m item fullfullfull
  • 37. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to << N max possible items • Keep smallest one with capacity > items inserted • Expensive to build • Cheap to store and lookup
  • 38. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to << N max possible items • Keep smallest one with capacity > items inserted • Expensive to build • Cheap to store and lookup n=1k
  • 43. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to N max possible items • Keep smallest one with capacity items inserted • Expensive to build • Cheap to store and lookup n=1k
  • 48. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to N max possible items • Keep smallest one with capacity items inserted • Expensive to build • Cheap to store and lookup n=1k
  • 53. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to N max possible items • Keep smallest one with capacity items inserted • Expensive to build • Cheap to store and lookup n=1k
  • 60. Track metadata • Label dump → content ingestion • Third partytrack genres, e.g. GraceNote • Audio attributes, e.g. tempo, key, time signature • Cultural data, e.g. popularity, tags • Latent vectors from collaborative filtering • Many sources for album, artist, user metadata too
  • 61. Multiple data sources • Big joins • Complex dependencies • Wide rows with few columns accessed • Wasting I/O
  • 62. Apache Parquet • Pre-join sources into mega-datasets • Store as Parquet columnar storage • Column projection • Predicate pushdown • Avro within Scalding pipelines
  • 63. Projection pipe.map(a = (a.getName, a.getAmount)) versus Parquet.project[Account](name, amount) • Strings → unsafe and error prone • No IDE auto-completion → finger injury • my_fancy_field_name → .getMyFancyFieldName • Hard to migrate existing code
  • 64. Predicate pipe.filter(a = a.getName == Neville a.getAmount 100) versus FilterApi.and( FilterApi.eq(FilterApi.binaryColumn(name), Binary.fromString(Neville)), FilterApi.gt(FilterApi.floatColumn(amount), 100f.asInstnacesOf[java.lang.Float]))
  • 65. Macro to the rescue Code →AST→ (pattern matching) → (recursion) → (quasi-quotes) → Code Projection[Account](_.getName, _.getAmount) Predicate[Account](x = x.getName == “Neville x.getAmount 100) https://github.com/nevillelyh/parquet-avro-extra http://www.lyh.me/slides/macros.html
  • 66. What else? ‣ Analytics ‣ Adstargeting,prediction ‣ Metadataquality ‣ Zeppelin ‣ Morecoolstuffintheworks