Streaming Algorithms
Joe Kelley
Data Engineer
July 2013
CONFIDENTIAL | 2
Accelerating Your Time to Value
Strategy
and Roadmap
IMAGINE
Training
and Education
ILLUMINATE
Hands-On
Data Science and
Data Engineering
IMPLEMENT
Leading Provider of
Data Science & Engineering for Big
Analytics
CONFIDENTIAL | 3
• Operates on a continuous stream of data
• Unknown or infinite size
• Only one pass; options:
• Store it
• Lose it
• Store an approximation
• Limited processing time per item
•
• Limited total memory
•
What is a Streaming Algorithm?
Algorithm
Standing Query
Ad-hoc Query
Input
Output
Memory
Disk
CONFIDENTIAL | 4
Why use a Streaming Algorithm?
• Compare to typical “Big Data” approach: store
everything, analyze later, scale linearly
• Streaming Pros:
• Lower latency
• Lower storage cost
• Streaming Cons:
• Less flexibility
• Lower precision (sometimes)
• Answer?
• Why not both?
Streaming
Algorithm
Result
Initial Answer
Long-term Storage Batch Algorithm
Result
Authoritative Answer
CONFIDENTIAL | 5
General Techniques
1. Tunable Approximation
2. Sampling
• Sliding window
• Fixed number
• Fixed percentage
3. Hashing: useful randomness
CONFIDENTIAL | 6
Example 1: Sampling device error rates
• Stream of (device_id, event, timestamp)
• Scenario:
• Not enough space to store everything
• Simple queries  storing 1% is good enough
Device-1
(Device-1, event-1, 10001123)
(Device-1, event-3, 10001126)
(Device-1, event-1, 10001129)
...
Device-2
(Device-2, event-2, 10001124)
(Device-2, ERROR, 10001130)
(Device-2, event-4, 10001132)
...
Device-3
(Device-3, event-3, 10001122)
(Device-3, event-1, 10001127)
(Device-3, ERROR, 10001135)
...
(Device-3, event-3, 10001122)
(Device-1, event-1, 10001123)
(Device-2, event-2, 10001124)
(Device-1, event-3, 10001126)
(Device-3, event-1, 10001127)
(Device-1, event-1, 10001129)
(Device-2, ERROR, 10001130)
(Device-2, event-4, 10001132)
(Device-3, ERROR, 10001135)
...
Input
CONFIDENTIAL | 7
Example 1: Sampling device error rates
• Stream of (device_id, event, timestamp)
• Scenario:
• Not enough space to store everything
• Simple queries  storing 1% is good enough
Algorithm:
for each element e:
with probability 0.01:
store e
else:
throw out e
Can lead to some insidious statistical “bugs”…
CONFIDENTIAL | 8
Example 1: Sampling device error rates
• Stream of (device_id, event, timestamp)
• Scenario:
• Not enough space to store everything
• Simple queries  storing 1% is good enough
Query:
How many errors has the average device encountered?
Answer:
SELECT AVG(n) FROM (
SELECT COUNT(*) AS n FROM events
WHERE event = 'ERROR'
GROUP BY device_id
)
Simple… but off by up to 100x. Each device had only 1% of its events
sampled.
Can we just multiply by 100?
CONFIDENTIAL | 9
Example 1: Sampling device error rates
• Stream of (device_id, event, timestamp)
• Scenario:
• Not enough space to store everything
• Simple queries  storing 1% is good enough
Better Algorithm:
for each element e:
if (hash(e.device_id) mod 100) == 0
store e
else:
throw out e
Choose how to hash carefully... or hash every different way
CONFIDENTIAL | 10
Example 2: Sampling fixed number
Choice of p is crucial:
• p = constant  prefer more recent elements. Higher p = more recent
• p = k/n  sample uniformly from entire stream
Let arr = array of size k
for each element e:
if arr is not yet full:
add e to arr
else:
with probability p:
replace a random element of arr with e
else:
throw out e
Want to sample a fixed count (k), not a fixed percentage.
Algorithm:
CONFIDENTIAL | 11
Example 2: Sampling fixed number
CONFIDENTIAL | 12
Example 3: Counting unique users
• Input: stream of (user_id, action, timestamp)
• Want to know how many distinct users are seen over
a time period
• Naïve approach:
• Store all user_id’s in a list/tree/hashtable
• Millions of users = lot of memory
• Better approach:
• Store all user_id’s in a database
• Good, but maybe it’s not fast enough…
• What if an approximate count is ok?
CONFIDENTIAL | 13
Example 3: Counting unique users
• Input: stream of (user_id, action, timestamp)
• Want to know how many distinct users are seen over a time period
• Approximate count is ok
• Flajolet-Martin Idea:
• Hash each user_id into a bit string
• Count the trailing zeros
• Remember maximum number of trailing zeros seen
user_id H(user_id) trailing zeros max(trailing zeros)
john_doe 0111001001 0 0
jane_doe 1011011100 2 2
alan_t 0010111000 3 3
EWDijkstra 1101011110 1 3
jane_doe 1011011100 2 3
CONFIDENTIAL | 14
Example 3: Counting unique users
• Input: stream of (user_id, action, timestamp)
• Want to know how many distinct users are seen over a time period
• Intuition:
• If we had seen 2 distinct users, we would expect 1
trailing zero
• If we had seen 4, we would expect 2 trailing zeros
• If we had seen , we would expect
• In general, if there has been a maximum of trailing
zeros, is a reasonable estimation of distinct users
• Want more precision? User more independent hash
functions, and combine the results
• Median = only get powers of two
• Mean = subject to skew
• Median of means of groups works well in practice
CONFIDENTIAL | 15
Example 3: Counting unique users
• Input: stream of (user_id, action, timestamp)
• Want to know how many distinct users are seen over a time period
Flajolet-Martin, all together:
arr = int[k]
for each item e:
for i in 0...k-1:
z = trailing_zeros(hashi(e))
if z > arr[i]:
arr[i] = z
means = group_means(arr)
median = median(means)
return pow(2, median)
CONFIDENTIAL | 16
Example 3: Counting unique users
Flajolet-Martin in practice
• Devil is in the details
• Tunable precision
• more hash functions = more precise
• See the paper for bounds on precision
• Tunable latency
• more hash functions = higher latency
• faster hash functions = lower latency
• faster hash functions = more possibility of
correlation = less precision
Remember: streaming algorithm for quick, imprecise
answer. Back-end batch algorithm for slower, exact
answer
CONFIDENTIAL | 17
Example 4: Counting Individual Item Frequencies
Want to keep track of how many times each item has
appeared in the stream
Many applications:
• How popular is each search term?
• How many times has this hashtag been tweeted?
• Which IP addresses are DDoS’ing me?
Again, two obvious approaches:
• In-memory hashmap of itemcount
• Database
But can we be more clever?
CONFIDENTIAL | 18
Example 4: Counting Individual Item Frequencies
Want to keep track of how many times each item has appeared in the stream
Idea:
• Maintain array of counts
• Hash each item, increment array at that index
To check the count of an item, hash again and check
array at that index
• Over-estimates because of hash “collisions”
CONFIDENTIAL | 19
Example 4: Counting Individual Item Frequencies
Count-Min Sketch algorithm:
• Maintain 2-d array of size w x d
• Choose d different hash functions; each row in array corresponds to one
hash function
• Hash each item with every hash function, increment the appropriate
position in each row
• To query an item, hash it d times again, take the minimum value from all
rows
CONFIDENTIAL | 20
Example 4: Counting Individual Item Frequencies
Want to keep track of how many times each item has appeared in the stream
Count-Min Sketch, all together:
arr = int[d][w]
for each item e:
for i in 0...d-1:
j = hashi(e) mod w
arr[i][j]++
def frequency(q):
min = +infinity
for i in 0...d-1:
j = hashi(e) mod w
if arr[i][j] < min:
min = arr[i][j]
return min
CONFIDENTIAL | 21
Example 4: Counting Individual Item Frequencies
Count-Min Sketch in practice
• Devil is in the details
• Tunable precision
• Bigger array = more precise
• See the paper for bounds on precision
• Tunable latency
• more hash functions = higher latency
• Better at estimating more frequent items
• Can subtract out estimation of collisions
Remember: streaming algorithm for quick, imprecise
answer. Back-end batch algorithm for slower, exact
answer
CONFIDENTIAL | 22
Questions?
• Feel free to reach out
• www.thinkbiganalytics.com
• joe.kelley@thinkbiganalytics.com
• www.slideshare.net/jfkelley1
• References:
• http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf
• http://infolab.stanford.edu/~ullman/mmds.html
We’re hiring! Engineers and Data Scientists

Streaming Algorithms

  • 1.
  • 2.
    CONFIDENTIAL | 2 AcceleratingYour Time to Value Strategy and Roadmap IMAGINE Training and Education ILLUMINATE Hands-On Data Science and Data Engineering IMPLEMENT Leading Provider of Data Science & Engineering for Big Analytics
  • 3.
    CONFIDENTIAL | 3 •Operates on a continuous stream of data • Unknown or infinite size • Only one pass; options: • Store it • Lose it • Store an approximation • Limited processing time per item • • Limited total memory • What is a Streaming Algorithm? Algorithm Standing Query Ad-hoc Query Input Output Memory Disk
  • 4.
    CONFIDENTIAL | 4 Whyuse a Streaming Algorithm? • Compare to typical “Big Data” approach: store everything, analyze later, scale linearly • Streaming Pros: • Lower latency • Lower storage cost • Streaming Cons: • Less flexibility • Lower precision (sometimes) • Answer? • Why not both? Streaming Algorithm Result Initial Answer Long-term Storage Batch Algorithm Result Authoritative Answer
  • 5.
    CONFIDENTIAL | 5 GeneralTechniques 1. Tunable Approximation 2. Sampling • Sliding window • Fixed number • Fixed percentage 3. Hashing: useful randomness
  • 6.
    CONFIDENTIAL | 6 Example1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Device-1 (Device-1, event-1, 10001123) (Device-1, event-3, 10001126) (Device-1, event-1, 10001129) ... Device-2 (Device-2, event-2, 10001124) (Device-2, ERROR, 10001130) (Device-2, event-4, 10001132) ... Device-3 (Device-3, event-3, 10001122) (Device-3, event-1, 10001127) (Device-3, ERROR, 10001135) ... (Device-3, event-3, 10001122) (Device-1, event-1, 10001123) (Device-2, event-2, 10001124) (Device-1, event-3, 10001126) (Device-3, event-1, 10001127) (Device-1, event-1, 10001129) (Device-2, ERROR, 10001130) (Device-2, event-4, 10001132) (Device-3, ERROR, 10001135) ... Input
  • 7.
    CONFIDENTIAL | 7 Example1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Algorithm: for each element e: with probability 0.01: store e else: throw out e Can lead to some insidious statistical “bugs”…
  • 8.
    CONFIDENTIAL | 8 Example1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Query: How many errors has the average device encountered? Answer: SELECT AVG(n) FROM ( SELECT COUNT(*) AS n FROM events WHERE event = 'ERROR' GROUP BY device_id ) Simple… but off by up to 100x. Each device had only 1% of its events sampled. Can we just multiply by 100?
  • 9.
    CONFIDENTIAL | 9 Example1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Better Algorithm: for each element e: if (hash(e.device_id) mod 100) == 0 store e else: throw out e Choose how to hash carefully... or hash every different way
  • 10.
    CONFIDENTIAL | 10 Example2: Sampling fixed number Choice of p is crucial: • p = constant  prefer more recent elements. Higher p = more recent • p = k/n  sample uniformly from entire stream Let arr = array of size k for each element e: if arr is not yet full: add e to arr else: with probability p: replace a random element of arr with e else: throw out e Want to sample a fixed count (k), not a fixed percentage. Algorithm:
  • 11.
    CONFIDENTIAL | 11 Example2: Sampling fixed number
  • 12.
    CONFIDENTIAL | 12 Example3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Naïve approach: • Store all user_id’s in a list/tree/hashtable • Millions of users = lot of memory • Better approach: • Store all user_id’s in a database • Good, but maybe it’s not fast enough… • What if an approximate count is ok?
  • 13.
    CONFIDENTIAL | 13 Example3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Approximate count is ok • Flajolet-Martin Idea: • Hash each user_id into a bit string • Count the trailing zeros • Remember maximum number of trailing zeros seen user_id H(user_id) trailing zeros max(trailing zeros) john_doe 0111001001 0 0 jane_doe 1011011100 2 2 alan_t 0010111000 3 3 EWDijkstra 1101011110 1 3 jane_doe 1011011100 2 3
  • 14.
    CONFIDENTIAL | 14 Example3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Intuition: • If we had seen 2 distinct users, we would expect 1 trailing zero • If we had seen 4, we would expect 2 trailing zeros • If we had seen , we would expect • In general, if there has been a maximum of trailing zeros, is a reasonable estimation of distinct users • Want more precision? User more independent hash functions, and combine the results • Median = only get powers of two • Mean = subject to skew • Median of means of groups works well in practice
  • 15.
    CONFIDENTIAL | 15 Example3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period Flajolet-Martin, all together: arr = int[k] for each item e: for i in 0...k-1: z = trailing_zeros(hashi(e)) if z > arr[i]: arr[i] = z means = group_means(arr) median = median(means) return pow(2, median)
  • 16.
    CONFIDENTIAL | 16 Example3: Counting unique users Flajolet-Martin in practice • Devil is in the details • Tunable precision • more hash functions = more precise • See the paper for bounds on precision • Tunable latency • more hash functions = higher latency • faster hash functions = lower latency • faster hash functions = more possibility of correlation = less precision Remember: streaming algorithm for quick, imprecise answer. Back-end batch algorithm for slower, exact answer
  • 17.
    CONFIDENTIAL | 17 Example4: Counting Individual Item Frequencies Want to keep track of how many times each item has appeared in the stream Many applications: • How popular is each search term? • How many times has this hashtag been tweeted? • Which IP addresses are DDoS’ing me? Again, two obvious approaches: • In-memory hashmap of itemcount • Database But can we be more clever?
  • 18.
    CONFIDENTIAL | 18 Example4: Counting Individual Item Frequencies Want to keep track of how many times each item has appeared in the stream Idea: • Maintain array of counts • Hash each item, increment array at that index To check the count of an item, hash again and check array at that index • Over-estimates because of hash “collisions”
  • 19.
    CONFIDENTIAL | 19 Example4: Counting Individual Item Frequencies Count-Min Sketch algorithm: • Maintain 2-d array of size w x d • Choose d different hash functions; each row in array corresponds to one hash function • Hash each item with every hash function, increment the appropriate position in each row • To query an item, hash it d times again, take the minimum value from all rows
  • 20.
    CONFIDENTIAL | 20 Example4: Counting Individual Item Frequencies Want to keep track of how many times each item has appeared in the stream Count-Min Sketch, all together: arr = int[d][w] for each item e: for i in 0...d-1: j = hashi(e) mod w arr[i][j]++ def frequency(q): min = +infinity for i in 0...d-1: j = hashi(e) mod w if arr[i][j] < min: min = arr[i][j] return min
  • 21.
    CONFIDENTIAL | 21 Example4: Counting Individual Item Frequencies Count-Min Sketch in practice • Devil is in the details • Tunable precision • Bigger array = more precise • See the paper for bounds on precision • Tunable latency • more hash functions = higher latency • Better at estimating more frequent items • Can subtract out estimation of collisions Remember: streaming algorithm for quick, imprecise answer. Back-end batch algorithm for slower, exact answer
  • 22.
    CONFIDENTIAL | 22 Questions? •Feel free to reach out • www.thinkbiganalytics.com • [email protected] • www.slideshare.net/jfkelley1 • References: • http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf • http://infolab.stanford.edu/~ullman/mmds.html We’re hiring! Engineers and Data Scientists