Streaming Algorithms

Streaming Algorithms
Joe Kelley
Data Engineer
July 2013

CONFIDENTIAL | 2
Accelerating Your Time to Value
Strategy
and Roadmap
IMAGINE
Training
and Education
ILLUMINATE
Hands-On
Data Science and
Data Engineering
IMPLEMENT
Leading Provider of
Data Science & Engineering for Big
Analytics

CONFIDENTIAL | 3
• Operates on a continuous stream of data
• Unknown or infinite size
• Only one pass; options:
• Store it
• Lose it
• Store an approximation
• Limited processing time per item
•
• Limited total memory
•
What is a Streaming Algorithm?
Algorithm
Standing Query
Ad-hoc Query
Input
Output
Memory
Disk

CONFIDENTIAL | 4
Why use a Streaming Algorithm?
• Compare to typical “Big Data” approach: store
everything, analyze later, scale linearly
• Streaming Pros:
• Lower latency
• Lower storage cost
• Streaming Cons:
• Less flexibility
• Lower precision (sometimes)
• Answer?
• Why not both?
Streaming
Algorithm
Result
Initial Answer
Long-term Storage Batch Algorithm
Result
Authoritative Answer

CONFIDENTIAL | 5
General Techniques
1. Tunable Approximation
2. Sampling
• Sliding window
• Fixed number
• Fixed percentage
3. Hashing: useful randomness

CONFIDENTIAL | 6
Example 1: Sampling device error rates
• Stream of (device_id, event, timestamp)
• Scenario:
• Not enough space to store everything
• Simple queries  storing 1% is good enough
Device-1
(Device-1, event-1, 10001123)
(Device-1, event-3, 10001126)
(Device-1, event-1, 10001129)
...
Device-2
(Device-2, event-2, 10001124)
(Device-2, ERROR, 10001130)
(Device-2, event-4, 10001132)
...
Device-3
(Device-3, event-3, 10001122)
(Device-3, event-1, 10001127)
...
(Device-3, event-3, 10001122)
(Device-1, event-1, 10001123)
(Device-2, event-2, 10001124)
(Device-1, event-3, 10001126)
(Device-3, event-1, 10001127)
(Device-1, event-1, 10001129)
(Device-2, event-4, 10001132)
...
Input

CONFIDENTIAL | 7
• Scenario:
Algorithm:
for each element e:
with probability 0.01:
store e
else:
throw out e
Can lead to some insidious statistical “bugs”…

CONFIDENTIAL | 8
• Scenario:
Query:
How many errors has the average device encountered?
Answer:
SELECT AVG(n) FROM (
SELECT COUNT(*) AS n FROM events
WHERE event = 'ERROR'
GROUP BY device_id
)
Simple… but off by up to 100x. Each device had only 1% of its events
sampled.
Can we just multiply by 100?

CONFIDENTIAL | 9
• Scenario:
Better Algorithm:
for each element e:
if (hash(e.device_id) mod 100) == 0
store e
else:
throw out e
Choose how to hash carefully... or hash every different way

CONFIDENTIAL | 10
Example 2: Sampling fixed number
Choice of p is crucial:
• p = constant  prefer more recent elements. Higher p = more recent
• p = k/n  sample uniformly from entire stream
Let arr = array of size k
for each element e:
if arr is not yet full:
add e to arr
else:
with probability p:
replace a random element of arr with e
else:
throw out e
Want to sample a fixed count (k), not a fixed percentage.
Algorithm:

CONFIDENTIAL | 11
Example 2: Sampling fixed number

CONFIDENTIAL | 12
Example 3: Counting unique users
• Input: stream of (user_id, action, timestamp)
• Want to know how many distinct users are seen over
a time period
• Naïve approach:
• Store all user_id’s in a list/tree/hashtable
• Millions of users = lot of memory
• Better approach:
• Store all user_id’s in a database
• Good, but maybe it’s not fast enough…
• What if an approximate count is ok?

CONFIDENTIAL | 13
• Want to know how many distinct users are seen over a time period
• Approximate count is ok
• Flajolet-Martin Idea:
• Hash each user_id into a bit string
• Count the trailing zeros
• Remember maximum number of trailing zeros seen
user_id H(user_id) trailing zeros max(trailing zeros)
john_doe 0111001001 0 0
jane_doe 1011011100 2 2
alan_t 0010111000 3 3
EWDijkstra 1101011110 1 3
jane_doe 1011011100 2 3

CONFIDENTIAL | 14
• Intuition:
• If we had seen 2 distinct users, we would expect 1
trailing zero
• If we had seen 4, we would expect 2 trailing zeros
• If we had seen , we would expect
• In general, if there has been a maximum of trailing
zeros, is a reasonable estimation of distinct users
• Want more precision? User more independent hash
functions, and combine the results
• Median = only get powers of two
• Mean = subject to skew
• Median of means of groups works well in practice

CONFIDENTIAL | 15
Flajolet-Martin, all together:
arr = int[k]
for each item e:
for i in 0...k-1:
z = trailing_zeros(hashi(e))
if z > arr[i]:
arr[i] = z
means = group_means(arr)
median = median(means)
return pow(2, median)

CONFIDENTIAL | 16
Flajolet-Martin in practice
• Devil is in the details
• Tunable precision
• more hash functions = more precise
• See the paper for bounds on precision
• Tunable latency
• more hash functions = higher latency
• faster hash functions = lower latency
• faster hash functions = more possibility of
correlation = less precision
Remember: streaming algorithm for quick, imprecise
answer. Back-end batch algorithm for slower, exact
answer

CONFIDENTIAL | 17
Example 4: Counting Individual Item Frequencies
Want to keep track of how many times each item has
appeared in the stream
Many applications:
• How popular is each search term?
• How many times has this hashtag been tweeted?
• Which IP addresses are DDoS’ing me?
Again, two obvious approaches:
• In-memory hashmap of itemcount
• Database
But can we be more clever?

CONFIDENTIAL | 18
Want to keep track of how many times each item has appeared in the stream
Idea:
• Maintain array of counts
• Hash each item, increment array at that index
To check the count of an item, hash again and check
array at that index
• Over-estimates because of hash “collisions”

CONFIDENTIAL | 19
Count-Min Sketch algorithm:
• Maintain 2-d array of size w x d
• Choose d different hash functions; each row in array corresponds to one
hash function
• Hash each item with every hash function, increment the appropriate
position in each row
• To query an item, hash it d times again, take the minimum value from all
rows

CONFIDENTIAL | 20
Want to keep track of how many times each item has appeared in the stream
Count-Min Sketch, all together:
arr = int[d][w]
for each item e:
for i in 0...d-1:
j = hashi(e) mod w
arr[i][j]++
def frequency(q):
min = +infinity
for i in 0...d-1:
j = hashi(e) mod w
if arr[i][j] < min:
min = arr[i][j]
return min

CONFIDENTIAL | 21
Count-Min Sketch in practice
• Devil is in the details
• Tunable precision
• Bigger array = more precise
• See the paper for bounds on precision
• Tunable latency
• more hash functions = higher latency
• Better at estimating more frequent items
• Can subtract out estimation of collisions
Remember: streaming algorithm for quick, imprecise
answer. Back-end batch algorithm for slower, exact
answer

CONFIDENTIAL | 22
Questions?
• Feel free to reach out
• www.thinkbiganalytics.com
• joe.kelley@thinkbiganalytics.com
• www.slideshare.net/jfkelley1
• References:
• http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf
• http://infolab.stanford.edu/~ullman/mmds.html
We’re hiring! Engineers and Data Scientists

Streaming Algorithms

More Related Content

What's hot(20)

Viewers also liked(20)

Similar to Streaming Algorithms(20)

Recently uploaded(20)

Streaming Algorithms