1©	Cloudera,	Inc.	All	rights	reserved.
Marton	Balassi	|	Solutions	Architect	@	Cloudera*
@MartonBalassi |	mbalassi@cloudera.com
Judit Feher |	Data	Scientist	@	MTA	SZTAKI
jfeher@ilab.sztaki.hu
*Work	carried	out	while	employed	by	MTA	SZTAKI	on	the	Streamline	project.
Streaming	ML	with	Flink
This	project	has	received	funding	from	the	European	Union’s	Horizon	2020
research	and	innovation	program	under	grant	agreement	No	688191.
2©	Cloudera,	Inc.	All	rights	reserved.
Outline
• Current	FlinkML API	through	an	example
• Adding	streaming	predictors
• Online	learning
• Use	cases	in	the	Streamline	project
• Summary
3©	Cloudera,	Inc.	All	rights	reserved.
FlinkML example	usage
val env = ExecutionEnvironment.getExecutionEnvironment
val trainData = env.readCsvFile[(Int,Int,Double)](trainFile)
val testData = env.readTextFile(testFile).map(_.toInt)
val model = ALS()
.setNumfactors(numFactors)
.setIterations(iterations)
.setLambda(lambda)
model.fit(trainData)
val prediction = model.predict(testData)
prediction.print()
Given	a	historical	(training)	dataset	of	user	preferences
let	us	recommend	desirable	items	for	a	set	of	users.
Design	motivated	by	the	sci-kit	learn	API.	More	at	http://arxiv.org/abs/1309.0238.
4©	Cloudera,	Inc.	All	rights	reserved.
FlinkML example	usage
val env = ExecutionEnvironment.getExecutionEnvironment
val trainData = env.readCsvFile[(Int,Int,Double)](trainFile)
val testData = env.readTextFile(testFile).map(_.toInt)
val model = ALS()
.setNumfactors(numFactors)
.setIterations(iterations)
.setLambda(lambda)
model.fit(trainData)
val prediction = model.test(testData)
prediction.print()
Given	a	historical	(training)	dataset	of	user	preferences
let	us	recommend	desirable	items	for	a	set	of	users.
This	is	a	batch	input.	
But	does	it	need	to	be?
5©	Cloudera,	Inc.	All	rights	reserved.
A	little	recommender	theory
Item
factors
User side
information User-Item matrixUser factors
Item side
information
U
I
P
Q
R
• R	is	potentially	huge,	approximate	it	with	P∗Q
• Prediction	is	TopK(user’s	row	∗ Q)
6©	Cloudera,	Inc.	All	rights	reserved.
Prediction	is	a	natural	fit	for	streaming.
7©	Cloudera,	Inc.	All	rights	reserved.
A	closer (schematic)	look at the API
trait PredictDataSetOperation[Self, Testing, Prediction] {
def predictDataSet(instance: Self, input: DataSet[Testing]) : DataSet[Prediction]
}
trait PredictOperation[Instance, Model, Testing, Prediction] {
def getModel(instance: Instance) : DataSet[Model]
def predict(value: Testing, model: Model) : DataSet[Prediction]
}
A	DataSet and	a	record	level	API	to	implement	the	algorithm
(Prediction	is	always	done	on	a	model	already	trained)
The	record	level	version	is	arguably	more	convenient
It	is	wrapped	into	a	default	dataset	level	implementation
8©	Cloudera,	Inc.	All	rights	reserved.
A	closer (schematic)	look at the API
trait Estimator {
def fit[Training](training: DataSet[Training])(implicit f: FitOperation[Training]) = {
f.fit(training)
}
}
trait Transformer extends Estimator {
def transform[I,O](input: DataSet[I])(implicit t: TransformDataSetOperation[I,O]) = {
t.transform(input)
}
}
trait Predictor extends Estimator {
def predict[Testing](testing: DataSet[Testing])(implicit p: PredictDataSetOperation[T]) = {
p.predict(testing)
}
}
Three	well-picked	traits	go	a	long	way
9©	Cloudera,	Inc.	All	rights	reserved.
Could	we	share	the	model	with	a	streaming	job?
10©	Cloudera,	Inc.	All	rights	reserved.
Learn	in	batch,	predict	in	streaming
val env = ExecutionEnvironment.getExecutionEnvironment
val strEnv = StreamExecutionEnvironment.getExecutionEnvironment
val trainData = env.readCsvFile[(Int,Int,Double)](trainFile)
val testData = env.socketTextStream(...).map(_.toInt)
val model = ALS()
.setNumfactors(numFactors)
.setIterations(iterations)
.setLambda(lambda)
model.fit(trainData)
val prediction = model.predictStream(testData)
prediction.print()
11©	Cloudera,	Inc.	All	rights	reserved.
A	closer (schematic)	look at the streaming API
trait PredictDataSetOperation[Self, Testing, Prediction] {
def predictDataSet(instance: Self, input: DataSet[Testing]) : DataSet[Prediction]
}
trait PredictDataStreamOperation[Self, Testing, Prediction] {
def predictDataStream(instance: Self, input: DataStream[Testing]) : DataStream[Prediction]
}
• Implicit	conversions	from	the	batch	Predictors	to	StreamPredictors
• The	model	is	stored	then	loaded	into	a	stateful RichMapFunction processing	
the	input	stream
• Default	wrapper	implementations	to	support	both	the	DataStream	level	and	
the	record	level	implementations
• Adding	the	streaming	predictor	implementation	for	an	algorithm	given	the	
batch	one	is	trivial
12©	Cloudera,	Inc.	All	rights	reserved.
Recommender systems in batch	vs online	learning
• “30M”	Music	listening	dataset	crawled	by	the	
CrowdRec	team
• Implicit,	timestamped	music	listening	dataset
• Each	record	contains:
[	timestamp,	user	,	artist,	album,	track,	…	]
• We	always	recommend	and	learn	when	the	user	
interacts	with	an	item	at	the	first	time
• ~50,000	users,	~100,000	artists,	~500,000	tracks
• This happens when we shuffle the time
• A	partially batch	online	system
Use cases in the Streamline project
Judit Fehér
Hungarian Academy of Sciences
How iALS works and why is it different from ALS
ALS Problem to solve: 𝑅# = 𝑃& 𝑄#
– Linear regression
Error function
𝐿 = 𝑅 − 𝑅*
+,-.
/
+ 𝜆2 𝑃 +,-.
/
+ 𝜆3 𝑄 +,-.
/
Implicit error function
𝐿 = 4 𝑤6,# 𝑟̂6,# − 𝑟6,#
/
:;,:<
6=>,#=>
+ 𝜆2 4 𝑃6
/
:;
6=>
+ 𝜆3 4 𝑄#
/
:<
#=>
• Weighted MSE
• 𝑤6,# = ?
𝑤6,# if	(𝑢, 𝑖) ∈ 𝑇
𝑤I otherwise
𝑤I ≪
𝑤6,#
• Typical weights:
𝑤I = 1, 𝑤6,# = 100 ∗ 𝑠𝑢𝑝𝑝 𝑢, 𝑖
• What does it mean?
– Create two matrices from the events
– (1) Preference matrix
• Binary
• 1 represents the presence of an event
– (2) Confidence matrix
• Interprets our certainty on the
corresponding values in the first matrix
• Negative feedback is much less certain
Machine learning: batch, streaming? Combined?
Streaming recommeder
• Online learning
• Update immediately, e.g. with large
learning rate
• Data streaming
• Read training/testing data only once, no
chance to store
• Real time / Interactive
+ More timely, adapts fast
- Challenging to implement
Batch recommender
• Repeatedly read all training data
multiple times
• Stochastic gradient: use multiple
times in random order
• Elaborate optimization procedures,
e.g. SVM
+ More accurate (?)
+ Easy to implement (?)
Contextualized recommendation (NMusic)
Social recommendation Geo recommendation
R.Palovics,	A.A.Benczur,	L.Kocsis,	T.Kiss,	E.Frigo. "Exploiting temporal
influence in	online	recommendation",	ACM	RecSys (2014)
Palovics,	Szalai,	Kocsis,	Pap,	Frigo,	Benczur.	„Location-Aware Online	
Learning for Top-k	Hashtag Recommendation”,	LocalRec (2015)
Internet Memory Research use cases
Identify events that influence consumer behavior (product purchases, media consumption)
Events influence people
Before a football match, people buy beer, chips, …
Specific events influence specific people (requires user profiles)
A football fan does not play Angry Birds during a football match
Annotation by logistic regression
Train over data in rest
Streaming predict crawl time
Portugal Telecom use cases
MEO quadruple-play
Features
Internet
TV (IPTV)
Mobile phone
Landline phone
Current challenges
Heterogeneous data
Heterogeneous technical solutions
Customers profiling
Cross-domain recommendation
1TB/day
Rovio use cases
Development at Sztaki
iALS
- Flink already has explicit ALS
- The implementation of the implicit version is done
- Currently testing the algorithm's accuracy
Matrix factorization
- Distributed algorithm*
- We have a working prototype tested on smaller matrices but it still needs optimization
Logistic regression
- Implementation in progress
- It is based on stochastic gradient descent, but in Flink there is only a batch version
- Currently working on the gradient descent implementation
Metrics
- Implementation and testing is finished
- We need to create a pull request
*R. Gemulla et al, “Large scale Matrix Factorization with Distributed Stochastic Gradient Descent”, KDD 2011.
21©	Cloudera,	Inc.	All	rights	reserved.
Summary
• Scala	is	a	great tool for building	DSLs
• FlinkML’s API	is	motivated by scikit-learn
• Streaming	is	a	natural	fit	for	ML	predictors
• Online	learning	can	outperform	batch	in	certain	cases
• The	Streamline project	builds on Flink,	aims to contribute back	
as much of	the results as possible

Márton Balassi Streaming ML with Flink-

  • 1.
    1© Cloudera, Inc. All rights reserved. Marton Balassi | Solutions Architect @ Cloudera* @MartonBalassi | [email protected] Judit Feher| Data Scientist @ MTA SZTAKI [email protected] *Work carried out while employed by MTA SZTAKI on the Streamline project. Streaming ML with Flink This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 688191.
  • 2.
    2© Cloudera, Inc. All rights reserved. Outline • Current FlinkML API through an example •Adding streaming predictors • Online learning • Use cases in the Streamline project • Summary
  • 3.
    3© Cloudera, Inc. All rights reserved. FlinkML example usage val env= ExecutionEnvironment.getExecutionEnvironment val trainData = env.readCsvFile[(Int,Int,Double)](trainFile) val testData = env.readTextFile(testFile).map(_.toInt) val model = ALS() .setNumfactors(numFactors) .setIterations(iterations) .setLambda(lambda) model.fit(trainData) val prediction = model.predict(testData) prediction.print() Given a historical (training) dataset of user preferences let us recommend desirable items for a set of users. Design motivated by the sci-kit learn API. More at http://arxiv.org/abs/1309.0238.
  • 4.
    4© Cloudera, Inc. All rights reserved. FlinkML example usage val env= ExecutionEnvironment.getExecutionEnvironment val trainData = env.readCsvFile[(Int,Int,Double)](trainFile) val testData = env.readTextFile(testFile).map(_.toInt) val model = ALS() .setNumfactors(numFactors) .setIterations(iterations) .setLambda(lambda) model.fit(trainData) val prediction = model.test(testData) prediction.print() Given a historical (training) dataset of user preferences let us recommend desirable items for a set of users. This is a batch input. But does it need to be?
  • 5.
    5© Cloudera, Inc. All rights reserved. A little recommender theory Item factors User side information User-ItemmatrixUser factors Item side information U I P Q R • R is potentially huge, approximate it with P∗Q • Prediction is TopK(user’s row ∗ Q)
  • 6.
  • 7.
    7© Cloudera, Inc. All rights reserved. A closer (schematic) look atthe API trait PredictDataSetOperation[Self, Testing, Prediction] { def predictDataSet(instance: Self, input: DataSet[Testing]) : DataSet[Prediction] } trait PredictOperation[Instance, Model, Testing, Prediction] { def getModel(instance: Instance) : DataSet[Model] def predict(value: Testing, model: Model) : DataSet[Prediction] } A DataSet and a record level API to implement the algorithm (Prediction is always done on a model already trained) The record level version is arguably more convenient It is wrapped into a default dataset level implementation
  • 8.
    8© Cloudera, Inc. All rights reserved. A closer (schematic) look atthe API trait Estimator { def fit[Training](training: DataSet[Training])(implicit f: FitOperation[Training]) = { f.fit(training) } } trait Transformer extends Estimator { def transform[I,O](input: DataSet[I])(implicit t: TransformDataSetOperation[I,O]) = { t.transform(input) } } trait Predictor extends Estimator { def predict[Testing](testing: DataSet[Testing])(implicit p: PredictDataSetOperation[T]) = { p.predict(testing) } } Three well-picked traits go a long way
  • 9.
  • 10.
    10© Cloudera, Inc. All rights reserved. Learn in batch, predict in streaming val env =ExecutionEnvironment.getExecutionEnvironment val strEnv = StreamExecutionEnvironment.getExecutionEnvironment val trainData = env.readCsvFile[(Int,Int,Double)](trainFile) val testData = env.socketTextStream(...).map(_.toInt) val model = ALS() .setNumfactors(numFactors) .setIterations(iterations) .setLambda(lambda) model.fit(trainData) val prediction = model.predictStream(testData) prediction.print()
  • 11.
    11© Cloudera, Inc. All rights reserved. A closer (schematic) look atthe streaming API trait PredictDataSetOperation[Self, Testing, Prediction] { def predictDataSet(instance: Self, input: DataSet[Testing]) : DataSet[Prediction] } trait PredictDataStreamOperation[Self, Testing, Prediction] { def predictDataStream(instance: Self, input: DataStream[Testing]) : DataStream[Prediction] } • Implicit conversions from the batch Predictors to StreamPredictors • The model is stored then loaded into a stateful RichMapFunction processing the input stream • Default wrapper implementations to support both the DataStream level and the record level implementations • Adding the streaming predictor implementation for an algorithm given the batch one is trivial
  • 12.
    12© Cloudera, Inc. All rights reserved. Recommender systems inbatch vs online learning • “30M” Music listening dataset crawled by the CrowdRec team • Implicit, timestamped music listening dataset • Each record contains: [ timestamp, user , artist, album, track, … ] • We always recommend and learn when the user interacts with an item at the first time • ~50,000 users, ~100,000 artists, ~500,000 tracks • This happens when we shuffle the time • A partially batch online system
  • 13.
    Use cases inthe Streamline project Judit Fehér Hungarian Academy of Sciences
  • 14.
    How iALS worksand why is it different from ALS ALS Problem to solve: 𝑅# = 𝑃& 𝑄# – Linear regression Error function 𝐿 = 𝑅 − 𝑅* +,-. / + 𝜆2 𝑃 +,-. / + 𝜆3 𝑄 +,-. / Implicit error function 𝐿 = 4 𝑤6,# 𝑟̂6,# − 𝑟6,# / :;,:< 6=>,#=> + 𝜆2 4 𝑃6 / :; 6=> + 𝜆3 4 𝑄# / :< #=> • Weighted MSE • 𝑤6,# = ? 𝑤6,# if (𝑢, 𝑖) ∈ 𝑇 𝑤I otherwise 𝑤I ≪ 𝑤6,# • Typical weights: 𝑤I = 1, 𝑤6,# = 100 ∗ 𝑠𝑢𝑝𝑝 𝑢, 𝑖 • What does it mean? – Create two matrices from the events – (1) Preference matrix • Binary • 1 represents the presence of an event – (2) Confidence matrix • Interprets our certainty on the corresponding values in the first matrix • Negative feedback is much less certain
  • 15.
    Machine learning: batch,streaming? Combined? Streaming recommeder • Online learning • Update immediately, e.g. with large learning rate • Data streaming • Read training/testing data only once, no chance to store • Real time / Interactive + More timely, adapts fast - Challenging to implement Batch recommender • Repeatedly read all training data multiple times • Stochastic gradient: use multiple times in random order • Elaborate optimization procedures, e.g. SVM + More accurate (?) + Easy to implement (?)
  • 16.
    Contextualized recommendation (NMusic) Socialrecommendation Geo recommendation R.Palovics, A.A.Benczur, L.Kocsis, T.Kiss, E.Frigo. "Exploiting temporal influence in online recommendation", ACM RecSys (2014) Palovics, Szalai, Kocsis, Pap, Frigo, Benczur. „Location-Aware Online Learning for Top-k Hashtag Recommendation”, LocalRec (2015)
  • 17.
    Internet Memory Researchuse cases Identify events that influence consumer behavior (product purchases, media consumption) Events influence people Before a football match, people buy beer, chips, … Specific events influence specific people (requires user profiles) A football fan does not play Angry Birds during a football match Annotation by logistic regression Train over data in rest Streaming predict crawl time
  • 18.
    Portugal Telecom usecases MEO quadruple-play Features Internet TV (IPTV) Mobile phone Landline phone Current challenges Heterogeneous data Heterogeneous technical solutions Customers profiling Cross-domain recommendation 1TB/day
  • 19.
  • 20.
    Development at Sztaki iALS -Flink already has explicit ALS - The implementation of the implicit version is done - Currently testing the algorithm's accuracy Matrix factorization - Distributed algorithm* - We have a working prototype tested on smaller matrices but it still needs optimization Logistic regression - Implementation in progress - It is based on stochastic gradient descent, but in Flink there is only a batch version - Currently working on the gradient descent implementation Metrics - Implementation and testing is finished - We need to create a pull request *R. Gemulla et al, “Large scale Matrix Factorization with Distributed Stochastic Gradient Descent”, KDD 2011.
  • 21.
    21© Cloudera, Inc. All rights reserved. Summary • Scala is a great toolfor building DSLs • FlinkML’s API is motivated by scikit-learn • Streaming is a natural fit for ML predictors • Online learning can outperform batch in certain cases • The Streamline project builds on Flink, aims to contribute back as much of the results as possible