1
RECOMMENDER SYSTEMS
Jesse Davis
Economics:
Traditional Retail vs. The Web
2
Physical retail: Space is limited and expensive
People are not willing to travel far for products
Products stocked must make money
Implication: Must focus on popular products
Web: Storage space is cheap
Sites cater to everyone
Implication: Low cost and easy access mean
possible to offer more choice
Problem: Too many products, what interests me?
Solution: Systems that can recommend products
Sales vs. Products: The Long Tail
3
Stores
Mixed
(e.g., Amazon)
Sales
Online only
(e.g., iTunes)
Products (e.g., songs, books, etc.)
Making Recommendations
4
“Into Thin Air”
Books are similar
“Touching the Void”
The book “Into Thin Air” helped
make “Touching the Void” a bestseller
Wired article:
[Link]
Recommendation Types
5
Editorial
List
of favorites
Essential items
Aggregates
Top10 lists
Most emailed articles
Most recent posts
Personalized user recommendations
Amazon
Movie sites
6 Key Challenges
Key Challenges
7
How do we get user feedback?
How do we evaluate predictions?
How do predict an unknown rating?
Main interest: Highly rated products
Options for Obtaining Ratings
8
Explicit rating of products
Bought items
Items on “wish lists”
Recently clicked product pages/links
Length of time spent on product page
Printed links
Etc.
Evaluation
9
Ratings
1 3 4
3 5 5
4 5 5
Users
3
3
2 ? ?
?
Most
2 1 ?
recent
3 ?
ratings
1
Metrics: Compare with Known Rating
10
Precision at top 10: % of those in top 10
𝑁
1
Root-mean square error = 𝑁
(𝑃𝑖 − 𝑅𝑖 )2
𝑖=1
Spearman’s rank correlation between
model’s and user’s complete rankings
6 σ𝑛𝑖=1 𝑈𝑖 − 𝑀𝑖 2
𝑅𝑠 =
𝑛(𝑛2 − 1)
For 0/1 data
Coverage: Number of items/users for which the
system can make a prediction
Precision, ROC, etc.
Weakness with Metrics
11
Focusing on accuracy misses important points
Prediction diversity
Prediction context
Order of prediction
Only high ratings matter
RMSE might penalize a method that does well for
high ratings and badly for others
Recommending Products
12
Idea 1: This is a machine learning problem
Collect(a lot of) rating for the product per user
Define a set of features (i.e., profile for item)
Learn a model
Idea 2: Content-based filtering
Define a profile for an item
Define a profile for a user
Compare similarity between users and items
Key Challenge: Building Item Profile
13
Item profile: A way to describe each item
These are usually hand-crafted
For a movie:
Actors
Genre
Director
Etc.
News articles:
Words, title, author, etc.
User Profiles and
Content-Based Prediction
14
Main idea: Recommend items to customer x that
are similar to previous items rated highly by x
User profile possibilities:
Weighted average of rated item profiles
Variation: weight by difference from average
rating for item
Prediction heuristic: Given user profile x and
item profile i, estimate
𝒙·𝒊
𝑢(𝒙, 𝒊) = cos(𝒙, 𝒊) =
| 𝒙 |⋅| 𝒊 |
Pros and Cons of
Content-Based Approaches
15
Advantages
Onlyneed data about one user
More personalized approach
More easily recommend new/unpopular items
Can provide context for recommendation
Disadvantages
Must manually construct meaningful features
Never recommends items outside of a
user’s content profile
Hard to build a profile for a new user
16 Netflix and Recent Advances
Present Recommendation in the
Context of Netflix Challenge
17
While it is now older, it highlights a number of
important issues in data mining that are
applicable to many other problems
Driven much of the advances in recommender
systems: We are still building off these ideas
Netflix: The $1 Million Question
18
Given: A training set of 100 million rating
Do: Build a recommendation system that
improves the root-mean squared error by
10% over Netflix’s system
Netflix Data and Evaluation
19
Each rating: (user, movie, rating, time stamp) tuple
Train set Test set
3M known to Netflix
100M ratings
• 99% sparse 1.5M for public
• 480,000 users leaderboard
• 17,700 movies
1.5M for
final winner
Score: Test set RMSE
The Competition
20
Register on competition site and download
anonymized data
Submit predictions for test set at most one time
per day
Winning algorithm: Software and non-exclusive
license to Netflix, must publish algorithm
50k improvement prizes every year
Once 10% threshold met: 30 day final
competition period
Start October of 2006: More popular than
anticipated (eventually around 40,000 teams!)
Pros and Cons of a Challenge?
21
Advantages
Expensive to develop internal system
Publicity is good
Awarding prize will pay for itself:
$1 million ≪ value of software to Netflix
Disadvantages
Privacy concerns (user backlash to data release)
Prize won too quickly or no one wins the prize
Time and effort to run the competition
Results not useful in practice (e.g., too slow)
What Worked Well:
Or General Data Analysis Advice
22
Try the obvious thing first
Predict the right thing
Think outside the box
Know your data
The more (models), the merrier
23 Try Obvious Approaches First
Classic Recommendation Approach:
Collaborative Filtering
24
Intuition: Find users with similar tastes and
recommend products they liked
Insight: Can do this just by looking at ratings!
Big idea:
Find other users whose ratings are similar to the
current user
Propagate the (dis)likes to the current user
Pictorial Overview
25
R1 R2 R3 R4 R5 R6
Alice 2 - 3 2 - 1
Bob 2 5 4 - - 2
Chris 4 3 - - - 5
Diana 3 - 2 4 - 5
Find Correlation R3 = 4
R1 R2 R3 R4 R5 R6
Eve 2 5 ? ? ? ?
Active user
General Algorithm
26
Step 1: Measure the similarity between
the user of interest and all other users
Step 2 (Optional): Select a smaller subset
consisting of the most similar users
Step 3: Predict ratings as a weighted
combination of “nearest neighbors”
Step 4: Return the highest-rated items
Step 1: Similarity Weighting
27
Question: How can we compute the similarity
between two users?
𝑆𝑖 ∩ 𝑆𝑗
Idea 1: Jaccard similarity: sim 𝑆𝑖 , 𝑆𝑗 =
𝑆𝑖 ∪ 𝑆𝑗
Problem: Ignores the actual rating
𝑆𝑖 ⋅ 𝑆𝑗
Idea 2: Cosine similarity: sim(Si, Sj) =
||𝑆𝑖 || ⋅ ||𝑆𝑗 ||
Problem: Treats missing ratings as zero
Step 1: Similarity Weighting
28
Pearson Correlation
σ𝑘 𝑅𝑖𝑘 − 𝑅ഥ𝑖 𝑅𝑗𝑘 − 𝑅ഥ𝑗
𝑊𝑖𝑗 =
2 2
σ𝑘 𝑅𝑖𝑘 − 𝑅ഥ𝑖 σ𝑘 𝑅𝑗𝑘 − 𝑅ഥ𝑗
σ𝑚 𝑅𝑖𝑒
𝑅ഥ𝑖 =
𝑒=1
(Average rating given by user i)
𝑚
Rik = User i’s rating on item k
Consider only items k rated by both users!!!
What Does Pearson Correlation
Mean?
29
σ𝑘 𝑅𝑖𝑘 − 𝑅ഥ𝑖 𝑅𝑗𝑘 − 𝑅ഥ𝑗
𝑊𝑖𝑗 =
2 2
σ𝑘 𝑅𝑖𝑘 − 𝑅ഥ𝑖 σ𝑘 𝑅𝑗𝑘 − 𝑅ഥ𝑗
Positive if both predictions on same side of average
1 = perfect linear relationship with Rik increasing with Rjk
-1 = perfect linear relationship with Rik decreasing
when Rjk increases
Step 2: Selecting Neighborhood
30
Could use the whole database
Could ignore “far away” users
Pick “k” nearest users
Include all users above a predetermined weight
threshold
Note: Could use any k-NN strategy here
Step 3: Predicting Ratings
31
Predict a rating
Account for different rating levels by looking at
difference from a user’s average rating
Weight each user’s contribution by similarity
𝑃𝑖𝑘 = 𝑅ഥ𝑖 + 𝛼 𝑊𝑖𝑗 𝑅𝑗𝑘 − 𝑅ഥ𝑗
𝑗
1
𝛼=
σ |𝑊𝑖𝑗 |
Wij = Pearson correlation
Step 4: Return Items
32
Usually only interested in highly rated items
Do not want to overload the user
There usually return the top 2, 5 or 10 items
Can give user the option to view more items
Practical Issues
33
Potentially very large datasets
Number of users: Millions
Number of items: 100,000s
Key issue: How to efficiently find similar users?
Pairwise similarity for 1M users: ~5 days
Pairwise similarity for 10M users: ~1 year!
Big trick: Locality Sensitive Hashing
(See programming for big data)
Pros and Cons of
Collaborative Filtering
34
Advantages
Simple and intuitive approach for any item type
No feature construction and selection
Exploits information about other users
Disadvantages
Data is sparse: Hard to find similar users
Cold start: Need to enough users in database
First rater: Can’t recommend unrated items,
e.g., new or unique items
Popularity bias: Favors items that lots of people like
(i.e., bad if you have unique taste)
Performance of Various Methods
Global average: 1.1296
User average: 1.0651
Movie average: 1.0533
Netflix: 0.9514
Basic Collaborative filtering: 0.94
Grand Prize: 0.8563
35
36 Predict the Right Thing
Predict the “Right Thing”
37
Our task
Predict(user_id, movie_id, ?)
Minimize RMSE of predictions
Question: What will produce low RMSE?
Two points:
Obvious: Better results if model optimized
towards the given objective
Subtle: To get good results, often have to derive
a new target variable to predict
What Affects a User’s Rating?
38
Hypothesis: Multiple factors affect rating
Goal: Isolate the portion of the rating that
captures the user-movie effect
Two big effects:
User – User’s rating scale
Bias – Values of other ratings user gave recently
(user’s mood, anchoring, multi-user accounts)
Movie – (Recent) popularity of movie i
Bias – Selection bias (“frequency”)
Capturing Global Effects:
Better Baseline
39
overall mean rating Rating deviation of movie i
𝒓𝒙𝒊 = 𝒃𝒙𝒊 = 𝝁 + 𝒃𝒙 + 𝒃𝒊
rating deviation of user x
𝝁: Average rating of all movies in data
𝒃𝒙 : Average rating of user x − 𝝁
𝒃𝒊 : Average rating of movie i − 𝝁
Example of Baseline
40
Mean movie rating: 3.7 stars
The Sixth Sense is 0.5 stars above average
Joe rates 0.2 stars below average
Baseline estimation:
3.7+ 0.5 + (-0.2) = 4
Joe will rate The Sixth Sense 4 stars
Three Problems with the
Collaborative Filtering Model
41
Similarities are arbitrary: Many choices and
not tied to our ultimate objective
Do not account for interactions between
neighbors: May “double count’’
Weighted average: Ignores differences among
high & low confidence neighbors 3-NN, with sims:
0.1, 0.05, and 0.15
0.1(2) + 0.05 (3) + 0.15(2)
2+ = 3.27
(0.1 + 0.05 + 0.15)
3-NN, with sims:
0.6(2) + 0.3 (3) + 0.9(2) 0.6, 0.3, and 0.75
2+ = 3.27
(0.6 + 0.3 + 0.9)
Idea: Weighted Sum
42
Weighted sum rather than weighted avg.:
𝑟ෞ
𝑥𝑖 = 𝑏𝑥𝑖 + 𝑤𝑖𝑗 𝑟𝑥𝑗 − 𝑏𝑥𝑗
𝑗∈𝑁(𝑖;𝑥)
A few notes:
𝒃𝒙𝒊 is the new, more complex baseline with biases
𝑵 𝒊; 𝒙 : movies rated by user x that are similar to
movie i
𝒘𝒊𝒋 is the interpolation weight (some real number)
𝒘𝒊𝒋 models the interaction between pairs of movies
(it does not depend on user x)
Picking Weights: Optimization
43
Competition objective: Sum of squared errors
𝟐
𝒓ො 𝒙𝒊 − 𝒓𝒙𝒊
(𝒊,𝒙)∈𝑹
Idea: Pick weights to minimize this objective!
2
𝐽 𝑤 = 𝑏𝑥𝑖 + 𝑤𝑖𝑗 𝑟𝑥𝑗 − 𝑏𝑥𝑗 − 𝑟𝑥𝑖
𝑥,𝑖 𝑗∈𝑁 𝑖;𝑥
True
Posed as Predicted rating
rating
optimization
problem!
Performance of Various Methods
Global average: 1.1296
User average: 1.0651
Movie average: 1.0533
Netflix: 0.9514
Basic Collaborative filtering: 0.94
CF+Biases+learned weights: 0.91
Grand Prize: 0.8563
44
45 Think Outside the Box
Let’s Think about the Task
46
Movies Goal: Fill in “missing entries”
1 3 ? 5
5 4 ? ?
Users
Likely some shared hidden
2 4 1 2 structure for users and movies
2 4 5
4 3 4 2 Implication: Could represent the
1 3 3 ? data using “fewer” dimension
Latent Factor Models
Drama
Sense and Amadeus
Sensibility
Braveheart
Serious Factor 1Light
Dumb and hearted
Dumber
Lethal Weapon
Factor 2
47
Syrianna
Ocean’s 11
Action
Latent Factor Model
48
Movies
= R ≈ Q x PT
1 3 ? 5 Movies “Topics’’ Movies
5 4 ? ?
“Topics’’
Users
Users
Users
2 4
2 4
1 2
5 ͌ X
4 3 4 2
R=uxm Q=uxd P=mxd
1 3 3 ?
matrix matrix matrix
Key idea: d ≪ m, u
“Topics”: shared hidden structure
(e.g., how much each users likes each genre)
Latent Factor Models
49
Movies factors factors
1 3 5 5 4 .1 -.4 .2 1.1 -.8 2.1
5 4 ? 4 2 1 -.2 .7 -.4
Users
-.5 .6 .5
2 4 1 2 3 4 3 5 .3 .5 .6
Users
≈
Items
-.2 .3 .5
2 4 5 4 2 1.1 2.1 .3 .5 1.4 1.7
4 3 4 2 2 -.7 2.1 -2 -2 .3 2.4
1 3 3 2 4 -1 .7 .3 -.5 -1 .9
.8 1.4 -.3
? = -.5 * -2 +.6 * .3 + .5 * 2.4 = 2.38 -.4 2.9 .4
.3 -.7 .8
Note: Item-factor should be 1.4 1.2 .7
transposed, but not due to space 2.4 -.1 -.6
Optimization problem
50
Goal: Find Matrices P and Q according to:
2
min 𝑅𝑖𝑗 − 𝑄𝑃T 𝑖𝑗
P,Q
𝑖𝑗∈𝑅
Three things to think about
R has many missing entries
Objective is non-convex
R is sparse, so worried about overfitting
Intuition of a Solution
51
1 3 5 ? ? ? ? ? ? ? ?
5 4 ? ? X ? ? ? ? ? ?
2 4 1 2 ? ?
2 4 5
4 3 4 2
͌ ? ?
? ?
1 3 3 ? ?
Intuition of a Solution
52
1 3 5 ? ? 0.1 0.1 0.5 1.2 0.1 1.5
5 4 ? ? X -1.4 0.8 -0.5 0.4 -0.3 0.9
2 4 1 2 ? ?
2 4 5
4 3 4 2
͌ ? ?
? ? Fit least
X1 X2 Y squares
1 3 3 ? ?
0.1 -1.4 2 solution
0.1 0.8 4
1.2 0.4 1
0.1 -0.3 2
Idea: Alternating Least Squares
53
Sum over non-missing entries
2
2 2
min 𝑅𝑖𝑗 − 𝑄𝑃T 𝑖𝑗
+ λ1 𝑄 𝐹 + λ2 𝑃 𝐹
P,Q
𝑖𝑗∈𝑅
Frobenius norm = Matrix equivalent of L2
Input: R, d
Randomly initialize Q,P
for i = 0, 1,… do Solve m d-dimensional
ridge regression problems
Fix Q: Optimize P
Fix P: Optimize Q Solve u d-dimensional
ridge regression problems
Latent Factor Models with Biases
54
users factors users
factors
1 3 5 5 4 .1 -.4 .2 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4
5 4 4 2 1 -.5 .6 .5 -.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1
items
2 4 1 2 3 4 3 5 2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6
≈
-.2 .3 .5
2 4 5 4 2 1.1 2.1 .3
PT
items
4 3 4 2 2 -.7 2.1 -2
1 3 3 2 4 -1 .7 .3
R Q
𝒓ො 𝒙𝒊 = 𝒒𝒊 ⋅ 𝒑𝒙 = 𝒒𝒊𝒇 ⋅ 𝒑𝒙𝒇
𝒇
𝑟𝑥𝑖 = 𝜇 + 𝑏𝑥 + 𝑏𝑖 + 𝑞𝑖 ⋅ 𝑝𝑥 T
Overall Bias for Bias for User-Movie
mean rating user x movie i interaction
Koren, Bell, Volinksy, IEEE Computer, 2009
55
Notes On Latent Factors
56
Regularization is key to avoiding overfitting
Often want to preprocess or normalize the data
(Note: Must convert back to original scale when
making a prediction)
Alternating least squares is easy to parallelize
Can also solve via stochastic gradient descent
Non-Negative Matrix Factorization
57
Often want the entries of P, Q to be
non-negative: Easier to interpret and visualize
Easy solution: Projected gradient methods
∂l(w)
Normal gradient wi+1 ⟵ wi - η
∂wi
Projected gradient wi+1 ⟵ max(0, wi – η ∂l(w) )
∂wi
Performance of Various Methods
Global average: 1.1296
User average: 1.0651
Movie average: 1.0533
Netflix: 0.9514
Basic Collaborative filtering: 0.94
CF+Biases+learned weights: 0.91
Latent factors: 0.90
Latent factors+Biases: 0.89
Grand Prize: 0.8563
58
59 Know Your Data
Temporal Effect: Early 2004 Sudden
Jump in Average Movie Rating
60
◼ Pre 2004: 3.4 stars
◼ Post 2004: >3.6 stars
Netflix improved matching, leading to higher
rankings?
People biased towards higher ratings: Meaning
of rating changes?
Temporal Effect: Movie Age
61
◼ High initial ratings
◼ Ratings increase with
movie age
Older movies rated by users better matched to
them?
Older movies are inherently better than newer
movies?
Temporal Biases Of Users
[Koren, KDD’09]
62
Original model: rxi = m +bx + bi + qi ·px
Add time dependence to biases:
rxi = m +bx(t)+ bi(t) +qi · px
Make parameters bx and bi depend on time
(1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors
px(t)… user preference vector on day t
Performance of Various Methods
Global average: 1.1296
User average: 1.0651
Movie average: 1.0533
Netflix: 0.9514
Basic Collaborative filtering: 0.94
CF+Biases+learned weights: 0.91
Latent factors: 0.90
Latent factors+Biases: 0.89
Latent factors+Biases+Time: 0.876
Grand Prize: 0.8563
63
64 The More (Models) the Merrier
What Now??
65
Tried lots of things, but still have not reached the
10% threshold
Getting desperate…
Idea: “Kitchen Sink Approach”
Build lots and lots of predictors
◼ Classifiers
◼ Collaborative filters
◼ Ensembles
Come up with clever ways to blend the results
66
2009
67
At the end of June, the leading team
(BellKorPragmaticChaos) submits results that
exceed the 10% threshold
Competition enters 30 day final period
New “Ensemble” team formed based on
collaboration for others near at top of
leaderboard and quickly beat 10% threshold too
The race is on…
Standing on June 26th 2009
68
June 26th submission triggers 30-day “last call”
The Final Countdown
69
Direct competition between two teams
Can only submit results once per day so each
team can submit once on final day
A day before deadline, BellKorPragmaticChaos
notice that Ensemble is in the lead
Each team prepares final results
BellKorPragmaticChaos submits 40 mins early
Ensemble submits 20 mins early
And they wait…
70
September 2009: Prize Awarded
71
Teams tied BellKorPragmaticChaos wins by
submitting earlier
72 Conclusions and Perspectives
Perspectives
73
Must account for unpredictable users behavior
Think about what you really want to predict and
whether you need a new target
Knowing your data is crucial
When in doubt, combine lots of models
Privacy and ethics are tricky but important
Summary
74
Recommender systems an important and active
area of research and use
Data is really messy
Paradigms
Content: Based on designing features and
applying classification/regression algos
Collaborative: Based on comparing users/items
(aka nearest neighbor)
Latent factor approaches
Baselines important in practice
Questions
75
A number of slides are based on the [Link]
lectures on recommender systems
[Link]