Designing a Personalized Content Recommendation System
Problem, Data, Cleaning, Models, Inference, and Deployment
Chandan J
July 23, 2025
1 / 20
Agenda
1 Problem Definition
2 Data & Datasets
3 Data Cleaning & Feature Engineering
4 Modeling Approaches
5 Training & Inference
6 Evaluation
7 Deployment & MLOps
8 Summary
2 / 20
What Are We Building?
Goal
Develop an algorithm that personalizes content (media, articles, products, etc.) for each user
to maximize engagement, satisfaction, or business KPIs.
Key Questions
What content types? (videos, news, songs, courses, products)
Which signals? (clicks, watch time, ratings, purchases, dwell time)
What metric optimizes success? (CTR, NDCG@10, retention, revenue)
Real-time vs. batch; on-device vs. cloud; latency constraints?
3 / 20
Example Use Cases
Domain Personalization Task
News App Rank daily articles per user based on reading history and topics of interest.
OTT/Streaming Recommend next movies/episodes; continue watching; cold-start for new users.
E-Learning Suggest courses/modules matching skills and completed lessons.
E-Commerce “Customers like you also bought”; re-rank search results for conversion.
Social Media Feed Order posts/stories balancing relevance, freshness, and diversity.
4 / 20
Data Sources
User Signals Item Metadata
Explicit: ratings, likes/dislikes, thumbs up. Text (title, description, tags, categories).
Implicit: clicks, watch time, scroll depth, Audio/Video features (embeddings).
add-to-cart. Creator info, publish time, popularity.
Context: time, device, location, session
info.
5 / 20
Public Benchmark Datasets
Dataset Domain Users/Items Signals
MovieLens (100K/1M/20M) Movies 943/6k ... Ratings (1–5)
Amazon Reviews (2018) E-commerce Millions Ratings, reviews, timestamps
GoodBooks-10k Books 53k/10k Ratings
Netflix Prize Movies 480k/17k Ratings
Last.fm 1K Music 1k/65k Play counts
Yelp Open Dataset Local biz 1.6M/200k Ratings, reviews
RecSys Challenge sets Varies yearly Varies Clicks, orders, add-to-cart
6 / 20
Building the Interaction Log
1. Define a unified schema: user id, item id, timestamp, event type, value.
2. Convert raw events to implicit scores (e.g., view → 1, complete → 3).
3. Handle missing/erroneous IDs, timestamps, duplicates.
4. Filter bots/outliers (excessive clicks in short time).
7 / 20
Cleaning & Splitting
Temporal split: train on past, validate/test on future to avoid leakage.
Minimum interaction thresholds (e.g., users with ≥5 actions).
Negative sampling for implicit data (items user didn’t interact with).
Normalize continuous features (popularity, recency).
Text cleanup: lowercase, stopwords, n-grams, embeddings.
8 / 20
Baseline Methods
Non-personalized: top popular, trending, newest.
Content-based: TF-IDF / embedding similarity of item metadata to user profile.
Neighborhood CF: User-based or item-based kNN using cosine/pearson similarity.
9 / 20
Matrix Factorization Family
ALS / SGD MF: Learn latent user/item vectors minimizing MSE.
BPR-MF: Pairwise ranking loss for implicit feedback.
SVD++: Incorporates implicit signals (clicks) + explicit ratings.
10 / 20
Neural Recommenders
Two-Tower / NCF Sequence Models
Separate user and item encoders. GRU4Rec, SASRec, Transformer4Rec.
Dot product / MLP for matching. Predict next-item from session history.
Good for ANN retrieval (FAISS, ScaNN). Handle context and order of interactions.
11 / 20
Advanced/Hybrid Approaches
Graph-based: GCNs/LightGCN on user–item bipartite graphs.
Context-aware: Wide & Deep, DeepFM, xDeepFM.
Knowledge Graph Recsys: leverage entity relations.
Hybrid: Combine collaborative + content signals.
Re-ranking: Diversity, novelty, fairness constraints.
12 / 20
Typical Training Loop (Ranking Model)
for epoch in range(E):
model.train()
for users, pos_items, neg_items in loader:
pos_scores = model(users, pos_items)
neg_scores = model(users, neg_items)
loss = bpr_loss(pos_scores, neg_scores) # or CE, MSE, etc.
loss.backward()
optimizer.step(); optimizer.zero_grad()
val_ndcg = evaluate(model, val_data, k=10)
early_stopping(val_ndcg)
save_checkpoint(...)
13 / 20
Serving / Inference Pipeline
Two-Stage Architecture Online Considerations
1. Candidate Generation (fast, approximate) Latency budgets (e.g., < 100 ms)
ANN search on item embeddings Caching popular results
Retrieve top 200–1000 candidates Real-time feature updates
2. Ranking (slower, accurate) (streaming)
Rich features + deep model
Output final top-k list
14 / 20
Offline Metrics
Ranking: HitRate@k, NDCG@k, MRR, MAP.
Classification/AUC: ROC-AUC, PR-AUC for click prediction.
Rating Prediction: RMSE, MAE.
Beyond-accuracy: Diversity, novelty, serendipity, coverage.
15 / 20
Online Testing
A/B testing on production traffic: CTR, retention, revenue uplift.
Interleaving tests for fine-grained pairwise comparison.
Guardrail metrics: latency, complaint rate, content policy violations.
16 / 20
Production Stack
Feature store (Feast), model registry (MLflow), experiment tracker (W&B).
Batch (Spark) + stream (Kafka/Flink) pipelines.
Model versioning, canary releases.
17 / 20
Monitoring & Ethics
Drift detection: user taste shifts, new items.
Bias/fairness: exposure imbalance, filter bubbles.
Privacy: GDPR/CCPA; minimize PII, anonymize logs.
Feedback loops: integrate user feedback/corrections.
18 / 20
Takeaways
Start with clear objectives and measurable metrics.
Build a robust data pipeline: clean, temporal splits, negative samples.
Compare baselines (popularity, CF) before complex neural models.
Two-stage serving (retrieve & rank) is practical at scale.
Continuous monitoring, ethical checks, and iteration are essential.
19 / 20
Questions?
20 / 20