Recommender System using Pyspark – Python
Last Updated :
24 Apr, 2025
A recommender system is a type of information filtering system that provides personalized recommendations to users based on their preferences, interests, and past behaviors. Recommender systems come in a variety of forms, such as content-based, collaborative filtering, and hybrid systems. Content-based systems make recommendations for products based on how closely their characteristics match those of products the user has previously expressed interest in. Collaborative filtering systems recommend items based on the preferences of users who have similar interests to the user being recommended. Hybrid systems combine both content-based and collaborative filtering approaches to make recommendations.
We will implement this with the help of Collaborative Filtering. Collaborative filtering involves making predictions (filtering) about a user’s interests by compiling preferences or taste data from numerous users (collaborating). The essential premise is that, if two users A and B share the same opinion on a subject, A is more likely to share B’s opinion on a related but unrelated subject, x, than the opinion of a randomly selected user.
Recommender System using Pyspark
Collaborative filtering is implemented by the machine learning library Spark MLlib using Alternating Least Squares. These parameters apply to the MLlib implementation:
- The number of blocks used to parallelize computation is numBlocks (set to -1 to auto-configure).
- The number of latent factors in the model is its rank.
- The number of iterations to execute is known as an iteration.
- The regularisation parameter in ALS is specified by lambda.
- Whether to utilize the ALS variation tailored for implicit feedback data or the explicit feedback variant is determined by implicitPrefs.
- The implicit feedback variant of ALS has a parameter called alpha that controls the initial level of confidence in preference observations.
In this, we will use the dataset of the book review.
Step 1: Import the necessary libraries and functions and Setup Spark Session
Python3
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
spark = SparkSession.builder.appName( 'Recommender' ).getOrCreate()
spark
|
Output:
SparkSession - in-memory
SparkContext
Spark UI
Version
v3.3.1
Master
local[*]
AppName
Recommender
Step 2: Reading the data from the data set
Python3
data = spark.read.csv( 'book_ratings.csv' ,
inferSchema = True ,header = True )
data.show( 5 )
|
Output:
+-------+-------+------+
|book_id|user_id|rating|
+-------+-------+------+
| 1| 314| 5|
| 1| 439| 3|
| 1| 588| 5|
| 1| 1169| 4|
| 1| 1185| 4|
+-------+-------+------+
only showing top 5 rows
Describe the dataset
Output:
+-------+-----------------+------------------+------------------+
|summary| book_id| user_id| rating|
+-------+-----------------+------------------+------------------+
| count| 981756| 981756| 981756|
| mean|4943.275635697668|25616.759933221696|3.8565335989797873|
| stddev|2873.207414896143|15228.338825882149|0.9839408559619973|
| min| 1| 1| 1|
| max| 10000| 53424| 5|
+-------+-----------------+------------------+------------------+
Step 3: Splitting the data into training and testing
Python3
train_data, test_data = data.randomSplit([ 0.8 , 0.2 ])
|
Step 4: Import the Alternating Least Squares(ALS) Method and apply it.
Python3
als = ALS(maxIter = 5 ,
regParam = 0.01 ,
userCol = "user_id" ,
itemCol = "book_id" ,
ratingCol = "rating" )
model = als.fit(train_data)
|
Step 5: Predictions
Python3
predictions = model.transform(test_data)
predictions.show()
|
Output:
+-------+-------+------+----------+
|book_id|user_id|rating|prediction|
+-------+-------+------+----------+
| 2| 6342| 3| 4.8064413|
| 1| 17984| 5| 4.9681554|
| 1| 38475| 4| 4.4078903|
| 2| 6630| 5| 4.344222|
| 1| 32055| 4| 3.990228|
| 1| 33697| 4| 3.7945805|
| 1| 18313| 5| 4.533183|
| 1| 5461| 3| 3.8614116|
| 1| 47800| 5| 4.914357|
| 2| 10751| 3| 4.160536|
| 1| 16377| 4| 5.304298|
| 1| 45493| 5| 3.998557|
| 2| 10509| 2| 1.8626969|
| 1| 33890| 3| 3.6022692|
| 1| 37284| 5| 4.8147345|
| 1| 1185| 4| 3.7463336|
| 1| 44397| 5| 5.0251017|
| 1| 46977| 4| 4.0746284|
| 1| 10944| 5| 4.343548|
| 2| 8167| 2| 3.705464|
+-------+-------+------+----------+
only showing top 20 rows
Evaluations
Python3
evaluator = RegressionEvaluator(metricName = "rmse" , labelCol = "rating" ,predictionCol = "prediction" )
rmse = evaluator.evaluate(predictions)
print ( "Root-mean-square error = " + str (rmse))
|
Output:
Root-mean-square error = nan
Step 6: Recommendations
Now, we will predict/recommend the book to a single user – user1 (let’s say, userId:5461) with the help of our trained model.
Python3
user1 = test_data. filter (test_data[ 'user_id' ] = = 5461 ).select([ 'book_id' , 'user_id' ])
user1.show()
|
Output:
+-------+-------+
|book_id|user_id|
+-------+-------+
| 1| 5461|
| 11| 5461|
| 19| 5461|
| 46| 5461|
| 60| 5461|
| 66| 5461|
| 93| 5461|
| 111| 5461|
| 121| 5461|
| 172| 5461|
| 194| 5461|
| 212| 5461|
| 222| 5461|
| 245| 5461|
| 264| 5461|
| 281| 5461|
| 301| 5461|
| 354| 5461|
| 388| 5461|
| 454| 5461|
+-------+-------+
only showing top 20 rows
Python3
recommendations = model.transform(user1)
recommendations.orderBy( 'prediction' ,ascending = False ).show()
|
Output:
+-------+-------+----------+
|book_id|user_id|prediction|
+-------+-------+----------+
| 19| 5461| 5.3429904|
| 11| 5461| 4.830688|
| 66| 5461| 4.804107|
| 245| 5461| 4.705879|
| 388| 5461| 4.6276107|
| 1161| 5461| 4.612251|
| 60| 5461| 4.5895457|
| 1402| 5461| 4.5184|
| 1088| 5461| 4.454755|
| 5152| 5461| 4.415825|
| 121| 5461| 4.3423634|
| 93| 5461| 4.3357944|
| 1796| 5461| 4.30891|
| 172| 5461| 4.2679276|
| 454| 5461| 4.245925|
| 1211| 5461| 4.2431927|
| 731| 5461| 4.1873074|
| 1094| 5461| 4.1829815|
| 222| 5461| 4.182873|
| 264| 5461| 4.1469045|
+-------+-------+----------+
only showing top 20 rows
In the above output, there are predictions for the book IDs for the user with userId “5461”.
Step 7: Stop the spark
Similar Reads
Recommendation System in Python
Industry leaders like Netflix, Amazon and Uber Eats have transformed how individuals access products and services. They do this by using recommendation algorithms that improve the user experience. These systems offer personalized recommendations based on users interests and preferences. In this arti
7 min read
Logistic Regression using PySpark Python
In this tutorial series, we are going to cover Logistic Regression using Pyspark. Logistic Regression is one of the basic ways to perform classification (donât be confused by the word âregressionâ). Logistic Regression is a classification method. Some examples of classification are: Spam detectionDi
3 min read
Implementation of Movie Recommender System - Python
Recommender Systems provide personalized suggestions for items that are most relevant to each user by predicting preferences according to user's past choices. They are used in various areas like movies, music, news, search queries, etc. These recommendations are made in two ways: Collaborative filte
4 min read
Music Recommendation System Using Machine Learning
When did we see a video on youtube let's say it was funny then the next time you open your youtube app you get recommendations of some funny videos in your feed ever thought about how? This is nothing but an application of Machine Learning using which recommender systems are built to provide persona
4 min read
Python PySpark - Union and UnionAll
In this article, we will discuss Union and UnionAll in PySpark in Python. Union in PySpark The PySpark union() function is used to combine two or more data frames having the same structure or schema. This function returns an error if the schema of data frames differs from each other. Syntax: dataFra
3 min read
Python PySpark sum() Function
PySpark, the Python API for Apache Spark, is a powerful tool for big data processing and analytics. One of its essential functions is sum(), which is part of the pyspark.sql.functions module. This function allows us to compute the sum of a column's values in a DataFrame, enabling efficient data anal
3 min read
K-Means Clustering using PySpark Python
In this tutorial series, we are going to cover K-Means Clustering using Pyspark. K-means is a clustering algorithm that groups data points into K distinct clusters based on their similarity. It is an unsupervised learning technique that is widely used in data mining, machine learning, and pattern re
4 min read
How to Check PySpark Version
Knowing the version of PySpark you're working with is crucial for compatibility and troubleshooting purposes. In this article, we will walk through the steps to check the PySpark version in the environment. What is PySpark?PySpark is the Python API for Apache Spark, a powerful distributed computing
3 min read
Data Visualization using Turicreate in Python
In Machine Learning, Data Visualization is a very important phase. In order to correctly understand the behavior and features of your data one needs to visualize it perfectly. So here I am with my post on how to efficiently and at the same time easily visualize your data to extract most out of it. B
3 min read
Setting Up a Data Science Environment in Python
Data Science is the field of working with data using computational and statistical methods, which is becoming more relevant than ever before as more people are coming online and companies are generating terabytes of data based on their behavior and platform usage. Setting up a data science environme
5 min read