DA Lab Program-6
DA Lab Program-6
6. Utilize Apache Spark to achieve the following tasks for the given dataset,
a) Find the movie with the lowest average rating with RDD.
Steps to be followed:
Prerequisites:
• Apache Spark Setup: Ensure you have Apache Spark installed and configured.
(If not go to terminal and install pyspark using pip install pyspark)
• Loading Data: First, load the data into Spark RDDs or DataFrames.
# Create RDDs
#movies_rdd = movies_df.rdd
#ratings_rdd = ratings_df.rdd
spark = SparkSession.builder.appName("MovieRatingsAnalysis").getOrCreate()
1
# Load datasets
movies_df = spark.read.csv("/Users/amithpradhaan/Desktop/ml-latest-small/movies.csv",
header=True, inferSchema=True)
ratings_df = spark.read.csv("/Users/amithpradhaan/Desktop/ml-latest-small/ratings.csv",
header=True, inferSchema=True)
# Create RDDs
movies_rdd = movies_df.rdd
ratings_rdd = ratings_df.rdd
a) Find the Movie with the Lowest Average Rating Using RDD.
.reduceByKey(lambda x, y: x + y) \
top_users = user_ratings_count.take(10)
2
print(f"Top users by number of ratings: {top_users}")
.withColumn("month", month(from_unixtime(ratings_df['timestamp'])))
# Show distribution
ratings_over_time.show()
min_ratings = 100
3
highest_rated_movies = qualified_movies.sortBy(lambda x: x[1][0],
ascending=False).take(10)