0% found this document useful (0 votes)

102 views36 pages

Rotten Tomatoes Audience Rating Prediction

Uploaded by

Woody Woodpecker

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

102 views36 pages

Rotten Tomatoes Audience Rating Prediction

Uploaded by

Woody Woodpecker

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Rotten Tomatoes

December 23, 2024

Rotten Tomatoes - Audience Rating Prediction

0.1 Author
James Jeberson M - (for feedback reach out to [email protected])

1 Introduction
1.1 Objective:
1. Preprocess and transform data, including handling missing values and encoding features.
2. Build and evaluate multiple regression models to predict audience ratings.
• Linear Regression (Ridge & Lasso)
• XGBoost Regressor
• CatBoost Regressor
• Neural Networks
3. Compare model performance using metrics like RMSE and R² to identify the best model.

2 Importing the required libraries

[1]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

from textblob import TextBlob

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LinearRegression

from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor

1
from tensorflow.keras.models import Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Embedding, Flatten, Dense,␣
↪Concatenate, LSTM

from tensorflow.keras.preprocessing.text import Tokenizer

from tensorflow.keras.preprocessing.sequence import pad_sequences

from sklearn.metrics import root_mean_squared_error, r2_score

import warnings
warnings.filterwarnings('ignore')

3 Data Loading
[2]: # loading the dataset as a pandas dataframe
rt_df = pd.read_excel("Rotten_Tomatoes_Movies3.xls")
print(f"The Rotten Tomatoes Dataset contains {rt_df.shape[0]} rows and {rt_df.
↪shape[1]} columns")

The Rotten Tomatoes Dataset contains 16638 rows and 16 columns

[3]: # looking at the dataset

rt_df.head()

[3]: movie_title \
0 Percy Jackson & the Olympians: The Lightning T…
1 Please Give
2 10
3 12 Angry Men (Twelve Angry Men)
4 20,000 Leagues Under The Sea

movie_info \
0 A teenager discovers he's the descendant of a …
1 Kate has a lot on her mind. There's the ethics…
2 Blake Edwards' 10 stars Dudley Moore as George…
3 A Puerto Rican youth is on trial for murder, a…
4 This 1954 Disney version of Jules Verne's 20,0…

critics_consensus rating \
0 Though it may seem like just another Harry Pot… PG
1 Nicole Holofcener's newest might seem slight i… R
2 NaN R
3 Sidney Lumet's feature debut is a superbly wri… NR
4 One of Disney's finest live-action adventures,… G

genre directors \
0 Action & Adventure, Comedy, Drama, Science Fic… Chris Columbus

2
1 Comedy Nicole Holofcener
2 Comedy, Romance Blake Edwards
3 Classics, Drama Sidney Lumet
4 Action & Adventure, Drama, Kids & Family Richard Fleischer

writers cast \
0 Craig Titley Logan Lerman, Brandon T. Jackson, Alexandra Da…
1 Nicole Holofcener Catherine Keener, Amanda Peet, Oliver Platt, R…
2 Blake Edwards Dudley Moore, Bo Derek, Julie Andrews, Robert …
3 Reginald Rose Martin Balsam, John Fiedler, Lee J. Cobb, E.G…
4 Earl Felton James Mason, Kirk Douglas, Paul Lukas, Peter L…

in_theaters_date on_streaming_date runtime_in_minutes \

0 2010-02-12 2010-06-29 83.0
1 2010-04-30 2010-10-19 90.0
2 1979-10-05 1997-08-27 118.0
3 1957-04-13 2001-03-06 95.0
4 1954-01-01 2003-05-20 127.0

studio_name tomatometer_status tomatometer_rating \

0 20th Century Fox Rotten 49
1 Sony Pictures Classics Certified Fresh 86
2 Waner Bros. Fresh 68
3 Criterion Collection Certified Fresh 100
4 Disney Fresh 89

tomatometer_count audience_rating
0 144 53.0
1 140 64.0
2 22 53.0
3 51 97.0
4 27 74.0

[4]: # looking at the schema of the dataset

rt_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16638 entries, 0 to 16637
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 movie_title 16638 non-null object
1 movie_info 16614 non-null object
2 critics_consensus 8309 non-null object
3 rating 16638 non-null object
4 genre 16621 non-null object
5 directors 16524 non-null object
6 writers 15289 non-null object

3
7 cast 16354 non-null object
8 in_theaters_date 15823 non-null datetime64[ns]
9 on_streaming_date 16636 non-null datetime64[ns]
10 runtime_in_minutes 16483 non-null float64
11 studio_name 16222 non-null object
12 tomatometer_status 16638 non-null object
13 tomatometer_rating 16638 non-null int64
14 tomatometer_count 16638 non-null int64
15 audience_rating 16386 non-null float64
dtypes: datetime64[ns](2), float64(2), int64(2), object(10)
memory usage: 2.0+ MB

4 Handling Duplicates
4.1 Handling Duplicates in Rows
[5]: # checking for duplicates in the dataset
rt_df.duplicated().max()

[5]: True

There are duplicate rows present in the datset

[6]: # Checking for the count of duplicate rows

rt_df.duplicated().sum()

[6]: 1

There is only 1 duplicated row

[7]: # viewing the duplicated rows

rt_df[rt_df.duplicated(keep=False)]

[7]: movie_title movie_info \

8495 King Charles III An adaptation of the Broadway drama about Prin…
8496 King Charles III An adaptation of the Broadway drama about Prin…

critics_consensus rating genre directors writers \

8495 NaN NR Drama Rupert Goold Mike Bartlett
8496 NaN NR Drama Rupert Goold Mike Bartlett

cast in_theaters_date \
8495 Oliver Chris, Richard Goulding, Charlotte Rile… 2017-05-14
8496 Oliver Chris, Richard Goulding, Charlotte Rile… 2017-05-14

on_streaming_date runtime_in_minutes studio_name tomatometer_status \

8495 2017-06-27 88.0 NaN Fresh
8496 2017-06-27 88.0 NaN Fresh

4
tomatometer_rating tomatometer_count audience_rating
8495 100 9 48.0
8496 100 9 48.0

[8]: # dropping the duplicated row

rt_df = rt_df.drop_duplicates()

[9]: # verifying for duplicates

rt_df.duplicated().max()

[9]: False

No duplicate rows are present in the dataset

5 Exploring the Dataset

[10]: # extracting numerical and categorical/Textual columns
txt_col = []
num_col = []
for col in rt_df.columns:
if rt_df[col].dtype == 'O':
txt_col.append(col)
else:
num_col.append(col)

[11]: print(f"Categorical/Textual columns: {txt_col}")

print(f"Numerical columns: {num_col}")

Categorical/Textual columns: ['movie_title', 'movie_info', 'critics_consensus',

'rating', 'genre', 'directors', 'writers', 'cast', 'studio_name',
'tomatometer_status']
Numerical columns: ['in_theaters_date', 'on_streaming_date',
'runtime_in_minutes', 'tomatometer_rating', 'tomatometer_count',
'audience_rating']

5.1 Exploring Numerical Columns

[12]: # visulaizing correlation between numerical columns
plt.figure(figsize = (16, 10))
sns.heatmap(rt_df[[col for col in num_col]].corr(), annot = True, fmt = '.2f')
plt.show()

5
Observation
1. High correlation between tomatometer_rating & audience_rating
2. Negligible correlation between other features/columns with audience_rating

5.2 Exploring Categorical/Textual Columns

[13]: txt_status = {'Feature':[], 'Unique Values':[] }
for col in txt_col:
txt_status['Feature'].append(col)
txt_status['Unique Values'].append(len(rt_df[col].dropna().unique()))
pd.DataFrame(txt_status)

[13]: Feature Unique Values

0 movie_title 16106
1 movie_info 16613
2 critics_consensus 8307
3 rating 8
4 genre 1080
5 directors 8314
6 writers 12121
7 cast 16326
8 studio_name 2886
9 tomatometer_status 3

Observation

6
It can be observed that except ‘rating’ & ‘tomatometer_status’ remaining columns have too many
unique values with suggests that we consider them as textual columns while ‘rating’ & ‘tomatome-
ter_status’ are considered as categorical columns

[14]: # Visualizing categorical columns 'rating' & 'tomatometer_status'

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

sns.histplot(rt_df, x='rating', hue='rating', ax=ax1)

ax1.set_title('Distribution of Rating', fontsize=14)

sns.histplot(rt_df, x='tomatometer_status', hue='tomatometer_status', ax=ax2)

ax2.set_title('Distribution of Tomatometer Status', fontsize=14)

fig.suptitle('Distribution of Categorical Data', fontsize=16)

plt.tight_layout()
plt.show()

Observation
From above in Distribution of Rating we can see that ‘PG-13)’ and ‘R)’ has typo in them, lets fix
them.

[15]: # fixing typo in dataset

rt_df['rating'] = rt_df['rating'].replace({'PG-13)': 'PG-13', 'R)': 'R'})

[16]: # visulizing the result

plt.figure(figsize = (7,5))
sns.histplot(rt_df, x='rating', hue='rating')
plt.title('Distribution of Rating', fontsize=14)
plt.show()

7
Typo issue has been fixed

6 Handling missing values

6.1 Handling missing values in Target variable
[17]: # checking for missing values in target variable 'audience_rating'
print(f"No of missing values in audience_rating: {rt_df['audience_rating'].
↪isna().sum()}")

No of missing values in audience_rating: 252

Observation
There are 252 missing values in audience_rating. Since it is the target trying to fill these missing
values will introduce bias into the model. Hence we will drop the rows with missing values in
audience_rating

[18]: # dropping columns with missing values in audience_rating

rt_df = rt_df.dropna(subset=['audience_rating'])

8
[19]: # verifying
print(f"No of missing values in audience_rating: {rt_df['audience_rating'].
↪isna().sum()}")

No of missing values in audience_rating: 0

Missing values in audience_rating has been handled (Target)

6.2 Handling missing values in numerical columns

[20]: # total missing values in numerical columns
pd.concat([rt_df[num_col].count(), rt_df[num_col].isna().sum(),␣
↪round((rt_df[num_col].isna().sum()/len(rt_df))*100, 2)],

axis = 1,
keys = ["Total Count of Values", "Total Missing Values", "Percent of␣
↪Missing Values"])

[20]: Total Count of Values Total Missing Values \

in_theaters_date 15666 719
on_streaming_date 16384 1
runtime_in_minutes 16238 147
tomatometer_rating 16385 0
tomatometer_count 16385 0
audience_rating 16385 0

Percent of Missing Values

in_theaters_date 4.39
on_streaming_date 0.01
runtime_in_minutes 0.90
tomatometer_rating 0.00
tomatometer_count 0.00
audience_rating 0.00

[21]: # visualizing missing values in the dataset with numerical columns

msno.matrix(rt_df[num_col], color=(0.4,0.2,0.5))

[21]: <Axes: >

9
[22]: missing_num_col = [col for col in num_col if rt_df[col].isna().sum() > 0]
missing_num_col

[22]: ['in_theaters_date', 'on_streaming_date', 'runtime_in_minutes']

[23]: # creating histograms for all the numerical columns with missing values
missing_num_col = [col for col in num_col if rt_df[col].isna().sum() > 0]

number_cols = len(missing_num_col)
cols = 2
rows = (number_cols//cols+1)

fig, axis = plt.subplots(nrows = rows, ncols = cols, figsize=(12, 4*rows))

axis = axis.flatten()

for i, col in enumerate(missing_num_col):

sns.histplot(rt_df, x=col, kde=True, bins = 20, color=sns.
↪color_palette('hls', len(missing_num_col))[i], ax=axis[i])

axis[i].set_title(col)
axis[i].set_xlabel(col)
axis[i].set_ylabel('Count')

axis[i].text(0.95, 0.95, f"Mean: {rt_df[col].mean()}", ha='right',␣

↪va='top', transform=axis[i].transAxes, fontsize=10, color='blue')

axis[i].text(0.95, 0.85, f"Median: {rt_df[col].median()}", ha='right',␣

↪va='top', transform=axis[i].transAxes, fontsize=10, color='green')

axis[i].text(0.95, 0.75, f"Mode: {rt_df[col].mode()[0]}", ha='right',␣

↪va='top', transform=axis[i].transAxes, fontsize=10, color='red')

10
axis[i].axvline(rt_df[col].mean(), color='blue', linestyle='dashed',␣
↪linewidth=1)
axis[i].axvline(rt_df[col].median(), color='green', linestyle='dashed',␣
↪linewidth=1)

axis[i].axvline(rt_df[col].mode()[0], color='red', linestyle='dashed',␣

↪linewidth=1)

for i in range(number_cols, len(axis)):

axis[i].axis('off')

fig.suptitle('Distribution of Numerical columns', fontsize=16)

plt.tight_layout()
plt.show()

Observation 1. runtime_in_minutes has an almost symmetrical distribution hence we go with

mean imputation to fill missing values 2. in_theaters_date and on_streaming_date have
skewed distribution hebce we go with median impytation to dill missing values

[24]: # mean imputation

# filling missing values in 'runtime_in_minutes' using mean

11
rt_df['runtime_in_minutes']=rt_df['runtime_in_minutes'].
↪fillna(rt_df['runtime_in_minutes'].mean())

[25]: # median imputation

# filling missing values using median
for col in ['in_theaters_date', 'on_streaming_date']:
rt_df[col]=rt_df[col].fillna(rt_df[col].median())

[26]: # verifying
print(f"Missing values in numeric columns: {rt_df[[col for col in num_col]].
↪isna().max().sum()}")

Missing values in numeric columns: 0

Missing values in numeric columns has been handled

6.3 Handling missing values in Categorical/Textual columns

[27]: # total missing values in Categorical/Textual columns
pd.concat([rt_df[txt_col].count(), rt_df[txt_col].isna().sum(),␣
↪round((rt_df[txt_col].isna().sum()/len(rt_df))*100, 2)],

axis = 1,
keys = ["Total Count of Values", "Total Missing Values", "Percent of␣
↪Missing Values"])

[27]: Total Count of Values Total Missing Values \

movie_title 16385 0
movie_info 16367 18
critics_consensus 8281 8104
rating 16385 0
genre 16368 17
directors 16281 104
writers 15108 1277
cast 16125 260
studio_name 16010 375
tomatometer_status 16385 0

Percent of Missing Values

movie_title 0.00
movie_info 0.11
critics_consensus 49.46
rating 0.00
genre 0.10
directors 0.63
writers 7.79
cast 1.59
studio_name 2.29
tomatometer_status 0.00

12
[28]: # visualizing missing values in the dataset with Categorical/Textual columns
msno.matrix(rt_df[txt_col], color=(0.4,0.2,0.5))

[28]: <Axes: >

[29]: missing_txt_col = [col for col in txt_col if rt_df[col].isna().sum() > 0]

print(missing_txt_col)

['movie_info', 'critics_consensus', 'genre', 'directors', 'writers', 'cast',

'studio_name']
Observation 1. critics_consensus has near 50% missing values but it contains info which
might be relevant and impact the target (audience_rating). Hence lets transform the values using
sentiment analysis. 2. ‘movie_info’, ‘genre’, ‘directors’, ‘writers’, ‘cast’, ‘studio_name’
have considerably less missing values but mean, median or mode imputation cannot be used as
each value is highly associated to the movie, hence here we will use placeholders.

6.3.1 Transforming ‘critics_consensus’

[30]: # taking a copy of the current dataset before processing 'critics_consensus' as␣
↪it will be need in later

critics_consensus_df = rt_df

[31]: # transforming 'critics_consensus' column

def classify_sentiment(text):
if pd.isnull(text):
return 'Unknown'
polarity = TextBlob(str(text)).sentiment.polarity
if polarity > 0:
return 'Positive'

13
elif polarity < 0:
return 'Negative'
else:
return 'Neutral'

rt_df['critics_sentiment'] = rt_df['critics_consensus'].
↪apply(classify_sentiment)

rt_df['critics_sentiment'].value_counts()

[31]: critics_sentiment
Unknown 8104
Positive 5717
Negative 2122
Neutral 442
Name: count, dtype: int64

[32]: # Dropping 'critics_consensus' column

rt_df = rt_df.drop(columns=['critics_consensus'])

# Fill missing values in text columns with placeholders

rt_df['movie_info'] = rt_df['movie_info'].fillna('Unknown')
rt_df['genre'] = rt_df['genre'].fillna('Unknown Genre')
rt_df['directors'] = rt_df['directors'].fillna('Unknown Director')
rt_df['writers'] = rt_df['writers'].fillna('Unknown Writer')
rt_df['cast'] = rt_df['cast'].fillna('Unknown Cast')
rt_df['studio_name'] = rt_df['studio_name'].fillna('Unknown Studio')

[33]: # verifying
txt_col.remove('critics_consensus')
print(f"Missing values in Categorical/Textual columns: {rt_df[[col for col in␣
↪txt_col]].isna().max().sum()}")

Missing values in Categorical/Textual columns: 0

6.4 Verifying whole dataset for missing values

[34]: print(f"Missing values: {rt_df.isna().max().sum()}")

Missing values: 0

[35]: msno.matrix(rt_df, color=(0.4,0.2,0.5))

[35]: <Axes: >

14
Missing values has been handled

7 Feature Encoding
7.1 Encoding Date Columns
[36]: rt_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16385 entries, 0 to 16637
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 movie_title 16385 non-null object
1 movie_info 16385 non-null object
2 rating 16385 non-null object
3 genre 16385 non-null object
4 directors 16385 non-null object
5 writers 16385 non-null object
6 cast 16385 non-null object
7 in_theaters_date 16385 non-null datetime64[ns]
8 on_streaming_date 16385 non-null datetime64[ns]
9 runtime_in_minutes 16385 non-null float64
10 studio_name 16385 non-null object
11 tomatometer_status 16385 non-null object
12 tomatometer_rating 16385 non-null int64
13 tomatometer_count 16385 non-null int64
14 audience_rating 16385 non-null float64
15 critics_sentiment 16385 non-null object
dtypes: datetime64[ns](2), float64(2), int64(2), object(10)

15
memory usage: 2.1+ MB

[37]: # Encoding Dates into day, month and year

# Encoding in_theaters_date
rt_df['in_theaters_day'] = rt_df['in_theaters_date'].dt.day
rt_df['in_theaters_month'] = rt_df['in_theaters_date'].dt.month
rt_df['in_theaters_year'] = rt_df['in_theaters_date'].dt.year

# Encoding on_streaming_date
rt_df['on_streaming_day'] = rt_df['on_streaming_date'].dt.day
rt_df['on_streaming_month'] = rt_df['on_streaming_date'].dt.month
rt_df['on_streaming_year'] = rt_df['on_streaming_date'].dt.year

# droping in_theaters_date & on_streaming_date

rt_df.drop(columns={'in_theaters_date', 'on_streaming_date'}, inplace=True)

7.2 Encoding Category Columns

[38]: # OneHot encoding 'rating', 'tomatometer_status' & 'critics_sentiment'
rt_df = pd.get_dummies(rt_df, columns=['rating', 'tomatometer_status',␣
↪'critics_sentiment'])

# Replacing True & False with 1 & 0

rt_df = rt_df.replace({True: 1, False: 0})

[39]: # Lets check if we can encode other textual columns

# most of them seems to be seperated by commas
for column in [col for col in rt_df.columns if rt_df[col].dtype == 'O']:
unique_values = set(
str(value).strip()
for values in rt_df[column]
for value in str(values).split(',')
)
print(f"Number of unique values in {column}: {len(unique_values)}")

Number of unique values in movie_title: 16068

Number of unique values in movie_info: 129530
Number of unique values in genre: 22
Number of unique values in directors: 8848
Number of unique values in writers: 14514
Number of unique values in cast: 197742
Number of unique values in studio_name: 2827
Observations & Actions 1. genre has 22 unique values this needs to be encoded 2. movie_title
can be dropped as it does not have impact on audience_rating 3. movie_info can also be droped
as transforming it does not make sense and most of its content is already captured in genre column
4. directors, writers, cast, studio_name lets try to frequency encode these columns

16
7.2.1 Multi Lable Encoding

[40]: # Encoding 'genre' column

genre_lists = rt_df['genre'].apply(lambda x: [i.strip() for i in str(x).
↪split(',')])

mlb = MultiLabelBinarizer()
genre_encoded = pd.DataFrame(mlb.fit_transform(genre_lists), columns=mlb.
↪classes_, index=rt_df.index)

rt_df = pd.concat([rt_df, genre_encoded], axis=1).drop(columns=['genre'])

[41]: # dropping 'movie_title' and 'movie_info' columns

rt_df.drop(columns=['movie_title', 'movie_info'], inplace=True)

7.2.2 Frequency Encoding

[42]: # encoding directors, writers, cast & studio_name

def frequency_encode(column):

all_values = [value.strip() for item in column for value in str(item).

↪split(',')]
frequency = pd.Series(all_values).value_counts()
return column.apply(lambda x: sum(frequency.get(value.strip()) for value in␣
↪str(x).split(',')))

rt_df['directors_freq'] = frequency_encode(rt_df['directors'])
rt_df['writers_freq'] = frequency_encode(rt_df['writers'])
rt_df['cast_freq'] = frequency_encode(rt_df['cast'])
rt_df['studio_name_freq'] = frequency_encode(rt_df['studio_name'])

[43]: rt_df[['directors', 'directors_freq', 'writers', 'writers_freq', 'cast',␣

↪'cast_freq', 'studio_name', 'studio_name_freq']].head()

[43]: directors directors_freq writers writers_freq \

0 Chris Columbus 13 Craig Titley 2
1 Nicole Holofcener 5 Nicole Holofcener 6
2 Blake Edwards 27 Blake Edwards 21
3 Sidney Lumet 30 Reginald Rose 3
4 Richard Fleischer 17 Earl Felton 3

cast cast_freq \
0 Logan Lerman, Brandon T. Jackson, Alexandra Da… 719
1 Catherine Keener, Amanda Peet, Oliver Platt, R… 260
2 Dudley Moore, Bo Derek, Julie Andrews, Robert … 315
3 Martin Balsam, John Fiedler, Lee J. Cobb, E.G… 208
4 James Mason, Kirk Douglas, Paul Lukas, Peter L… 242

17
studio_name studio_name_freq
0 20th Century Fox 414
1 Sony Pictures Classics 259
2 Waner Bros. 1
3 Criterion Collection 110
4 Disney 26

Observation
The columns directors, writers, cast & studio_name has been frequency encoded where each
value for example ‘Chris Columbus’ is replaced with no of time times they appear in the dataset.

[44]: rt_df.loc[rt_df['directors'] == 'Unknown Director', ['directors',␣

↪'directors_freq']].sample(5)

[44]: directors directors_freq

12444 Unknown Director 104
8226 Unknown Director 104
3940 Unknown Director 104
11384 Unknown Director 104
5878 Unknown Director 104

Observation
The values which was used to fill the missing values in directors, writers, cast & studio_name has
also been frequency encoded which we do not want. Hence we will replace all the frequesncy for
these to 0.

[45]: # replaceing frequency of unknown values with 0

rt_df.loc[rt_df['directors'] == 'Unknown Director', 'directors_freq'] = 0
rt_df.loc[rt_df['directors'] == 'Unknown Writer', 'writers'] = 0
rt_df.loc[rt_df['directors'] == 'Unknown Cast', 'cast'] = 0
rt_df.loc[rt_df['directors'] == 'Unknown Studio', 'studio_name'] = 0

[46]: # dropping directors, writers, cast & studio_name

rt_df.drop(columns = ['directors', 'writers', 'cast', 'studio_name'],␣
↪inplace=True)

Data Cleaning and Processing Complete

[47]: # final dataset

print(f"The dataset has {rt_df.shape[0]} rows and {rt_df.shape[1]} Columns")

The dataset has 16385 rows and 49 Columns

18
8 Train Test Split
[48]: # splitting dataset to train and test datasets

X = rt_df.drop(columns = 'audience_rating')
y = rt_df['audience_rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,␣

↪random_state = 42)

print(f"Train dataset shape = {X_train.shape}")

print(f"Test dataset shape = {X_test.shape}")

Train dataset shape = (13108, 48)

Test dataset shape = (3277, 48)

9 Feature Scaling - Normalization (Min-Max Scaling)

[49]: scalar = MinMaxScaler()
scaled_X = scalar.fit_transform(X)

[50]: # scaled traing data will be used in the required ML model only
scaled_X_train, scaled_X_test, y_train, y_test = train_test_split(scaled_X, y,␣
↪test_size = 0.2, random_state = 42)

print(f"Scaled train dataset shape = {scaled_X_train.shape}")

Scaled train dataset shape = (13108, 48)

10 Linear Regression
[51]: # creating a variable to store the results of all the models
model_perf = []
def evaluate_model(model, pred):
rmse = round(root_mean_squared_error(y_test, pred), 2)
r2 = round(r2_score(y_test, pred), 2)
model_perf.append({'Model': model, 'RMSE': rmse, 'R Squared': r2})
print(f"RMSE of {model}: {rmse}")
print(f"R Squared of {model}: {r2}")

[52]: # initializing Linear Regression model

lreg = LinearRegression()

# training model with normalized dataset (since linear regression is sensitive␣

↪to scale of values)

lreg.fit(scaled_X_train, y_train)

19
# predicting on scaled test dataset
pred_lreg = lreg.predict(scaled_X_test)

# evaluating model
evaluate_model('Linear Regression', pred_lreg)

RMSE of Linear Regression: 14.01

R Squared of Linear Regression: 0.53

[53]: # checking for overfitting

print(f"RMSE with train data: {root_mean_squared_error(y_train, lreg.
↪predict(scaled_X_train)):.2f}")

print(f"R Squared with train data: {r2_score(y_train, lreg.

↪predict(scaled_X_train)):.2f}")

RMSE with train data: 14.00

R Squared with train data: 0.53

[54]: # lets check the consitancy of the model accross different splits of the dataset
cv_scores = cross_val_score(lreg, X_train, y_train, cv=5,␣
↪scoring='neg_mean_squared_error')

print(f"Mean Cross-Validation RMSE: {(np.sqrt(-cv_scores)).mean():.2f}")

cv_r2 = cross_val_score(lreg, X_train, y_train, cv=5, scoring='r2')

print(f"Mean Cross-Validation R²: {cv_r2.mean():.2f}")

Mean Cross-Validation RMSE: 14.23

Mean Cross-Validation R²: 0.52
Observation
The RMSE and R Squared for both train dataset and test dataset are comparable which indicates
that the model is not overfitting. From K fold cross validation the model is consistant across data
splits.

11 Ridge and Lasso Regression

[55]: # Ridge Regression
alpha = [0.1, 1.0, 10.0, 100.0]
for i in alpha:
Rreg = Ridge(alpha=i, random_state=42)
Rreg.fit(scaled_X_train, y_train)
print(f"alpha = {i}")
print(f"RMSE with test data: {root_mean_squared_error(y_test, Rreg.
↪predict(scaled_X_test)):.2f}")

print(f"R Squared with test data: {r2_score(y_test, Rreg.

↪predict(scaled_X_test)):.2f}\n")

20
alpha = 0.1
RMSE with test data: 14.00
R Squared with test data: 0.53

alpha = 1.0
RMSE with test data: 14.03
R Squared with test data: 0.53

alpha = 10.0
RMSE with test data: 14.09
R Squared with test data: 0.52

alpha = 100.0
RMSE with test data: 14.23
R Squared with test data: 0.51

[56]: # Lasso Regression

alpha = [0.1, 1.0, 10.0, 100.0]
for i in alpha:
Lareg = Lasso(alpha=i, random_state=42)
Lareg.fit(scaled_X_train, y_train)
print(f"alpha = {i}")
print(f"RMSE with test data: {root_mean_squared_error(y_test, Lareg.
↪predict(scaled_X_test)):.2f}")

print(f"R Squared with test data: {r2_score(y_test, Lareg.

↪predict(scaled_X_test)):.2f}\n")

alpha = 0.1
RMSE with test data: 14.32
R Squared with test data: 0.51

alpha = 1.0
RMSE with test data: 15.62
R Squared with test data: 0.41

alpha = 10.0
RMSE with test data: 20.36
R Squared with test data: -0.00

alpha = 100.0
RMSE with test data: 20.36
R Squared with test data: -0.00

[57]: # best alpha value

Rreg = Ridge(alpha=0.1, random_state=42)
Rreg.fit(scaled_X_train, y_train)

21
evaluate_model('Ridge Linear Regression', Rreg.predict(scaled_X_test))

Lareg = Lasso(alpha=0.1, random_state=42)

Lareg.fit(scaled_X_train, y_train)
evaluate_model('Lasso Linear Regression', Lareg.predict(scaled_X_test))

RMSE of Ridge Linear Regression: 14.0

R Squared of Ridge Linear Regression: 0.53
RMSE of Lasso Linear Regression: 14.32
R Squared of Lasso Linear Regression: 0.51

12 Random Forest Regression

[58]: # Initialize Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, max_depth=None,␣
↪random_state=42)

# Training the model

rf_reg.fit(X_train, y_train)

# Predicting on the test dataset

pred_rf_reg = rf_reg.predict(X_test)

# Evaluating the model

print(f"RMSE: {root_mean_squared_error(y_test, pred_rf_reg):.2f}")
print(f"R Squared: {r2_score(y_test, pred_rf_reg):.2f}")

RMSE: 13.59
R Squared: 0.55

[59]: # hyperparameter tuninig

result_rf_reg = []
for criterion in ['squared_error', 'friedman_mse', 'poisson']:
for n_estimators in [100, 200, 300]:
for max_depth in [10, 15, None]:

rf_reg = RandomForestRegressor(n_estimators=n_estimators,␣
↪criterion=criterion, max_depth=max_depth, random_state=42)
rf_reg.fit(X_train, y_train)
pred_rf_reg = rf_reg.predict(X_test)

result_rf_reg.append(
{
'No of Estimators': n_estimators,
'Criterion': criterion,
'Max Depth': max_depth,

22
'RMSE': round(root_mean_squared_error(y_test, pred_rf_reg),␣
↪2),
'R Squared': r2_score(y_test, pred_rf_reg)
}
)

print(f"Training Done!")

Training Done!

[60]: # best RMSE values

result_rf_reg = pd.DataFrame(result_rf_reg)
result_rf_reg.loc[result_rf_reg['RMSE'].idxmin()]

[60]: No of Estimators 300

Criterion squared_error
Max Depth NaN
RMSE 13.51
R Squared 0.559636
Name: 8, dtype: object

[61]: # best R Squared values

result_rf_reg.loc[result_rf_reg['R Squared'].idxmax()]

[61]: No of Estimators 300

Criterion squared_error
Max Depth NaN
RMSE 13.51
R Squared 0.559636
Name: 8, dtype: object

Observation
From above we can observe that the paramenters with best RMSE and R Squared are same
(n_estimators=300, criterion=‘squared_error’, max_depth=None)

[62]: # Random Forest Regressor with optimal parameters

rf_reg = RandomForestRegressor(n_estimators=300, criterion='squared_error',␣
↪max_depth=None, random_state=42)

rf_reg.fit(X_train, y_train)

evaluate_model('Random Forest Regression', rf_reg.predict(X_test))

RMSE of Random Forest Regression: 13.51

R Squared of Random Forest Regression: 0.56

23
13 XGBoost Regressor
[63]: # Initialize XGBoost Regressor
xgb_reg = XGBRegressor(n_estimators=100, max_depth=5, learning_rate=0.1,␣
↪random_state=42)

# Training the model

xgb_reg.fit(X_train, y_train)

# Predict on test data

y_pred = xgb_reg.predict(X_test)

# Evaluate the model

print(f"RMSE: {root_mean_squared_error(y_test, y_pred):.2f}")
print(f"R Squared: {r2_score(y_test, y_pred):.2f}")

RMSE: 13.25
R Squared: 0.58

[64]: # hyperparameter tuninig

result_xgb_reg = []
for n_estimators in [100, 200, 300]:
for learning_rate in [0.01, 0.05, 0.1, 0.2]:
for max_depth in [3, 5, 7, 9]:
xgb_reg = XGBRegressor(n_estimators=n_estimators,␣
↪learning_rate=learning_rate, max_depth=max_depth, random_state=42)

xgb_reg.fit(X_train, y_train)
pred_xgb_reg = xgb_reg.predict(X_test)

result_xgb_reg.append(
{
'No of Estimators': n_estimators,
'Learning Rate': learning_rate,
'Max Depth': max_depth,
'RMSE': round(root_mean_squared_error(y_test,␣
↪pred_xgb_reg), 2),

'R Squared': r2_score(y_test, pred_xgb_reg)

}
)
print(f"Training Done!")

Training Done!

[65]: # best RMSE values

result_xgb_reg = pd.DataFrame(result_xgb_reg)
result_xgb_reg.loc[result_xgb_reg['RMSE'].idxmin()]

24
[65]: No of Estimators 300.0000
Learning Rate 0.0500
Max Depth 5.0000
RMSE 13.2000
R Squared 0.5795
Name: 37, dtype: float64

[66]: # best R Squared values

result_xgb_reg.loc[result_xgb_reg['R Squared'].idxmax()]

[66]: No of Estimators 300.0000

Learning Rate 0.0500
Max Depth 5.0000
RMSE 13.2000
R Squared 0.5795
Name: 37, dtype: float64

Observation
From above we can observe that the paramenters with best RMSE and R Squared are same
(n_estimators=200, Learnign Rate=0.1, max_depth=5)

[67]: # XGBoost Regressor with optimal parameters

xgb_reg = XGBRegressor(n_estimators=200, learning_rate=0.1, max_depth=5,␣
↪random_state=42)

xgb_reg.fit(X_train, y_train)

evaluate_model('XGBoost Regressor', xgb_reg.predict(X_test))

RMSE of XGBoost Regressor: 13.22

R Squared of XGBoost Regressor: 0.58

14 CatBoost Regressor
[68]: # Initialize CatBoost Regressor
catboost_reg = CatBoostRegressor(iterations=100, depth=6, learning_rate=0.1,␣
↪verbose=0, random_state=42)

# Training the model

catboost_reg.fit(X_train, y_train)

# predicting on test data

y_pred = catboost_reg.predict(X_test)

# Evaluate the model

print(f"RMSE: {root_mean_squared_error(y_test, y_pred):.2f}")
print(f"R²: {r2_score(y_test, y_pred):.2f}")

25
RMSE: 13.27
R²: 0.58

[69]: # hyperparameter tuninig

result_catboost_reg = []
for iterations in [100, 200, 300]:
for learning_rate in [0.01, 0.05, 0.1, 0.2]:
for depth in [4, 6, 8, 10]:
catboost_reg = CatBoostRegressor(iterations=iterations,␣
↪learning_rate=learning_rate, depth=depth, verbose=0, random_state=42)

catboost_reg.fit(X_train, y_train)
pred_catboost_reg = catboost_reg.predict(X_test)

result_catboost_reg.append(
{
'Iterations': iterations,
'Learning Rate': learning_rate,
'Depth': depth,
'RMSE': round(root_mean_squared_error(y_test,␣
↪pred_catboost_reg), 2),

'R Squared': r2_score(y_test, pred_catboost_reg)

}
)
print(f"Training Done!")

Training Done!

[70]: # best RMSE values

result_catboost_reg = pd.DataFrame(result_catboost_reg)
result_catboost_reg.loc[result_catboost_reg['RMSE'].idxmin()]

[70]: Iterations 200.000000

Learning Rate 0.100000
Depth 8.000000
RMSE 13.110000
R Squared 0.585722
Name: 26, dtype: float64

[71]: # best R Squared values

result_catboost_reg.loc[result_catboost_reg['R Squared'].idxmax()]

[71]: Iterations 200.000000

Learning Rate 0.100000
Depth 8.000000
RMSE 13.110000
R Squared 0.585722
Name: 26, dtype: float64

Observation

26
From above we can observe that the paramenters with best RMSE and R Squared are same (iter-
ations=200, Learnign Rate=0.1, depth=8)

[72]: # CatBoost Regressor with optimal parameters

catboost_reg = CatBoostRegressor(iterations=200, depth=8, learning_rate=0.1,␣
↪verbose=0, random_state=42)

catboost_reg.fit(X_train, y_train)

evaluate_model('CatBoost Regressor', catboost_reg.predict(X_test))

RMSE of CatBoost Regressor: 13.11

R Squared of CatBoost Regressor: 0.59

15 Feedforward Neural Network

[73]: model = Sequential([
Input(shape=(scaled_X_train.shape[1],)),
Dense(64, activation='relu'),
Dense(32, activation='relu'),
Dense(1)
])

model.summary()

Model: "sequential"

��
� Layer (type) � Output Shape � ␣
↪Param # �

��
� dense (Dense) � (None, 64) � ␣
↪3,136 �

��
� dense_1 (Dense) � (None, 32) � ␣
↪2,080 �

��
� dense_2 (Dense) � (None, 1) � ␣
↪ 33 �

��

Total params: 5,249 (20.50 KB)

Trainable params: 5,249 (20.50 KB)

Non-trainable params: 0 (0.00 B)

27
[74]: # Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

# Train the model

model.fit(scaled_X_train, y_train, epochs=30, verbose=1, validation_split=0.2)

# Predict on test data

y_pred = model.predict(scaled_X_test)

# Evaluate the model

evaluate_model('Feedforward Neural Network', y_pred)

Epoch 1/30
328/328 �� 1s 2ms/step -
loss: 2286.9290 - val_loss: 250.0674
Epoch 2/30
328/328 �� 1s 2ms/step -
loss: 246.5836 - val_loss: 227.1913
Epoch 3/30
328/328 �� 1s 2ms/step -
loss: 230.4257 - val_loss: 220.9194
Epoch 4/30
328/328 �� 1s 2ms/step -
loss: 223.1775 - val_loss: 216.1244
Epoch 5/30
328/328 �� 1s 2ms/step -
loss: 217.4891 - val_loss: 212.0494
Epoch 6/30
328/328 �� 1s 2ms/step -
loss: 205.9429 - val_loss: 210.7139
Epoch 7/30
328/328 �� 1s 2ms/step -
loss: 209.4912 - val_loss: 205.6074
Epoch 8/30
328/328 �� 1s 2ms/step -
loss: 200.7118 - val_loss: 203.2866
Epoch 9/30
328/328 �� 1s 2ms/step -
loss: 201.7865 - val_loss: 202.4568
Epoch 10/30
328/328 �� 1s 2ms/step -
loss: 199.6100 - val_loss: 200.0547
Epoch 11/30
328/328 �� 1s 2ms/step -
loss: 192.4956 - val_loss: 199.5509
Epoch 12/30
328/328 �� 1s 2ms/step -
loss: 196.8385 - val_loss: 202.0047

28
Epoch 13/30
328/328 �� 1s 2ms/step -
loss: 190.3293 - val_loss: 198.7908
Epoch 14/30
328/328 �� 1s 2ms/step -
loss: 192.3974 - val_loss: 197.0974
Epoch 15/30
328/328 �� 1s 2ms/step -
loss: 188.1279 - val_loss: 196.5793
Epoch 16/30
328/328 �� 1s 2ms/step -
loss: 192.7969 - val_loss: 204.0867
Epoch 17/30
328/328 �� 1s 2ms/step -
loss: 189.1077 - val_loss: 200.2438
Epoch 18/30
328/328 �� 1s 2ms/step -
loss: 187.1580 - val_loss: 198.2013
Epoch 19/30
328/328 �� 1s 2ms/step -
loss: 183.3916 - val_loss: 196.3865
Epoch 20/30
328/328 �� 1s 2ms/step -
loss: 180.5508 - val_loss: 194.4440
Epoch 21/30
328/328 �� 1s 2ms/step -
loss: 185.5836 - val_loss: 196.2199
Epoch 22/30
328/328 �� 1s 2ms/step -
loss: 180.4846 - val_loss: 193.9772
Epoch 23/30
328/328 �� 1s 2ms/step -
loss: 185.6906 - val_loss: 194.2294
Epoch 24/30
328/328 �� 1s 2ms/step -
loss: 184.7639 - val_loss: 197.1824
Epoch 25/30
328/328 �� 1s 2ms/step -
loss: 180.2869 - val_loss: 193.0731
Epoch 26/30
328/328 �� 1s 2ms/step -
loss: 181.7159 - val_loss: 192.7819
Epoch 27/30
328/328 �� 1s 2ms/step -
loss: 177.1915 - val_loss: 192.9699
Epoch 28/30
328/328 �� 1s 2ms/step -
loss: 180.7197 - val_loss: 192.7080

29
Epoch 29/30
328/328 �� 1s 2ms/step -
loss: 176.0547 - val_loss: 192.8633
Epoch 30/30
328/328 �� 1s 2ms/step -
loss: 178.3805 - val_loss: 192.9083
103/103 �� 0s 1ms/step
RMSE of Feedforward Neural Network: 13.75
R Squared of Feedforward Neural Network: 0.54

16 Multi-Input Neural Network

[75]: critics_consensus_df['critics_consensus'] =␣
↪critics_consensus_df['critics_consensus'].fillna('Unknown')

# Tokenize the text

tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(critics_consensus_df['critics_consensus'])
sequences = tokenizer.
↪texts_to_sequences(critics_consensus_df['critics_consensus'])

[76]: # padding the sequences

padded_sequences = pad_sequences(sequences, maxlen=88, padding='post')

[77]: # creating the feature datsets

X_text = np.array(padded_sequences)
X_other = X.drop(columns={'critics_sentiment_Negative',␣
↪'critics_sentiment_Neutral','critics_sentiment_Positive',␣

↪'critics_sentiment_Unknown',}).values

y = rt_df['audience_rating'].values

X_combined = [X_text, X_other]

[78]: # normalizing (min-max scaling)

scaler = MinMaxScaler()
X_other_normalized = scaler.fit_transform(X_other)

[79]: # splitting dataset to train and test datasets

X_train_text, X_test_text, X_train_other, X_test_other, y_train, y_test =␣
↪train_test_split(X_text, X_other_normalized, y, test_size=0.2,␣

↪random_state=42)

[80]: # textual input branch

text_input = Input(shape=(88,), name="text_input")
embedding = Embedding(input_dim=10000, output_dim=128,␣
↪input_length=88)(text_input)

lstm_out = LSTM(64)(embedding)

30
# numerical input branch
num_input = Input(shape=(X_other.shape[1],), name="num_input")
dense_num = Dense(64, activation='relu')(num_input)

# combining text and numerical features

merged = Concatenate()([lstm_out, dense_num])
dense = Dense(128, activation='relu')(merged)
output = Dense(1, activation='linear')(dense)

# defining the model

model = Model(inputs=[text_input, num_input], outputs=output)
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae'])

model.summary()

Model: "functional_1"

��
� Layer (type) � Output Shape � Param # �␣
↪Connected to �
��
� text_input (InputLayer) � (None, 88) � 0 � - ␣
↪ �
��
� embedding (Embedding) � (None, 88, 128) � 1,280,000 �␣
↪text_input[0][0] �
��
� num_input (InputLayer) � (None, 44) � 0 � - ␣
↪ �
��
� lstm (LSTM) � (None, 64) � 49,408 �␣
↪embedding[0][0] �
��
� dense_3 (Dense) � (None, 64) � 2,880 �␣
↪num_input[0][0] �
��
� concatenate (Concatenate) � (None, 128) � 0 �␣
↪lstm[0][0], dense_3[0][0] �
��
� dense_4 (Dense) � (None, 128) � 16,512 �␣
↪concatenate[0][0] �
��
� dense_5 (Dense) � (None, 1) � 129 �␣
↪dense_4[0][0] �
��

31
Total params: 1,348,929 (5.15 MB)

Trainable params: 1,348,929 (5.15 MB)

Non-trainable params: 0 (0.00 B)

[81]: # training the model

history = model.fit(
[X_train_text, X_train_other], y_train,
validation_data=([X_test_text, X_test_other], y_test),
epochs=10,batch_size=32
)

Epoch 1/10
410/410 �� 15s 32ms/step -
loss: 1140.8567 - mae: 25.6652 - val_loss: 221.6878 - val_mae: 11.9149
Epoch 2/10
410/410 �� 13s 32ms/step -
loss: 217.3896 - mae: 11.8196 - val_loss: 212.7557 - val_mae: 11.6834
Epoch 3/10
410/410 �� 13s 31ms/step -
loss: 215.1940 - mae: 11.7355 - val_loss: 206.2878 - val_mae: 11.4945
Epoch 4/10
410/410 �� 12s 30ms/step -
loss: 203.9770 - mae: 11.3881 - val_loss: 203.9777 - val_mae: 11.4025
Epoch 5/10
410/410 �� 13s 30ms/step -
loss: 202.5593 - mae: 11.2855 - val_loss: 197.7494 - val_mae: 11.1343
Epoch 6/10
410/410 �� 13s 31ms/step -
loss: 195.9349 - mae: 11.0397 - val_loss: 202.5623 - val_mae: 11.2185
Epoch 7/10
410/410 �� 13s 31ms/step -
loss: 194.5986 - mae: 11.0365 - val_loss: 194.2616 - val_mae: 11.0375
Epoch 8/10
410/410 �� 13s 31ms/step -
loss: 191.5667 - mae: 10.9208 - val_loss: 193.5592 - val_mae: 11.0496
Epoch 9/10
410/410 �� 12s 30ms/step -
loss: 194.7535 - mae: 10.9932 - val_loss: 192.5889 - val_mae: 10.9288
Epoch 10/10
410/410 �� 12s 30ms/step -
loss: 189.6154 - mae: 10.8478 - val_loss: 192.2870 - val_mae: 10.9272

32
[82]: # predicting on test dataset
y_pred = model.predict([X_test_text, X_test_other])

103/103 �� 1s 8ms/step

[83]: # Evaluate the model

evaluate_model('Multi-Input Neural Network', y_pred)

RMSE of Multi-Input Neural Network: 13.87

R Squared of Multi-Input Neural Network: 0.54

17 Model Comparision
[84]: pd.DataFrame(model_perf)

[84]: Model RMSE R Squared

0 Linear Regression 14.01 0.53
1 Ridge Linear Regression 14.00 0.53
2 Lasso Linear Regression 14.32 0.51
3 Random Forest Regression 13.51 0.56
4 XGBoost Regressor 13.22 0.58
5 CatBoost Regressor 13.11 0.59
6 Feedforward Neural Network 13.75 0.54
7 Multi-Input Neural Network 13.87 0.54

[85]: # Visualizing the results

model_perf = pd.DataFrame(model_perf)

fig, ax1 = plt.subplots(figsize=(12, 8))

# plot RMSE
sns.lineplot(data=model_perf, x="Model", y="RMSE", marker="o", color="blue",␣
↪ax=ax1, label="RMSE")

ax1.set_ylabel("Root Mean Squared Error", fontsize=12)

ax1.set_xlabel("Model", fontsize=12)
ax1.legend(loc="upper left")
ax1.set_ylim(0, max(model_perf['RMSE']) + 5)

# highlighting the model with minimum RMSE

min_rmse_idx = model_perf['RMSE'].idxmin()
min_rmse_value = model_perf.loc[min_rmse_idx, 'RMSE']
min_rmse_model = model_perf.loc[min_rmse_idx, 'Model']
ax1.scatter(x=[min_rmse_model], y=[min_rmse_value], color="red", s=100,␣
↪label="Min RMSE")

# plot R²
ax2 = ax1.twinx()

33
sns.lineplot(data=model_perf, x="Model", y="R Squared", marker="o",␣
↪color="green", ax=ax2, label="R²")

ax2.set_ylabel("R Squared", fontsize=12)

ax2.legend(loc="upper right")
ax2.set_ylim(0, 1)

# highlighting the model with maximum R²

max_r2_idx = model_perf['R Squared'].idxmax()
max_r2_value = model_perf.loc[max_r2_idx, 'R Squared']
max_r2_model = model_perf.loc[max_r2_idx, 'Model']
ax2.scatter(x=[max_r2_model], y=[max_r2_value], color="red", s=100, label="Max␣
↪R²")

# Rotate x-axis values explicitly

ax1.set_xticklabels(model_perf["Model"], rotation=45, ha='right')

# Add title and adjust layout

fig.suptitle('Model Performance', fontsize=16)
fig.tight_layout()
plt.show()

34
18 Conclusion
[86]: # best performing model
model_perf.loc[model_perf['Model'] == 'CatBoost Regressor']

[86]: Model RMSE R Squared

5 CatBoost Regressor 13.11 0.59

CatBoost Regression has the hightest model performance compared to other models with Root
Mean Squared Error: 13.11 and R Squared: 0.59.

[87]: # Plotting Original vs Predicted values (CatBoost Regressor)

df = pd.DataFrame({
'Index': np.arange(len(y_test)),
'Original': y_test,
'Predicted': catboost_reg.predict(X_test)
})

df['Bucket'] = (df['Index'] // 50) * 50

bucketed_df = df.groupby('Bucket').agg({'Original': 'mean', 'Predicted':␣
↪'mean'}).reset_index()

plt.figure(figsize=(10, 6))
sns.lineplot(data=bucketed_df, x='Bucket', y='Original', label='Original',␣
↪color='blue', linewidth=2)

sns.lineplot(data=bucketed_df, x='Bucket', y='Predicted', label='Predicted',␣

↪color='green', linestyle='--', linewidth=2)

plt.xlabel('Index')
plt.ylabel('Audience Rating')
plt.title('Original Audience Rating vs Predicted Audience Rating')
plt.legend()

plt.show()

35
[ ]:

Movies On Streaming Platforms
No ratings yet
Movies On Streaming Platforms
8 pages
Diego Luna Movie Analysis Guide
No ratings yet
Diego Luna Movie Analysis Guide
11 pages
Project Report
No ratings yet
Project Report
16 pages
IMDB MOVIES Analysis
No ratings yet
IMDB MOVIES Analysis
13 pages
IMDb+Movie+Assignment Stub
No ratings yet
IMDb+Movie+Assignment Stub
9 pages
R Movie Recommendation System Guide
No ratings yet
R Movie Recommendation System Guide
18 pages
ADS Phase3
No ratings yet
ADS Phase3
13 pages
IMDB Dataframe Insights
No ratings yet
IMDB Dataframe Insights
3 pages
Movie Dataset Analysis
No ratings yet
Movie Dataset Analysis
15 pages
Project 5
No ratings yet
Project 5
5 pages
Movie Data Insights & Predictions
No ratings yet
Movie Data Insights & Predictions
22 pages
Mini Project
No ratings yet
Mini Project
17 pages
Report Final-MovieLens
No ratings yet
Report Final-MovieLens
47 pages
Movie Recommendations for Viewers
No ratings yet
Movie Recommendations for Viewers
11 pages
Nloypqbmz: Pandas PD
No ratings yet
Nloypqbmz: Pandas PD
3 pages
Movie Data Analysis Netflix
No ratings yet
Movie Data Analysis Netflix
16 pages
Chapter 9 - Recommendation Systems
No ratings yet
Chapter 9 - Recommendation Systems
12 pages
Data Analysis for Movie Enthusiasts
No ratings yet
Data Analysis for Movie Enthusiasts
23 pages
Analytic Project Report APR
No ratings yet
Analytic Project Report APR
42 pages
Factors Influencing IMDb Ratings
No ratings yet
Factors Influencing IMDb Ratings
11 pages
Project Movielense Solution
No ratings yet
Project Movielense Solution
4 pages
Mini Project
No ratings yet
Mini Project
18 pages
Movie Recommendation System Overview
No ratings yet
Movie Recommendation System Overview
11 pages
Hands-On Lab - Importing Data in R
No ratings yet
Hands-On Lab - Importing Data in R
8 pages
Project Movielense Solution
29% (7)
Project Movielense Solution
4 pages
Movie Data Analysis and Recommendations
No ratings yet
Movie Data Analysis and Recommendations
8 pages
Predicting Movie Rating Prior To Release
No ratings yet
Predicting Movie Rating Prior To Release
15 pages
Movielens Recommender System Capstone Project: Compiled by Mahesh Halkeri
No ratings yet
Movielens Recommender System Capstone Project: Compiled by Mahesh Halkeri
19 pages
Recommendation Engine 1657857468
No ratings yet
Recommendation Engine 1657857468
15 pages
Movie Recommendation System
No ratings yet
Movie Recommendation System
22 pages
Team Renegades MMLA Report
No ratings yet
Team Renegades MMLA Report
27 pages
Netflix Business Case Study - Data Exploration and Visualisation.. Sonam Meshram
No ratings yet
Netflix Business Case Study - Data Exploration and Visualisation.. Sonam Meshram
27 pages
MovieLens Project Report
No ratings yet
MovieLens Project Report
19 pages
Project - Report (Movie Genre Classification)
100% (1)
Project - Report (Movie Genre Classification)
19 pages
Movie Rating Prediction Presentation
No ratings yet
Movie Rating Prediction Presentation
11 pages
Deepanshu Tyagi's Restaurant Menu Analysis
No ratings yet
Deepanshu Tyagi's Restaurant Menu Analysis
18 pages
Understanding Recommendation Systems
No ratings yet
Understanding Recommendation Systems
45 pages
Department of Computer Science and Engineering (Data Science) Subject: Recommender System Laboratory (DJS22DSL6012)
No ratings yet
Department of Computer Science and Engineering (Data Science) Subject: Recommender System Laboratory (DJS22DSL6012)
16 pages
DSLAB5
No ratings yet
DSLAB5
17 pages
IMDB Movie Analysis Insights
No ratings yet
IMDB Movie Analysis Insights
14 pages
Project Highlights
No ratings yet
Project Highlights
1 page
Swati Mam The - Iscale Movies Project Code
No ratings yet
Swati Mam The - Iscale Movies Project Code
13 pages
Sneha Kumari - 262 - DS Project.
No ratings yet
Sneha Kumari - 262 - DS Project.
19 pages
Source Code Source Code
No ratings yet
Source Code Source Code
4 pages
Analyzing IMDB Scores of Netflix Films
No ratings yet
Analyzing IMDB Scores of Netflix Films
14 pages
Anurag Chaturvedi Netflix - Jupyter - Notebook Case Study
No ratings yet
Anurag Chaturvedi Netflix - Jupyter - Notebook Case Study
27 pages
MovieLens Final-Project
No ratings yet
MovieLens Final-Project
18 pages
Predicting Movie Popularity with Regression
No ratings yet
Predicting Movie Popularity with Regression
3 pages
Statistical Analysis of Movies (1910-2024)
No ratings yet
Statistical Analysis of Movies (1910-2024)
3 pages
Group 15 Report
No ratings yet
Group 15 Report
23 pages
A Predictor For Movie Success: 2.1 Data Collection
No ratings yet
A Predictor For Movie Success: 2.1 Data Collection
5 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
80 pages
Final Project - CS181
No ratings yet
Final Project - CS181
3 pages
Investigate A Dataset
No ratings yet
Investigate A Dataset
14 pages
MovieLens Ratings Analysis Case Study
No ratings yet
MovieLens Ratings Analysis Case Study
5 pages
1st Harvard Project
No ratings yet
1st Harvard Project
17 pages
RE Paper
No ratings yet
RE Paper
25 pages
Bollywood and Heart Data Analysis
No ratings yet
Bollywood and Heart Data Analysis
15 pages
Disney Movies Box Office Analysis
No ratings yet
Disney Movies Box Office Analysis
7 pages
The Complete SQL HandBook
No ratings yet
The Complete SQL HandBook
89 pages
Crime Analysis in India (2001-2013)
No ratings yet
Crime Analysis in India (2001-2013)
23 pages
Data Structure and Algorithms
No ratings yet
Data Structure and Algorithms
110 pages
Power BI Dax Cheat Sheet
No ratings yet
Power BI Dax Cheat Sheet
18 pages
Celebrate 50 Years of Microsoft
No ratings yet
Celebrate 50 Years of Microsoft
28 pages
Power BI Interview Questions Part-1
No ratings yet
Power BI Interview Questions Part-1
53 pages
Excel Mastery With These Guided Projects
100% (2)
Excel Mastery With These Guided Projects
66 pages
Delta Lake With Azure Databricks
No ratings yet
Delta Lake With Azure Databricks
33 pages
Dimensional Modeling
No ratings yet
Dimensional Modeling
52 pages
Mastering SQL CASE WHEN Statement
100% (1)
Mastering SQL CASE WHEN Statement
10 pages
The Big Six - SQL
No ratings yet
The Big Six - SQL
23 pages
Limpieza de Datos Con Pandas
100% (1)
Limpieza de Datos Con Pandas
19 pages
ETL Best Practices
No ratings yet
ETL Best Practices
21 pages
Data KPIs Cheat Sheet
100% (1)
Data KPIs Cheat Sheet
12 pages
8 Machine Learning Algorithms
No ratings yet
8 Machine Learning Algorithms
13 pages
Trade Tariffs in 3 Levels of Difficulty
No ratings yet
Trade Tariffs in 3 Levels of Difficulty
10 pages
R Cookbook: Geospatial Data Processing
No ratings yet
R Cookbook: Geospatial Data Processing
79 pages
Inventory Abbreviations
No ratings yet
Inventory Abbreviations
13 pages
Crack Your Databricks
100% (2)
Crack Your Databricks
103 pages
Syllabus Rabat Language and Liberal Arts Rabat Arabic
No ratings yet
Syllabus Rabat Language and Liberal Arts Rabat Arabic
40 pages
Ajoy Kumar Ghose v. State of Jharkhand and Another: Procedure Where Accused Is Not Discharged
No ratings yet
Ajoy Kumar Ghose v. State of Jharkhand and Another: Procedure Where Accused Is Not Discharged
1 page
Prince Product Catalogue 2022
No ratings yet
Prince Product Catalogue 2022
48 pages
State Scholarship Portal
No ratings yet
State Scholarship Portal
3 pages
OHE Guidelines
No ratings yet
OHE Guidelines
35 pages
P.V.C Pump Private Limited Plot No. 1c Industrial Area, Lokikere Road Davanagere-577005 Off.: 08192-260014
No ratings yet
P.V.C Pump Private Limited Plot No. 1c Industrial Area, Lokikere Road Davanagere-577005 Off.: 08192-260014
1 page
Unit 9 Definitions
No ratings yet
Unit 9 Definitions
1 page
Unit 4,5,6,7 and 9 Organic Reactions
No ratings yet
Unit 4,5,6,7 and 9 Organic Reactions
134 pages
Bsit 7th Semester Course Outline.
No ratings yet
Bsit 7th Semester Course Outline.
7 pages
C1.2 - MediationStrategies-3982214UNIT 5
No ratings yet
C1.2 - MediationStrategies-3982214UNIT 5
1 page
Semiotics in DeLillo's Cosmopolis
No ratings yet
Semiotics in DeLillo's Cosmopolis
6 pages
Generational Trends & Milestones
No ratings yet
Generational Trends & Milestones
50 pages
Concept of Multiplier
100% (3)
Concept of Multiplier
21 pages
Understanding Subjects and Predicates
No ratings yet
Understanding Subjects and Predicates
18 pages
National Professional Standards For Teachers Sep 2012
No ratings yet
National Professional Standards For Teachers Sep 2012
28 pages
PDF 1111
No ratings yet
PDF 1111
25 pages
Aurora Chameleon Flow
No ratings yet
Aurora Chameleon Flow
4 pages
Key Objectives of Effective Communication
No ratings yet
Key Objectives of Effective Communication
27 pages
Candidates Export 2025-10-12
No ratings yet
Candidates Export 2025-10-12
9 pages
TRACK 1020 Installation and Activation Manual EN
No ratings yet
TRACK 1020 Installation and Activation Manual EN
2 pages
Quito's River Renaissance - Rescuing Our Rivers and Defending Their Rights in The Ecuadorian Andes by Lisa Maria Madera, PHD and Maribel Pasquel, MA
No ratings yet
Quito's River Renaissance - Rescuing Our Rivers and Defending Their Rights in The Ecuadorian Andes by Lisa Maria Madera, PHD and Maribel Pasquel, MA
30 pages
Class Test 2 - 2025 Memo
No ratings yet
Class Test 2 - 2025 Memo
10 pages
Confirmation - Delhi - Aloft
No ratings yet
Confirmation - Delhi - Aloft
2 pages
Bachelor of Engineering (B.Engg) : Education St. John College of Engineering and Technology
No ratings yet
Bachelor of Engineering (B.Engg) : Education St. John College of Engineering and Technology
1 page
Mastering The Arduino Uno R4
No ratings yet
Mastering The Arduino Uno R4
326 pages
How The Modern Middle East Map Came To Be Drawn
No ratings yet
How The Modern Middle East Map Came To Be Drawn
11 pages
Advantage and Limitation of Fixed Product Layout
100% (6)
Advantage and Limitation of Fixed Product Layout
4 pages
WRIT 1301 Syllabus F2009
No ratings yet
WRIT 1301 Syllabus F2009
5 pages
Nurs 350 Pico Paper, Spring 2014
No ratings yet
Nurs 350 Pico Paper, Spring 2014
12 pages
Framework of Accounting
No ratings yet
Framework of Accounting
11 pages