Rotten Tomatoes Audience Rating Prediction
Rotten Tomatoes Audience Rating Prediction
0.1 Author
James Jeberson M - (for feedback reach out to [email protected])
1 Introduction
1.1 Objective:
1. Preprocess and transform data, including handling missing values and encoding features.
2. Build and evaluate multiple regression models to predict audience ratings.
• Linear Regression (Ridge & Lasso)
• XGBoost Regressor
• CatBoost Regressor
• Neural Networks
3. Compare model performance using metrics like RMSE and R² to identify the best model.
1
from tensorflow.keras.models import Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Embedding, Flatten, Dense,␣
↪Concatenate, LSTM
3 Data Loading
[2]: # loading the dataset as a pandas dataframe
rt_df = pd.read_excel("Rotten_Tomatoes_Movies3.xls")
print(f"The Rotten Tomatoes Dataset contains {rt_df.shape[0]} rows and {rt_df.
↪shape[1]} columns")
[3]: movie_title \
0 Percy Jackson & the Olympians: The Lightning T…
1 Please Give
2 10
3 12 Angry Men (Twelve Angry Men)
4 20,000 Leagues Under The Sea
movie_info \
0 A teenager discovers he's the descendant of a …
1 Kate has a lot on her mind. There's the ethics…
2 Blake Edwards' 10 stars Dudley Moore as George…
3 A Puerto Rican youth is on trial for murder, a…
4 This 1954 Disney version of Jules Verne's 20,0…
critics_consensus rating \
0 Though it may seem like just another Harry Pot… PG
1 Nicole Holofcener's newest might seem slight i… R
2 NaN R
3 Sidney Lumet's feature debut is a superbly wri… NR
4 One of Disney's finest live-action adventures,… G
genre directors \
0 Action & Adventure, Comedy, Drama, Science Fic… Chris Columbus
2
1 Comedy Nicole Holofcener
2 Comedy, Romance Blake Edwards
3 Classics, Drama Sidney Lumet
4 Action & Adventure, Drama, Kids & Family Richard Fleischer
writers cast \
0 Craig Titley Logan Lerman, Brandon T. Jackson, Alexandra Da…
1 Nicole Holofcener Catherine Keener, Amanda Peet, Oliver Platt, R…
2 Blake Edwards Dudley Moore, Bo Derek, Julie Andrews, Robert …
3 Reginald Rose Martin Balsam, John Fiedler, Lee J. Cobb, E.G…
4 Earl Felton James Mason, Kirk Douglas, Paul Lukas, Peter L…
tomatometer_count audience_rating
0 144 53.0
1 140 64.0
2 22 53.0
3 51 97.0
4 27 74.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16638 entries, 0 to 16637
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 movie_title 16638 non-null object
1 movie_info 16614 non-null object
2 critics_consensus 8309 non-null object
3 rating 16638 non-null object
4 genre 16621 non-null object
5 directors 16524 non-null object
6 writers 15289 non-null object
3
7 cast 16354 non-null object
8 in_theaters_date 15823 non-null datetime64[ns]
9 on_streaming_date 16636 non-null datetime64[ns]
10 runtime_in_minutes 16483 non-null float64
11 studio_name 16222 non-null object
12 tomatometer_status 16638 non-null object
13 tomatometer_rating 16638 non-null int64
14 tomatometer_count 16638 non-null int64
15 audience_rating 16386 non-null float64
dtypes: datetime64[ns](2), float64(2), int64(2), object(10)
memory usage: 2.0+ MB
4 Handling Duplicates
4.1 Handling Duplicates in Rows
[5]: # checking for duplicates in the dataset
rt_df.duplicated().max()
[5]: True
[6]: 1
cast in_theaters_date \
8495 Oliver Chris, Richard Goulding, Charlotte Rile… 2017-05-14
8496 Oliver Chris, Richard Goulding, Charlotte Rile… 2017-05-14
4
tomatometer_rating tomatometer_count audience_rating
8495 100 9 48.0
8496 100 9 48.0
[9]: False
5
Observation
1. High correlation between tomatometer_rating & audience_rating
2. Negligible correlation between other features/columns with audience_rating
Observation
6
It can be observed that except ‘rating’ & ‘tomatometer_status’ remaining columns have too many
unique values with suggests that we consider them as textual columns while ‘rating’ & ‘tomatome-
ter_status’ are considered as categorical columns
Observation
From above in Distribution of Rating we can see that ‘PG-13)’ and ‘R)’ has typo in them, lets fix
them.
7
Typo issue has been fixed
8
[19]: # verifying
print(f"No of missing values in audience_rating: {rt_df['audience_rating'].
↪isna().sum()}")
axis = 1,
keys = ["Total Count of Values", "Total Missing Values", "Percent of␣
↪Missing Values"])
9
[22]: missing_num_col = [col for col in num_col if rt_df[col].isna().sum() > 0]
missing_num_col
[23]: # creating histograms for all the numerical columns with missing values
missing_num_col = [col for col in num_col if rt_df[col].isna().sum() > 0]
number_cols = len(missing_num_col)
cols = 2
rows = (number_cols//cols+1)
axis = axis.flatten()
axis[i].set_title(col)
axis[i].set_xlabel(col)
axis[i].set_ylabel('Count')
10
axis[i].axvline(rt_df[col].mean(), color='blue', linestyle='dashed',␣
↪linewidth=1)
axis[i].axvline(rt_df[col].median(), color='green', linestyle='dashed',␣
↪linewidth=1)
11
rt_df['runtime_in_minutes']=rt_df['runtime_in_minutes'].
↪fillna(rt_df['runtime_in_minutes'].mean())
[26]: # verifying
print(f"Missing values in numeric columns: {rt_df[[col for col in num_col]].
↪isna().max().sum()}")
axis = 1,
keys = ["Total Count of Values", "Total Missing Values", "Percent of␣
↪Missing Values"])
12
[28]: # visualizing missing values in the dataset with Categorical/Textual columns
msno.matrix(rt_df[txt_col], color=(0.4,0.2,0.5))
[30]: # taking a copy of the current dataset before processing 'critics_consensus' as␣
↪it will be need in later
critics_consensus_df = rt_df
13
elif polarity < 0:
return 'Negative'
else:
return 'Neutral'
rt_df['critics_sentiment'] = rt_df['critics_consensus'].
↪apply(classify_sentiment)
rt_df['critics_sentiment'].value_counts()
[31]: critics_sentiment
Unknown 8104
Positive 5717
Negative 2122
Neutral 442
Name: count, dtype: int64
[33]: # verifying
txt_col.remove('critics_consensus')
print(f"Missing values in Categorical/Textual columns: {rt_df[[col for col in␣
↪txt_col]].isna().max().sum()}")
Missing values: 0
14
Missing values has been handled
7 Feature Encoding
7.1 Encoding Date Columns
[36]: rt_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 16385 entries, 0 to 16637
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 movie_title 16385 non-null object
1 movie_info 16385 non-null object
2 rating 16385 non-null object
3 genre 16385 non-null object
4 directors 16385 non-null object
5 writers 16385 non-null object
6 cast 16385 non-null object
7 in_theaters_date 16385 non-null datetime64[ns]
8 on_streaming_date 16385 non-null datetime64[ns]
9 runtime_in_minutes 16385 non-null float64
10 studio_name 16385 non-null object
11 tomatometer_status 16385 non-null object
12 tomatometer_rating 16385 non-null int64
13 tomatometer_count 16385 non-null int64
14 audience_rating 16385 non-null float64
15 critics_sentiment 16385 non-null object
dtypes: datetime64[ns](2), float64(2), int64(2), object(10)
15
memory usage: 2.1+ MB
# Encoding in_theaters_date
rt_df['in_theaters_day'] = rt_df['in_theaters_date'].dt.day
rt_df['in_theaters_month'] = rt_df['in_theaters_date'].dt.month
rt_df['in_theaters_year'] = rt_df['in_theaters_date'].dt.year
# Encoding on_streaming_date
rt_df['on_streaming_day'] = rt_df['on_streaming_date'].dt.day
rt_df['on_streaming_month'] = rt_df['on_streaming_date'].dt.month
rt_df['on_streaming_year'] = rt_df['on_streaming_date'].dt.year
16
7.2.1 Multi Lable Encoding
mlb = MultiLabelBinarizer()
genre_encoded = pd.DataFrame(mlb.fit_transform(genre_lists), columns=mlb.
↪classes_, index=rt_df.index)
rt_df['directors_freq'] = frequency_encode(rt_df['directors'])
rt_df['writers_freq'] = frequency_encode(rt_df['writers'])
rt_df['cast_freq'] = frequency_encode(rt_df['cast'])
rt_df['studio_name_freq'] = frequency_encode(rt_df['studio_name'])
cast cast_freq \
0 Logan Lerman, Brandon T. Jackson, Alexandra Da… 719
1 Catherine Keener, Amanda Peet, Oliver Platt, R… 260
2 Dudley Moore, Bo Derek, Julie Andrews, Robert … 315
3 Martin Balsam, John Fiedler, Lee J. Cobb, E.G… 208
4 James Mason, Kirk Douglas, Paul Lukas, Peter L… 242
17
studio_name studio_name_freq
0 20th Century Fox 414
1 Sony Pictures Classics 259
2 Waner Bros. 1
3 Criterion Collection 110
4 Disney 26
Observation
The columns directors, writers, cast & studio_name has been frequency encoded where each
value for example ‘Chris Columbus’ is replaced with no of time times they appear in the dataset.
Observation
The values which was used to fill the missing values in directors, writers, cast & studio_name has
also been frequency encoded which we do not want. Hence we will replace all the frequesncy for
these to 0.
18
8 Train Test Split
[48]: # splitting dataset to train and test datasets
X = rt_df.drop(columns = 'audience_rating')
y = rt_df['audience_rating']
[50]: # scaled traing data will be used in the required ML model only
scaled_X_train, scaled_X_test, y_train, y_test = train_test_split(scaled_X, y,␣
↪test_size = 0.2, random_state = 42)
10 Linear Regression
[51]: # creating a variable to store the results of all the models
model_perf = []
def evaluate_model(model, pred):
rmse = round(root_mean_squared_error(y_test, pred), 2)
r2 = round(r2_score(y_test, pred), 2)
model_perf.append({'Model': model, 'RMSE': rmse, 'R Squared': r2})
print(f"RMSE of {model}: {rmse}")
print(f"R Squared of {model}: {r2}")
lreg.fit(scaled_X_train, y_train)
19
# predicting on scaled test dataset
pred_lreg = lreg.predict(scaled_X_test)
# evaluating model
evaluate_model('Linear Regression', pred_lreg)
[54]: # lets check the consitancy of the model accross different splits of the dataset
cv_scores = cross_val_score(lreg, X_train, y_train, cv=5,␣
↪scoring='neg_mean_squared_error')
20
alpha = 0.1
RMSE with test data: 14.00
R Squared with test data: 0.53
alpha = 1.0
RMSE with test data: 14.03
R Squared with test data: 0.53
alpha = 10.0
RMSE with test data: 14.09
R Squared with test data: 0.52
alpha = 100.0
RMSE with test data: 14.23
R Squared with test data: 0.51
alpha = 0.1
RMSE with test data: 14.32
R Squared with test data: 0.51
alpha = 1.0
RMSE with test data: 15.62
R Squared with test data: 0.41
alpha = 10.0
RMSE with test data: 20.36
R Squared with test data: -0.00
alpha = 100.0
RMSE with test data: 20.36
R Squared with test data: -0.00
21
evaluate_model('Ridge Linear Regression', Rreg.predict(scaled_X_test))
RMSE: 13.59
R Squared: 0.55
rf_reg = RandomForestRegressor(n_estimators=n_estimators,␣
↪criterion=criterion, max_depth=max_depth, random_state=42)
rf_reg.fit(X_train, y_train)
pred_rf_reg = rf_reg.predict(X_test)
result_rf_reg.append(
{
'No of Estimators': n_estimators,
'Criterion': criterion,
'Max Depth': max_depth,
22
'RMSE': round(root_mean_squared_error(y_test, pred_rf_reg),␣
↪2),
'R Squared': r2_score(y_test, pred_rf_reg)
}
)
print(f"Training Done!")
Training Done!
Observation
From above we can observe that the paramenters with best RMSE and R Squared are same
(n_estimators=300, criterion=‘squared_error’, max_depth=None)
rf_reg.fit(X_train, y_train)
23
13 XGBoost Regressor
[63]: # Initialize XGBoost Regressor
xgb_reg = XGBRegressor(n_estimators=100, max_depth=5, learning_rate=0.1,␣
↪random_state=42)
RMSE: 13.25
R Squared: 0.58
xgb_reg.fit(X_train, y_train)
pred_xgb_reg = xgb_reg.predict(X_test)
result_xgb_reg.append(
{
'No of Estimators': n_estimators,
'Learning Rate': learning_rate,
'Max Depth': max_depth,
'RMSE': round(root_mean_squared_error(y_test,␣
↪pred_xgb_reg), 2),
Training Done!
24
[65]: No of Estimators 300.0000
Learning Rate 0.0500
Max Depth 5.0000
RMSE 13.2000
R Squared 0.5795
Name: 37, dtype: float64
Observation
From above we can observe that the paramenters with best RMSE and R Squared are same
(n_estimators=200, Learnign Rate=0.1, max_depth=5)
xgb_reg.fit(X_train, y_train)
14 CatBoost Regressor
[68]: # Initialize CatBoost Regressor
catboost_reg = CatBoostRegressor(iterations=100, depth=6, learning_rate=0.1,␣
↪verbose=0, random_state=42)
25
RMSE: 13.27
R²: 0.58
catboost_reg.fit(X_train, y_train)
pred_catboost_reg = catboost_reg.predict(X_test)
result_catboost_reg.append(
{
'Iterations': iterations,
'Learning Rate': learning_rate,
'Depth': depth,
'RMSE': round(root_mean_squared_error(y_test,␣
↪pred_catboost_reg), 2),
Training Done!
Observation
26
From above we can observe that the paramenters with best RMSE and R Squared are same (iter-
ations=200, Learnign Rate=0.1, depth=8)
catboost_reg.fit(X_train, y_train)
model.summary()
Model: "sequential"
����������������������������������������������������������������������������������������
� Layer (type) � Output Shape � ␣
↪Param # �
����������������������������������������������������������������������������������������
� dense (Dense) � (None, 64) � ␣
↪3,136 �
����������������������������������������������������������������������������������������
� dense_1 (Dense) � (None, 32) � ␣
↪2,080 �
����������������������������������������������������������������������������������������
� dense_2 (Dense) � (None, 1) � ␣
↪ 33 �
����������������������������������������������������������������������������������������
27
[74]: # Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')
Epoch 1/30
328/328 �������������������� 1s 2ms/step -
loss: 2286.9290 - val_loss: 250.0674
Epoch 2/30
328/328 �������������������� 1s 2ms/step -
loss: 246.5836 - val_loss: 227.1913
Epoch 3/30
328/328 �������������������� 1s 2ms/step -
loss: 230.4257 - val_loss: 220.9194
Epoch 4/30
328/328 �������������������� 1s 2ms/step -
loss: 223.1775 - val_loss: 216.1244
Epoch 5/30
328/328 �������������������� 1s 2ms/step -
loss: 217.4891 - val_loss: 212.0494
Epoch 6/30
328/328 �������������������� 1s 2ms/step -
loss: 205.9429 - val_loss: 210.7139
Epoch 7/30
328/328 �������������������� 1s 2ms/step -
loss: 209.4912 - val_loss: 205.6074
Epoch 8/30
328/328 �������������������� 1s 2ms/step -
loss: 200.7118 - val_loss: 203.2866
Epoch 9/30
328/328 �������������������� 1s 2ms/step -
loss: 201.7865 - val_loss: 202.4568
Epoch 10/30
328/328 �������������������� 1s 2ms/step -
loss: 199.6100 - val_loss: 200.0547
Epoch 11/30
328/328 �������������������� 1s 2ms/step -
loss: 192.4956 - val_loss: 199.5509
Epoch 12/30
328/328 �������������������� 1s 2ms/step -
loss: 196.8385 - val_loss: 202.0047
28
Epoch 13/30
328/328 �������������������� 1s 2ms/step -
loss: 190.3293 - val_loss: 198.7908
Epoch 14/30
328/328 �������������������� 1s 2ms/step -
loss: 192.3974 - val_loss: 197.0974
Epoch 15/30
328/328 �������������������� 1s 2ms/step -
loss: 188.1279 - val_loss: 196.5793
Epoch 16/30
328/328 �������������������� 1s 2ms/step -
loss: 192.7969 - val_loss: 204.0867
Epoch 17/30
328/328 �������������������� 1s 2ms/step -
loss: 189.1077 - val_loss: 200.2438
Epoch 18/30
328/328 �������������������� 1s 2ms/step -
loss: 187.1580 - val_loss: 198.2013
Epoch 19/30
328/328 �������������������� 1s 2ms/step -
loss: 183.3916 - val_loss: 196.3865
Epoch 20/30
328/328 �������������������� 1s 2ms/step -
loss: 180.5508 - val_loss: 194.4440
Epoch 21/30
328/328 �������������������� 1s 2ms/step -
loss: 185.5836 - val_loss: 196.2199
Epoch 22/30
328/328 �������������������� 1s 2ms/step -
loss: 180.4846 - val_loss: 193.9772
Epoch 23/30
328/328 �������������������� 1s 2ms/step -
loss: 185.6906 - val_loss: 194.2294
Epoch 24/30
328/328 �������������������� 1s 2ms/step -
loss: 184.7639 - val_loss: 197.1824
Epoch 25/30
328/328 �������������������� 1s 2ms/step -
loss: 180.2869 - val_loss: 193.0731
Epoch 26/30
328/328 �������������������� 1s 2ms/step -
loss: 181.7159 - val_loss: 192.7819
Epoch 27/30
328/328 �������������������� 1s 2ms/step -
loss: 177.1915 - val_loss: 192.9699
Epoch 28/30
328/328 �������������������� 1s 2ms/step -
loss: 180.7197 - val_loss: 192.7080
29
Epoch 29/30
328/328 �������������������� 1s 2ms/step -
loss: 176.0547 - val_loss: 192.8633
Epoch 30/30
328/328 �������������������� 1s 2ms/step -
loss: 178.3805 - val_loss: 192.9083
103/103 �������������������� 0s 1ms/step
RMSE of Feedforward Neural Network: 13.75
R Squared of Feedforward Neural Network: 0.54
↪'critics_sentiment_Unknown',}).values
y = rt_df['audience_rating'].values
↪random_state=42)
lstm_out = LSTM(64)(embedding)
30
# numerical input branch
num_input = Input(shape=(X_other.shape[1],), name="num_input")
dense_num = Dense(64, activation='relu')(num_input)
model.summary()
Model: "functional_1"
������������������������������������������������������������������������������������������������������������
� Layer (type) � Output Shape � Param # �␣
↪Connected to �
������������������������������������������������������������������������������������������������������������
� text_input (InputLayer) � (None, 88) � 0 � - ␣
↪ �
������������������������������������������������������������������������������������������������������������
� embedding (Embedding) � (None, 88, 128) � 1,280,000 �␣
↪text_input[0][0] �
������������������������������������������������������������������������������������������������������������
� num_input (InputLayer) � (None, 44) � 0 � - ␣
↪ �
������������������������������������������������������������������������������������������������������������
� lstm (LSTM) � (None, 64) � 49,408 �␣
↪embedding[0][0] �
������������������������������������������������������������������������������������������������������������
� dense_3 (Dense) � (None, 64) � 2,880 �␣
↪num_input[0][0] �
������������������������������������������������������������������������������������������������������������
� concatenate (Concatenate) � (None, 128) � 0 �␣
↪lstm[0][0], dense_3[0][0] �
������������������������������������������������������������������������������������������������������������
� dense_4 (Dense) � (None, 128) � 16,512 �␣
↪concatenate[0][0] �
������������������������������������������������������������������������������������������������������������
� dense_5 (Dense) � (None, 1) � 129 �␣
↪dense_4[0][0] �
������������������������������������������������������������������������������������������������������������
31
Total params: 1,348,929 (5.15 MB)
Epoch 1/10
410/410 �������������������� 15s 32ms/step -
loss: 1140.8567 - mae: 25.6652 - val_loss: 221.6878 - val_mae: 11.9149
Epoch 2/10
410/410 �������������������� 13s 32ms/step -
loss: 217.3896 - mae: 11.8196 - val_loss: 212.7557 - val_mae: 11.6834
Epoch 3/10
410/410 �������������������� 13s 31ms/step -
loss: 215.1940 - mae: 11.7355 - val_loss: 206.2878 - val_mae: 11.4945
Epoch 4/10
410/410 �������������������� 12s 30ms/step -
loss: 203.9770 - mae: 11.3881 - val_loss: 203.9777 - val_mae: 11.4025
Epoch 5/10
410/410 �������������������� 13s 30ms/step -
loss: 202.5593 - mae: 11.2855 - val_loss: 197.7494 - val_mae: 11.1343
Epoch 6/10
410/410 �������������������� 13s 31ms/step -
loss: 195.9349 - mae: 11.0397 - val_loss: 202.5623 - val_mae: 11.2185
Epoch 7/10
410/410 �������������������� 13s 31ms/step -
loss: 194.5986 - mae: 11.0365 - val_loss: 194.2616 - val_mae: 11.0375
Epoch 8/10
410/410 �������������������� 13s 31ms/step -
loss: 191.5667 - mae: 10.9208 - val_loss: 193.5592 - val_mae: 11.0496
Epoch 9/10
410/410 �������������������� 12s 30ms/step -
loss: 194.7535 - mae: 10.9932 - val_loss: 192.5889 - val_mae: 10.9288
Epoch 10/10
410/410 �������������������� 12s 30ms/step -
loss: 189.6154 - mae: 10.8478 - val_loss: 192.2870 - val_mae: 10.9272
32
[82]: # predicting on test dataset
y_pred = model.predict([X_test_text, X_test_other])
17 Model Comparision
[84]: pd.DataFrame(model_perf)
# plot RMSE
sns.lineplot(data=model_perf, x="Model", y="RMSE", marker="o", color="blue",␣
↪ax=ax1, label="RMSE")
# plot R²
ax2 = ax1.twinx()
33
sns.lineplot(data=model_perf, x="Model", y="R Squared", marker="o",␣
↪color="green", ax=ax2, label="R²")
34
18 Conclusion
[86]: # best performing model
model_perf.loc[model_perf['Model'] == 'CatBoost Regressor']
CatBoost Regression has the hightest model performance compared to other models with Root
Mean Squared Error: 13.11 and R Squared: 0.59.
plt.figure(figsize=(10, 6))
sns.lineplot(data=bucketed_df, x='Bucket', y='Original', label='Original',␣
↪color='blue', linewidth=2)
plt.xlabel('Index')
plt.ylabel('Audience Rating')
plt.title('Original Audience Rating vs Predicted Audience Rating')
plt.legend()
plt.show()
35
[ ]:
36