Documentation
Documentation
Overview
This project aims to predict student success (i.e., whether a student will complete or drop out of a
course) on an online learning platform. The project involves merging several datasets related to
student profiles, engagement data, and historical performance. A classification model (Random
Forest) is built to predict student completion status, and feature importance analysis is conducted to
determine which factors contribute most to the outcome. The goal is to identify at-risk students and
suggest interventions to help them succeed.
Code Breakdown
1. Loading Libraries
import pandas as pd
import joblib
import numpy as np
2. Loading Datasets
These are merged into a single dataframe using student_id as the common key.
profile_df = pd.read_csv('student_profile_data.csv')
engagement_df = pd.read_csv('course_engagement_data.csv')
historical_df = pd.read_csv('historical_data.csv')
def determine_completion_status(row):
else:
Numerical columns are filled with their mean, and categorical columns are filled with their most
frequent value (mode).
merged_df[numerical_cols] = merged_df[numerical_cols].fillna(merged_df[numerical_cols].mean())
merged_df[categorical_cols] =
merged_df[categorical_cols].fillna(merged_df[categorical_cols].mode().iloc[0])
5. Data Encoding
Categorical variables are converted into one-hot encoded columns for compatibility with machine
learning algorithms.
The features (X) and target variable (y) are defined. The data is split into training and test sets using
an 70/30 split.
X = merged_df.drop(columns=['student_id', 'completion_status'])
y = merged_df['completion_status']
A Random Forest classifier is trained, with hyperparameter tuning performed using GridSearchCV.
The best model is selected based on F1-score.
param_grid = {
'min_samples_leaf': [1, 2, 4]
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
8. Model Evaluation
The performance of the model is evaluated using accuracy and classification report metrics on the
test data.
y_pred_best = best_model.predict(X_test)
9. Feature Importance
The importance of each feature is visualized to determine which factors contribute the most to
predicting student success.
importances = best_model.feature_importances_
plt.figure(figsize=(10, 6))
plt.show()
10. Cross-Validation
The model is used to predict the completion status for all students, and the students likely to drop
out are saved in a separate file.
merged_df['predicted_completion_status'] = best_model.predict(X)
dropout_students = merged_df[merged_df['predicted_completion_status'] == 0]
dropout_students.to_csv('likely_dropout_students.csv', index=False)
Functions are provided to calculate MAP@K (Mean Average Precision at K) and NDCG@K
(Normalized Discounted Cumulative Gain at K) for ranking evaluation.
if len(predicted) > k:
predicted = predicted[:k]
score = 0.0
num_hits = 0.0
for i, p in enumerate(predicted):
num_hits += 1.0
relevance_scores = np.array(relevance_scores)[:k]
return 0.0
dcg = dcg_at_k(relevance_scores, k)
The MAP@K and NDCG@K functions are tested with example data.
predicted_list = [[1, 2, 4], [2], [1, 2, 3], [1, 3], [1, 2]]
print(f'NDCG@3: {np.mean(ndcg_scores)}')
Conclusion
The project successfully predicts student completion status with a Random Forest classifier, analyzes
feature importance, and provides insights into student behavior.