Machine Learning Final Report
Machine Learning Final Report
SECOND YEAR SECOND SEMESTER FOR BSC. DATA SCIENCE AND ANALYTICS
APRIL,2025
import pandas as pd
import numpy as np
url = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0120ENv3/
Dataset/ML0101EN_EDX_skill_up/cbb.csv'
df = pd.read_csv(url)
print(df.head())
print("\nData Info:")
print(df.info())
print("\nStatistical Summary:")
print(df.describe())
print("\nMissing Values:")
print(df.isnull().sum())
if df[col].isnull().sum() > 0:
if df[col].dtype == 'object':
df[col] = df[col].fillna('None')
else:
df[col] = df[col].fillna(df[col].median())
le = LabelEncoder()
df['CONF'] = le.fit_transform(df['CONF'])
# Target Variable: Did the team reach POSTSEASON (1) or not (0)
# Feature selection
X = df[features]
# Step 4: Data Visualization
plt.figure(figsize=(12, 8))
plt.tight_layout()
plt.show()
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(df['ADJOE'], kde=True)
plt.subplot(1, 2, 2)
sns.histplot(df['ADJDE'], kde=True)
plt.tight_layout()
plt.show()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
models = {
'KNN': KNeighborsClassifier(n_neighbors=5),
trained_models = {}
model.fit(X_train, y_train)
trained_models[name] = model
results = {}
plt.figure(figsize=(20,5))
plt.subplot(1, 4, i)
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
results[name] = acc
plt.tight_layout()
plt.show()
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print("\nComparative Accuracy:")
print(results_df.sort_values(by='Accuracy', ascending=False))
# Conclusion
The dataset had missing values hence were handled by replacement with either null (for
categorical data) or medium (for numerical data ) for the dataset to remain interpretable and
maintain its central tendency.
Preprocessing:
Categorical Encoding: The column 'CONF' was label-encoded.
Target Variable: Transformed the 'POSTSEASON' column into binary labels (1 for
postseason, 0 otherwise).
Feature Selection: Key features such as 'ADJOE', 'ADJDE', 'BARTHAG', etc., were
extracted for analysis.
Visualization:
Comparative Accuracy:
Model Accuracy
3 Logistic Regression 0.932624
2 SVM 0.918440
0 KNN 0.897163
1 Decision Tree 0.879433