0% found this document useful (0 votes)
3 views

Machine Learning Final Report

The document details a comprehensive assignment on machine learning classification for a BSc in Data Science and Analytics, focusing on a dataset related to postseason performance. It includes steps for data loading, preprocessing, visualization, model training, and evaluation using various classifiers, ultimately identifying Logistic Regression as the best-performing model with an accuracy of 0.93. Key insights include handling missing values, feature selection, and the correlation between offensive and defensive metrics.

Uploaded by

jorambwana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Machine Learning Final Report

The document details a comprehensive assignment on machine learning classification for a BSc in Data Science and Analytics, focusing on a dataset related to postseason performance. It includes steps for data loading, preprocessing, visualization, model training, and evaluation using various classifiers, ultimately identifying Logistic Regression as the best-performing model with an accuracy of 0.93. Key insights include handling missing values, feature selection, and the correlation between offensive and defensive metrics.

Uploaded by

jorambwana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

ESTHER AWINO- SCT213-C002-0089/2023

SECOND YEAR SECOND SEMESTER FOR BSC. DATA SCIENCE AND ANALYTICS

APRIL,2025

MACHINE LEARNING COMPREHENSIVE ASSIGNMENT 1-CLASSIFICATION

# Import required libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.svm import SVC

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Step 1 & 2: Load and describe dataset

url = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0120ENv3/
Dataset/ML0101EN_EDX_skill_up/cbb.csv'

df = pd.read_csv(url)

print("Dataset Shape:", df.shape)

print("\nFirst five rows:")

print(df.head())

print("\nData Info:")
print(df.info())

print("\nStatistical Summary:")

print(df.describe())

print("\nMissing Values:")

print(df.isnull().sum())

# Step 3: Data Preprocessing and Visualization

# Handle missing values (if any)

for col in df.columns:

if df[col].isnull().sum() > 0:

if df[col].dtype == 'object':

df[col] = df[col].fillna('None')

else:

df[col] = df[col].fillna(df[col].median())

# Encode 'CONF' categorical column

le = LabelEncoder()

df['CONF'] = le.fit_transform(df['CONF'])

# Target Variable: Did the team reach POSTSEASON (1) or not (0)

y = df['POSTSEASON'].apply(lambda x: 0 if x == 'None' else 1)

# Feature selection

features = ['ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D', 'TOR', 'TORD',

'ORB', 'DRB', 'FTR', 'FTRD', '2P_O', '2P_D', '3P_O', '3P_D',

'ADJ_T', 'WAB', 'CONF']

X = df[features]
# Step 4: Data Visualization

# Create correlation heatmap

plt.figure(figsize=(12, 8))

correlation_matrix = df[['ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D', 'TOR',


'TORD']].corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)

plt.title('Correlation Heatmap of Key Features')

plt.tight_layout()

plt.show()

# Distribution of offensive and defensive efficiency

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)

sns.histplot(df['ADJOE'], kde=True)

plt.title('Distribution of Adjusted Offensive Efficiency')

plt.subplot(1, 2, 2)

sns.histplot(df['ADJDE'], kde=True)

plt.title('Distribution of Adjusted Defensive Efficiency')

plt.tight_layout()

plt.show()

print("Feature distributions and correlations visualized above.")


# Step 5: Normalization

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Step 6: Training and Validation

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

models = {

'KNN': KNeighborsClassifier(n_neighbors=5),

'Decision Tree': DecisionTreeClassifier(random_state=42),

'SVM': SVC(kernel='rbf', random_state=42),

'Logistic Regression': LogisticRegression(random_state=42)

trained_models = {}

for name, model in models.items():

model.fit(X_train, y_train)

trained_models[name] = model

print(f"{name} model trained successfully.")

# Step 7: Model Evaluation

results = {}

plt.figure(figsize=(20,5))

for i, (name, model) in enumerate(trained_models.items(), 1):

plt.subplot(1, 4, i)

y_pred = model.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')

plt.title(f'{name} Confusion Matrix')

plt.xlabel('Predicted')

plt.ylabel('Actual')

acc = accuracy_score(y_test, y_pred)

results[name] = acc

plt.tight_layout()

plt.show()

for name, model in trained_models.items():

print(f"Classification Report for {name}:")

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

# Step 8: Comparative Analysis

print("\nComparative Accuracy:")

results_df = pd.DataFrame(list(results.items()), columns=['Model', 'Accuracy'])

print(results_df.sort_values(by='Accuracy', ascending=False))

# Conclusion

best_model = results_df.sort_values(by='Accuracy', ascending=False).iloc[0]

print(f"\nBest Performing Model: {best_model['Model']} with Accuracy: {best_model['Accuracy']:.2f}")

Results and Key insights.

 The dataset had missing values hence were handled by replacement with either null (for
categorical data) or medium (for numerical data ) for the dataset to remain interpretable and
maintain its central tendency.
 Preprocessing:
Categorical Encoding: The column 'CONF' was label-encoded.
Target Variable: Transformed the 'POSTSEASON' column into binary labels (1 for
postseason, 0 otherwise).
Feature Selection: Key features such as 'ADJOE', 'ADJDE', 'BARTHAG', etc., were
extracted for analysis.

 Visualization:

From the Correlation Heatmap:


Strong Positive Correlations:
BARTHAG correlates positively with ADJOE and EFG_O.
EFG_O and ADJOE are positively linked, showing that better field goal percentages improve offensive
efficiency
Negative Correlations:
ADJDE negatively correlates with offensive metrics (lower values indicate better defense).
TOR negatively correlates with offensive metrics, meaning higher turnover rates reduce offensive
efficiency
Distribution plots
Offensive Efficiency (ADJOE)
Shows a normal distribution, indicating most teams hover near the league average with few extremes.
Defensive Efficiency (ADJDE)
Also follows a normal distribution but has a tighter spread , suggesting less variation across teams.

 Used StandardScaler for feature normalization to prepare data for training.


 Model training and validation:
The following models were used:

Classification Report for KNN:


precision recall f1-score support
0 0.91 0.96 0.94 223
1 0.81 0.66 0.73 59

accuracy 0.90 282


macro avg 0.86 0.81 0.83 282
weighted avg 0.89 0.90 0.89 282

Classification Report for Decision Tree:


precision recall f1-score support

0 0.92 0.92 0.92 223


1 0.71 0.71 0.71 59

accuracy 0.88 282


macro avg 0.82 0.82 0.82 282
weighted avg 0.88 0.88 0.88 282

Classification Report for SVM:


precision recall f1-score support

0 0.93 0.97 0.95 223


1 0.88 0.71 0.79 59

accuracy 0.92 282


macro avg 0.90 0.84 0.87 282
weighted avg 0.92 0.92 0.92 282

Classification Report for Logistic Regression:


precision recall f1-score support

0 0.94 0.97 0.96 223


1 0.88 0.78 0.83 59

accuracy 0.93 282


macro avg 0.91 0.88 0.89 282
weighted avg 0.93 0.93 0.93 282

Comparative Accuracy:
Model Accuracy
3 Logistic Regression 0.932624
2 SVM 0.918440
0 KNN 0.897163
1 Decision Tree 0.879433

Best Performing Model: Logistic Regression with Accuracy: 0.93

You might also like