0% found this document useful (0 votes)

3 views

Machine Learning Final Report

The document details a comprehensive assignment on machine learning classification for a BSc in Data Science and Analytics, focusing on a dataset related to postseason performance. It includes steps for data loading, preprocessing, visualization, model training, and evaluation using various classifiers, ultimately identifying Logistic Regression as the best-performing model with an accuracy of 0.93. Key insights include handling missing values, feature selection, and the correlation between offensive and defensive metrics.

Uploaded by

jorambwana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Machine Learning Final Report

Uploaded by

jorambwana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

ESTHER AWINO- SCT213-C002-0089/2023

SECOND YEAR SECOND SEMESTER FOR BSC. DATA SCIENCE AND ANALYTICS

APRIL,2025

MACHINE LEARNING COMPREHENSIVE ASSIGNMENT 1-CLASSIFICATION

# Import required libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.svm import SVC

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Step 1 & 2: Load and describe dataset

url = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0120ENv3/
Dataset/ML0101EN_EDX_skill_up/cbb.csv'

df = pd.read_csv(url)

print("Dataset Shape:", df.shape)

print("\nFirst five rows:")

print(df.head())

print("\nData Info:")
print(df.info())

print("\nStatistical Summary:")

print(df.describe())

print("\nMissing Values:")

print(df.isnull().sum())

# Step 3: Data Preprocessing and Visualization

# Handle missing values (if any)

for col in df.columns:

if df[col].isnull().sum() > 0:

if df[col].dtype == 'object':

df[col] = df[col].fillna('None')

else:

df[col] = df[col].fillna(df[col].median())

# Encode 'CONF' categorical column

le = LabelEncoder()

df['CONF'] = le.fit_transform(df['CONF'])

# Target Variable: Did the team reach POSTSEASON (1) or not (0)

y = df['POSTSEASON'].apply(lambda x: 0 if x == 'None' else 1)

# Feature selection

features = ['ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D', 'TOR', 'TORD',

'ORB', 'DRB', 'FTR', 'FTRD', '2P_O', '2P_D', '3P_O', '3P_D',

'ADJ_T', 'WAB', 'CONF']

X = df[features]
# Step 4: Data Visualization

# Create correlation heatmap

plt.figure(figsize=(12, 8))

correlation_matrix = df[['ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D', 'TOR',

'TORD']].corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)

plt.title('Correlation Heatmap of Key Features')

plt.tight_layout()

plt.show()

# Distribution of offensive and defensive efficiency

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)

sns.histplot(df['ADJOE'], kde=True)

plt.title('Distribution of Adjusted Offensive Efficiency')

plt.subplot(1, 2, 2)

sns.histplot(df['ADJDE'], kde=True)

plt.title('Distribution of Adjusted Defensive Efficiency')

plt.tight_layout()

plt.show()

print("Feature distributions and correlations visualized above.")

# Step 5: Normalization

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Step 6: Training and Validation

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

models = {

'KNN': KNeighborsClassifier(n_neighbors=5),

'Decision Tree': DecisionTreeClassifier(random_state=42),

'SVM': SVC(kernel='rbf', random_state=42),

'Logistic Regression': LogisticRegression(random_state=42)

trained_models = {}

for name, model in models.items():

model.fit(X_train, y_train)

trained_models[name] = model

print(f"{name} model trained successfully.")

# Step 7: Model Evaluation

results = {}

plt.figure(figsize=(20,5))

for i, (name, model) in enumerate(trained_models.items(), 1):

plt.subplot(1, 4, i)

y_pred = model.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')

plt.title(f'{name} Confusion Matrix')

plt.xlabel('Predicted')

plt.ylabel('Actual')

acc = accuracy_score(y_test, y_pred)

results[name] = acc

plt.tight_layout()

plt.show()

for name, model in trained_models.items():

print(f"Classification Report for {name}:")

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

# Step 8: Comparative Analysis

print("\nComparative Accuracy:")

results_df = pd.DataFrame(list(results.items()), columns=['Model', 'Accuracy'])

print(results_df.sort_values(by='Accuracy', ascending=False))

# Conclusion

best_model = results_df.sort_values(by='Accuracy', ascending=False).iloc[0]

print(f"\nBest Performing Model: {best_model['Model']} with Accuracy: {best_model['Accuracy']:.2f}")

Results and Key insights.

 The dataset had missing values hence were handled by replacement with either null (for
categorical data) or medium (for numerical data ) for the dataset to remain interpretable and
maintain its central tendency.
 Preprocessing:
Categorical Encoding: The column 'CONF' was label-encoded.
Target Variable: Transformed the 'POSTSEASON' column into binary labels (1 for
postseason, 0 otherwise).
Feature Selection: Key features such as 'ADJOE', 'ADJDE', 'BARTHAG', etc., were
extracted for analysis.

 Visualization:

From the Correlation Heatmap:

Strong Positive Correlations:
BARTHAG correlates positively with ADJOE and EFG_O.
EFG_O and ADJOE are positively linked, showing that better field goal percentages improve offensive
efficiency
Negative Correlations:
ADJDE negatively correlates with offensive metrics (lower values indicate better defense).
TOR negatively correlates with offensive metrics, meaning higher turnover rates reduce offensive
efficiency
Distribution plots
Offensive Efficiency (ADJOE)
Shows a normal distribution, indicating most teams hover near the league average with few extremes.
Defensive Efficiency (ADJDE)
Also follows a normal distribution but has a tighter spread , suggesting less variation across teams.

 Used StandardScaler for feature normalization to prepare data for training.

 Model training and validation:
The following models were used:

Classification Report for KNN:

precision recall f1-score support
0 0.91 0.96 0.94 223
1 0.81 0.66 0.73 59

accuracy 0.90 282

macro avg 0.86 0.81 0.83 282
weighted avg 0.89 0.90 0.89 282

Classification Report for Decision Tree:

precision recall f1-score support

0 0.92 0.92 0.92 223

1 0.71 0.71 0.71 59

accuracy 0.88 282

macro avg 0.82 0.82 0.82 282
weighted avg 0.88 0.88 0.88 282

Classification Report for SVM:

precision recall f1-score support

0 0.93 0.97 0.95 223

1 0.88 0.71 0.79 59

accuracy 0.92 282

macro avg 0.90 0.84 0.87 282
weighted avg 0.92 0.92 0.92 282

Classification Report for Logistic Regression:

precision recall f1-score support

0 0.94 0.97 0.96 223

1 0.88 0.78 0.83 59

accuracy 0.93 282

macro avg 0.91 0.88 0.89 282
weighted avg 0.93 0.93 0.93 282

Comparative Accuracy:
Model Accuracy
3 Logistic Regression 0.932624
2 SVM 0.918440
0 KNN 0.897163
1 Decision Tree 0.879433

Best Performing Model: Logistic Regression with Accuracy: 0.93

Validation of Reliability, Repeatability and Consistency of Three Dimensional Choroidal Vascular Index
No ratings yet
Validation of Reliability, Repeatability and Consistency of Three Dimensional Choroidal Vascular Index
10 pages
Machine Learning Lab Manual 06
100% (1)
Machine Learning Lab Manual 06
8 pages
Machine Learning Assignment (1)
No ratings yet
Machine Learning Assignment (1)
8 pages
ML Assignment 4
No ratings yet
ML Assignment 4
7 pages
WINSEM2024-25_CSE3008_ELA_AP2024254001161_2025-02-13_Reference-Material-I (1)
No ratings yet
WINSEM2024-25_CSE3008_ELA_AP2024254001161_2025-02-13_Reference-Material-I (1)
2 pages
6 - 2 - SVMS, - Randon - Forests - and - KNN - Ipynb - Colaboratory
No ratings yet
6 - 2 - SVMS, - Randon - Forests - and - KNN - Ipynb - Colaboratory
4 pages
Machine Learnin1
100% (1)
Machine Learnin1
41 pages
(REPORT) LAB - 2 - Decision - Tree
No ratings yet
(REPORT) LAB - 2 - Decision - Tree
17 pages
machine-learning-assignment (1)
No ratings yet
machine-learning-assignment (1)
7 pages
Experiment 7
No ratings yet
Experiment 7
3 pages
Confusion Matrix
No ratings yet
Confusion Matrix
5 pages
Vertopal.com_ML LAB 8
No ratings yet
Vertopal.com_ML LAB 8
9 pages
Mini Project
No ratings yet
Mini Project
9 pages
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
No ratings yet
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
5 pages
dsbda_5
No ratings yet
dsbda_5
4 pages
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
No ratings yet
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
20 pages
ML Lab Assessment 4
No ratings yet
ML Lab Assessment 4
4 pages
Ml-Exp-2 - Jupyter Notebook
No ratings yet
Ml-Exp-2 - Jupyter Notebook
2 pages
ml lab programs 2
No ratings yet
ml lab programs 2
16 pages
ADS 5
No ratings yet
ADS 5
5 pages
Credit_Card_Approval_Prediction_Report-Final
No ratings yet
Credit_Card_Approval_Prediction_Report-Final
27 pages
ADS_EXP_4 (1)
No ratings yet
ADS_EXP_4 (1)
4 pages
FRA Project Report - Chilla Nagaraju
100% (1)
FRA Project Report - Chilla Nagaraju
66 pages
Rev Insurance Business Report
No ratings yet
Rev Insurance Business Report
4 pages
22K61A0654_2_sasi_auto
No ratings yet
22K61A0654_2_sasi_auto
24 pages
ANN_EXPERIENTIAL_LEARNING
No ratings yet
ANN_EXPERIENTIAL_LEARNING
43 pages
Final Report (1)
No ratings yet
Final Report (1)
17 pages
Multi - Class - Scaled - Down - Data - Colaboratory
No ratings yet
Multi - Class - Scaled - Down - Data - Colaboratory
2 pages
SVM Implementation
No ratings yet
SVM Implementation
8 pages
Openlab1
No ratings yet
Openlab1
17 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
10 pages
ML
No ratings yet
ML
11 pages
MLS+2+-+Classification
No ratings yet
MLS+2+-+Classification
13 pages
MLAssCode
No ratings yet
MLAssCode
1 page
Case Study - Classifier
No ratings yet
Case Study - Classifier
5 pages
Module 2
No ratings yet
Module 2
151 pages
SVM K NN MLP With Sklearn Jupyter NoteBo
No ratings yet
SVM K NN MLP With Sklearn Jupyter NoteBo
22 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Review Paper[2]
No ratings yet
Review Paper[2]
3 pages
Name: Le Ho Thao Nguyen Student ID: 20194224
No ratings yet
Name: Le Ho Thao Nguyen Student ID: 20194224
9 pages
ML101 Graded Assignment 2.Ipynb - Colab
No ratings yet
ML101 Graded Assignment 2.Ipynb - Colab
6 pages
Machine Learningassignment
No ratings yet
Machine Learningassignment
10 pages
BI_6_NEW
No ratings yet
BI_6_NEW
6 pages
CCD.ipynb - Colab
No ratings yet
CCD.ipynb - Colab
6 pages
Decision Tree
No ratings yet
Decision Tree
6 pages
Final ML
No ratings yet
Final ML
2 pages
Model Evaluation - II
No ratings yet
Model Evaluation - II
12 pages
ML LAB 146
No ratings yet
ML LAB 146
50 pages
ashfatmaterial
No ratings yet
ashfatmaterial
4 pages
Machine Learning Model
No ratings yet
Machine Learning Model
9 pages
Machine Learning Model Evaluation
No ratings yet
Machine Learning Model Evaluation
11 pages
Classification Metrics For Generalized Results
No ratings yet
Classification Metrics For Generalized Results
70 pages
Machine learning lab manual
No ratings yet
Machine learning lab manual
22 pages
Lecture03. Classification (Chapter 3)
No ratings yet
Lecture03. Classification (Chapter 3)
46 pages
ML 2 16
No ratings yet
ML 2 16
6 pages
Instruction & Option Choice
No ratings yet
Instruction & Option Choice
6 pages
Information Securtiy
No ratings yet
Information Securtiy
8 pages
Machine Learning Techniques For Sensor Data Analysis
No ratings yet
Machine Learning Techniques For Sensor Data Analysis
17 pages
DSBDA_10
No ratings yet
DSBDA_10
5 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
العلاقة بين التدفقات النقدية وعوائد الأسهم وفقاً للمعيار المحاسبي الدولي رقم 7
No ratings yet
العلاقة بين التدفقات النقدية وعوائد الأسهم وفقاً للمعيار المحاسبي الدولي رقم 7
182 pages
Python GTU Study Material Presentations Unit-5 20112020032922AM
No ratings yet
Python GTU Study Material Presentations Unit-5 20112020032922AM
24 pages
JPSP 2022 168
No ratings yet
JPSP 2022 168
10 pages
EDUCATION CORE -7 (1)
No ratings yet
EDUCATION CORE -7 (1)
11 pages
Students’ Thinking, Reasoning and Learning Proficiency in Problem Solving Rational Functions
No ratings yet
Students’ Thinking, Reasoning and Learning Proficiency in Problem Solving Rational Functions
12 pages
Statistics in the Social Sciences Current Methodological Developments 1st Edition Stanislav Kolenikov pdf download
100% (1)
Statistics in the Social Sciences Current Methodological Developments 1st Edition Stanislav Kolenikov pdf download
45 pages
IFRS 9, Earnings Management and Capital Management by European Banks
No ratings yet
IFRS 9, Earnings Management and Capital Management by European Banks
17 pages
2.1 Exploratory Data Analysis Using Python
No ratings yet
2.1 Exploratory Data Analysis Using Python
12 pages
AP Stats Exercise
No ratings yet
AP Stats Exercise
4 pages
A Test of Intercultural Communication Competence
No ratings yet
A Test of Intercultural Communication Competence
22 pages
W-22 Model Answer 22397 .Final
No ratings yet
W-22 Model Answer 22397 .Final
23 pages
Output Aiteman
No ratings yet
Output Aiteman
32 pages
MAS-01 Cost Behavior Analysis
No ratings yet
MAS-01 Cost Behavior Analysis
6 pages
FM - Group Assignment Report (D)
No ratings yet
FM - Group Assignment Report (D)
18 pages
Thesis Defense
No ratings yet
Thesis Defense
40 pages
Constraining Nuclear Matter Parameters and Neutron Star Observables Using PREX-2 and NICER Data
No ratings yet
Constraining Nuclear Matter Parameters and Neutron Star Observables Using PREX-2 and NICER Data
5 pages
28.statistics Formulae - by Anand Kaku-1
No ratings yet
28.statistics Formulae - by Anand Kaku-1
7 pages
Statistics Tutorial
No ratings yet
Statistics Tutorial
14 pages
Unit 4 FR Review
No ratings yet
Unit 4 FR Review
5 pages
CCTV
No ratings yet
CCTV
26 pages
Linear Statistical Models
No ratings yet
Linear Statistical Models
7 pages
Schools Division of Zambales
No ratings yet
Schools Division of Zambales
17 pages
Denver Scale of Communication Function
No ratings yet
Denver Scale of Communication Function
6 pages
The Relationship Between Academic Stress of Students and The Factors That Affect On It
No ratings yet
The Relationship Between Academic Stress of Students and The Factors That Affect On It
11 pages
Adopting Learner-Centered Education and Perceptions of School Effectiveness (#925637) - 1726517
No ratings yet
Adopting Learner-Centered Education and Perceptions of School Effectiveness (#925637) - 1726517
29 pages
Unlock The Secrets of Trading Gold
No ratings yet
Unlock The Secrets of Trading Gold
12 pages
Saadiq Khaliif Ducaale
No ratings yet
Saadiq Khaliif Ducaale
3 pages
Sta162 2024 01 Exam Paper
No ratings yet
Sta162 2024 01 Exam Paper
12 pages
American J Phys Anthropol - 2002 - Buckberry - Age Estimation From The Auricular Surface of The Ilium A Revised Method
No ratings yet
American J Phys Anthropol - 2002 - Buckberry - Age Estimation From The Auricular Surface of The Ilium A Revised Method
9 pages

Machine Learning Final Report

Uploaded by

Machine Learning Final Report

Uploaded by

ESTHER AWINO- SCT213-C002-0089/2023

MACHINE LEARNING COMPREHENSIVE ASSIGNMENT 1-CLASSIFICATION

# Import required libraries

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.svm import SVC

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Step 1 & 2: Load and describe dataset

print("Dataset Shape:", df.shape)

print("\nFirst five rows:")

# Step 3: Data Preprocessing and Visualization

# Handle missing values (if any)

for col in df.columns:

# Encode 'CONF' categorical column

y = df['POSTSEASON'].apply(lambda x: 0 if x == 'None' else 1)

features = ['ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D', 'TOR', 'TORD',

'ORB', 'DRB', 'FTR', 'FTRD', '2P_O', '2P_D', '3P_O', '3P_D',

'ADJ_T', 'WAB', 'CONF']

# Create correlation heatmap

correlation_matrix = df[['ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D', 'TOR',

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)

plt.title('Correlation Heatmap of Key Features')

# Distribution of offensive and defensive efficiency

plt.title('Distribution of Adjusted Offensive Efficiency')

plt.title('Distribution of Adjusted Defensive Efficiency')

print("Feature distributions and correlations visualized above.")

# Step 6: Training and Validation

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

'Decision Tree': DecisionTreeClassifier(random_state=42),

'SVM': SVC(kernel='rbf', random_state=42),

'Logistic Regression': LogisticRegression(random_state=42)

for name, model in models.items():

print(f"{name} model trained successfully.")

# Step 7: Model Evaluation

for i, (name, model) in enumerate(trained_models.items(), 1):

plt.title(f'{name} Confusion Matrix')

acc = accuracy_score(y_test, y_pred)

for name, model in trained_models.items():

print(f"Classification Report for {name}:")

# Step 8: Comparative Analysis

results_df = pd.DataFrame(list(results.items()), columns=['Model', 'Accuracy'])

best_model = results_df.sort_values(by='Accuracy', ascending=False).iloc[0]

print(f"\nBest Performing Model: {best_model['Model']} with Accuracy: {best_model['Accuracy']:.2f}")

Results and Key insights.

From the Correlation Heatmap:

 Used StandardScaler for feature normalization to prepare data for training.

Classification Report for KNN:

accuracy 0.90 282

Classification Report for Decision Tree:

0 0.92 0.92 0.92 223

accuracy 0.88 282

Classification Report for SVM:

0 0.93 0.97 0.95 223

accuracy 0.92 282

Classification Report for Logistic Regression:

0 0.94 0.97 0.96 223

accuracy 0.93 282

Best Performing Model: Logistic Regression with Accuracy: 0.93

You might also like