G. H.
RAISONI COLLEGE OF ENGINEERING, NAGPUR
(An Autonomous Institute affiliated to RTM Nagpur University)
Department of Computer Science & Engineering
Session: Summer 2023
Date :
Practical Details: Practical No. 3
Student Details:
Roll Number 56
Name Sanket Jambhulkar
Semester 5th
Section B
Branch CSE
Subject MLA
Aim: Write a python program to classify the given dataset using Logistic Regression and
evaluate the model.
Tools: PIMA diabetes Dataset, Python, Kaggle jupyter Notebook
Theory: Introduction:
Logistic Regression is a fundamental algorithm used for binary classification tasks. Despite its
name, it's a classification algorithm rather than a regression one. In this theoretical explanation,
we will delve into the concept of Logistic Regression, its underlying principles, and how it is
used for classification tasks. Furthermore, we will discuss the process of evaluating a Logistic
Regression model's performance.
1) Logistic Regression:
Logistic Regression is a statistical method used for predicting the probability of a binary
outcome.
It models the probability that a given input belongs to a particular class.
The logistic function (sigmoid function) is used to map input features to the range [0, 1],
representing probabilities.
Mathematically, the logistic function is expressed as:
σ(z) = 1 / (1 + e^(-z)), where z = w^T * x + b, w is the weight vector, x is the feature
vector, and b is the bias term.
2) Training Process:
In the training process, the Logistic Regression model learns the optimal weights and
bias that minimize a predefined loss function, typically the logistic loss or cross-entropy
loss.
This process involves iterative optimization algorithms such as gradient descent, where
the model iteratively updates the weights and bias to minimize the loss function.
3) Classification:
After training, the Logistic Regression model uses the learned parameters to predict the
probability that a given input belongs to the positive class (class 1).
If the predicted probability is greater than a predefined threshold (usually 0.5), the input
is classified as belonging to the positive class; otherwise, it is classified as belonging to
the negative class (class 0).
4) Model Evaluation:
Several metrics are commonly used to evaluate the performance of a Logistic Regression
model, including accuracy, precision, recall, F1-score, and area under the ROC curve
(AUC-ROC).
Accuracy measures the proportion of correctly classified instances out of the total
instances.
Precision measures the proportion of true positive predictions among all positive
predictions.
Recall measures the proportion of true positive predictions among all actual positive
instances.
F1-score is the harmonic mean of precision and recall and provides a balanced measure
of a model's performance.
AUC-ROC measures the area under the Receiver Operating Characteristic curve and
provides a comprehensive evaluation of the model's ability to discriminate between
positive and negative instances across different threshold value
predict diabetes using the Logistic Regression Classifier
Importing necessary libraries
If you already have an idea of the dataset you would like to use from the package, you can
specify it. In the following example, we will import the diabetes dataset. This dataset
contains data from diabetic patients and contains certain features such as their bmi, age ,
blood pressure and glucose levels which are useful in predicting the diabetes disease
progression in patients.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import load_diabetes
Load the diabetes dataset
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
Splitting the dataset into training and testing sets
To understand model performance, dividing the dataset into a training set and a test set is a
good strategy.
Let's split the dataset by using the function train_test_split(). You need to pass 3
parameters: features, target, and test_set size. Additionally, you can use random_state to
select records randomly.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Initialize and train the logistic regression model
Logistic Regression is another statistical analysis method borrowed by Machine Learning.
It is used when our dependent variable is dichotomous or binary. It just means a variable
that has only 2 outputs, for example, A person will survive this accident or not, The
student will pass this exam or not.
model = LogisticRegression()
model.fit(X_train, y_train)
LogisticRegression()
Predicting on the test set
y_pred = model.predict(X_test)
y_pred
array([200., 178., 178., 178., 178., 200., 178., 200., 71., 200., 200.,
71., 71., 178., 71., 71., 178., 178., 71., 178., 200., 200.,
71., 178., 200., 178., 178., 178., 71., 71., 200., 200., 71.,
178., 200., 200., 71., 178., 178., 71., 71., 71., 71., 178.,
200., 200., 71., 71., 71., 200., 71., 71., 71., 71., 200.,
200., 200., 178., 200., 71., 200., 200., 71., 71., 200., 200.,
200., 178., 71., 71., 71., 178., 178., 71., 71., 178., 200.,
200., 178., 200., 200., 71., 71., 200., 71., 71., 71., 71.,
200.])
Evaluating the model
It is one of the performance evaluation metrics of a classification-based machine learning
model. It displays your model’s precision, recall, F1 score and support. It provides a better
understanding of the overall performance of our trained model. To understand the
classification report of a machine learning model, you need to know all of the metrics
displayed in the report. For a clear understanding, I have explained all of the metrics below
so that you can easily understand the classification report of your machine learning model:
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("Mean Absolute Error:", mae)
print("R-squared:", r2)
Mean Squared Error: 5691.91011235955
Mean Absolute Error:
61.640449438202246 R-squared:
0.07431994826369315
plotting ROC and Precision-Recall curve
This flexibility comes from the way that probabilities may be interpreted using different
thresholds that allow the operator of the model to trade-off concerns in the errors made by
the model, such as the number of false positives compared to the number of false negatives.
This is required when using models where the cost of one error outweighs the cost of other
types of errors.
Two diagnostic tools that help in the interpretation of probabilistic forecast for binary
(two-class) classification predictive modeling problems are ROC Curves and Precision-
Recall curves.
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'k--', lw=4)
plt.xlabel('True Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs Actual Values')
plt.show()
confusion matrix
import matplotlib.pyplot as plt
import seaborn as sns
# Define number of bins and bin edges
num_bins = 10
bin_edges = np.linspace(min(min(y_test), min(y_pred)), max(max(y_test),
max(y_pred)), num_bins + 1)
# Create bins
y_test_bins = np.digitize(y_test, bin_edges) y_pred_bins =
np.digitize(y_pred, bin_edges)
# Create confusion matrix
conf_matrix = np.zeros((num_bins, num_bins))
for i in range(len(y_test)):
# Adjust for potential boundary cases
actual_index = min(y_test_bins[i], num_bins) - 1
pred_index = min(y_pred_bins[i], num_bins) - 1
conf_matrix[actual_index, pred_index] += 1
# Plot confusion matrix heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix, annot=True, fmt='.0f', cmap='Blues', cbar=False)
plt.xlabel('Predicted Bin')
plt.ylabel('Actual Bin') plt.title('Confusion
Matrix (Binned)')
plt.xticks(np.arange(num_bins) + 0.5, np.arange(1, num_bins + 1))
plt.yticks(np.arange(num_bins) + 0.5, np.arange(1, num_bins + 1))
plt.show()
conclusion: Logistic Regression is a powerful algorithm for binary classification tasks,
providing interpretable results and efficient computation. Understanding its principles
and the process of evaluating its performance is crucial for effectively applying it to
real-world datasets. By comprehensively evaluating the model's performance, we can
assess its suitability for the given task and make informed decisions about its
deployment.