Diabetes prediction using logistic regression

Last Updated : 31 Dec, 2025

Logistic Regression is a binary classification algorithm that uses a sigmoid function to model probabilities and offers simplicity, efficiency and interpretability. In this article we will use Logistic Regression to predict diabetes by learning patterns from clinical features and estimating the likelihood of disease occurrence.

Step-by-Step Implementation

Step 1: Import Required Libraries

Python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Step 2: Load the Dataset

  • The dataset is loaded from a CSV file using pandas
  • head() is used to preview the first few records

You can download dataset from here

Python
df = pd.read_csv("diabetes_dataset.csv")
df.head(10)

Output:

LG1
Top 10 rows

Step 3: Data Inspection and Statistical Summary

  • isna() checks for missing values in each column
  • describe() provides statistical information like mean and standard deviation
Python
df.isna().sum()
df.describe()

Output:

Step 4: Feature and Target Separation

  • Independent variables are stored in X
  • Dependent variable (Outcome) is stored in y
  • Separating features and labels is required for ML models
  • This prepares data for training and testing
Python
X = df.drop(columns=['Outcome'])
y = df['Outcome']

Step 5: Train-Test Split

  • Data is split into training and testing sets
  • 70% data is used for training and 30% for testing
  • random_state ensures reproducibility
  • This helps evaluate model performance on unseen data
Python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=11
)

Step 6: Feature Scaling

  • Standardization is applied using StandardScaler
  • Scaling improves model convergence and performance
  • Logistic Regression performs better with scaled data
  • Training and testing data are scaled consistently
Python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Step 7: Logistic Regression Model Training

  • Logistic Regression is used for binary classification
  • max_iter is increased to ensure proper convergence
  • The model is trained using scaled training data
Python
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=2000)
model.fit(X_train_scaled, y_train)

Step 9: Model Prediction and Evaluation

  • Predictions are stored in y_pred
  • Accuracy score measures overall correctness
  • Classification report shows precision, recall, and F1-score
  • Confusion matrix provides detailed classification results
Python
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
y_pred = model.predict(X_test_scaled)
print("Accuracy of this model:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Output:

LG4
Model Prediction and Evaluation

Step 10: Confusion Matrix Visualization

  • Confusion matrix is visualized using a heatmap
  • It shows true positives, true negatives, false positives, and false negatives
Python
cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix")
plt.xlabel("Predicted Labels")
plt.ylabel("Actual Labels")
plt.show()

Output:

LG5
Confusion Matrix

You can download full code from here

Comment