0% found this document useful (0 votes)
4 views

linear_merged_pagenumber

The document outlines a Jupyter notebook for performing logistic regression analysis on a diabetes dataset. It includes steps for importing libraries, loading data, exploring the dataset, handling missing values, and visualizing data relationships. The notebook also demonstrates the use of LassoCV for feature selection and importance visualization.

Uploaded by

rihab.ahmed.ug23
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

linear_merged_pagenumber

The document outlines a Jupyter notebook for performing logistic regression analysis on a diabetes dataset. It includes steps for importing libraries, loading data, exploring the dataset, handling missing values, and visualizing data relationships. The notebook also demonstrates the use of LassoCV for feature selection and importance visualization.

Uploaded by

rihab.ahmed.ug23
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

1

2
3
4
5
6
7
8
9
10
11
12
13
4/3/25, 2:14 AM diabetesLogisticRegression.ipynb - Colab

keyboard_arrow_down Diabetes Logistic Regression


keyboard_arrow_down Import Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LogisticRegression, LassoCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
%matplotlib inline
from sklearn.preprocessing import StandardScaler

keyboard_arrow_down Importing and getting info on Data


# prompt: open a .csv file with pandas

import pandas as pd

# Assuming the .csv file is named 'your_file.csv' and is in the current directory
# Replace 'your_file.csv' with the actual file name if it's different
try:
df = pd.read_csv('diabetes.csv')
except FileNotFoundError:
print("Error: 'your_file.csv' not found. Please upload the file or provide the correct path.")
except pd.errors.ParserError:
print("Error: Could not parse the CSV file. Please check its format.")

keyboard_arrow_down getting info on data


#Explore the dataset
df.head(10)

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

0 6 148 72 35 0 33.6 0.627 50 1

1 1 85 66 29 0 26.6 0.351 31 0

2 8 183 64 0 0 23.3 0.672 32 1

3 1 89 66 23 94 28.1 0.167 21 0

4 0 137 40 35 168 43.1 2.288 33 1

5 5 116 74 0 0 25.6 0.201 30 0

6 3 78 50 32 88 31.0 0.248 26 1

7 10 115 0 0 0 35.3 0.134 29 0

8 2 197 70 45 543 30.5 0.158 53 1

9 8 125 96 0 0 0.0 0.232 54 1


 

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

df.tail(10)

https://colab.research.google.com/drive/1sFPGSs1MHNRsQl-nBT-IG-pm5vsLLP1D#scrollTo=W-eyPJRTKema&printMode=true 1/8 14
4/3/25, 2:14 AM diabetesLogisticRegression.ipynb - Colab

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

758 1 106 76 0 0 37.5 0.197 26 0

759 6 190 92 0 0 35.5 0.278 66 1

760 2 88 58 26 16 28.4 0.766 22 0

761 9 170 74 31 0 44.0 0.403 43 1

762 9 89 62 0 0 22.5 0.142 33 0

763 10 101 76 48 180 32.9 0.171 63 0

764 2 122 70 27 0 36.8 0.340 27 0

765 5 121 72 23 112 26.2 0.245 30 0

766 1 126 60 0 0 30.1 0.349 47 1

767 1 93 70 31 0 30.4 0.315 23 0


 

df.describe() #To know more about the dataset

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000

mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958

std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951

min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000

25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000

50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000

75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000

max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000
 

df.shape ## printing the no. of columns and rows of the dataframe

(768, 9)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

df.eq(0).sum()

https://colab.research.google.com/drive/1sFPGSs1MHNRsQl-nBT-IG-pm5vsLLP1D#scrollTo=W-eyPJRTKema&printMode=true 2/8 15
4/3/25, 2:14 AM diabetesLogisticRegression.ipynb - Colab

Pregnancies 111

Glucose 5

BloodPressure 35

SkinThickness 227

Insulin 374

BMI 11

DiabetesPedigreeFunction 0

Age 0

Outcome 500

dtype: int64
 

keyboard_arrow_down prepare data


keyboard_arrow_down split
X = df.drop(columns=["Outcome"]) # Replace with the actual target column name
y = df["Outcome"]

# Split the data into 80% training and 20% testing


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

keyboard_arrow_down handling missing values


# Replace zero values with NaN for columns where zero is not meaningful
columns_to_check = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
df[columns_to_check] = df[columns_to_check].replace(0, np.nan)

# Impute missing values with the median


df.fillna(df.median(), inplace=True)

df.head(10)

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

0 6 148.0 72.0 35.0 125.0 33.6 0.627 50 1

1 1 85.0 66.0 29.0 125.0 26.6 0.351 31 0

2 8 183.0 64.0 29.0 125.0 23.3 0.672 32 1

3 1 89.0 66.0 23.0 94.0 28.1 0.167 21 0

4 0 137.0 40.0 35.0 168.0 43.1 2.288 33 1

5 5 116.0 74.0 29.0 125.0 25.6 0.201 30 0

6 3 78.0 50.0 32.0 88.0 31.0 0.248 26 1

7 10 115.0 72.0 29.0 125.0 35.3 0.134 29 0

8 2 197.0 70.0 45.0 543.0 30.5 0.158 53 1

9 8 125.0 96.0 29.0 125.0 32.3 0.232 54 1

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

keyboard_arrow_down visualising data


pen_spark Generate get categorical columns of the df and visualise them search Close

https://colab.research.google.com/drive/1sFPGSs1MHNRsQl-nBT-IG-pm5vsLLP1D#scrollTo=W-eyPJRTKema&printMode=true 3/8 16
4/3/25, 2:14 AM diabetesLogisticRegression.ipynb - Colab

chevron_left
1 of 1
chevron_right
Undo Changes Use code with caution
# prompt: get categorical columns of the df and visualise them

# Get categorical columns


categorical_cols = df.select_dtypes(include=['object']).columns
print(categorical_cols)

# Visualize categorical columns using countplots


for col in categorical_cols:
plt.figure(figsize=(8, 6))
sns.countplot(x=col, data=df, hue='Outcome') # Assuming 'Outcome' is your target variable
plt.title(f'Distribution of {col} by Outcome')
plt.xlabel(col)
plt.ylabel('Count')
plt.show()

Index([], dtype='object')

# Pairplot to visualize relationships between features


sns.pairplot(df, hue=df.columns[-1])
plt.show()
print()
print()

# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Feature Correlation Heatmap")
plt.show()
print()
print()

# Split into features and target


X = df.iloc[:, :-1] # All columns except last one as features
y = df.iloc[:, -1] # Last column as target

# Split into train and test sets


x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the data


scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# Apply LassoCV
lassocv = LassoCV(alphas=None, cv=10, max_iter=100000)
lassocv.fit(x_train, y_train)

# Print best alpha


print("Best Alpha from LassoCV:", lassocv.alpha_)

# Visualizing feature importance


feature_importance = pd.Series(abs(lassocv.coef_), index=X.columns)
feature_importance = feature_importance.sort_values(ascending=False)

plt.figure(figsize=(10, 5))
sns.barplot(x=feature_importance.values, y=feature_importance.index, palette="viridis")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.title("Feature Importance from Lasso Regression")
plt.show()

https://colab.research.google.com/drive/1sFPGSs1MHNRsQl-nBT-IG-pm5vsLLP1D#scrollTo=W-eyPJRTKema&printMode=true 17
4/8
4/3/25, 2:14 AM diabetesLogisticRegression.ipynb - Colab

https://colab.research.google.com/drive/1sFPGSs1MHNRsQl-nBT-IG-pm5vsLLP1D#scrollTo=W-eyPJRTKema&printMode=true 18
5/8
4/3/25, 2:14 AM diabetesLogisticRegression.ipynb - Colab

Best Alpha from LassoCV: 0.016416819292315595


<ipython-input-15-3a22acdb2300>:39: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend

sns.barplot(x=feature_importance.values, y=feature_importance.index, palette="viridis")

https://colab.research.google.com/drive/1sFPGSs1MHNRsQl-nBT-IG-pm5vsLLP1D#scrollTo=W-eyPJRTKema&printMode=true 19
6/8
4/3/25, 2:14 AM diabetesLogisticRegression.ipynb - Colab

plt.figure(figsize=(8, 6))
sns.scatterplot(x='Glucose', y='BMI', hue='Outcome', data=df, palette='Set2')
plt.title('Glucose vs. BMI by Outcome')
plt.show()

 

keyboard_arrow_down Train the Logistic Regression Model


# Initialize the Logistic Regression model
model = LogisticRegression()

# Train the model


model.fit(X_train, y_train)

/usr/local/lib/python3.11/dist-packages/sklearn/linear_model/_logistic.py:465: ConvergenceWarning: lbfgs failed to converge (status=1):


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
▾ LogisticRegression i ?

LogisticRegression()
 

keyboard_arrow_down Make Predictions


# Make predictions on the test set
y_pred = model.predict(X_test)

keyboard_arrow_down Evaluate the Model


keyboard_arrow_down Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

https://colab.research.google.com/drive/1sFPGSs1MHNRsQl-nBT-IG-pm5vsLLP1D#scrollTo=W-eyPJRTKema&printMode=true 7/8 20
4/3/25, 2:14 AM diabetesLogisticRegression.ipynb - Colab

Accuracy: 0.75

keyboard_arrow_down Confusion Matrix


conf_matrix = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(conf_matrix)

Confusion Matrix:
[[78 21]
[18 37]]

keyboard_arrow_down Classification Report


class_report = classification_report(y_test, y_pred)
print('Classification Report:')
print(class_report)

Classification Report:
precision recall f1-score support

0 0.81 0.79 0.80 99


1 0.64 0.67 0.65 55

accuracy 0.75 154


macro avg 0.73 0.73 0.73 154
weighted avg 0.75 0.75 0.75 154

keyboard_arrow_down Visualize the confusion matrix


sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

https://colab.research.google.com/drive/1sFPGSs1MHNRsQl-nBT-IG-pm5vsLLP1D#scrollTo=W-eyPJRTKema&printMode=true 21
8/8
Naive bias

22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

You might also like