0% found this document useful (0 votes)

7 views

Session-11 Machine Learning - Jupyter Notebook

The document discusses the differences between linear regression and logistic regression, emphasizing that linear regression is not suitable for classification problems due to the nature of categorical data. It explains logistic regression as a binary classifier that calculates probabilities using the sigmoid function, which ensures outputs are between 0 and 1. Additionally, it covers evaluation metrics for classification models, such as accuracy, recall, precision, F1 score, and the confusion matrix.

Uploaded by

shrikantrathod9663

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Session-11 Machine Learning - Jupyter Notebook

Uploaded by

shrikantrathod9663

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

1/31/24, 7:44 PM Session-11 Machine Learning - Jupyter Notebook

Introduction
In linear regression, the type of data we deal with is quantitative, whereas we use classification models to deal with qualitative data or categorical data.
The algorithms used for solving a classification problem first predict the probability of each of the categories of the qualitative variables, as the basis for
making the classification. And, as the probabilities are continuous numbers, classification using probabilities also behave like regression methods.
Logistic regression is one such type of classification model which is used to classify the dependent variable into two or more classes or categories.

Why don’t we use Linear regression for classification problems?

Let’s suppose you took a survey and noted the response of each person as satisfied, neutral or Not satisfied. Let’s map each category:

Satisfied – 2

Neutral – 1

Not Satisfied – 0

But this doesn’t mean that the gap between Not satisfied and Neutral is same as Neutral and satisfied. There is no mathematical significance of these
mapping. We can also map the categories like:

Satisfied – 0

Neutral – 1

Not Satisfied – 2

It’s completely fine to choose the above mapping. If we apply linear regression to both the type of mappings, we will get different sets of predictions. Also,
we can get prediction values like 1.2, 0.8, 2.3 etc. which makes no sense for categorical values. So, there is no normal method to convert qualitative data
into quantitative data for use in linear regression. Although, for binary classification, i.e. when there only two categorical values, using the least square
method can give decent results. Suppose we have two categories Black and White and we map them as follows:

Black – 0

White - 1

We can assign predicted values for both the categories such as Y> 0.5 goes to class white and vice versa. Although, there will be some predictions for
which the value can be greater than 1 or less than 0 making them hard to classify in any class. Nevertheless, linear regression can work decently for
binary classification but not that well for multi-class classification. Hence, we use classification methods for dealing with such problems.

Logistic Regression
Logistic regression is one such regression algorithm which can be used for performing classification problems. It calculates the probability that a given
value belongs to a specific class. If the probability is more than 50%, it assigns the value in that particular class else if the probability is less than 50%, the
value is assigned to the other class. Therefore, we can say that logistic regression acts as a binary classifier.

Working of a Logistic Model

For linear regression, the model is defined by: 𝑦 = 𝛽0 + 𝛽1 𝑥 - (i)

and for logistic regression, we calculate probability, i.e. y is the probability of a given variable x belonging to a certain class. Thus, it is obvious that the
value of y should lie between 0 and 1.

But, when we use equation(i) to calculate probability, we would get values less than 0 as well as greater than 1. That doesn’t make any sense . So, we
need to use such an equation which always gives values between 0 and 1, as we desire while calculating the probability.

Sigmoid function

We use the sigmoid function as the underlying function in Logistic regression. Mathematically and graphically.

Why do we use the Sigmoid Function?

1. The sigmoid function’s range is bounded between 0 and 1. Thus it’s useful in calculating the probability for the Logistic function.
2. It’s derivative is easy to calculate than other functions which is useful during gradient descent calculation.
3. It is a simple way of introducing non-linearity to the model.

Although there are other functions as well, which can be used, but sigmoid is the most common function used for logistic regression. We will talk about
the rest of the functions in the neural network section.

Evaluation of a Classification Model

In machine learning, once we have a result of the classification problem, how do we measure how accurate our classification is? For a regression
problem, we have different metrics like R Squared score, Mean Squared Error etc. what are the metrics to measure the credibility of a classification
model?

Metrics In a regression problem, the accuracy is generally measured in terms of the difference in the actual values and the predicted values. In a
classification problem, the credibility of the model is measured using the confusion matrix generated, i.e., how accurately the true positives and true
negatives were predicted. The different metrics used for this purpose are:

Accuracy
Recall

localhost:8888/notebooks/Downloads/Session-11 Machine Learning.ipynb 1/11

1/31/24, 7:44 PM Session-11 Machine Learning - Jupyter Notebook
Precision
F1 Score
Specifity
AUC( Area Under the Curve)
ROC(Receiver Operator Characteristic)
Classification Report

Confusion Matrix
Where the terms have the meaning:

 True Positive(TP): A result that was predicted as positive by the classification model and also is positive

 True Negative(TN): A result that was predicted as negative by the classification model and also is negative

 False Positive(FP): A result that was predicted as positive by the classification model but actually is negative

 False Negative(FN): A result that was predicted as negative by the classification model but actually is positive.

The Credibility of the model is based on how many correct predictions did the model do.

Accuracy
The mathematical formula is :
(𝑇𝑃+𝑇𝑁)
Accuracy= (𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁)
Or, it can be said that it’s defined as the total number of correct classifications divided by the total number of classifications. Its is not the correct for
inbalanc data beacause its always show you high accurancy becoz its bais to the high count data in binary classification becoz its not calculate the error /
its won't count the error

Recall or Sensitivity

The mathematical formula is:

𝑇𝑃
Recall= (𝑇𝑃+𝐹𝑁)
Or, as the name suggests, it is a measure of: from the total number of positive results how many positives were correctly predicted by the model.

It shows how relevant the model is, in terms of positive results only.

Consider a classification model , the model gave 50 correct predictions(TP) but failed to identify 200 cancer patients(FN). Recall in that case will be:

50
Recall= (50+200) = 0.2 (The model was able to recall only 20% of the cancer patients)
Precision
Precision is a measure of amongst all the positive predictions, how many of them were actually positive. Mathematically,

𝑇𝑃
Precision= (𝑇𝑃+𝐹𝑃)
Let’s suppose in the previous example, the model identified 50 people as cancer patients(TP) but also raised a false alarm for 100 patients(FP). Hence,

50
Precision= (50+100) =0.33 (The model only has a precision of 33%)
But we have a problem!!
As evident from the previous example, the model had a very high Accuracy but performed poorly in terms of Precision and Recall. So, necessarily
Accuracy is not the metric to use for evaluating the model in this case.

Imagine a scenario, where the requirement was that the model recalled all the defaulters who did not pay back the loan. Suppose there were 10 such
defaulters and to recall those 10 defaulters, and the model gave you 20 results out of which only the 10 are the actual defaulters. Now, the recall of the
model is 100%, but the precision goes down to 50%.

F1 Score
From the previous examples, it is clear that we need a metric that considers both Precision and Recall for evaluating a model. One such metric is the F1
score.

F1 score is defined as the harmonic mean of Precision and Recall.

2∗((𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙)
The mathematical formula is: F1 score= (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙))
Specificity or True Negative Rate
This represents how specific is the model while predicting the True Negatives. Mathematically,

𝑇𝑁
Specificity= (𝑇𝑁+𝐹𝑃) Or, it can be said that it quantifies the total number of negatives predicted by the model with respect to the total number of actual
negative or non favorable outcomes.

localhost:8888/notebooks/Downloads/Session-11 Machine Learning.ipynb 2/11

1/31/24, 7:44 PM Session-11 Machine Learning - Jupyter Notebook
𝐹𝑃
Similarly, False Positive rate can be defined as: (1- specificity) Or, (𝑇𝑁+𝐹𝑃)
ROC(Receiver Operator Characteristic)
We know that the classification algorithms work on the concept of probability of occurrence of the possible outcomes. A probability value lies between 0
and 1. Zero means that there is no probability of occurrence and one means that the occurrence is certain.

But while working with real-time data, it has been observed that we seldom get a perfect 0 or 1 value. Instead of that, we get different decimal values
lying between 0 and 1. Now the question is if we are not getting binary probability values how are we actually determining the class in our classification
problem?

There comes the concept of Threshold. A threshold is set, any probability value below the threshold is a negative outcome, and anything more than the
threshold is a favourable or the positive outcome. For Example, if the threshold is 0.5, any probability value below 0.5 means a negative or an
unfavourable outcome and any value above 0.5 indicates a positive or favourable outcome.

Now, the question is, what should be an ideal threshold?

The horizontal lines represent the various values of thresholds ranging from 0 to 1.
Let’s suppose our classification problem was to identify the obese people from the given data.
The green markers represent obese people and the red markers represent the non-obese people.
Our confusion matrix will depend on the value of the threshold chosen by us.
For Example, if 0.25 is the threshold then TP(actually obese)=3 TN(Not obese)=2 FP(Not obese but predicted obese)=2(the two red squares above
the 0.25 line) FN(Obese but predicted as not obese )=1(Green circle below 0.25line )

business case: to predict weather a a patient will have a diaabetes or not

In [4]: # libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [5]: data = pd.read_csv('diabetes.csv')

In [6]: data

Out[6]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

0 6 148 72 35 0 33.6 0.627 50 1

1 1 85 66 29 0 26.6 0.351 31 0

2 8 183 64 0 0 23.3 0.672 32 1

3 1 89 66 23 94 28.1 0.167 21 0

4 0 137 40 35 168 43.1 2.288 33 1

... ... ... ... ... ... ... ... ... ...

763 10 101 76 48 180 32.9 0.171 63 0

764 2 122 70 27 0 36.8 0.340 27 0

765 5 121 72 23 112 26.2 0.245 30 0

766 1 126 60 0 0 30.1 0.349 47 1

767 1 93 70 31 0 30.4 0.315 23 0

768 rows × 9 columns

In [7]: data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

localhost:8888/notebooks/Downloads/Session-11 Machine Learning.ipynb 3/11

1/31/24, 7:44 PM Session-11 Machine Learning - Jupyter Notebook

In [10]: data.describe()

Out[10]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000

mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958

std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951

min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000

25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000

50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000

75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000

max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000

EDA
In [12]: # Univariate analysis

In [13]: !pip install sweetviz

Requirement already satisfied: sweetviz in c:\users\shrik\anaconda3\lib\site-packages (2.3.0)

Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3 in c:\users\shrik\anaconda3\lib\site-packages (from s
weetviz) (2.0.3)
Requirement already satisfied: numpy>=1.16.0 in c:\users\shrik\anaconda3\lib\site-packages (from sweetviz) (1.24.3)
Requirement already satisfied: matplotlib>=3.1.3 in c:\users\shrik\anaconda3\lib\site-packages (from sweetviz) (3.7.2)
Requirement already satisfied: tqdm>=4.43.0 in c:\users\shrik\anaconda3\lib\site-packages (from sweetviz) (4.65.0)
Requirement already satisfied: scipy>=1.3.2 in c:\users\shrik\anaconda3\lib\site-packages (from sweetviz) (1.11.1)
Requirement already satisfied: jinja2>=2.11.1 in c:\users\shrik\anaconda3\lib\site-packages (from sweetviz) (3.1.2)
Requirement already satisfied: importlib-resources>=1.2.0 in c:\users\shrik\anaconda3\lib\site-packages (from sweetviz) (6.
1.1)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\shrik\anaconda3\lib\site-packages (from jinja2>=2.11.1->sweetvi
z) (2.1.1)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\shrik\anaconda3\lib\site-packages (from matplotlib>=3.1.3->swee
tviz) (1.0.5)
Requirement already satisfied: cycler>=0.10 in c:\users\shrik\anaconda3\lib\site-packages (from matplotlib>=3.1.3->sweetvi
z) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\shrik\anaconda3\lib\site-packages (from matplotlib>=3.1.3->swe
etviz) (4.25.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\shrik\anaconda3\lib\site-packages (from matplotlib>=3.1.3->swe
etviz) (1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\users\shrik\anaconda3\lib\site-packages (from matplotlib>=3.1.3->sweet
viz) (23.1)
Requirement already satisfied: pillow>=6.2.0 in c:\users\shrik\anaconda3\lib\site-packages (from matplotlib>=3.1.3->sweetvi
z) (9.4.0)
Requirement already satisfied: pyparsing<3.1,>=2.3.1 in c:\users\shrik\anaconda3\lib\site-packages (from matplotlib>=3.1.3-
>sweetviz) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\shrik\anaconda3\lib\site-packages (from matplotlib>=3.1.3->
sweetviz) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\shrik\anaconda3\lib\site-packages (from pandas!=1.0.0,!=1.0.1,!=1.
0.2,>=0.25.3->sweetviz) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in c:\users\shrik\anaconda3\lib\site-packages (from pandas!=1.0.0,!=1.0.1,!=
1.0.2,>=0.25.3->sweetviz) (2023.3)
Requirement already satisfied: colorama in c:\users\shrik\anaconda3\lib\site-packages (from tqdm>=4.43.0->sweetviz) (0.4.6)
Requirement already satisfied: six>=1.5 in c:\users\shrik\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotli
b>=3.1.3->sweetviz) (1.16.0)

In [14]: import sweetviz as sv

my_report = sv.analyze(data)
my_report.show_html()

Done! Use 'show' commands to display/save. [100%] 00:01 -> (00:00 left)

Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS
saved in your notebook/colab files.

In [15]: data.head()

Out[15]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

0 6 148 72 35 0 33.6 0.627 50 1

1 1 85 66 29 0 26.6 0.351 31 0

2 8 183 64 0 0 23.3 0.672 32 1

3 1 89 66 23 94 28.1 0.167 21 0

4 0 137 40 35 168 43.1 2.288 33 1

In [16]: # bi variate analysis

localhost:8888/notebooks/Downloads/Session-11 Machine Learning.ipynb 4/11

1/31/24, 7:44 PM Session-11 Machine Learning - Jupyter Notebook

In [17]: sns.countplot(x= data.Pregnancies, hue = data.Outcome)

plt.show()

In [18]: # missing values

In [19]: data.isnull().sum()

Out[19]: Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

In [20]: data.loc[data['BMI']==0]

Out[20]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

9 8 125 96 0 0 0.0 0.232 54 1

49 7 105 0 0 0 0.0 0.305 24 0

60 2 84 0 0 0 0.0 0.304 21 0

81 2 74 0 0 0 0.0 0.102 22 0

145 0 102 75 23 0 0.0 0.572 21 0

371 0 118 64 23 89 0.0 1.731 21 0

426 0 94 0 0 0 0.0 0.256 25 0

494 3 80 0 0 0 0.0 0.174 22 0

522 6 114 0 0 0 0.0 0.189 26 0

684 5 136 82 0 0 0.0 0.640 69 0

706 10 115 0 0 0 0.0 0.261 30 1

In [21]: data['BMI'].mean()

Out[21]: 31.992578124999998

In [22]: # replacing 0 values with either mean or meadian

data['BMI'] = data['BMI'].replace(0, data['BMI'].mean())
data['Glucose'] = data['Glucose'].replace(0, data['Glucose'].mean())
data['BloodPressure'] = data['BloodPressure'].replace(0, data['BloodPressure'].mean())
data['SkinThickness'] = data['SkinThickness'].replace(0, data['SkinThickness'].mean())
data['Insulin'] = data['Insulin'].replace(0, data['Insulin'].mean())

localhost:8888/notebooks/Downloads/Session-11 Machine Learning.ipynb 5/11

1/31/24, 7:44 PM Session-11 Machine Learning - Jupyter Notebook

In [23]: data.describe()

Out[23]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000

mean 3.845052 121.681605 72.254807 26.606479 118.660163 32.450805 0.471876 33.240885 0.348958

std 3.369578 30.436016 12.115932 9.631241 93.080358 6.875374 0.331329 11.760232 0.476951

min 0.000000 44.000000 24.000000 7.000000 14.000000 18.200000 0.078000 21.000000 0.000000

25% 1.000000 99.750000 64.000000 20.536458 79.799479 27.500000 0.243750 24.000000 0.000000

50% 3.000000 117.000000 72.000000 23.000000 79.799479 32.000000 0.372500 29.000000 0.000000

75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000

max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000

In [24]: data

Out[24]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

0 6 148.0 72.0 35.000000 79.799479 33.6 0.627 50 1

1 1 85.0 66.0 29.000000 79.799479 26.6 0.351 31 0

2 8 183.0 64.0 20.536458 79.799479 23.3 0.672 32 1

3 1 89.0 66.0 23.000000 94.000000 28.1 0.167 21 0

4 0 137.0 40.0 35.000000 168.000000 43.1 2.288 33 1

... ... ... ... ... ... ... ... ... ...

763 10 101.0 76.0 48.000000 180.000000 32.9 0.171 63 0

764 2 122.0 70.0 27.000000 79.799479 36.8 0.340 27 0

765 5 121.0 72.0 23.000000 112.000000 26.2 0.245 30 0

766 1 126.0 60.0 20.536458 79.799479 30.1 0.349 47 1

767 1 93.0 70.0 31.000000 79.799479 30.4 0.315 23 0

768 rows × 9 columns

outliers
In [30]: data.columns
data1 = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']

localhost:8888/notebooks/Downloads/Session-11 Machine Learning.ipynb 6/11

1/31/24, 7:44 PM Session-11 Machine Learning - Jupyter Notebook

In [32]: plt.figure(figsize=(20,25), facecolor= 'White')

plotnumber =1

for column in data:
if plotnumber <= 9:
ax = plt.subplot(3,3, plotnumber)
sns.boxplot(data[column])
plt.xlabel(column, fontsize = 20)
plotnumber +=1
plt.show()

feature selection

localhost:8888/notebooks/Downloads/Session-11 Machine Learning.ipynb 7/11

1/31/24, 7:44 PM Session-11 Machine Learning - Jupyter Notebook

In [39]: tc = data.drop('Outcome', axis =1)

sns.heatmap(tc.corr(), annot = True)
plt.show()

In [40]: # Model Creation

In [41]: data

Out[41]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

0 6 148.0 72.0 35.000000 79.799479 33.6 0.627 50 1

1 1 85.0 66.0 29.000000 79.799479 26.6 0.351 31 0

2 8 183.0 64.0 20.536458 79.799479 23.3 0.672 32 1

3 1 89.0 66.0 23.000000 94.000000 28.1 0.167 21 0

4 0 137.0 40.0 35.000000 168.000000 43.1 2.288 33 1

... ... ... ... ... ... ... ... ... ...

763 10 101.0 76.0 48.000000 180.000000 32.9 0.171 63 0

764 2 122.0 70.0 27.000000 79.799479 36.8 0.340 27 0

765 5 121.0 72.0 23.000000 112.000000 26.2 0.245 30 0

766 1 126.0 60.0 20.536458 79.799479 30.1 0.349 47 1

767 1 93.0 70.0 31.000000 79.799479 30.4 0.315 23 0

768 rows × 9 columns

In [42]: X = data.drop('Outcome', axis=1)

y = data['Outcome']

In [43]: X

Out[43]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age

0 6 148.0 72.0 35.000000 79.799479 33.6 0.627 50

1 1 85.0 66.0 29.000000 79.799479 26.6 0.351 31

2 8 183.0 64.0 20.536458 79.799479 23.3 0.672 32

3 1 89.0 66.0 23.000000 94.000000 28.1 0.167 21

4 0 137.0 40.0 35.000000 168.000000 43.1 2.288 33

... ... ... ... ... ... ... ... ...

763 10 101.0 76.0 48.000000 180.000000 32.9 0.171 63

764 2 122.0 70.0 27.000000 79.799479 36.8 0.340 27

765 5 121.0 72.0 23.000000 112.000000 26.2 0.245 30

766 1 126.0 60.0 20.536458 79.799479 30.1 0.349 47

767 1 93.0 70.0 31.000000 79.799479 30.4 0.315 23

768 rows × 8 columns

localhost:8888/notebooks/Downloads/Session-11 Machine Learning.ipynb 8/11

1/31/24, 7:44 PM Session-11 Machine Learning - Jupyter Notebook

In [44]: # scale the individual features

# 1. min-max scaler
# 2. standard scaler

In [45]: from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [46]: X_scaled

Out[46]: array([[ 0.63994726, 0.86527574, -0.0210444 , ..., 0.16725546,

0.46849198, 1.4259954 ],
[-0.84488505, -1.20598931, -0.51658286, ..., -0.85153454,
-0.36506078, -0.19067191],
[ 1.23388019, 2.01597855, -0.68176235, ..., -1.33182125,
0.60439732, -0.10558415],
...,
[ 0.3429808 , -0.02240928, -0.0210444 , ..., -0.90975111,
-0.68519336, -0.27575966],
[-0.84488505, 0.14197684, -1.01212132, ..., -0.34213954,
-0.37110101, 1.17073215],
[-0.84488505, -0.94297153, -0.18622389, ..., -0.29847711,
-0.47378505, -0.87137393]])

In [47]: from sklearn.model_selection import train_test_split

X_train, x_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25)

In [48]: from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

Out[48]: ▾ LogisticRegression
LogisticRegression()

In [50]: y_train_pred = log_reg.predict(X_train)

In [51]: y_train_pred

Out[51]: array([1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1,
0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,
0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0,
1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0,
0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1,
0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,
1, 0, 0, 0], dtype=int64)

In [52]: y_pred = log_reg.predict(x_test)

In [53]: y_pred

Out[53]: array([0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0], dtype=int64)

In [55]: y_pred.shape

Out[55]: (192,)

localhost:8888/notebooks/Downloads/Session-11 Machine Learning.ipynb 9/11

1/31/24, 7:44 PM Session-11 Machine Learning - Jupyter Notebook

In [56]: data

Out[56]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

0 6 148.0 72.0 35.000000 79.799479 33.6 0.627 50 1

1 1 85.0 66.0 29.000000 79.799479 26.6 0.351 31 0

2 8 183.0 64.0 20.536458 79.799479 23.3 0.672 32 1

3 1 89.0 66.0 23.000000 94.000000 28.1 0.167 21 0

4 0 137.0 40.0 35.000000 168.000000 43.1 2.288 33 1

... ... ... ... ... ... ... ... ... ...

763 10 101.0 76.0 48.000000 180.000000 32.9 0.171 63 0

764 2 122.0 70.0 27.000000 79.799479 36.8 0.340 27 0

765 5 121.0 72.0 23.000000 112.000000 26.2 0.245 30 0

766 1 126.0 60.0 20.536458 79.799479 30.1 0.349 47 1

767 1 93.0 70.0 31.000000 79.799479 30.4 0.315 23 0

768 rows × 9 columns

In [57]: # evaluating logistic regression model

In [72]: from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score, classif
 

In [59]: accuracy = accuracy_score(y_test, y_pred)

accuracy

Out[59]: 0.78125

In [60]: precision = precision_score(y_test, y_pred)

In [61]: precision

Out[61]: 0.7708333333333334

In [63]: recall = recall_score(y_test, y_pred)

In [64]: recall

Out[64]: 0.5441176470588235

In [66]: f1_score= f1_score(y_test, y_pred)

In [67]: f1_score

Out[67]: 0.6379310344827587

In [68]: pd.crosstab(y_test, y_pred)

Out[68]:
col_0 0 1

Outcome

0 113 11

1 31 37

In [70]: auc = roc_auc_score(y_test, y_pred)

In [71]: auc

Out[71]: 0.7277039848197343

In [73]: report = classification_report(y_test,y_pred)

In [75]: print(report)

precision recall f1-score support

0 0.78 0.91 0.84 124

1 0.77 0.54 0.64 68

accuracy 0.78 192

macro avg 0.78 0.73 0.74 192
weighted avg 0.78 0.78 0.77 192

localhost:8888/notebooks/Downloads/Session-11 Machine Learning.ipynb 10/11

1/31/24, 7:44 PM Session-11 Machine Learning - Jupyter Notebook

In [ ]:

localhost:8888/notebooks/Downloads/Session-11 Machine Learning.ipynb 11/11

2306 8MA0-01 AS Pure Mathematics - June 2023 (Word)
100% (6)
2306 8MA0-01 AS Pure Mathematics - June 2023 (Word)
10 pages
Session-11 Machine Learning
No ratings yet
Session-11 Machine Learning
27 pages
Practical - Logistic Regression
No ratings yet
Practical - Logistic Regression
84 pages
13 Logistic Regression Main
No ratings yet
13 Logistic Regression Main
14 pages
Part A Assignment - No - 5 PDF
No ratings yet
Part A Assignment - No - 5 PDF
8 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
ML CLASS 5 Logistic Regression Algorithm
No ratings yet
ML CLASS 5 Logistic Regression Algorithm
16 pages
Logistic Regression
No ratings yet
Logistic Regression
5 pages
Unit II
100% (1)
Unit II
13 pages
Information Securtiy
No ratings yet
Information Securtiy
8 pages
Logistic+regression Data
No ratings yet
Logistic+regression Data
13 pages
Classification
100% (2)
Classification
105 pages
Session9-LogisticRegression_a6c5bc556df30fa3eb779e22e464a08a - Copy
No ratings yet
Session9-LogisticRegression_a6c5bc556df30fa3eb779e22e464a08a - Copy
33 pages
Logistic Regression
No ratings yet
Logistic Regression
25 pages
Logistic Regression
No ratings yet
Logistic Regression
13 pages
Classification Algorithms
100% (2)
Classification Algorithms
23 pages
3-Intro-to-Logistic-Regression-LT
No ratings yet
3-Intro-to-Logistic-Regression-LT
18 pages
DS203 2024 01 02 LogisticRegression
No ratings yet
DS203 2024 01 02 LogisticRegression
38 pages
Logistic Regression
No ratings yet
Logistic Regression
25 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Chp2 Logistic Regression
No ratings yet
Chp2 Logistic Regression
6 pages
Module 2
No ratings yet
Module 2
72 pages
FAM Unit6
No ratings yet
FAM Unit6
32 pages
07 Logistics Regression
No ratings yet
07 Logistics Regression
23 pages
Classification-Introduction, Logistic Regression
No ratings yet
Classification-Introduction, Logistic Regression
26 pages
1694600777-Unit2.2 Logistic Regression CU 2.0
100% (1)
1694600777-Unit2.2 Logistic Regression CU 2.0
37 pages
ML_MU_Unit_2 - Supervised Learning-Classification Techniques
No ratings yet
ML_MU_Unit_2 - Supervised Learning-Classification Techniques
153 pages
Chapter 4 Statistical Classification Methods
No ratings yet
Chapter 4 Statistical Classification Methods
63 pages
08 Logistic Regression
No ratings yet
08 Logistic Regression
19 pages
Model Evaluation - II
No ratings yet
Model Evaluation - II
12 pages
DMML Unit4
No ratings yet
DMML Unit4
77 pages
Logistic Regression
No ratings yet
Logistic Regression
18 pages
Unit 2 Chap 4
No ratings yet
Unit 2 Chap 4
14 pages
Advanced Regression
No ratings yet
Advanced Regression
13 pages
Lecture Material 11
No ratings yet
Lecture Material 11
14 pages
ML CH 5
No ratings yet
ML CH 5
45 pages
Deep Learning Week 204-4
No ratings yet
Deep Learning Week 204-4
1 page
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Evaluation Measures
No ratings yet
Evaluation Measures
8 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
Logistic regression by Nirzona
No ratings yet
Logistic regression by Nirzona
11 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
B-56 Sanket Jambhulkar MLA-3
No ratings yet
B-56 Sanket Jambhulkar MLA-3
7 pages
Mathematics Behind Logistic Regression Model 1598272636
No ratings yet
Mathematics Behind Logistic Regression Model 1598272636
6 pages
DSML Clasification
No ratings yet
DSML Clasification
44 pages
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
100% (1)
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
6 pages
Day_3,Task_2
No ratings yet
Day_3,Task_2
2 pages
ML Assignment
No ratings yet
ML Assignment
20 pages
Confusion Matrix & Evaluation Metrics in Machine Learning
No ratings yet
Confusion Matrix & Evaluation Metrics in Machine Learning
23 pages
ML2
No ratings yet
ML2
8 pages
09_23ECE216_LogisticRegression
No ratings yet
09_23ECE216_LogisticRegression
40 pages
Logistic Regression Notes
No ratings yet
Logistic Regression Notes
23 pages
Intermediate Analytics-Regression-Week 3-1
No ratings yet
Intermediate Analytics-Regression-Week 3-1
44 pages
Logistic Regression
No ratings yet
Logistic Regression
36 pages
Binary Classification PDF
No ratings yet
Binary Classification PDF
27 pages
Logistic Regression and Naive Bayes
No ratings yet
Logistic Regression and Naive Bayes
4 pages
MISY 631 Final Review Calculators Will Be Provided For The Exam
No ratings yet
MISY 631 Final Review Calculators Will Be Provided For The Exam
9 pages
FALLSEM2024-25 BCSE401L TH VL2024250102078 2024-09-04 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE401L TH VL2024250102078 2024-09-04 Reference-Material-I
27 pages
ML-Unit 4
No ratings yet
ML-Unit 4
29 pages
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Welcome To Bedini Technology
No ratings yet
Welcome To Bedini Technology
38 pages
GBVG Proceedings v1
No ratings yet
GBVG Proceedings v1
196 pages
Worksheet 89FundamentalsofMathematicsi
No ratings yet
Worksheet 89FundamentalsofMathematicsi
12 pages
Summary Product Design and Development
67% (3)
Summary Product Design and Development
14 pages
Theory of Errors in Observations
0% (1)
Theory of Errors in Observations
45 pages
Mental Maths Homework Year 2
100% (1)
Mental Maths Homework Year 2
6 pages
Matlab-Signals and Systems
No ratings yet
Matlab-Signals and Systems
26 pages
BE Lab Manual 2016
No ratings yet
BE Lab Manual 2016
90 pages
CHAPTER 3 - Non - Uniform Flow in Open Channel (Part 1)
No ratings yet
CHAPTER 3 - Non - Uniform Flow in Open Channel (Part 1)
16 pages
Lecture 4 - Single & Multi-Port Networks
No ratings yet
Lecture 4 - Single & Multi-Port Networks
27 pages
Lesson 1: Exponential Notation: Student Outcomes
No ratings yet
Lesson 1: Exponential Notation: Student Outcomes
9 pages
First National Metrobank - Mtap - Deped Math Challenge Grade 3
100% (2)
First National Metrobank - Mtap - Deped Math Challenge Grade 3
3 pages
Instant Download for Calculus for Scientists and Engineers 1st Edition Briggs Test Bank 2024 Full Chapters in PDF
100% (5)
Instant Download for Calculus for Scientists and Engineers 1st Edition Briggs Test Bank 2024 Full Chapters in PDF
70 pages
Sample Thesis Chapter 5 Conclusion and Recommendation
100% (1)
Sample Thesis Chapter 5 Conclusion and Recommendation
7 pages
SP Iii-39
No ratings yet
SP Iii-39
3 pages
Mat565 - Table of Laplace Transforms
No ratings yet
Mat565 - Table of Laplace Transforms
1 page
Engg. Mathematics I (MAT - 101) RCS (Makeup) PDF
No ratings yet
Engg. Mathematics I (MAT - 101) RCS (Makeup) PDF
2 pages
AT - Chapter 10-Notes - Part 3
No ratings yet
AT - Chapter 10-Notes - Part 3
3 pages
FEA of Nonlinear Problems
No ratings yet
FEA of Nonlinear Problems
62 pages
Physics 2121 Lab Manual 11 0e
No ratings yet
Physics 2121 Lab Manual 11 0e
121 pages
Chapter 04 R
0% (1)
Chapter 04 R
31 pages
4 Dim
No ratings yet
4 Dim
8 pages
15MAY Depthmap Network Analysis Tutorial PDF
No ratings yet
15MAY Depthmap Network Analysis Tutorial PDF
10 pages
MCNP5 Manual VOL I
No ratings yet
MCNP5 Manual VOL I
416 pages
MATH 250 Midterm Exam I October 12, 2004 Name: Student Number: Instructor: Section
No ratings yet
MATH 250 Midterm Exam I October 12, 2004 Name: Student Number: Instructor: Section
10 pages
Modelling of Synchronized Jumping Crowds On Grandstands
No ratings yet
Modelling of Synchronized Jumping Crowds On Grandstands
8 pages
Dividing by 10, 100 and 1000 Activity Sheet
No ratings yet
Dividing by 10, 100 and 1000 Activity Sheet
4 pages
Practice Exercise For Validity
No ratings yet
Practice Exercise For Validity
5 pages
Fluke PM6666
No ratings yet
Fluke PM6666
8 pages