Session-11 Machine Learning - Jupyter Notebook
Session-11 Machine Learning - Jupyter Notebook
Introduction
In linear regression, the type of data we deal with is quantitative, whereas we use classification models to deal with qualitative data or categorical data.
The algorithms used for solving a classification problem first predict the probability of each of the categories of the qualitative variables, as the basis for
making the classification. And, as the probabilities are continuous numbers, classification using probabilities also behave like regression methods.
Logistic regression is one such type of classification model which is used to classify the dependent variable into two or more classes or categories.
Let’s suppose you took a survey and noted the response of each person as satisfied, neutral or Not satisfied. Let’s map each category:
Satisfied – 2
Neutral – 1
Not Satisfied – 0
But this doesn’t mean that the gap between Not satisfied and Neutral is same as Neutral and satisfied. There is no mathematical significance of these
mapping. We can also map the categories like:
Satisfied – 0
Neutral – 1
Not Satisfied – 2
It’s completely fine to choose the above mapping. If we apply linear regression to both the type of mappings, we will get different sets of predictions. Also,
we can get prediction values like 1.2, 0.8, 2.3 etc. which makes no sense for categorical values. So, there is no normal method to convert qualitative data
into quantitative data for use in linear regression. Although, for binary classification, i.e. when there only two categorical values, using the least square
method can give decent results. Suppose we have two categories Black and White and we map them as follows:
Black – 0
White - 1
We can assign predicted values for both the categories such as Y> 0.5 goes to class white and vice versa. Although, there will be some predictions for
which the value can be greater than 1 or less than 0 making them hard to classify in any class. Nevertheless, linear regression can work decently for
binary classification but not that well for multi-class classification. Hence, we use classification methods for dealing with such problems.
Logistic Regression
Logistic regression is one such regression algorithm which can be used for performing classification problems. It calculates the probability that a given
value belongs to a specific class. If the probability is more than 50%, it assigns the value in that particular class else if the probability is less than 50%, the
value is assigned to the other class. Therefore, we can say that logistic regression acts as a binary classifier.
But, when we use equation(i) to calculate probability, we would get values less than 0 as well as greater than 1. That doesn’t make any sense . So, we
need to use such an equation which always gives values between 0 and 1, as we desire while calculating the probability.
Sigmoid function
We use the sigmoid function as the underlying function in Logistic regression. Mathematically and graphically.
1. The sigmoid function’s range is bounded between 0 and 1. Thus it’s useful in calculating the probability for the Logistic function.
2. It’s derivative is easy to calculate than other functions which is useful during gradient descent calculation.
3. It is a simple way of introducing non-linearity to the model.
Although there are other functions as well, which can be used, but sigmoid is the most common function used for logistic regression. We will talk about
the rest of the functions in the neural network section.
Metrics In a regression problem, the accuracy is generally measured in terms of the difference in the actual values and the predicted values. In a
classification problem, the credibility of the model is measured using the confusion matrix generated, i.e., how accurately the true positives and true
negatives were predicted. The different metrics used for this purpose are:
Accuracy
Recall
Confusion Matrix
Where the terms have the meaning:
True Positive(TP): A result that was predicted as positive by the classification model and also is positive
True Negative(TN): A result that was predicted as negative by the classification model and also is negative
False Positive(FP): A result that was predicted as positive by the classification model but actually is negative
False Negative(FN): A result that was predicted as negative by the classification model but actually is positive.
The Credibility of the model is based on how many correct predictions did the model do.
Accuracy
The mathematical formula is :
(𝑇𝑃+𝑇𝑁)
Accuracy= (𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁)
Or, it can be said that it’s defined as the total number of correct classifications divided by the total number of classifications. Its is not the correct for
inbalanc data beacause its always show you high accurancy becoz its bais to the high count data in binary classification becoz its not calculate the error /
its won't count the error
Recall or Sensitivity
𝑇𝑃
Recall= (𝑇𝑃+𝐹𝑁)
Or, as the name suggests, it is a measure of: from the total number of positive results how many positives were correctly predicted by the model.
It shows how relevant the model is, in terms of positive results only.
Consider a classification model , the model gave 50 correct predictions(TP) but failed to identify 200 cancer patients(FN). Recall in that case will be:
50
Recall= (50+200) = 0.2 (The model was able to recall only 20% of the cancer patients)
Precision
Precision is a measure of amongst all the positive predictions, how many of them were actually positive. Mathematically,
𝑇𝑃
Precision= (𝑇𝑃+𝐹𝑃)
Let’s suppose in the previous example, the model identified 50 people as cancer patients(TP) but also raised a false alarm for 100 patients(FP). Hence,
50
Precision= (50+100) =0.33 (The model only has a precision of 33%)
But we have a problem!!
As evident from the previous example, the model had a very high Accuracy but performed poorly in terms of Precision and Recall. So, necessarily
Accuracy is not the metric to use for evaluating the model in this case.
Imagine a scenario, where the requirement was that the model recalled all the defaulters who did not pay back the loan. Suppose there were 10 such
defaulters and to recall those 10 defaulters, and the model gave you 20 results out of which only the 10 are the actual defaulters. Now, the recall of the
model is 100%, but the precision goes down to 50%.
F1 Score
From the previous examples, it is clear that we need a metric that considers both Precision and Recall for evaluating a model. One such metric is the F1
score.
2∗((𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙)
The mathematical formula is: F1 score= (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙))
Specificity or True Negative Rate
This represents how specific is the model while predicting the True Negatives. Mathematically,
𝑇𝑁
Specificity= (𝑇𝑁+𝐹𝑃) Or, it can be said that it quantifies the total number of negatives predicted by the model with respect to the total number of actual
negative or non favorable outcomes.
But while working with real-time data, it has been observed that we seldom get a perfect 0 or 1 value. Instead of that, we get different decimal values
lying between 0 and 1. Now the question is if we are not getting binary probability values how are we actually determining the class in our classification
problem?
There comes the concept of Threshold. A threshold is set, any probability value below the threshold is a negative outcome, and anything more than the
threshold is a favourable or the positive outcome. For Example, if the threshold is 0.5, any probability value below 0.5 means a negative or an
unfavourable outcome and any value above 0.5 indicates a positive or favourable outcome.
The horizontal lines represent the various values of thresholds ranging from 0 to 1.
Let’s suppose our classification problem was to identify the obese people from the given data.
The green markers represent obese people and the red markers represent the non-obese people.
Our confusion matrix will depend on the value of the threshold chosen by us.
For Example, if 0.25 is the threshold then TP(actually obese)=3 TN(Not obese)=2 FP(Not obese but predicted obese)=2(the two red squares above
the 0.25 line) FN(Obese but predicted as not obese )=1(Green circle below 0.25line )
In [6]: data
Out[6]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
1 1 85 66 29 0 26.6 0.351 31 0
3 1 89 66 23 94 28.1 0.167 21 0
... ... ... ... ... ... ... ... ... ...
In [7]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
In [10]: data.describe()
Out[10]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000
EDA
In [12]: # Univariate analysis
Done! Use 'show' commands to display/save. [100%] 00:01 -> (00:00 left)
Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS
saved in your notebook/colab files.
In [15]: data.head()
Out[15]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
1 1 85 66 29 0 26.6 0.351 31 0
3 1 89 66 23 94 28.1 0.167 21 0
In [19]: data.isnull().sum()
Out[19]: Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
In [20]: data.loc[data['BMI']==0]
Out[20]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
60 2 84 0 0 0 0.0 0.304 21 0
81 2 74 0 0 0 0.0 0.102 22 0
In [21]: data['BMI'].mean()
Out[21]: 31.992578124999998
In [23]: data.describe()
Out[23]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 121.681605 72.254807 26.606479 118.660163 32.450805 0.471876 33.240885 0.348958
std 3.369578 30.436016 12.115932 9.631241 93.080358 6.875374 0.331329 11.760232 0.476951
min 0.000000 44.000000 24.000000 7.000000 14.000000 18.200000 0.078000 21.000000 0.000000
25% 1.000000 99.750000 64.000000 20.536458 79.799479 27.500000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 79.799479 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000
In [24]: data
Out[24]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
... ... ... ... ... ... ... ... ... ...
outliers
In [30]: data.columns
data1 = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
feature selection
In [41]: data
Out[41]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
... ... ... ... ... ... ... ... ... ...
In [43]: X
Out[43]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
In [46]: X_scaled
Out[48]: ▾ LogisticRegression
LogisticRegression()
In [51]: y_train_pred
Out[51]: array([1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1,
0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,
0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0,
1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0,
0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1,
0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,
1, 0, 0, 0], dtype=int64)
In [53]: y_pred
Out[53]: array([0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0], dtype=int64)
In [55]: y_pred.shape
Out[55]: (192,)
In [56]: data
Out[56]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
... ... ... ... ... ... ... ... ... ...
In [72]: from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score, classif
Out[59]: 0.78125
In [61]: precision
Out[61]: 0.7708333333333334
In [64]: recall
Out[64]: 0.5441176470588235
In [67]: f1_score
Out[67]: 0.6379310344827587
Out[68]:
col_0 0 1
Outcome
0 113 11
1 31 37
In [71]: auc
Out[71]: 0.7277039848197343
In [75]: print(report)
In [ ]: