Statistical Modelling
Statistical Modelling
2 Logistic Regression
2.1 Introduction to LR
3 Problem Statement
3.2 Objective
4 Dataset
5 Methodology
(Steps In Statistical Model-Building)
5.1 Define problem
5.2 Develop a conceptual model
5.3 Design study
6 Conclusions
1. Multiple Linear Regression
1.1 Introduction:
Multiple linear regression is a statistical method used to examine the
relationship between two or more independent variables (say, X) (also called
predictor variables, causal variables) and a dependent variable (say, Y) (also
called as response variable, criterion variable). In multiple linear regression, the
relationship is considered to be linear and the dependent variable is assumed to
follow normal distribution. MLR not only describes the pattern of relationship
between Y and X, but also infers about the strength of the relationship. In
addition, MLR is also used for prediction.
The general form of a multiple linear regression model with P independent
variables is given by the equation:
Y= β0X0 + β1X1 + β2X2 + β3X3 +……+ βpXp + ϵ
Here:
Y is the dependent variable. The purpose of MLR is either to explain Y’s
relationship with the IV’s or to predict Y’s future values or both.
X1, X2,……,XP are independent variables.
X0 assumes a value of 1 only and its inclusion facilitates to quantify the
value of Y when all IV’s assume 0 (zero) values.
β1, β2,….., βP are constant terms called as regression coefficients that
represent the amount of contribution of, X1, X2,……,XP respectively in
explaining or predicting Y.
β0, a constant term is known as the value of Y when all the IV’s, X1, X2,
……,XP assume 0 (zero) value. Β0 is also called regression intercept.
∈ is a random error term which represents the amount (variance) of Y that
cannot be explained or predicted by the IV’s, X1, X2,……,XP.
1.2 Assumptions:
A model is a representation of the real-world phenomenon and is established
under certain assumptions. And the assumptions in MLR are as follows:
Linearity of the phenomenon measured, i.e., Y is linearly related with
IV’s.
Constant error variance across observations of IV’s, known as
homoscedasticity.
Uncorrelated error terms, i.e ., Cov(∈I, ∈k)= 0 ,for i ≠ k.
Normal distribution of the error terms, i.e., ∈I ~ N(0, σ2y).
2. Logistic Regression
2.1 Introduction:
Logistic regression models the probability that a given instance belongs to a
particular category. The logistic function (sigmoid function) is used to
transform the linear combination of the input features into a value between 0
and 1. The logistic regression model predicts the probability of the positive
class (1), and if this probability is above a certain threshold (commonly 0.5), the
instance is classified as belonging to the positive class; otherwise, it is classified
as belonging to the negative class (0).
The logistic regression equation is given by:
1
P(Y=1) = −( β 0 X 0 +β 1 X 1+ β 2 X 2+ β 3 X 3 +… …+β p X p)
1+ e
Here:
P(Y=1) is the probability of the positive class.
e is the base of the natural logarithm.
β0 is regression intercept.
β1, β2,….., βP are constant terms called as regression coefficients that
represent the amount of contribution of, X1, X2,……,XP respectively in
explaining or predicting Y.
The logistic function ensures that the output is bounded between 0 and 1,
making it suitable for representing probabilities. The probability of the negative
class P(Y=0) is simply 1- P(Y=1)
It's important to note that the logistic regression model is trained to find the
optimal values for the coefficients (β0, β1, β2,….., βP) during the training process.
The logistic function transforms the linear combination of input features into a
probability, and a threshold (commonly 0.5) is used to determine the predicted
class.
3. Problem statement
3.1 Problem statement:
To develop a predictive model to predict the likelihood of diabetes in patients
based on diagnostic measurements including pregnancies, glucose levels, blood
pressure, skin thickness, insulin levels, BMI, diabetes pedigree function, and
age.
3.2 Objective:
The objective is to identify and quantify the impact of each diagnostic
parameter on the outcome variable (diabetes) for accurate prediction and
improved understanding of the disease.
4. Dataset
5. Methodology
Steps In Statistical Model-Building:
Step 1: Define problem:
To predict the likelihood of diabetes in patients based on diagnostic
measurements including pregnancies, glucose levels, blood pressure, skin
thickness, insulin levels, BMI, diabetes pedigree function, and age. The
objective is to identify and quantify the impact of each diagnostic parameter on
the outcome variable (diabetes) for accurate prediction and improved
understanding of the disease.
Step 2: Develop a conceptual model:
Conceptual model describes the relationships among the variables of interest
pertaining to a system under investigation, and is often represented pictorially.
Given problem can be conceptualized with following steps;
i. Identify the variables of interests: pregnancies, glucose levels, blood
pressure, skin thickness, insulin levels, BMI, diabetes pedigree function,
age and outcome (diabetes).
ii. Identify dependent variables (Y): outcome (diabetes)
iii. Identify independent variables (X): pregnancies, glucose levels, blood
pressure, skin thickness, insulin levels, BMI, diabetes pedigree function,
age
iv. Find out dependence relationships: Y = f(X)
X0
β0
X1 Β1
X2 Β2
….
X3 Β3 ϵ
X4 Β4 Y
X5 Β5
Β6
X6 Β7
Β8
X7
X8
So, the statistical problem is as follows:
The changes in the response (dependent) variable, Y (outcome) caused by the
explanatory (independent) variables, X1 (pregnancies), X2 (glucose levels) and
X3 (blood pressure), X4 (skin thickness), X5 (insulin levels), X6 (BMI), X7
(diabetes pedigree function), X8 (age) as represented by the relationship
expressed as Y = f(X).
The linear relationship can be statistically modeled as:
Y = β 0 X0 + β 1 X1 + β 2 X2 + β 3 X3 + β 4 X4 + β 5 X5 + β 6 X6 + β 7 X7 + β 8 X8 + ϵ
The logistic regression equation is:
1
P(Y=1) = −( β 0 X 0 +β 1 X 1+ β 2 X 2+ β 3 X 3 +β 4 X 4+ β 5 X 5 +β 6 X 6+ β 7 X 7+ β 8 X 8 )
1+ e
TEST OF INDEPENDENCE
Here the data for MLR model is not suitable, MLR model does not satisfied the
test of assumptions, this problem can be reduced by transforming the response
variable but according to our problem statement we have to predict the diabetes
positive or negative for this a logistic regression model is best suited. We done
all study further accordingly for the logistic regression.
TEST OF INDEPENDENCE
Step 8: Verify model
The summary provides information about the estimated coefficients,
significance levels, and goodness-of-fit statistics. This information is crucial for
understanding the contribution of each variable to the model.
Created diagnostic plots to assess the model's assumptions and identify
potential issues, such as heteroscedasticity or outliers.
Step 9: Validate model
Evaluated the model's performance on the testing data using a confusion
matrix and calculated accuracy. This step provides insights into how well
the model generalizes to new, unseen data.
Performed a Sum of Squares analysis to understand the total variability
(SST), variability explained by the model (SSR), and unexplained
variability (SSE).
Step 10: Interpret result
The logistic regression model provides a way to estimate the probability
of diabetes based on the given predictor variables.
The model has been refined by excluding certain variables, suggesting a
focus on more relevant features.
Diagnostic plots and the confusion matrix provide insights into the
model's performance and potential areas for improvement.
The Sum of Squares analysis offers a quantitative assessment of the
model's goodness of fit.
R- LANGUAGE CODE
nrow(MLR_data_training)
nrow(MLR_data_testing)
4. Testing of LR Model
6. Sum Squre Calculation :
7. Prediction :
6.Conclusion
The logistic regression model, after refinement and evaluation, appears to be a
reasonable choice for predicting diabetes outcomes in the Pima Indian dataset.
And model is approx. 80% gives accurate result. Further refinements and
evaluations could be considered based on the diagnostic plots and other model
performance metrics.