0% found this document useful (0 votes)
121 views53 pages

UNIT 2 - Linear & Logistic Regression Ppt-Inverted

Simple Linear Regression is a statistical method that models the relationship between a dependent variable and a single independent variable, aiming to forecast new observations. It assumes a linear relationship, no multicollinearity, homoscedasticity, normal distribution of error terms, and no autocorrelation. Logistic Regression, on the other hand, is used for classification tasks to predict probabilities of categorical outcomes, fitting an 'S' shaped logistic function instead of a regression line.

Uploaded by

akshattiwari487
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views53 pages

UNIT 2 - Linear & Logistic Regression Ppt-Inverted

Simple Linear Regression is a statistical method that models the relationship between a dependent variable and a single independent variable, aiming to forecast new observations. It assumes a linear relationship, no multicollinearity, homoscedasticity, normal distribution of error terms, and no autocorrelation. Logistic Regression, on the other hand, is used for classification tasks to predict probabilities of categorical outcomes, fitting an 'S' shaped logistic function instead of a regression line.

Uploaded by

akshattiwari487
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Simple Linear Regression in

Machine Learning
Simple Linear Regression is a type of Regression algorithm that models
the relationship between a dependent variable and a single independent
variable. The relationship shown by a Simple Linear Regression model is
linear or a sloped straight line, hence it is called Simple Linear Regression.
The key point in Simple Linear Regression is that the dependent variable
must be a continuous/real value. However, the independent variable
can be measured on continuous or categorical values.

Simple Linear regression algorithm has mainly two objectives:

➢ Model the relationship between the two variables. Such as the relationship
between Income and expenditure, experience and Salary, etc.
➢ Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments
relationship between the variables. Consider the below image:

Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε
Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.
Linear Regression Line
A linear line showing the relationship between the dependent and independent variables is called a regression line. A
regression line can show two types of relationship:
•Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then such a relationship
is termed as a Positive linear relationship.

•Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and the independent variable increases on the X-axis, then such a
relationship is called a negative linear relationship.
Assumptions of Linear Regression
Below are some important assumptions of Linear Regression. These are some formal checks while building a
Linear Regression model, which ensures to get the best possible result from the given dataset.

o Linear relationship between the features and target:


Linear regression assumes the linear relationship between the dependent and independent variables.

o Small or no multicollinearity between the features:


Multicollinearity means high-correlation between the independent variables. Due to multicollinearity, it may
difficult to find the true relationship between the predictors and target variables. Or we can say, it is difficult
to determine which predictor variable is affecting the target variable and which is not. So, the model
assumes either little or no multicollinearity between the features or independent variables.

o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of independent variables.
With homoscedasticity, there should be no clear pattern distribution of data in the scatter plot.

o Normal distribution of error terms:


Linear regression assumes that the error term should follow the normal distribution pattern. If error terms
are not normally distributed, then confidence intervals will become either too wide or too narrow, which
may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any deviation, which means
the error is normally distributed.

o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be any correlation in the
error term, then it will drastically reduce the accuracy of the model. Autocorrelation usually occurs if there
is a dependency between residual errors.
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit line that means
the error between predicted values and actual values should be minimized. The best fit
line will have the least error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line
of regression, so we need to calculate the best values for a0 and a1 to find the best fit line,
so to calculate this we use cost function.

Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line
of regression, and the cost function is used to estimate the values of the coefficient
for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a
linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.
Given data (weeks, sales in thousands):
x: 1, 2, 3, 4, 5
y: 1.2, 1.8, 2.6, 3.2, 3.8
1. Compute means
ˉ 1 + 2 + 3 + 4 + 5 15
𝑥= = =3
5 5
ˉ 1.2 + 1.8 + 2.6 + 3.2 + 3.8 12.6
𝑦= = = 2.
5 5
.2. Compute slope 𝑏and intercept 𝑎
Use, 𝒃 = ∑(𝑥𝑖​− 𝑥ˉ)(𝑦𝑖 − 𝑦ˉ​)/ ∑(𝑥𝑖​− 𝑥ˉ)2, 𝒂 = 𝑦ˉ​− 𝑏𝑥ˉ.
ˉ ˉ ˉ ˉ ˉ
𝐱𝐢 𝐲𝐢 𝐱𝐢 − 𝐱 𝐲𝐢 − 𝐲 𝐱𝐢 − 𝐱 𝐲𝐢 − 𝐲 ቆ𝐱 𝐢 − 𝐱)𝟐

1 1.2 -2 -1.32 2.64 4


2 1.8 -1 -0.72 0.72 1
3 2.6 0 0.08 0.00 0
4 3.2 1 0.68 0.68 1
5 3.8 2 1.28 2.56 4
sum 6.60 10

So
6.60
𝑏= = 0.66
10
ˉ ˉ
𝑎 = 𝑦 − 𝑏𝑥 = 2.52 − 0.66 × 3 = 2.52 − 1.98 = 0.54
3. Regression line
𝑦ො = 𝑎 + 𝑏𝑥 = 0.54 + 0.66𝑥

.4Predictions

•For week 𝑥 = 7:
𝑦ො7 = 0.54 + 0.66 × 7 = 0.54 + 4.62 = 5.16(thousands) → 5,160 units
•For week 𝑥 = 12:
𝑦ො12 = 0.54 + 0.66 × 12 = 0.54 + 7.92 = 8.46(thousands) → 8,460 units

Answer: Predicted sales — week 7: 5.16k, week 12: 8.46k.


Linear Regression Problem
QUESTION 3# Given the data points:
x: 1, 2, 3, 4, 5
y: 1.2, 1.9, 3.2, 3.8, 5.1
[Link] a simple linear regression model 𝑦 = 𝑎 + 𝑏𝑥. Find the slope b and intercept 𝑎.
[Link] the predicted 𝑦forො each x and the residuals.
[Link] 𝑅2 (coefficient of determination).
[Link] 𝑦when 𝑥 = 6.

1) Useful sums and means


Number of observations: 𝑛 = 5.
෍ 𝑥 = 1 + 2 + 3 + 4 + 5 = 15.

෍ 𝑦 = 1.2 + 1.9 + 3.2 + 3.8 + 5.1 = 15.2.

ˉ ∑ 𝑥 15
𝑥= = = 3.0.
𝑛 5
ˉ ∑ 𝑦 15.2
𝑦= = = 3.04.
𝑛 5
෍ 𝑥 2 = 12 + 22 + 32 + 42 + 52 = 55.

෍ 𝑥 𝑦 = 1 ⋅ 1.2 + 2 ⋅ 1.9 + 3 ⋅ 3.2 + 4 ⋅ 3.8 + 5 ⋅ 5.1 = 55.3.


Compute the centered sums:
SSxx​=∑(x−xˉ)2=10.0
ˉ
(You can check: 𝑆𝑆𝑥𝑥 = ෌ 𝑥 2 − 𝑛𝑥 2 = 55 − 5 ⋅ 32 = 55 − 45 = 10)
SSxy​=∑(x−xˉ)(y−yˉ​)=9.7
ˉ ˉ
(Check: 𝑆𝑆𝑥𝑦 = ∑ 𝑥 𝑦 − 𝑛𝑥 𝑦 = 55.3 − 5 ⋅ 3 ⋅ 3.04 = 55.3 − 45.6 = 9.7)

)2) Regression coefficients


Slope:
𝑆𝑆𝑥𝑦 9.7
𝑏= = = 0.97.
𝑆𝑆𝑥𝑥 10.0
Intercept:
ˉ ˉ
𝑎 = 𝑦 − 𝑏𝑥 = 3.04 − 0.97 ⋅ 3.0 = 3.04 − 2.91 = 0.13.
So the fitted line is:
𝑦ො = 0.13 + 0.97𝑥.
3) Predicted values and residuals
Compute 𝑦ො𝑖 = 𝑎 + 𝑏𝑥𝑖 :
•For x=1: 𝑦ො = 0.13 + 0.97 ⋅ 1 = 1.10.
Residual 𝑒 = 𝑦 − 𝑦ො = 1.20 − 1.10 = 0.10.
•For x=2: 𝑦ො = 0.13 + 0.97 ⋅ 2 = 2.07.
Residual = 1.90 − 2.07 = −0.17.
•For x=3: 𝑦ො = 0.13 + 0.97 ⋅ 3 = 3.04.
Residual = 3.20 − 3.04 = 0.16.
•For x=4: 𝑦ො = 0.13 + 0.97 ⋅ 4 = 4.01.
Residual = 3.80 − 4.01 = −0.21.
•For x=5: 𝑦ො = 0.13 + 0.97 ⋅ 5 = 4.98.
Residual = 5.10 − 4.98 = 0.12.
(Residuals sum to ≈ 0 as expected.)

4) Predict at 𝑥 = 6

𝑦ො 6 = 0.13 + 0.97 ⋅ 6 = 0.13 + 5.82 = 5.95.


LOGISTIC REGRESSION
• Logistic regression is a type of machine learning
model used for classification tasks, where the goal is
to predict whether something belongs to one of two
categories, like yes/no, true/false, or spam/not spam.
• Instead of predicting a continuous number like in
linear regression, logistic regression predicts the
probability that an input belongs to a particular class
(like the probability that an email is spam). The result
is a value between 0 and 1.
What is Logistic Regression?
Logistic regression machine learning is a statistical method that is used for building machine
learning models where the dependent variable is dichotomous: i.e. binary. Logistic regression is
used to describe data and the relationship between one dependent variable and one or more
independent variables. The independent variables can be nominal, ordinal, or of interval type.
The name “logistic regression” is derived from the concept of the logistic function that it uses.
The logistic function is also known as the sigmoid function. The value of this logistic function lies
between zero and one.
The following is an example of a logistic function we can use to find the probability of a vehicle
breaking down, depending on how many years it has been since it was serviced last.

Here is how you can interpret the results from the graph to decide whether the vehicle will break
down or not.
LOGISTIC REGRESSION
• Logistic regression is a type of machine learning
model used for classification tasks, where the goal is
to predict whether something belongs to one of two
categories, like yes/no, true/false, or spam/not spam.
• Instead of predicting a continuous number like in
linear regression, logistic regression predicts the
probability that an input belongs to a particular class
(like the probability that an email is spam). The result
is a value between 0 and 1.
• In Logistic regression, instead of fitting a
regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0
or 1).
• Logistic Regression can be used to classify the
observations using different types of data and can
easily determine the most effective variables
used for the classification. The below image is
showing the logistic function:
Logistic Regression in Machine Learning
1. Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
2. Logistic regression predicts the output of a categorical dependent variable. Therefore,
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc., but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
3. Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems,
whereas Logistic Regression is used for solving classification problems.
4. In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
Logistic Function - Sigmoid Function, which predicts two maximum values (0 or
1).
5. The curve from the logistic function indicates the likelihood of something, such as
whether the cells are cancerous or not, whether a mouse is obese or not based on its
weight, etc.
6. Logistic Regression is a significant machine learning algorithm because it can
provide probabilities and classify new data using continuous and discrete datasets.
7. Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The previous image shows the logistic function:
Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

Assumption in a Logistic Regression Algorithm

•In a binary logistic regression, the dependent variable must be binary


•For a binary regression, the factor level one of the dependent variables should
represent the desired outcome
•Only meaningful variables should be included
•The independent variables should be independent of each other. This means the
model should have little or no multicollinearity
•The independent variables are linearly related to the log odds
•Logistic regression requires quite large sample sizes
How Does the Logistic Regression Algorithm Work?

Consider the following example: An organization wants to determine an employee’s salary


increase based on their performance.
For this purpose, a linear regression algorithm will help them decide. Plotting a regression line by
considering the employee’s performance as the independent variable, and the salary increase as the
dependent variable will make their task easier.

Now, what if the organization wants to know whether an employee would get a promotion or not
based on their performance? The above linear graph won’t be suitable in this case. As such, we
clip the line at zero and one, and convert it into a sigmoid curve (S curve).
Based on the threshold values, the organization can decide whether an
employee will get a salary increase or not.

To understand logistic regression, let’s go over the odds of success.

Odds (𝜃) = Probability of an event happening / Probability of an event


not happening

𝜃=p/1-p

The values of odds range from zero to ∞ and the values of probability
lies between zero and one.
Use case of Logistic Regression
Logistic regression can be used to predict if a student will pass or fail an
exam based on the number of hours they spent studying. The dependent
variable is "pass" or "fail", which are represented by the values 1 and 0,
respectively.
Advantages of Logistic Regression:
• Easy to Use: Simple to understand and explain.
• Good for Yes/No Questions: Perfect for predicting things like "Will it rain?" or
"Will I buy this?"
• Gives Probabilities: Shows how likely something is to happen.
• Handles Curved Relationships: Works well even if the relationship isn’t straight.
• Better with Outliers: Less affected by extreme values than some other methods.

Disadvantages of Logistic Regression:


• Needs Lots of Data: Works best with a good amount of information.
• Assumes a Simple Relationship: It assumes a specific way the factors relate to the
outcome.
• Only for Two Options: Mainly used for two choices; not for multiple options.
• Struggles with Complex Patterns: Might not capture very complicated relationships
well.
• Sensitive to Similar Variables: If the factors are too alike, it can cause problems.
Difference Between Linear regression and Logistic Regression

You might also like