0% found this document useful (0 votes)
24 views24 pages

Understanding Linear Regression Basics

The document provides an overview of Linear Regression, a statistical technique used to predict a continuous variable based on the relationship between dependent and independent variables. It discusses the assumptions of Simple Linear Regression, the mathematical model, and the Ordinary Least Squares method for fitting a regression line. Additionally, it outlines the steps for building a model using Scikit-Learn, including data preparation and visualization techniques.

Uploaded by

Manoj B R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views24 pages

Understanding Linear Regression Basics

The document provides an overview of Linear Regression, a statistical technique used to predict a continuous variable based on the relationship between dependent and independent variables. It discusses the assumptions of Simple Linear Regression, the mathematical model, and the Ordinary Least Squares method for fitting a regression line. Additionally, it outlines the steps for building a model using Scikit-Learn, including data preparation and visualization techniques.

Uploaded by

Manoj B R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

26/09/2022, 16:58 Linear_Regression_withoutcode.

ipynb - Colaboratory

LINEAR REGRESSION

Total points =51

Linear Regression
Linear Regression is a statistical technique which is used to find the linear relationship between dependent and one or more independent
variables. This technique is applicable for Supervised learning Regression problems where we try to predict a continuous variable.

Linear Regression can be further classified into two types – Simple and Multiple Linear Regression. It is the simplest form of Linear Regression
where we fit a straight line to the data.

Read this blog in incognito mode: [Link]


7e64afd614e1#:~:text=What%20is%20Linear%20Regression%3F,represented%20with%20a%20straight%20line.

Simple Linear Regression - Model Assumptions


The Linear Regression Model is based on several assumptions which are listed below:-

i. Linear relationship ii. Multivariate normality iii. No or little multicollinearity iv. No auto-correlation v. Homoscedasticity

i. Linear relationship
The relationship between response and feature variables should be linear. This linear relationship assumption can be tested by plotting a
scatter-plot between response and feature variables.

ii. Multivariate normality

[Link] 1/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

The linear regression model requires all variables to be multivariate normal. A multivariate normal distribution means a vector in multiple
normally distributed variables, where any linear combination of the variables is also normally distributed.

iii. No or little multicollinearity


It is assumed that there is little or no multicollinearity in the data. Multicollinearity occurs when the features (or independent variables) are
highly correlated.

iv. No auto-correlation
Also, it is assumed that there is little or no auto-correlation in the data. Autocorrelation occurs when the residual errors are not independent
from each other.

v. Homoscedasticity
Homoscedasticity describes a situation in which the error term (that is, the noise in the model) is the same across all values of the
independent variables. It means the residuals are same across the regression line. It can be checked by looking at scatter plot.

WATCH ALL VIDEOS IN THE PORTAL

Video 1 : Assumptions of Linear Regression

Simple Linear Regression (SLR)


Simple Linear Regression (or SLR) is the simplest model in machine learning. It models the linear relationship between the independent and
dependent variables.

[Link] 2/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

This assignment is based on the TV and Sales data . There is one independent or input variable which represents the TV data and is denoted by
X. Similarly, there is one dependent or output variable which represents the Sales and is denoted by y. We want to build a linear relationship
between these variables. This linear relationship can be modelled by mathematical equation of the form:-

Y = β0 + β1*X ------------- (1)

In this equation, X and Y are called independent and dependent variables respectively,

β1 is the coefficient for independent variable and

β0 is the constant term.

β0 and β1 are called parameters of the model.

For simplicity, we can compare the above equation with the basic line equation of the form:-

y = ax + b ----------------- (2)

We can see that

slope of the line is given by, a = β1, and

intercept of the line by b = β0.

In this Simple Linear Regression model, we want to fit a line which estimates the linear relationship between X and Y. So, the question of fitting
reduces to estimating the parameters of the model β0 and β1.

Ordinary Least Square Method


The TV and Sales data are given by X and y respectively. We can draw a scatter plot between X and y which shows the relationship between
them.

Now, our task is to find a line which best fits this scatter plot. This line will help us to predict the value of any Target variable for any given
Feature variable. This line is called Regression line.
[Link] 3/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

We can define an error function for any line. Then, the regression line is the one which minimizes the error function. Such an error function is
also called a Cost function.

By below chart you might understand more clearly

[Link]
Problem Statement
Build a model which predicts sales based on the money spent on different platforms for marketing.

Understanding the Data Let's start with the following steps:

1. Importing data using the pandas library


2. Understanding the structure of the data

2*2=4 points

# Import necessary libraries numpy as np, pandas as pd, pyplot as plt

%matplotlib inline

# The above command sets the backend of matplotlib to the 'inline' backend.
# It means the output of plotting commands is displayed inline.

2*6 = 12 points

About the dataset


[Link] 4/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

Let's import data from the following url:-

[Link]

Data Dict:
There are 3 Input Variables and 1 Output Variable (Sales).
The data type of all the input variables is float64. The data type of out variable (Sales) is float64.

# Import the data

df =

#drop radio and newspaper column


df =

pandas shape attribute


The shape attribute of the pandas dataframe gives the dimensions of the dataframe.

# View the dimensions of df

(200, 2)

pandas head() method

# View the top 5 rows of df

[Link] 5/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

TV Sales
0 230.1 22.1
1 44.5 10.4
2 17.2 12.0
3 151.5 16.5
4 180.8 17.9

pandas info() method

# View dataframe summary

<class '[Link]'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TV 200 non-null float64
1 Sales 200 non-null float64
dtypes: float64(2)
memory usage: 3.2 KB
None

pandas describe() method

# View descriptive statistics

[Link] 6/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

TV Sales
count 200.000000 200.000000
mean 147.042500 15.130500
std 85.854236 5.283892
min 0.700000 1.600000
25% 74.375000 11.000000
50% 149.750000 16.000000
75% 218.825000 19.050000
max 296.400000 27.000000

Independent and Dependent Variables

Independent variable
Independent variable is also called Input variable and is denoted by X. In practical applications, independent variable is also called Feature
variable or Predictor variable. We can denote it as:-

Independent or Input variable (X) = Feature variable = Predictor variable

Dependent variable
Dependent variable is also called Output variable and is denoted by y.

Dependent variable is also called Target variable or Response variable. It can be denoted it as follows:-

Dependent or Output variable (y) = Target variable = Response variable

Video 2 : Linear Regression-Splitting and describing dataframe

2 points

# Declare feature variable and target variable

[Link] 7/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

# TV and Sales data values are given by X and y respectively.

# Values attribute of pandas dataframe returns the numpy arrays.

X =

y =

Visual exploratory data analysis


Visualize the relationship between X and y by plotting a scatterplot between X and y.

Video 3: Linear Regression-EDA on dataset

2 points

# Plot scatter plot between X and y

[Link] 8/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

Hey buddy! did you notice ? the above graph shows some sort of relationship between sales and TV. Don't you think this shows positive linear
relation? i.e when As TV's value increases sales increases ans same is vise-versa.

Visualising Data Using Seaborn

Video 4 : Linear Regression-Reshaping concept

2*2=4 points

# import seaborn with alias sns

# import %matplotlib inline to visualise in the notebook


%matplotlib inline

# Visualise the relationship between the features and the response using scatterplots

# plot a pairplot also for df

[Link] 9/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

[Link] 10/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/seaborn/_decorators.py:43: F
FutureWarning

Ohoo! We can see very well that you have done good practice of Visualisation in your EDA assignment. Anyways the above graph also shows
positive linear relation between both TV and Sales.

Checking dimensions of X and y


We need to check the dimensions of X and y to make sure they are in right format for Scikit-Learn API.

2points

/# Print the dimensions of X and y

(200,)
(200,)

Reshaping X and y
Since we are working with only one feature variable, so we need to reshape using Numpy reshape() method.

E.g, If you have an array of shape (3,2) then reshaping it with (-1, 1), then the array will get reshaped in such a way that the resulting array has
only 1 column and this is only possible by having 6 rows, hence, (6,1)

You have seen the above example. Now you smarty! try reshaping on your data.

[Link] 11/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

2*2 = 4 points
# Reshape X and y

X =

y =

# Print the dimensions of X and y after reshaping

(200, 1)
(200, 1)

Cool right!

Difference in dimensions of X and y after reshaping


Hey! You can see the difference in diminsions of X and y before and after reshaping.

It is essential in this case because getting the feature and target variable right is an important precursor to model building.

Performing Simple Linear Regression

Equation of linear regression


y = c + m1 x1 + m2 x2 +. . . +mn xn

y is the response

[Link] 12/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

c is the intercept
m1 is the coefficient for the first feature
mn is the coefficient for the nth feature

In our case:

y = c + m1 × T V

The m values are called the model coefficients or model parameters.

Video 5 : Linear Regression-Fitting The Regression Model

Mechanics of the model


Hey! before you read further, it is good to understand the generic structure of modeling using the scikit-learn library. Broadly, the steps to build
any model can be divided as follows:

Split the dataset into two sets – the training set and the test set. Then, instantiate the regressor lm and fit it on the training set with the fit
method.

In this step, the model learned the relationships between the training data (X_train, y_train).

Oh Yeah! Now the model is ready to make predictions on the test data (X_test). Hence, predict on the test data using the predict method.

The steps are as follow:

Train test split


Split the dataset into two sets namely - train set and test set.

The model learn the relationships from the training data and predict on test data.
[Link] 13/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

Hey Smarty!! It's absolutely fine if you didn't understand the theory well! We are here to help you make comfortable with all the concepts slowly
as we proceeds towards our upcoming assignments.

No fear when AI_4_All is here :)

2+2+3=7 points

# import train test split

# Split X and y into training and test data sets

X_train,X_test,y_train,y_test =

# print shapes of X_train,y_train, X_test, y_test

(140, 1)
(140, 1)
(60, 1)
(60, 1)

# Fit the linear model

# Instantiate the linear regression object lm

lm =

[Link] 14/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

# Train the model using training data sets

# Predict on the test data


y_pred =

Model slope and intercept term


The model slope is given by lm.coef_ and model intercept term is given by lm.intercept_.

for example. if the estimated model slope and intercept values are 1.60509347 and -11.16003616.

So, the equation of the fitted regression line will be:-

y = 1.60509347 * x - 11.16003616

2 points

# Compute model slope and intercept

a =
b =

# also print a and b

Estimated model slope, a: [[0.05483488]]


Estimated model intercept, b: (array([7.20655455]),)

# So comment below, our fitted regression line here is ?

#y=0.05483488 * x + 7.20655455
[Link] 15/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

# That is our linear model.

Wohoo! Awesome job done!

Making predictions
To make prediction, on an individual TV value,

[Link](Xi)

where Xi is the TV data value of the ith observation.

2 points

# Predicting Sales values on first five 5 TV datasets only

array([[19.82406131],
[ 9.64670688],
[ 8.14971455],
[15.51403944],
[17.12070154]])

We know that you can also do prediction for all values of TV available in our dataset

Can you show it now?

[Link] 16/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

# prediction for all X present in the dataset

array([[19.82406131],
[ 9.64670688],
[ 8.14971455],
[15.51403944],
[17.12070154],
[ 7.68361804],
[10.35956037],
[13.79770758],
[ 7.67813455],
[18.16256433],
[10.83114037],
[18.9796041 ],
[ 8.51162478],
[12.55295572],
[18.39835433],
[17.92129084],
[10.92435967],
[22.63709085],
[11.00112851],
[15.28373293],
[19.18249317],
[20.22435596],
[ 7.93037501],
[19.72535852],
[10.62276781],
[21.6226455 ],
[15.04245944],
[20.37241015],
[20.84947364],
[11.07789734],
[23.26769201],
[13.39741293],
[12.53650525],
[21.77069968],
[12.45425293],

[Link] 17/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

[23.14705527],
[21.84198503],
[11.30272037],
[ 9.56993804],
[19.70890805],
[18.31061852],
[16.91232898],
[23.30607643],
[18.55189201],
[ 8.58291013],
[16.8081427 ],
[12.12524362],
[20.36144317],
[19.66504015],
[10.87500827],
[18.16256433],
[12.71197688],
[19.0728234 ],
[17.21940433],
[21.61167852],
[18.11321294],
[ 7.6068492 ],
[14 67506572]

Regression metrics for model performance


Now, it is the time to evaluate model performance.

For regression problems, there are two ways to compute the model performance. They are RMSE (Root Mean Square Error) and R-Squared
Value. These are explained below:-

RMSE

RMSE is the standard deviation of the residuals. So, RMSE gives us the standard deviation of the unexplained variance by the model. I
RMSE is an absolute measure of fit. It gives us how spread the residuals are, given by the standard deviation of the residuals. The m

[Link] 18/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

Formula: [Link]

R-Squared

(R2) Correlation explains the strength of the relationship between an independent and dependent variable,whereas R-square explains to
So, if the R2 of a model is 0.50, then approximately half of the observed variation can be explained by the model's inputs.
In general, the higher the R2 Score value, the better the model fits the data. Usually, its value ranges from 0 to 1. So, we want its

Fomula:

[Link]

2*2 = 4 points

Read this blog for metrics in regression: [Link]

Video 6 : Linear Regression-Metrics In Regression

# Calculate and print Root Mean Square Error(RMSE)

RMSE value: 2.2759

# Calculate and print r2_score

[Link] 19/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

R2 Score value: 0.8149

Interpretation and Conclusion


The RMSE value has been found to be 2.2759. It means the standard deviation for our prediction is 2.2759. which is quite less. Sometimes we
can also expect the RMSE to be less than 2.2759. So, the model is good fit to the data.

In business decisions, the benchmark for the R2 score value is 0.7. It means if R2 score value >= 0.7, then the model is good enough to deploy
on unseen data whereas if R2 score value < 0.7, then the model is not good enough to deploy. Our R2 score value has been found to be 0.8149.
It means that this model explains 81.49 % of the variance in our dependent variable. So, the R2 score value confirms that the model is good
enough to deploy because it provides good fit to the data.

Wohoo! Really good job done!

2 points

# Plot the Regression Line between X and Y as shown in below output.

[Link] 20/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

As you can see above, the regression line fits the data quite well. Wow!

Residual analysis
A linear regression model may not represent the data appropriately. The model may be a poor fit to the data. So, we should validate our model
by defining and examining residual plots.

The difference between the observed value of the dependent variable (y) and the predicted value (ŷi) is called the residual and is denoted by e
or error. The scatter-plot of these residuals is called residual plot.

If the data points in a residual plot are randomly dispersed around horizontal axis and an approximate zero residual mean, a linear regression
model may be appropriate for the data. Otherwise a non-linear model may be more appropriate.

If we take a look at the generated ‘Residual errors’ plot, we can clearly see that the train data plot pattern is non-random. Same is the case with
the test data plot pattern. So, it suggests a better-fit for a non-linear model.

Check this blog for residual analysis: [Link]


c3c70e8ab378#:~:text=Residuals,and%20the%20observed%20actual%20value.

Video 7 : Linear Regression-Residual Errors

# Plotting residual errors


[Link] 21/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

Checking for Overfitting and Underfitting


We will see training set score and test set score.

You can excpect the training set score to be 0.7996, which is averagely good. So, the model learned the relationships quite appropriately from
the training data. Thus, the model performs good on the test data as test score will be 0.8149. It is a clear sign of good fit/ balanced fit. Hence,
we can validated our finding that the linear regression model provides good fit to the data.

Underfitting: Your model is underfitting the training data when the model performs poorly on the training data. This is because the model is
unable to capture the relationship between the input examples (often called X) and the target values (often called Y).

Overfitting: Your model is overfitting your training data when you see that the model performs well on the training data but does not perform
well on the evaluation data. This is because the model is memorizing the data it has seen and is unable to generalize to unseen examples.

You see the difference visually as below:

[Link]
[Link] 22/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

2 points

# Checking for Overfitting or Underfitting the data by calculation score using score function.

Training set score: 0.7996


Test set score: 0.8149

Summary
We learnt assumptions required for linear regression model.
We built linear regression model using sklearn and also got the basic idea of overfitting and underfitting.
We also did residual analysis to cross check one of the linear regression assumption.

Congratulations on building your first machine learning model! Smile please! :)

FeedBack
We hope you’ve enjoyed this course so far. We’re committed to helping you use AIforAll course to its full potential so you can grow with us. And
that’s why we need your help in form of a feedback here

We appreciate your time for your thoughtful comment here

[Link]
[Link] 23/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory

Colab paid products - Cancel contracts here

[Link] 24/24

You might also like