26/09/2022, 16:58 Linear_Regression_withoutcode.
ipynb - Colaboratory
LINEAR REGRESSION
Total points =51
Linear Regression
Linear Regression is a statistical technique which is used to find the linear relationship between dependent and one or more independent
variables. This technique is applicable for Supervised learning Regression problems where we try to predict a continuous variable.
Linear Regression can be further classified into two types – Simple and Multiple Linear Regression. It is the simplest form of Linear Regression
where we fit a straight line to the data.
Read this blog in incognito mode: [Link]
7e64afd614e1#:~:text=What%20is%20Linear%20Regression%3F,represented%20with%20a%20straight%20line.
Simple Linear Regression - Model Assumptions
The Linear Regression Model is based on several assumptions which are listed below:-
i. Linear relationship ii. Multivariate normality iii. No or little multicollinearity iv. No auto-correlation v. Homoscedasticity
i. Linear relationship
The relationship between response and feature variables should be linear. This linear relationship assumption can be tested by plotting a
scatter-plot between response and feature variables.
ii. Multivariate normality
[Link] 1/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
The linear regression model requires all variables to be multivariate normal. A multivariate normal distribution means a vector in multiple
normally distributed variables, where any linear combination of the variables is also normally distributed.
iii. No or little multicollinearity
It is assumed that there is little or no multicollinearity in the data. Multicollinearity occurs when the features (or independent variables) are
highly correlated.
iv. No auto-correlation
Also, it is assumed that there is little or no auto-correlation in the data. Autocorrelation occurs when the residual errors are not independent
from each other.
v. Homoscedasticity
Homoscedasticity describes a situation in which the error term (that is, the noise in the model) is the same across all values of the
independent variables. It means the residuals are same across the regression line. It can be checked by looking at scatter plot.
WATCH ALL VIDEOS IN THE PORTAL
Video 1 : Assumptions of Linear Regression
Simple Linear Regression (SLR)
Simple Linear Regression (or SLR) is the simplest model in machine learning. It models the linear relationship between the independent and
dependent variables.
[Link] 2/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
This assignment is based on the TV and Sales data . There is one independent or input variable which represents the TV data and is denoted by
X. Similarly, there is one dependent or output variable which represents the Sales and is denoted by y. We want to build a linear relationship
between these variables. This linear relationship can be modelled by mathematical equation of the form:-
Y = β0 + β1*X ------------- (1)
In this equation, X and Y are called independent and dependent variables respectively,
β1 is the coefficient for independent variable and
β0 is the constant term.
β0 and β1 are called parameters of the model.
For simplicity, we can compare the above equation with the basic line equation of the form:-
y = ax + b ----------------- (2)
We can see that
slope of the line is given by, a = β1, and
intercept of the line by b = β0.
In this Simple Linear Regression model, we want to fit a line which estimates the linear relationship between X and Y. So, the question of fitting
reduces to estimating the parameters of the model β0 and β1.
Ordinary Least Square Method
The TV and Sales data are given by X and y respectively. We can draw a scatter plot between X and y which shows the relationship between
them.
Now, our task is to find a line which best fits this scatter plot. This line will help us to predict the value of any Target variable for any given
Feature variable. This line is called Regression line.
[Link] 3/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
We can define an error function for any line. Then, the regression line is the one which minimizes the error function. Such an error function is
also called a Cost function.
By below chart you might understand more clearly
[Link]
Problem Statement
Build a model which predicts sales based on the money spent on different platforms for marketing.
Understanding the Data Let's start with the following steps:
1. Importing data using the pandas library
2. Understanding the structure of the data
2*2=4 points
# Import necessary libraries numpy as np, pandas as pd, pyplot as plt
%matplotlib inline
# The above command sets the backend of matplotlib to the 'inline' backend.
# It means the output of plotting commands is displayed inline.
2*6 = 12 points
About the dataset
[Link] 4/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
Let's import data from the following url:-
[Link]
Data Dict:
There are 3 Input Variables and 1 Output Variable (Sales).
The data type of all the input variables is float64. The data type of out variable (Sales) is float64.
# Import the data
df =
#drop radio and newspaper column
df =
pandas shape attribute
The shape attribute of the pandas dataframe gives the dimensions of the dataframe.
# View the dimensions of df
(200, 2)
pandas head() method
# View the top 5 rows of df
[Link] 5/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
TV Sales
0 230.1 22.1
1 44.5 10.4
2 17.2 12.0
3 151.5 16.5
4 180.8 17.9
pandas info() method
# View dataframe summary
<class '[Link]'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TV 200 non-null float64
1 Sales 200 non-null float64
dtypes: float64(2)
memory usage: 3.2 KB
None
pandas describe() method
# View descriptive statistics
[Link] 6/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
TV Sales
count 200.000000 200.000000
mean 147.042500 15.130500
std 85.854236 5.283892
min 0.700000 1.600000
25% 74.375000 11.000000
50% 149.750000 16.000000
75% 218.825000 19.050000
max 296.400000 27.000000
Independent and Dependent Variables
Independent variable
Independent variable is also called Input variable and is denoted by X. In practical applications, independent variable is also called Feature
variable or Predictor variable. We can denote it as:-
Independent or Input variable (X) = Feature variable = Predictor variable
Dependent variable
Dependent variable is also called Output variable and is denoted by y.
Dependent variable is also called Target variable or Response variable. It can be denoted it as follows:-
Dependent or Output variable (y) = Target variable = Response variable
Video 2 : Linear Regression-Splitting and describing dataframe
2 points
# Declare feature variable and target variable
[Link] 7/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
# TV and Sales data values are given by X and y respectively.
# Values attribute of pandas dataframe returns the numpy arrays.
X =
y =
Visual exploratory data analysis
Visualize the relationship between X and y by plotting a scatterplot between X and y.
Video 3: Linear Regression-EDA on dataset
2 points
# Plot scatter plot between X and y
[Link] 8/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
Hey buddy! did you notice ? the above graph shows some sort of relationship between sales and TV. Don't you think this shows positive linear
relation? i.e when As TV's value increases sales increases ans same is vise-versa.
Visualising Data Using Seaborn
Video 4 : Linear Regression-Reshaping concept
2*2=4 points
# import seaborn with alias sns
# import %matplotlib inline to visualise in the notebook
%matplotlib inline
# Visualise the relationship between the features and the response using scatterplots
# plot a pairplot also for df
[Link] 9/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
[Link] 10/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/seaborn/_decorators.py:43: F
FutureWarning
Ohoo! We can see very well that you have done good practice of Visualisation in your EDA assignment. Anyways the above graph also shows
positive linear relation between both TV and Sales.
Checking dimensions of X and y
We need to check the dimensions of X and y to make sure they are in right format for Scikit-Learn API.
2points
/# Print the dimensions of X and y
(200,)
(200,)
Reshaping X and y
Since we are working with only one feature variable, so we need to reshape using Numpy reshape() method.
E.g, If you have an array of shape (3,2) then reshaping it with (-1, 1), then the array will get reshaped in such a way that the resulting array has
only 1 column and this is only possible by having 6 rows, hence, (6,1)
You have seen the above example. Now you smarty! try reshaping on your data.
[Link] 11/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
2*2 = 4 points
# Reshape X and y
X =
y =
# Print the dimensions of X and y after reshaping
(200, 1)
(200, 1)
Cool right!
Difference in dimensions of X and y after reshaping
Hey! You can see the difference in diminsions of X and y before and after reshaping.
It is essential in this case because getting the feature and target variable right is an important precursor to model building.
Performing Simple Linear Regression
Equation of linear regression
y = c + m1 x1 + m2 x2 +. . . +mn xn
y is the response
[Link] 12/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
c is the intercept
m1 is the coefficient for the first feature
mn is the coefficient for the nth feature
In our case:
y = c + m1 × T V
The m values are called the model coefficients or model parameters.
Video 5 : Linear Regression-Fitting The Regression Model
Mechanics of the model
Hey! before you read further, it is good to understand the generic structure of modeling using the scikit-learn library. Broadly, the steps to build
any model can be divided as follows:
Split the dataset into two sets – the training set and the test set. Then, instantiate the regressor lm and fit it on the training set with the fit
method.
In this step, the model learned the relationships between the training data (X_train, y_train).
Oh Yeah! Now the model is ready to make predictions on the test data (X_test). Hence, predict on the test data using the predict method.
The steps are as follow:
Train test split
Split the dataset into two sets namely - train set and test set.
The model learn the relationships from the training data and predict on test data.
[Link] 13/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
Hey Smarty!! It's absolutely fine if you didn't understand the theory well! We are here to help you make comfortable with all the concepts slowly
as we proceeds towards our upcoming assignments.
No fear when AI_4_All is here :)
2+2+3=7 points
# import train test split
# Split X and y into training and test data sets
X_train,X_test,y_train,y_test =
# print shapes of X_train,y_train, X_test, y_test
(140, 1)
(140, 1)
(60, 1)
(60, 1)
# Fit the linear model
# Instantiate the linear regression object lm
lm =
[Link] 14/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
# Train the model using training data sets
# Predict on the test data
y_pred =
Model slope and intercept term
The model slope is given by lm.coef_ and model intercept term is given by lm.intercept_.
for example. if the estimated model slope and intercept values are 1.60509347 and -11.16003616.
So, the equation of the fitted regression line will be:-
y = 1.60509347 * x - 11.16003616
2 points
# Compute model slope and intercept
a =
b =
# also print a and b
Estimated model slope, a: [[0.05483488]]
Estimated model intercept, b: (array([7.20655455]),)
# So comment below, our fitted regression line here is ?
#y=0.05483488 * x + 7.20655455
[Link] 15/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
# That is our linear model.
Wohoo! Awesome job done!
Making predictions
To make prediction, on an individual TV value,
[Link](Xi)
where Xi is the TV data value of the ith observation.
2 points
# Predicting Sales values on first five 5 TV datasets only
array([[19.82406131],
[ 9.64670688],
[ 8.14971455],
[15.51403944],
[17.12070154]])
We know that you can also do prediction for all values of TV available in our dataset
Can you show it now?
[Link] 16/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
# prediction for all X present in the dataset
array([[19.82406131],
[ 9.64670688],
[ 8.14971455],
[15.51403944],
[17.12070154],
[ 7.68361804],
[10.35956037],
[13.79770758],
[ 7.67813455],
[18.16256433],
[10.83114037],
[18.9796041 ],
[ 8.51162478],
[12.55295572],
[18.39835433],
[17.92129084],
[10.92435967],
[22.63709085],
[11.00112851],
[15.28373293],
[19.18249317],
[20.22435596],
[ 7.93037501],
[19.72535852],
[10.62276781],
[21.6226455 ],
[15.04245944],
[20.37241015],
[20.84947364],
[11.07789734],
[23.26769201],
[13.39741293],
[12.53650525],
[21.77069968],
[12.45425293],
[Link] 17/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
[23.14705527],
[21.84198503],
[11.30272037],
[ 9.56993804],
[19.70890805],
[18.31061852],
[16.91232898],
[23.30607643],
[18.55189201],
[ 8.58291013],
[16.8081427 ],
[12.12524362],
[20.36144317],
[19.66504015],
[10.87500827],
[18.16256433],
[12.71197688],
[19.0728234 ],
[17.21940433],
[21.61167852],
[18.11321294],
[ 7.6068492 ],
[14 67506572]
Regression metrics for model performance
Now, it is the time to evaluate model performance.
For regression problems, there are two ways to compute the model performance. They are RMSE (Root Mean Square Error) and R-Squared
Value. These are explained below:-
RMSE
RMSE is the standard deviation of the residuals. So, RMSE gives us the standard deviation of the unexplained variance by the model. I
RMSE is an absolute measure of fit. It gives us how spread the residuals are, given by the standard deviation of the residuals. The m
[Link] 18/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
Formula: [Link]
R-Squared
(R2) Correlation explains the strength of the relationship between an independent and dependent variable,whereas R-square explains to
So, if the R2 of a model is 0.50, then approximately half of the observed variation can be explained by the model's inputs.
In general, the higher the R2 Score value, the better the model fits the data. Usually, its value ranges from 0 to 1. So, we want its
Fomula:
[Link]
2*2 = 4 points
Read this blog for metrics in regression: [Link]
Video 6 : Linear Regression-Metrics In Regression
# Calculate and print Root Mean Square Error(RMSE)
RMSE value: 2.2759
# Calculate and print r2_score
[Link] 19/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
R2 Score value: 0.8149
Interpretation and Conclusion
The RMSE value has been found to be 2.2759. It means the standard deviation for our prediction is 2.2759. which is quite less. Sometimes we
can also expect the RMSE to be less than 2.2759. So, the model is good fit to the data.
In business decisions, the benchmark for the R2 score value is 0.7. It means if R2 score value >= 0.7, then the model is good enough to deploy
on unseen data whereas if R2 score value < 0.7, then the model is not good enough to deploy. Our R2 score value has been found to be 0.8149.
It means that this model explains 81.49 % of the variance in our dependent variable. So, the R2 score value confirms that the model is good
enough to deploy because it provides good fit to the data.
Wohoo! Really good job done!
2 points
# Plot the Regression Line between X and Y as shown in below output.
[Link] 20/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
As you can see above, the regression line fits the data quite well. Wow!
Residual analysis
A linear regression model may not represent the data appropriately. The model may be a poor fit to the data. So, we should validate our model
by defining and examining residual plots.
The difference between the observed value of the dependent variable (y) and the predicted value (ŷi) is called the residual and is denoted by e
or error. The scatter-plot of these residuals is called residual plot.
If the data points in a residual plot are randomly dispersed around horizontal axis and an approximate zero residual mean, a linear regression
model may be appropriate for the data. Otherwise a non-linear model may be more appropriate.
If we take a look at the generated ‘Residual errors’ plot, we can clearly see that the train data plot pattern is non-random. Same is the case with
the test data plot pattern. So, it suggests a better-fit for a non-linear model.
Check this blog for residual analysis: [Link]
c3c70e8ab378#:~:text=Residuals,and%20the%20observed%20actual%20value.
Video 7 : Linear Regression-Residual Errors
# Plotting residual errors
[Link] 21/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
Checking for Overfitting and Underfitting
We will see training set score and test set score.
You can excpect the training set score to be 0.7996, which is averagely good. So, the model learned the relationships quite appropriately from
the training data. Thus, the model performs good on the test data as test score will be 0.8149. It is a clear sign of good fit/ balanced fit. Hence,
we can validated our finding that the linear regression model provides good fit to the data.
Underfitting: Your model is underfitting the training data when the model performs poorly on the training data. This is because the model is
unable to capture the relationship between the input examples (often called X) and the target values (often called Y).
Overfitting: Your model is overfitting your training data when you see that the model performs well on the training data but does not perform
well on the evaluation data. This is because the model is memorizing the data it has seen and is unable to generalize to unseen examples.
You see the difference visually as below:
[Link]
[Link] 22/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
2 points
# Checking for Overfitting or Underfitting the data by calculation score using score function.
Training set score: 0.7996
Test set score: 0.8149
Summary
We learnt assumptions required for linear regression model.
We built linear regression model using sklearn and also got the basic idea of overfitting and underfitting.
We also did residual analysis to cross check one of the linear regression assumption.
Congratulations on building your first machine learning model! Smile please! :)
FeedBack
We hope you’ve enjoyed this course so far. We’re committed to helping you use AIforAll course to its full potential so you can grow with us. And
that’s why we need your help in form of a feedback here
We appreciate your time for your thoughtful comment here
[Link]
[Link] 23/24
26/09/2022, 16:58 Linear_Regression_withoutcode.ipynb - Colaboratory
Colab paid products - Cancel contracts here
[Link] 24/24