DA unit-III
DA unit-III
UNIT-III
Regression Concepts
• It is a Predictive modeling technique where the target variable to be estimated is continuous.
Examples of applications of regression
• Applications of regression are numerous and occur in almost every filed, including engineering, the
physical and the social sciences, and the biological sciences.
• Predicting a stock market index using other economic indicators.
• projecting the total sales of a company based on the amount spent for advertising
Regression
Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable
(often called the ‘outcome variable’) and one or more independent variables (often called ‘predictors’,
‘covariates’, or ‘features’).
The terminology you will often listen related to regression analysis is:
Dependent variable or target variable: Variable to predict.
Independent variable or predictor variable: Variables to estimate the dependent variable.
Outlier: Observation that differs significantly from other observations. It should be avoided since it may
hamper the result.
Multicollinearity: Situation in which two or more independent variables are highly linearly related.
Regression is the task of learning a target function ‘f’ that maps each attribute set x into a continuous-
valued output y.
For an input x, if the output is continuous, this is called a regression problem. For example, based on historical
information of demand for smart phone in our mobile shop, you are asked to predict the demand for the next
month. Regression is concerned with the prediction of continuous quantities.
Why do we use Regression Analysis?
As mentioned above, regression analysis estimates the relationship between two or more variables. Let’s
understand this with an easy example:
Let’s say, you want to estimate growth in sales of a company based on current economic conditions.
You have the recent company data which indicates that the growth in sales is around two and a half times the
growth in the economy. Using this insight, we can predict future sales of the company based on current & past
information.
There are multiple benefits of using regression analysis. They are as follows:
1. It indicates the significant relationships between dependent variable and independent variable.
2. It indicates the strength of impact of multiple independent variables on a dependent variable.
Regression analysis also allows us to compare the effects of variables measured on different scales, such as the
effect of price changes and the number of promotional activities. These benefits help market researchers / data
analysts / data scientists to eliminate and evaluate the best set of variables to be used for building predictive
models.
How many types of regression techniques do we have?
There are various kinds of regression techniques available to make predictions. These techniques are mostly
driven by three metrics (number of independent variables, type of dependent variables and shape of regression
line).
For the creative ones, you can even cook up new regressions, if you feel the need to use a combination of the
parameters above, which people haven’t used before. But before you start that, let us understand the most
commonly used regressions:
The goal of regression
• To find a target function that can fit the input data with minimum error.
• The error function for a regression task can be expressed in terms of the sum of absolute or
squared error:
Linear Regression
It is one of the most widely known modeling technique. Linear regression is usually among the first few topics
which people pick while learning predictive modeling. In this technique, the dependent variable is continuous,
independent variable(s) can be continuous or discrete, and nature of regression line is linear.
Linear Regression establishes a relationship between dependent variable (Y) and one or more
independent variables (X) using a best fit straight line (also known as regression line).
It is represented by an equation Y=a+b*X + e, where a is intercept, b is slope of the line and e is error term.
This equation can be used to predict the value of target variable based on given predictor
variable(s).
The difference between simple linear regression and multiple linear regression is that, multiple linear
regression has (>1) independent variables, whereas simple linear regression has only 1 independent
variable. Now, the question is “How do we obtain best fit line”
In the simplest case, the regression model allows for a linear relationship between the forecast variable y
and a single predictor variable x:
yt=β0+β1xt+εt.
An artificial example of data from such a model is shown in Figure. The coefficients β 0 and β1 denote the
intercept and the slope of the line respectively. The intercept β0 represents the predicted value of y when
x=0. The slope β1 represents the average predicted change in y resulting from a one unit increase in x.
The simplest case of linear regression is to find a relationship using a linear model (i.e line) between an input
independent variable (input single feature) and an output dependent variable. This is called Bivariate Linear
Regression.
On the other hand, when there is a linear model representing the relationship between a dependent output and
multiple independent input variables is called Multivariate Linear Regression.
The dependent variable is continuous and independent variables may or may not be continuous. We find the
relationship between them with the help of the best fit line which is also known as the Regression line.
This task can be easily accomplished by Least Square Method. It is the most common method used for fitting a
regression line. It calculates the best-fit line for the observed data by minimizing the sum of the squares of the
vertical deviations from each data point to the line. Because the deviations are first squared, when added, there
is no cancelling out between positive and negative values.
• We can evaluate the model performance using the metric R-square. In multiple linear regressions,
multiple equations are added together but the parameters are still linear.
Important Points:
There must be linear relationship between independent and dependent variables
Multiple regressions suffer from multicollinearity, autocorrelation, heteroskedasticity.
Linear Regression is very sensitive to Outliers. It can terribly affect the regression line and
eventually the forecasted values.
Multicollinearity can increase the variance of the coefficient estimates and make the estimates very
sensitive to minor changes in the model. The result is that the coefficient estimates are unstable
In case of multiple independent variables, we can go with forward selection, backward
elimination and step wise approach for selection of most significant independent variables.
Imagine you have some points, and want to have a line that best fits them like this:
We can place the line "by eye": try to have the line as close as possible to all points, and a similar number of
points above and below the line.
But for better accuracy let's see how to calculate the line using Least Squares Regression.
The Line
Our aim is to calculate the values m (slope) and b (y-intercept) in the equation of a line
y = mx + b
Where:
y = how far up
x = how far along
m = Slope or Gradient (how steep the line is)
b = the Y Intercept (where the line crosses the Y axis)
Steps
To find the line of best fit for N points:
Step 2: Sum all x, y, x2 and xy, which gives us Σx, Σy, Σx2 and Σxy (Σ means "sum up")
Step 3: Calculate Slope m:
m = N Σ(xy) − ΣxΣy
N Σ(x2) − (Σx)2
(N is the number of points.)
Step 4: Calculate Intercept b:
b = Σy − m Σx
Step 5: Assemble the equation of a line N
Done!
Example y = mx + b
Let's have an example to see how to do it!
Example: Sam found how many hours of sunshine vs how many ice creams were sold at the shop from Monday to Friday:
"x" "y"
Hours of Ice Creams
Sunshine Sold
2 4
3 5
5 7
7 10
9 15
Let us find the best m (slope) and b (y-intercept) that suits that data
y = mx + b
Step 1: For each (x,y) calculate x2 and xy:
x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
y = mx + b
y = 1.518x + 0.305
2 4 3.34 −0.66
3 5 4.86 −0.14
5 7 7.89 0.89
7 10 10.93 0.93
9 15 13.97 −1.03
Here are the (x,y) points and the line y = 1.518x + 0.305 on a graph:
Sam hears the weather forecast which says "we expect 8 hours of sun tomorrow", so he uses the above equation to
estimate that he will sell
Sam makes fresh waffle cone mixture for 14 ice creams just in case.
And for Multiple Linear regression since we have more than 2 independent variables the equation
becomes:
Let’s interpret the graph above. In linear regression the best fit line will be somewhat like this, the only
difference will be the number of data points. To make it easier I have taken a fewer number of data points.
Suppose there’s a variable Yi, The distance between this Yi and the predicted value is what we call “SUM OF
SQUARED ESTIMATE OF ERRORS” (SSE) . This is the unexplained variance and we have to minimize it
to get the best accuracy.
The distance between the predicted value y_hat and the mean of the dependent variable is called “SUM OF
SQUARED RESIDUALS” (SSR). This is the explained variance of our model and we want to maximize it.
The total variation in the model (SSR+SSE=SST) is called “SUM OF SQUARED TOTAL” .
1. Positive Relationship – When the regression line between the two variables moves in the same direction
with an upward slope then the variables are said to be in a Positive Relationship, it means that if we increase the
value of x (independent variable) then we will see an increase in our dependent variable.
2. Negative Relationship – When the regression line between the two variables moves in the same direction
with a downward slope then the variables are said to be in a Negative Relationship it means that if we increase
the value of an independent variable (x) then we will see a decrease in our dependent variable (y)
3. No Relationship – If the best fit line is flat (not sloped) then we can say that there is no relationship among
the variables. It means there will be no change in our dependent variable (y) by increasing or decreasing our
independent variable (x) value.
Correlation
When two sets of data are strongly linked together we say they have a High Correlation.
The word Correlation is made of Co- (meaning "together"), and Relation
Note:
Model building is process of setting various methods to collect data, understanding and then focusing on
data. The importance of data must be known to find a statistical or a simulation model to gain understanding
and to even make predictions. All these things are important, model building is an important skill to obtain in
every field of science.
This process is very much true to scientific method by making the learn things through models to be useful to
gain understanding of investigated things and to make the predictions which are true for testing. The
process of building variable models involves asking of queries, gathering and manipulating data, building of
models and even ultimately testing and evaluating them.
We are going to discuss life cycle phases of data analytics in which we will cover various life cycle
phases and will discuss them one by one.
Come to know about data sources needed and available for the project.
The team formulates initial hypothesis that can be later tested with data.
Phase 2: Data Preparation –
Steps to explore, preprocess, and condition data prior to modeling and analysis.
It requires the presence of an analytic sandbox; the team executes, load, and transform, to get data
into the sandbox.
Data preparation tasks are likely to be performed multiple times and not in predefined order.
Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine, etc.
Logistic Regression
Classification techniques are an essential part of machine learning and data mining applications. Approximately
70% of problems in Data Science are classification problems. There are lots of classification problems that are
available, but the logistic regression is common and is a useful regression method for solving the binary
classification problem. Another category of classification is Multinomial classification, which handles the issues
where multiple classes are present in the target variable.
Logistic Regression can be used for various classification problems such as spam detection. Diabetes prediction,
if a given customer will purchase a particular product or will they churn another competitor, whether the user
will click on a given advertisement link or not, and many more examples are in the bucket.
Logistic Regression is one of the most simple and commonly used Machine Learning algorithms for two- class
classification. It is easy to implement and can be used as the baseline for any binary classification problem. Its
basic fundamental concepts are also constructive in deep learning. Logistic regression describes and estimates
the relationship between one dependent binary variable and independent variables.
Definition of Logistic Regression
Logistic regression is a statistical method for predicting binary classes. The outcome or target variable is
dichotomous in nature. Dichotomous means there are only two possible classes. For example, it can be used for
cancer detection problems. It computes the probability of an event occurrence.
We can call a Logistic Regression a Linear Regression model but the Logistic Regression uses a more complex
cost function, this cost function can be defined as the ‘Sigmoid function’ or also known as the ‘logistic
function’ instead of a linear function.
The hypothesis of logistic regression tends it to limit the cost function between 0 and 1. Therefore linear
functions fail to represent it as it can have a value greater than 1 or less than 0 which is not possible as per the
hypothesis of logistic regression.
What is the Sigmoid Function?
In order to map predicted values to probabilities, we use the Sigmoid function. The function maps any real value
into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities.
The sigmoid function, also called logistic function gives an ‘S’ shaped curve that can take any real- valued
number and map it into a value between 0 and 1. If the curve goes to positive infinity, y predicted will become
1, and if the curve goes to negative infinity, y predicted will become 0. If the output of the
sigmoid function is more than 0.5, we can classify the outcome as 1 or YES, and if it is less than 0.5, we can
classify it as 0 or NO. The output cannot For example: If the output is 0.75, we can say in terms of probability
as: There is a 75 percent chance that patient will suffer from cancer.
Binary Logistic Regression: The target variable has only two possible outcomes such as Spam or Not
Spam, Cancer or No Cancer.
Example:
Whether or not to lend to a bank customer (outcomes are yes or no).
Assessing cancer risk (outcomes are high or low).
Will a team win tomorrow’s game (outcomes are yes or no).
Multinomial Logistic Regression: In such a kind of classification, dependent variable can have 3 or
more possible unordered types or the types having no quantitative significance. For example, these
variables may represent “Type A” or “Type B” or “Type C”.
Example:
Color(Red,Blue, Green)
School Subjects (Science, Math and Art)
Ordinal Logistic Regression: In such a kind of classification, dependent variable can have 3 or more
possible ordered types or the types having a quantitative significance. For example, these variables may
represent “poor” or “good”, “very good”, “Excellent” and each category can have the scores like 0,1,2,3.
Example:
Medical Condition (Critical, Serious, Stable, Good) Survey
Results (Disagree, Neutral and Agree)
Linear regression gives you a continuous output, but logistic regression provides a constant output. An example
of the continuous output is house price and stock price. Example's of the discrete output is predicting whether
a patient has cancer or not, predicting whether the customer will churn.
Linear regression is estimated using Ordinary Least Squares (OLS) while logistic regression is estimated using
Maximum Likelihood Estimation (MLE) approach.
Linear Regression Logistic Regression
Linear regression is used to predict the Logistic Regression is used to predict the
continuous dependent variable using a given categorical dependent variable using a given set
set of independent variables. of independent variables.
Linear Regression is used for solving Logistic regression is used for solving
Regression problem. Classification problems.
In Linear regression, we predict the value of In logistic Regression, we predict the values of
continuous variables. categorical variables.
In linear regression, we find the best fit line, In Logistic Regression, we find the S-curve by
by which we can easily predict the output. which we can classify the samples.
Least square estimation method is used for Maximum likelihood estimation method is used
estimation of accuracy. for estimation of accuracy.
The output for Linear Regression must be a The output of Logistic Regression must be a
continuous value, such as price, age, etc. Categorical value such as 0 or 1, Yes or No,
etc.
In linear regression, there may be collinearity In logistic regression, there should not be
between the independent variables. collinearity between the independent variable.
What Is Logistic Regression Used For?
Here is a more realistic and detailed scenario for when logistic regression might be used:
Logistic regression may be used when predicting whether bank customers are likely to default on their
loans. This is a calculation a bank makes when deciding if it will or will not lend to a customer and
assessing the maximum amount the bank will lend to those it has already deemed to be creditworthy. In
order to make this calculation, the bank will look at several factors. Lend is the target in this logistic
regression, and based on the likelihood of default that is calculated, a lender will choose whether to take
the risk of lending to each customer.
These factors, also known as features or independent variables, might include credit score,
income level, age, job status, marital status, gender, the neighborhood of current residence and
educational history.
Logistic regression is also often used for medical research and by insurance companies. In order to
calculate cancer risks, researchers would look at certain patient habits and genetic predispositions as
predictive factors. To assess whether or not a patient is at a high risk of developing cancer, factors such
as age, race, weight, smoking status, drinking status, exercise habits, overall medical history, family
history of cancer and place of residence and workplace, accounting for environmental factors, would be
considered.
Logistic regression is used in many other fields and is a common tool of data scient
Here the likelihood function can be put into hypothesis testing for finding the probability of various outcomes
using the set of parameters defined in the null hypothesis.
The main goal of the maximum likelihood estimation is to make inferences about the data population which
will take part in the generation of the sample and evaluating the joint density at the observed data set. As we
have seen in the likelihood function above it can be maximized by
Here the motive of the estimation is to select the best fit parameter for the model to make the data most
probable. The specific value that maximizes the likelihood function Ln is called the maximum
likelihood estimation.
The MLE is a "likelihood" maximization method, while OLS is a distance-minimizing approximation method.
Maximizing the likelihood function determines the parameters that are most likely to produce the observed data.
From a statistical point of view, MLE sets the mean and variance as parameters in determining the specific
parametric values for a given model. This set of parameters can be used for predicting the data needed in a
normal distribution.
Ordinary Least squares estimates are computed by fitting a regression line on given data points that has the
minimum sum of the squared deviations (least square error). Both are used to estimate the parameters of a
linear regression model. MLE assumes a joint probability mass function, while OLS doesn't require any
stochastic assumptions for minimizing distance.
Maximum likelihood estimation, or MLE, is a method used in estimating the parameters of a statistical model,
and for fitting a statistical model to data. If you want to find the height measurement of every basketball player
in a specific location, you can use the maximum likelihood estimation. Normally, you would encounter
problems such as cost and time constraints. If you could not afford to measure all of the basketball players’
heights, the maximum likelihood estimation would be very handy. Using the maximum
likelihood estimation, you can estimate the mean and variance of the height of your subjects. The MLE would
set the mean and variance as parameters in determining the specific parametric values in a given model.
To sum it up, the maximum likelihood estimation covers a set of parameters which can be used for predicting
the data needed in a normal distribution. A given, fixed set of data and its probability model would likely
produce the predicted data. The MLE would give us a unified approach when it comes to the estimation. But in
some cases, we cannot use the maximum likelihood estimation because of recognized errors or the problem
actually doesn’t even exist in reality.
“OLS” stands for “ordinary least squares” while “MLE” stands for “maximum likelihood estimation.”
The ordinary least squares, or OLS, can also be called the linear least squares. This is a method for
approximately determining the unknown parameters located in a linear regression model.
Maximum likelihood estimation, or MLE, is a method used in estimating the parameters of a
statistical model and for fitting a statistical model to data.
Model Theory
Model Theory is the part of mathematics which shows how to apply logic to the study of structures in pure
mathematics. On the one hand it is the ultimate abstraction; on the other, it has immediate applications to every-
day mathematics.
The fundamental tenet of Model Theory is that mathematical truth, like all truth, is relative. A statement may be
true or false, depending on how and where it is interpreted.
This isn't necessarily due to mathematics itself, but is a consequence of the language that we use to express
mathematical ideas.
Model Theory is divided into two parts namely pure and applied. Pure model theory will learn the abstract
properties of first order theories and there on derives structure theorems for their models. The applied model
theory will study the concrete algebraic structures from model theoretic point of view and then uses the results
from pure model theory functionalities and uniformities of definition. The applied model theory is connected
strongly with other branches of mathematics.
Model fit statistics describe and test the overall fit of the model.
It measure the similarity between fitted model and actual outcome values that are generated.
The problem of assessing model fit might be challenging if researchers tend to measure the fit that
accounts for variability in model complexity, model misspecification and sample size.
Fit model describes the relationship between a response variable and one or more predictor variables.
There are many different models that you can fit including simple linear regression, multiple linear regression,
analysis of variance (ANOVA), analysis of covariance (ANCOVA), and binary logistic regression.
Linear fit
A linear model describes the relationship between a continuous response variable and the explanatory variables
using a linear function.
Logistic fit
A logistic model describes the relationship between a categorical response variable and the explanatory
variables using a logistic function.
• Sum of Squared Errors (SSE): Sum of squared differences between predicted and observed values.
Measures deviation from actual values.
• Log-likelihood (LL): The Kullback-Leibler based measure of model fit to observed data. It will choose
the model which can most likely create the in sample data.
• Akaike Information Criterion (A/C): A/C will allow the comparison between nested, overlapping or
non nested models that have different numbers of parameters. It selects the model that makes out sample
data most likely. It assumes that models are specified correctly.
• Akaike Information Criterion with finite sample correction(A/CC): The A/CC enables comparison
between nested, overlapping or non nested models that have various number of parameters with less
sample size correction. It assumes that models are correctly classified. Etc.
Construction
Data modelling is the process of creating a data model for storing the data in the database
A model is nothing but representation of data objects that is associations between different data objects
and rules.
Data Modelling helps in visually representing the data and the enforce business rules, regulatory
compliances and the government policies on data.
The logical designs are translated into physical models which contain storage devices, database and the
files that build the data.. The business earlier used the relational database technology such as SQL to
build the data models since it uniquely suits for linking the data set keys flexibility and data types to
support the requirements of business processes.
Traditionally, fixed record data is stable and even predictable in its growth. With this the data modelling
become easy. It sites contemplate the modelling of data. The effect of modelling must counter on
building open and elastic data interfaces since users does not known when new data source or form of
data can emerge.
Design a system rather than Schema : in the era of traditional data, The relational database schema can
cover the relationships and links between the data needed by business for its information support. This is
not the case in big data that does not have database or that uses database such as NoSQL. The big data
models must be created on systems rather than on databases.
Use Data Modelling Tools: The IT decision makers must include the ability to create the data models
for big data as the requirements while considering the big data tools and methodologies.
Focus on the Core Data of Business: Enterprises get the data at large volumes. Most of the data is
extraneous. The best method would be to identity the big data suitable for data tools and methodologies.
Deliver the Quality Data: Earlier data models and relationships are effected for big data when
organisations focus on development of sound definitions for data. The thorough meta data describes the
source of data and its purpose. The knowledge about the data helps in planning it properly in data
models to support the business.
Search for Key in Roads into the data: A Commonly used vectors into big data today is geographical
location. Based on the business the industries have other common keys into big data required by users.
The data models can be created which support information access paths for the company by identifying
the common entry points into the data.