0% found this document useful (0 votes)
222 views48 pages

Fdsa Unit 5

Unit 5 of the Fundamentals of Data Science and Analytics course focuses on predictive analytics, covering key concepts such as linear regression, logistic regression, and time series analysis. It outlines the steps in predictive analytics, various techniques, and applications, including fraud detection and customer segmentation. Additionally, it discusses statistical tests for model validation, the importance of goodness-of-fit, and the challenges associated with time series analysis.

Uploaded by

hodaids
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
222 views48 pages

Fdsa Unit 5

Unit 5 of the Fundamentals of Data Science and Analytics course focuses on predictive analytics, covering key concepts such as linear regression, logistic regression, and time series analysis. It outlines the steps in predictive analytics, various techniques, and applications, including fraud detection and customer segmentation. Additionally, it discusses statistical tests for model validation, the importance of goodness-of-fit, and the challenges associated with time series analysis.

Uploaded by

hodaids
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

UNIT V – PREDICTIVE ANALYTICS


SYLLABUS:
Linear least squares – implementation – goodness of fit – testing a linear
model – weighted resampling. Regression using StatsModels – multiple
regression – nonlinear relationships – logistic regression – estimating
parameters – Time series analysis – moving averages – missing values – serial
correlation – autocorrelation. Introduction to survival analysis.

PART A
1. Define predictive analytics.
 Predictive analytics is the process of using data to forecast future
outcomes.
 The process uses data analysis, machine learning, artificial intelligence,
and statistical models to find patterns that might predict future
behavior.
 Data scientists use historical data as their source and utilize various
regression models and machine learning techniques to detect patterns
and trends in the data.

2. List the Steps in Predictive Analytics


1. Define the problem
2. Acquire and organize data
3. Pre-process data
4. Develop predictive models
5. Validate and deploy results

3. What are the Predictive Analytics Techniques available? List the


techniques used for Predictive Analytics.
1. Regression analysis
2. Decision trees
3. Neural networks

4. List the uses and examples of predictive analytics


 Fraud detection
 Conversion and purchase prediction
 Risk reduction
 Operational improvement
 Customer segmentation
 Maintenance forecasting

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 1


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

5. Define Least squares fit


 A “linear fit” is a line intended to model the relationship between
variables.
 A “least squares” fit is one that minimizes the mean squared error (MSE)
between the line and the data.

6. Define Residuals
 The deviation of an actual value from a model.
 The difference between the actual values and the fitted line.
 thinkstats2 provides a function that computes residuals:
def Residuals(xs, ys, inter, slope):
xs = np.asarray(xs)
ys = np.asarray(ys)
res = ys - (inter + slope * xs)
return res
It returns the differences between the actual values and the fitted line.

7. What is Goodness of fit in predictive analytics?


Goodness of fit
 A goodness-of-fit is a statistical test that tries to determine whether a set
of observed values match those expected under the applicable model.
 They can show whether your sample data fit an expected set of data from
a population with normal distribution.

8. Mention the types of goodness-of-fit tests


 The chi-square test determines if a relationship exists between
categorical data.
Variables must be mutually exclusive in order to qualify for the chi-
square test for independence. And the chi goodness-of-fit test should not
be used for data that is continuous.
 The Kolmogorov-Smirnov test determines whether a sample comes
from a specific distribution of a population.

9. What are the different ways to measure the quality of a linear model, or
goodness of fit?
 Standard deviation of the residuals
 Coefficient of determination, usually denoted R2 and called “R-squared”:
def CoefDetermination(ys, res):
return 1 - Var(res) / Var(ys)
Var(res) is the MSE of guesses using the model, Var(ys) is the MSE
without it.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 2


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

10. Differentiate Goodness-of-Fit Test vs. Independence Test


 Goodness-of-fit test and independence test are both statistical tests
used to assess the relationship between variables.
 A goodness-of-fit test is used to evaluate how well a set of observed
data fits a particular probability distribution.
 An independence test is used to assess the relationship between two
variables. It is used to test whether there is any association between
two variables.
 The primary purpose of an independence test is to see whether a
change in one variable is related to a change in another variable.
 An independence test is pointed towards two specific variables. A
goodness-of-fit test is used on an entire set of observed data to
evaluate the appropriateness of a specific model.

11. Define Regression and list its types.


Regression
 The linear least squares fit is an example of regression, which is fitting
any kind of model to any kind of data.
 The goal of regression analysis is to describe the relationship between
one set of variables, called the dependent variables, and another set of
variables, called independent or explanatory variables.
 When there is only one dependent and one explanatory variable, that’s
simple regression.
 If there is more than one dependent variable with more than one
explanatory variable, that’s multivariate regression.
 If the relationship between the dependent and explanatory variable is
linear, that’s linear regression.

12. Define StatsModels and mention its purpose.


 statsmodels provides two interfaces (APIs); the “formula” API uses
strings to identify the dependent and explanatory variables.
It uses a syntax called patsy; in this example, the ~ operator separates
the dependent variable on the left from the explanatory variables on
the right.
 smf.ols takes the formula string and the DataFrame, live, and returns
an OLS object that represents the model.
The name ols stands for “ordinary least squares.”
Given a sequence of values for y and sequences for x1 and x2, find the

parameters, , , and , that minimize the sum of 2 . This


process is called ordinary least squares.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 3


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

13. How to implement Regression in Python using StatsModels?


1. Step 1: Import packages.
2. Step 2: Loading data.
3. Step 3: Setting a hypothesis.
4. Step 4: Fitting the model
5. Step 5: Summary of the model.

14. Define R- squared value, F- statistic and Predictions.


R- squared value
 R-squared value ranges between 0 and 1.
 An R-squared of 100 percent indicates that all changes in the dependent
variable are completely explained by changes in the independent
variable(s).
F- statistic:
 The F statistic simply compares the combined effect of all variables.
Predictions:
 If significance level (alpha) to be 0.05, reject the null hypothesis and
accept the alternative hypothesis as p<0.05. so, say that there is a
relationship between head size and brain weight.

15.Define multiple linear regression or Multiple Regression using


Statsmodels in Python.
Multiple linear regression (MLR)
 Multiple linear regression (MLR), also known simply as multiple
regression, is a statistical technique that uses several explanatory
variables to predict the outcome of a response variable.
 The goal of multiple linear regression is to model the linear relationship
between the explanatory (independent) variables and response (dependent)
variables.
 MLR is used extensively in econometrics and financial inference.

Formula and Calculation of Multiple Linear Regression

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 4


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

16. Define logistic regression.


 If the dependent variable is boolean, the generalized model is called
logistic regression.
 Logistic regression is a supervised machine learning algorithm that
accomplishes binary classification tasks by predicting the probability of
an outcome, event, or observation.
 The model delivers a binary or dichotomous outcome limited to two
possible outcomes: yes/no, 0/1, or true/false.
 Logistic regression is commonly used in binary classification problems
where the outcome variable reveals either of the two categories (0 and 1).

17. Define Sigmoid Function


 Logistic regression uses a logistic function called a sigmoid function to
map predictions and their probabilities. Refer figure 5.3 for Sigmoid
function.
 The sigmoid function refers to an S-shaped curve that converts any real
value to a range between 0 and 1.
 If the output of the sigmoid function (estimated probability) is greater
than a predefined threshold on the graph, the model predicts that the
instance belongs to that class.
 If the estimated probability is less than the predefined threshold, the
model predicts that the instance does not belong to the class.
The sigmoid function is referred to as an activation function for logistic
regression and is defined as:

where,
e = base of natural logarithms
value = numerical value one wishes to transform

18. List the types of Logistic Regression with Examples


Logistic regression is classified into binary, multinomial, and ordinal.
Binary logistic regression
 Binary logistic regression predicts the relationship between the
independent and binary dependent variables.
 Some examples of the output of this regression type may be,
success/failure, 0/1, or true/false.
Examples:
1. Deciding on whether or not to offer a loan to a bank customer:
Outcome = yes or no.
2. Evaluating the risk of cancer: Outcome = high or low.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 5


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

3. Predicting a team’s win in a football match: Outcome = yes or no.


Multinomial logistic regression
 A categorical dependent variable has two or more discrete outcomes in
a multinomial regression type.
 This implies that this regression type has more than two possible
outcomes.
Ordinal logistic regression
 Ordinal logistic regression applies when the dependent variable is in
an ordered state (i.e., ordinal). The dependent variable (y) specifies an
order with two or more categories or levels.

19. Define time series and time series analysis.


 Time Series
 A time series is a sequence of measurements from a system that
varies in time.
 An ordered sequence of values of a variable at equally spaced time
intervals.
 Time Series Analysis
 Time series analysis is a specific way of analyzing a sequence of data
points collected over an interval of time.
 In time series analysis, analysts record data points at consistent
intervals over a set period of time rather than just recording the data
points intermittently or randomly.
 Time series analysis has become a crucial tool for companies looking
to make better decisions based on data.

20. Mention the components of Time Series Data


 Trends: Long-term increases, decreases, or stationary movement
 Seasonality: Predictable patterns at fixed intervals
 Cycles: Fluctuations without a consistent period
 Noise: Residual unexplained variability

21. List the different types of data used for predictive analysis.
 Types of Data
 Time Series Data: Comprises observations collected at different time
intervals. It's geared towards analyzing trends, cycles, and other
temporal patterns.
 Cross-Sectional Data: Involves data points collected at a single
moment in time. Useful for understanding relationships or
comparisons between different entities or categories at that specific
point.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 6


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

Pooled Data: A combination of Time Series and Cross-Sectional data.


This hybrid enriches the dataset, allowing for more nuanced and
comprehensive analyses.

22. Mention the different types of time series analysis.


 Time Series Analysis Types
 Classification
 Curve fitting
 Descriptive analysis
 Explanative analysis
 Exploratory analysis
 Forecasting
 Intervention analysis
 Segmentation.

23. List the Time Series Analysis Techniques


 Moving Average
 Exponential Smoothing
 Autoregression
 Decomposition
 Time Series Clustering
 Wavelet Analysis
 Intervention Analysis
 Box-Jenkins ARIMA models
 Box-Jenkins Multivariate models
 Holt-Winters Exponential Smoothing

24. List the Advantages of Time Series Analysis


1. Data Cleansing
2. Understanding Data
3. Forecasting
4. Identifying Trends and Seasonality
5. Visualizations
6. Efficiency
7. Risk Assessment

25. List the Challenges of Time Series Analysis


1. Limited Scope
2. Noise Introduction
3. Interpretation Challenges
4. Generalization Issues

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 7


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

5. Model Complexity
6. Non-Independence of Data
7. Data Availability

26. Define Serial Correlation and Auto Correlation in Time


Series Analysis.
Serial Correlation
 Serial correlation is the relationship between a given variable and
a lagged version of itself over various time intervals.
 It measures the relationship between a variable's current value
given its past values.
 A variable that is serially correlated indicates that it may not be
random.
 Serial correlation occurs in a time series when a variable and a
lagged version of itself (for instance a variable at times T and at T-
1) are observed to be correlated with one another over periods of
time.
 lag: The size of the shift the time series by an interval in a serial
correlation or autocorrelation.

Autocorrelation
 Autocorrelation, refers to the degree of correlation of the same
variables between two successive time intervals.
 Autocorrelation represents the degree of similarity between a given
time series and a lagged version of itself over successive time
intervals.
 Autocorrelation measures the relationship between a variable's
current value and its past values.

27. Define Survival Analysis and Survival


Curve.
Survival Analysis
 Survival analysis is a field of statistics that focuses on analysing
the expected time until a certain event happens.
 Survival analysis can be used for analysing the results of that
treatment in terms of the patients’ life expectancy.
 The term `survival time' specifies the length of time taken for
failure to occur.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 8


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

Survival curves
o The fundamental concept in survival analysis is the survival curve,
S(t), which is a function that maps from a duration, t, to the
probability of surviving longer than t, it’s just the complement of
the CDF:
S(t) = 1 − CDF(t)
where CDF(t) is the probability of a lifetime less than or equal to t.

28. Define missing value and narrate the reason for missing value.
Missing Value
 Missing data is defined as the values or data that is not stored
for some variable/s in the given dataset.
Reason for Missing Values
 Past data might get corrupted due to improper maintenance.
 Observations are not recorded for certain fields due to some
reasons. There might be a failure in recording the values due to
human error.
 The user has not provided the values intentionally
 Item nonresponse: This means the participant refused to
respond.

29. Why the missing data should be handled?


 The missing data will decrease the predictive power of the
model. If the algorithms are applied with missing data, then
there will be bias in the estimation of parameters.
 The results are not confident if the missing data is not handled
properly.

30. List the types of Missing Values


Type Definition
Missing completely at Missing data are randomly
random (MCAR) distributed across the variable and
unrelated to other variables.
Missing at random (MAR) Missing data are not randomly
distributed but they are accounted
for by other observed variables.
Missing not at random Missing data systematically differ
(MNAR) from the observed values.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 9


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

31. List the methods for identifying missing data

Functions Descriptions

This function returns a pandas dataframe, where each


.isnull() value is a boolean value True if the value is missing, False
otherwise.

Similarly to the previous function, the values for this one


.notnull()
are False if either NaN or None value is detected.

This function generates three main columns, including the


.info() “Non-Null Count” which shows the number of non-missing
values for each column.

This one is similar to isnull and notnull. However it shows


.isna()
True only when the missing value is NaN type.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 10


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

PART B

1. Give a brief introduction about predictive analytics.


 Predictive analytics
 Predictive analytics is the process of using data to forecast future
outcomes.
 The process uses data analysis, machine learning, artificial intelligence,
and statistical models to find patterns that might predict future
behavior.
 Data scientists use historical data as their source and utilize various
regression models and machine learning techniques to detect patterns
and trends in the data.

 Steps in Predictive Analytics


 The workflow for building predictive analytics frameworks follows five
basic steps:
1. Define the problem:
 A prediction starts with a good thesis and set of requirements.
 A distinct problem to solve will help determine what method of
predictive analytics should be used.
2. Acquire and organize data:
 An organization may have decades of data to draw upon, or a
continual flood of data from customer interactions.
 Before predictive analytics models can be developed, data flows must
be identified, and then datasets can be organized in a repository such
as a data warehouse like BigQuery.
3. Pre-process data:
 To prepare the data for the predictive analytics models, it should be
cleaned to remove anomalies, missing data points, or extreme outliers,
any of which might be the result of input or measurement errors.
4. Develop predictive models:
 Data scientists have a variety of tools and techniques to develop
predictive models depending on the problem to be solved and nature
of the dataset.
 Machine learning, regression models, and decision trees are some of
the most common types of predictive models.
5. Validate and deploy results:
 Check on the accuracy of the model and adjust accordingly.
 Once acceptable results have been achieved, make them available to
stakeholders via an app, website, or data dashboard.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 11


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

 Predictive Analytics Techniques


Predictive analytics tends to be performed with three main types of
techniques:
1. Regression analysis
 Regression is a statistical analysis technique that estimates
relationships between variables.
 Regression is useful to determine patterns in large datasets to
determine the correlation between inputs.
 Regression is often used to determine how one or more independent
variables affects another, such as how a price increase will affect the
sale of a product.
2. Decision trees
 Decision trees are classification models that place data into different
categories based on distinct variables.
 The model looks like a tree, with each branch representing a potential
choice, with the leaf of the branch representing the result of the
decision.
3. Neural networks
 Neural networks are machine learning methods that are useful in
predictive analytics when modeling very complex relationships.
 Neural networks are best used to determine nonlinear relationships in
datasets, especially when no known mathematical formula exists to
analyze the data.
 Neural networks can be used to validate the results of decision trees
and regression models.

 Uses and examples of predictive analytics


 Predictive analytics can be used to streamline operations, boost revenue,
and mitigate risk for almost any business or industry, including banking,
retail, utilities, public sector, healthcare, and manufacturing.
Fraud detection
 Predictive analytics examines all actions on a company’s network
in real time to pinpoint abnormalities that indicate fraud and
other vulnerabilities.
Conversion and purchase prediction
 Companies can take actions, like retargeting online ads to
visitors, with data that predicts a greater likelihood of
conversion and purchase intent.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 12


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

Risk reduction
 Credit scores, insurance claims, and debt collections all use
predictive analytics to assess and determine the likelihood of
future defaults.
Operational improvement
 Companies use predictive analytics models to forecast
inventory, manage resources, and operate more efficiently.
Customer segmentation
 By dividing a customer base into specific groups, marketers
can use predictive analytics to make forward-looking decisions
to tailor content to unique audiences.
Maintenance forecasting
 Organizations use data to predict when routine equipment
maintenance will be required and can then schedule it before a
problem or malfunction arises.

2. Explain linear least squares and its implementation in detail.

Least squares fit


 A “linear fit” is a line intended to model the relationship between
variables.
 A “least squares” fit is one that minimizes the mean squared error
(MSE) between the line and the data.
 The more general problem is that of fitting a straight line to a
collection of pairs of observations (x, y)

 The most commonly used method for finding a model is that of least
squares estimation.
 It is supposed that x is an independent (or predictor) variable which is
known exactly, while y is a dependent (or response) variable.

 The least squares (LS) estimates for are those for which the
predicted values of the curve minimize the sum of the squared
deviations from the observations.

 That is the problem is to find the values of that minimize


the residual sum of squares

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 13


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

Implementation

thinkstats2 provides simple functions that demonstrate linear least


squares:

def LeastSquares(xs, ys):


meanx, varx = MeanVar(xs)
meany = Mean(ys)
slope = Cov(xs, ys, meanx, meany) / varx
inter = meany - slope * meanx
return inter, slope

LeastSquares takes sequences xs and ys and returns the estimated


parameters inter and slope.

thinkstats2 also provides FitLine, which takes inter and slope and re-
turns the fitted line for a sequence of xs.

def FitLine(xs, inter, slope):


fit_xs = np.sort(xs)
fit_ys = inter + slope * fit_xs
return fit_xs, fit_ys

Residuals
 The deviation of an actual value from a model.
 The difference between the actual values and the fitted line.
 thinkstats2 provides a function that computes residuals:

def Residuals(xs, ys, inter, slope):


xs = np.asarray(xs)
ys = np.asarray(ys)
res = ys - (inter + slope * xs)
return res

Residuals takes sequences xs and ys and estimated parameters inter


and slope. It returns the differences between the actual values and the
fitted line.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 14


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

Figure 5.1 – Linear Least Square

 A plot in the figure 5.1 depicts the data points (in red), the least squares
line of best fit (in blue), and the residuals (in green)
 The parameters slope and inter are estimates based on a sample; like
other estimates, they are vulnerable to sampling bias, measurement
error, and sampling error.
 Sampling bias is caused by non-representative sampling, measurement
error is caused by errors in collecting and recording data, and sampling
error is the result of measuring a sample rather than the entire
population.

3. Explain in detail about Goodness of fit.


Goodness of fit
 A goodness-of-fit is a statistical test that tries to determine whether a set
of observed values match those expected under the applicable model.
 They can show whether sample data fit an expected set of data from a
population with normal distribution.

Types of goodness-of-fit tests


 The chi-square test determines if a relationship exists between
categorical data.
Variables must be mutually exclusive in order to qualify for the chi-
square test for independence. And the chi goodness-of-fit test should not
be used for data that is continuous.
Goodness of fit is a measure of how well a statistical model fits a set of
observations.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 15


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

 When goodness of fit is high, the values expected based on the


model are close to the observed values.
 When goodness of fit is low, the values expected based on the
model are far from the observed values.
 The Kolmogorov-Smirnov test determines whether a sample comes
from a specific distribution of a population.

To conduct the test, need a certain variable, along with an assumption of


how it is distributed.
 The observed values, which are derived from the actual data set
 The expected values, which are taken from the assumptions made
 The total number of categories in the set

Ways to measure the quality of a linear model, or goodness of fit.


 Standard deviation of the residuals - Std(res) is the root mean
squared error (RMSE) of predictions.
 Coefficient of determination, usually denoted R2 and called “R-
squared”:
def CoefDetermination(ys, res):
return 1 - Var(res) / Var(ys)
Var(res) is the MSE of guesses using the model, Var(ys) is the MSE
without it.

Importance of Goodness-of-Fit Tests


 Provide a way to assess how well a statistical model fits a set of
observed data.
 To determine whether the observed data are consistent with the
assumed statistical model
 Useful in choosing between different models which may better fit the
data.
 Help to identify outliers or market abnormalities that may be affecting
the fit of the model
 Provide information about the variability of the data and the
estimated parameters of the model.
 Can be useful for making predictions and understanding the behavior
of the system being modeled.

Goodness-of-Fit Test vs. Independence Test


 Goodness-of-fit test and independence test are both statistical tests
used to assess the relationship between variables.
 A goodness-of-fit test is used to evaluate how well a set of observed
data fits a particular probability distribution.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 16


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

 An independence test is used to assess the relationship between two


variables. It is used to test whether there is any association between
two variables.
 The primary purpose of an independence test is to see whether a
change in one variable is related to a change in another variable.
 An independence test is pointed towards two specific variables. A
goodness-of-fit test is used on an entire set of observed data to
evaluate the appropriateness of a specific model.

4. Discuss in detail about Regression using StatsModels.


Regression
 The linear least squares fit is an example of regression, which is
fitting any kind of model to any kind of data.
 The goal of regression analysis is to describe the relationship
between one set of variables, called the dependent variables, and
another set of variables, called independent or explanatory
variables.
 When there is only one dependent and one explanatory variable,
that’s simple regression.
 If there is more than one dependent variable with more than one
explanatory variable, that’s multivariate regression.
 If the relationship between the dependent and explanatory variable
is linear, that’s linear regression.
 For example, if the dependent variable is y and the explanatory
variables are x1 and x2, linear regression model:

where is the intercept, is the parameter associated with x1,

is the parameter associated with x2, and is the residual.

StatsModels
 statsmodels provides two interfaces (APIs); the “formula” API uses strings
to identify the dependent and explanatory variables.
It uses a syntax called patsy; in this example, the ~ operator separates
the dependent variable on the left from the explanatory variables on the
right.
 smf.ols takes the formula string and the DataFrame, live, and returns an
OLS object that represents the model.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 17


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

The name ols stands for “ordinary least squares.”


Given a sequence of values for y and sequences for x1 and x2, find the

parameters, , , and , that minimize the sum of 2 . This process


is called ordinary least squares.
 The fit method fits the model to the data and returns a RegressionResults
object that contains the results.

Stepwise Implementation in Python


Step 1: Import packages.
Step 2: Loading data.
Step 3: Setting a hypothesis.
Step 4: Fitting the model
statsmodels.regression.linear_model.OLS() method is used to
get ordinary least squares, and fit() method is used to fit the
data in it. The ols method takes in the data and performs linear
regression.
inpendent_columns ~ dependent_column:
left side of the ~ operator contains the independent variables
and right side of the operator contains the name of the
dependent variable or the predicted column.
Step 5: Summary of the model.
All the summary statistics of the linear regression model are
returned by the model.summary() method.

Example Program

# import packages
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

# loading the csv file


df = pd.read_csv('headbrain1.csv')
print(df.head())

# fitting the model


df.columns = ['Head_size', 'Brain_weight']
model = smf.ols(formula='Head_size ~ Brain_weight', data=df).fit()

# model summary
print(model.summary())

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 18


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

Output

R- squared value:
 R-squared value ranges between 0 and 1.
 An R-squared of 100 percent indicates that all changes in the dependent
variable are completely explained by changes in the independent
variable(s).

F- statistic:
 The F statistic simply compares the combined effect of all variables.

Predictions:
 If significance level (alpha) to be 0.05, reject the null hypothesis and
accept the alternative hypothesis as p<0.05. so, say that there is a
relationship between head size and brain weight.

5. Discuss in detail about multiple linear regression or Multiple Regression


using Statsmodels in Python.
Multiple linear regression (MLR)
 Multiple linear regression (MLR), also known simply as multiple
regression, is a statistical technique that uses several explanatory
variables to predict the outcome of a response variable.
 The goal of multiple linear regression is to model the linear relationship
between the explanatory (independent) variables and response
(dependent) variables.
 MLR is used extensively in econometrics and financial inference.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 19


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

Formula and Calculation of Multiple Linear Regression

Figure 5.2 – Simple Linear Regression Vs Multiple Linear Regression

Example
import statsmodels.api as sm
X = advertising[[‘TV’,’Newspaper’,’Radio’]]
y = advertising[‘Sales’]

# Add a constant to get an intercept


X_train_sm = sm.add_constant(X_train)

# Fit the resgression line using ‘OLS’


lr = sm.OLS(y_train, X_train_sm).fit()
print(lr.summary())

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 20


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

Output

Understanding the results:


 Rsq value is 91% which is good. It means that the degree of variance in Y
variable is explained by X variables
 Adj Rsq value is also good although it penalizes predictors more than Rsq
 After looking at the p values we can see that ‘newspaper’ is not a
significant X variable since p value is greater than 0.05
 The coef values are good as they fall in 5% and 95%, except for the
newspaper variable.

6. Discuss in detail about logistic regression with suitable case study.


LOGISTIC REGRESSION
 Linear regression can be generalized to handle other kinds of dependent
variables.
 If the dependent variable is boolean, the generalized model is called
logistic regression.
 If the dependent variable is an integer count, it’s called Poisson
regression.
 Logistic regression is a supervised machine learning algorithm that
accomplishes binary classification tasks by predicting the probability of
an outcome, event, or observation.
 The model delivers a binary or dichotomous outcome limited to two
possible outcomes: yes/no, 0/1, or true/false.
 Logistic regression is commonly used in binary classification problems
where the outcome variable reveals either of the two categories (0 and 1).

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 21


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

Example:
1. Determine the probability of heart attacks:
With the help of a logistic model, medical practitioners can determine
the relationship between variables such as the weight, exercise, etc.,
of an individual and use it to predict whether the person will suffer
from a heart attack or any other medical complication.
2. Identifying spam emails:
Email inboxes are filtered to determine if the email communication is
promotional/spam by understanding the predictor variables and
applying a logistic regression algorithm to check its authenticity.
Sigmoid Function
 Logistic regression uses a logistic function called a sigmoid function to
map predictions and their probabilities. Refer figure 5.3 for Sigmoid
function.
 The sigmoid function refers to an S-shaped curve that converts any real
value to a range between 0 and 1.
 If the output of the sigmoid function (estimated probability) is greater
than a predefined threshold on the graph, the model predicts that the
instance belongs to that class.
 If the estimated probability is less than the predefined threshold, the
model predicts that the instance does not belong to the class.
 The sigmoid function is referred to as an activation function for logistic
regression and is defined as:

where,
e = base of natural logarithms
value = numerical value one wishes to transform

The following equation represents logistic regression:

 x = input value
 y = predicted output
 b0 = bias or intercept term
 b1 = coefficient for input (x)

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 22


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

Figure 5.3 – Sigmoid Function

Key Assumptions for implementing Logistic Regression


1. The dependent/response variable is binary or dichotomous
 The first assumption of logistic regression is that response
variables can only take on two possible outcomes – pass/fail,
male/female, and malignant/benign.
2. Little or no multicollinearity between the predictor/explanatory
variables
 This assumption implies that the predictor variables (or the
independent variables) should be independent of each other.
Multicollinearity relates to two or more highly correlated
independent variables.
3. Linear relationship of independent variables to log odds
 Log odds refer to the ways of expressing probabilities. Log odds are
different from probabilities. Odds refer to the ratio of success to
failure, while probability refers to the ratio of success to everything
that can occur.
 For example, consider that you play twelve tennis games with your
friend. Here, the odds of you winning are 5 to 7 (or 5/7), while the
probability of you winning is 5 to 12 (as the total games played =
12).
4. Prefers large sample size
 Logistic regression analysis yields reliable, robust, and valid
results when a larger sample size of the dataset is considered.
5. Problem with extreme outliers
 Another critical assumption of logistic regression is the
requirement of no extreme outliers in the dataset.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 23


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

6. Consider independent observations


 This assumption states that the dataset observations should be
independent of each other. The observations should not be related
to each other or emerge from repeated measurements of the same
individual type.

Types of Logistic Regression with Examples


Logistic regression is classified into binary, multinomial, and ordinal.
Binary logistic regression
 Binary logistic regression predicts the relationship between the
independent and binary dependent variables.
 Some examples of the output of this regression type may be,
success/failure, 0/1, or true/false.
Examples:
4. Deciding on whether or not to offer a loan to a bank customer:
Outcome = yes or no.
5. Evaluating the risk of cancer: Outcome = high or low.
6. Predicting a team’s win in a football match: Outcome = yes or no.
Multinomial logistic regression
 A categorical dependent variable has two or more discrete
outcomes in a multinomial regression type.
 This implies that this regression type has more than two possible
outcomes.
Examples:
1. Let’s say you want to predict the most popular transportation type
for 2040. Here, transport type equates to the dependent variable,
and the possible outcomes can be electric cars, electric trains,
electric buses, and electric bikes.
2. Predicting whether a student will join a college, vocational/trade
school, or corporate industry.
3. Estimating the type of food consumed by pets, the outcome may be
wet food, dry food, or junk food.
Ordinal logistic regression
 Ordinal logistic regression applies when the dependent variable is in
an ordered state (i.e., ordinal). The dependent variable (y) specifies an
order with two or more categories or levels.
Examples: Dependent variables represent,
1. Formal shirt size: Outcomes = XS/S/M/L/XL
2. Survey answers: Outcomes = Agree/Disagree/Unsure
3. Scores on a math test: Outcomes = Poor/Average/Good

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 24


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

Logistic regression works in the following steps:


1. Prepare the data: The data should be in a format where each row
represents a single observation and each column represents a
different variable. The target variable (the variable you want to predict)
should be binary (yes/no, true/false, 0/1).
2. Train the model: We teach the model by showing it the training data.
This involves finding the values of the model parameters that minimize
the error in the training data.
3. Evaluate the model: The model is evaluated on the held-out test data
to assess its performance on unseen data.
4. Use the model to make predictions: After the model has been
trained and assessed, it can be used to forecast outcomes on new
data.

ESTIMATING PARAMETERS
Given a probability, compute the odds like this:

Given odds in favor, convert to probability like this:

Logistic regression is based on the following model:

Where o is the odds in favor of a particular outcome;

Suppose having estimated the parameters And given


values for x1 and x2 can compute the predicted value of log o, and then
convert to a probability:
o = np.exp (log_o)

p = o / (o+1)

The usual goal is to find the maximum-likelihood estimate (MLE), which


is the set of parameters that maximizes the likelihood of the data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 25


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

Example
Suppose the following data:
>>> y = np.array([0, 1, 0, 1])
>>> x1 = np.array([0, 0, 0, 1])
>>> x2 = np.array([0, 1, 1, 1])
And start with the initial guesses

:
>>> beta = [-1.5, 2.8, 1.1]

Then for each row can compute log_o:

>>> log_o = beta[0] + beta[1] * x1 + beta[2] * x2


[-1.5 -0.4 -0.4 2.4]

And convert from log odds to probabilities:

>>> o = np.exp(log_o)
[ 0.223 0.670 0.670 11.02 ]

>>> p = o / (o+1)
[ 0.182 0.401 0.401 0.916 ]

Notice that when log_o is greater than 0, o is greater than 1 and p is


greater than 0.5.
The likelihood of an outcome is p when y==1 and 1-p when y==0.

If think the probability of a boy is 0.8 and the outcome is a boy, the
likelihood is 0.8; if the outcome is a girl, the likelihood is 0.2.

Compute that like this:


>>> likes = y * p + (1-y) * (1-p)
[ 0.817 0.401 0.598 0.916 ]

The overall likelihood of the data is the product of likes:


>>> like = np.prod(likes)
0.18

For these values of beta, the likelihood of the data is 0.18. The goal of
logistic regression is to find parameters that maximize this likelihood.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 26


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

IMPLEMENTATION
StatsModels provides an implementation of logistic regression called logit,
named for the function that converts from probability to log odds.

import statsmodels.formula.api as smf


model = smf.logit('boy ~ agepreg', data=df)
results = model.fit()
SummarizeResults(results)

The result is a Logit object that represents the model.


It contains attributes called endog and exog that contain the endogenous
variable, another name for the dependent variable, and the exogenous
variables, another name for the explanatory variables.
The result of model.fit is a BinaryResults object,

7. Discuss in detail about time series analysis with a suitable case study.
 Time Series
 A time series is a sequence of measurements from a system that
varies in time.
 An ordered sequence of values of a variable at equally spaced time
intervals.
 Time Series Analysis
 Time series analysis is a specific way of analyzing a sequence of data
points collected over an interval of time.
 In time series analysis, analysts record data points at consistent
intervals over a set period of time rather than just recording the data
points intermittently or randomly.
 Time series analysis has become a crucial tool for companies looking
to make better decisions based on data.
 Examples of time series analysis in action include:
 Weather data
 Rainfall measurements
 Temperature readings
 Heart rate monitoring (EKG)
 Brain monitoring (EEG)
 Quarterly sales
 Stock prices
 Automated stock trading
 Industry forecasts

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 27


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

 Components of Time Series Data


 Trends: Long-term increases, decreases, or stationary movement
 Seasonality: Predictable patterns at fixed intervals
 Cycles: Fluctuations without a consistent period
 Noise: Residual unexplained variability

 Types of Data
 Time Series Data: Comprises observations collected at different time
intervals. It's geared towards analyzing trends, cycles, and other
temporal patterns.
 Cross-Sectional Data: Involves data points collected at a single
moment in time. Useful for understanding relationships or
comparisons between different entities or categories at that specific
point.
Pooled Data: A combination of Time Series and Cross-Sectional data.
This hybrid enriches the dataset, allowing for more nuanced and
comprehensive analyses.

 Time Series Analysis Types


 Classification: Identifies and assigns categories to the data.
 Curve fitting: Plots the data along a curve to study the relationships
of variables within the data.
 Descriptive analysis: Identifies patterns in time series data, like
trends, cycles, or seasonal variation.
 Explanative analysis: Attempts to understand the data and the
relationships within it, as well as cause and effect.
 Exploratory analysis: Highlights the main characteristics of the time
series data, usually in a visual format.
 Forecasting: Predicts future data. This type is based on historical
trends. It uses the historical data as a model for future data,
predicting scenarios that could happen along future plot points.
 Intervention analysis: Studies how an event can change the data.
 Segmentation: Splits the data into segments to show the underlying
properties of the source information.

 Time Series Analysis Techniques


 Moving Average: Useful for smoothing out long-term trends. It is ideal
for removing noise and identifying the general direction in which
values are moving.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 28


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

 Exponential Smoothing: Suited for univariate data with a systematic


trend or seasonal component. Assigns higher weight to recent
observations, allowing for more dynamic adjustments.
 Autoregression: Leverages past observations as inputs for a regression
equation to predict future values. It is good for short-term forecasting
when past data is a good indicator.
 Decomposition: This breaks down a time series into its core
components—trend, seasonality, and residuals—to enhance the
understanding and forecast accuracy.
 Time Series Clustering: Unsupervised method to categorize data points
based on similarity, aiding in identifying archetypes or trends in
sequential data.
 Wavelet Analysis: Effective for analyzing non-stationary time series data.
It helps in identifying patterns across various scales or resolutions.
 Intervention Analysis: Assesses the impact of external events on a time
series, such as the effect of a policy change or a marketing campaign.
 Box-Jenkins ARIMA models: Focuses on using past behavior and errors
to model time series data. Assumes data can be characterized by a linear
function of its past values.
 Box-Jenkins Multivariate models: Similar to ARIMA, but accounts for
multiple variables. Useful when other variables influence one time series.
 Holt-Winters Exponential Smoothing: Best for data with a distinct
trend and seasonality. Incorporates weighted averages and builds upon
the equations for exponential smoothing.

 The Advantages of Time Series Analysis


1. Data Cleansing: Time series analysis techniques such as smoothing and
seasonality adjustments help remove noise and outliers, making the data
more reliable and interpretable.
2. Understanding Data: Models like ARIMA or exponential smoothing
provide insight into the data's underlying structure. Autocorrelations and
stationary measures can help understand the data's true nature.
3. Forecasting: One of the primary uses of time series analysis is to predict
future values based on historical data. Forecasting is invaluable for
business planning, stock market analysis, and other applications.
4. Identifying Trends and Seasonality: Time series analysis can uncover
underlying patterns, trends, and seasonality in data that might not be
apparent through simple observation.
5. Visualizations: Through time series decomposition and other techniques,
it's possible to create meaningful visualizations that clearly show trends,
cycles, and irregularities in the data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 29


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

6. Efficiency: With time series analysis, less data can sometimes be more.
Focusing on critical metrics and periods can often derive valuable
insights without getting bogged down in overly complex models or
datasets.
7. Risk Assessment: Volatility and other risk factors can be modeled over
time, aiding financial and operational decision-making processes.

 Challenges of Time Series Analysis


1. Limited Scope: Time series analysis is restricted to time-dependent data.
It's not suitable for cross-sectional or purely categorical data.
2. Noise Introduction: Techniques like differencing can introduce
additional noise into the data, which may obscure fundamental patterns
or trends.
3. Interpretation Challenges: Some transformed or differenced values may
need more intuitive meaning, making it easier to understand the real-
world implications of the results.
4. Generalization Issues: Results may only sometimes be generalizable,
primarily when the analysis is based on a single, isolated dataset or
period.
5. Model Complexity: The choice of model can greatly influence the results,
and selecting an inappropriate model can lead to unreliable or misleading
conclusions.
6. Non-Independence of Data: Unlike other types of statistical analysis,
time series data points are not always independent, which can introduce
bias or error in the analysis.
7. Data Availability: Time series analysis often requires many data points
for reliable results, and such data may not always be easily accessible or
available.

8. Explain in detail about Time Series Analysis Technique – Moving Average


and exponentially-weighted moving average with an example.
 Moving Average
 A moving average divides the series into overlapping regions, called
windows, and computes the average of the values in each window.
 One of the simplest moving averages is the rolling mean, which computes
the mean of the values in each window.
 For example, if the window size is 3, the rolling mean computes the mean
of values 0 through 2, 1 through 3, 2 through 4, etc.
 pandas provides rolling_mean, which takes a Series and a window size
and returns a new Series.
>>> series = np.arange(10)

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 30


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

>>> pandas.rolling_mean(series, 3)
array([ nan, nan, 1, 2, 3, 4, 5, 6, 7, 8])
 The first two values are nan; the next value is the mean of the first three
elements, 0, 1, and 2. The next value is the mean of 1, 2, and 3. And so
on.
 The rolling mean seems to do smoothing out the noise and extracting the
trend.
 Exponentially-weighted moving average (EWMA)
 The Exponentially Weighted Moving Average (EWMA) is a quantitative or
statistical measure used to model or describe a time series.
 The moving average is designed as such that older observations are given
lower weights. The weights fall exponentially as the data point gets older –
hence the name exponentially weighted.
 An alternative is the exponentially-weighted moving average (EWMA),
which has two advantages.
 First, it computes a weighted average where the most recent value has
the highest weight and the weights for previous values drop off
exponentially.
 Second, the pandas implementation of EWMA handles missing values
better.
EWMA Formula

Where:
Alpha = The weight decided by the user
r = Value of the series in the current period

ewma = pandas.ewma(reindexed.ppg, span=30)


thinkplot.Plot(ewma.index, ewma)

 The span parameter corresponds roughly to the window size of a moving


average; it controls how fast the weights drop off, so it determines the
number of points that make a non-negligible contribution to each
average.

 Figure 5.1 (right) shows the EWMA for the same data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 31


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

 It is similar to the rolling mean, where they are both defined, but it has
no missing values, which makes it easier to work with.

Figure 5.4: Daily price and a rolling mean (left) and exponentially-weighted
moving average (right).

Missing values
 A simple and common way to fill missing data is to use a moving average.
 The Series method fillna:
reindexed.ppg.fillna(ewma, inplace=True)

Wherever reindexed.ppg is nan, fillna replaces it with the corresponding


value from ewma. The inplace flag tells fillna to modify the existing Series
rather than create a new one.

9. Discuss in detail about Serial Correlation and Auto Correlation in Time


Series Analysis with suitable example.
Serial Correlation
 Serial correlation is the relationship between a given variable and a
lagged version of itself over various time intervals.
 It measures the relationship between a variable's current value given its
past values.
 A variable that is serially correlated indicates that it may not be random.
 Serial correlation occurs in a time series when a variable and a lagged
version of itself (for instance a variable at times T and at T-1) are
observed to be correlated with one another over periods of time.
 lag: The size of the shift the time series by an interval in a serial
correlation or autocorrelation.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 32


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

 One example of serial correlation is found in stock prices.


 Stock prices tend to go up and down together over time, which is said to
be “serially correlated.” This means that if stock prices go up today, they
will also go up tomorrow. Similarly, if stock prices go down today, they
are likely to go down tomorrow.
 The degree of serial correlation can be measured using the
autocorrelation coefficient.
 The autocorrelation coefficient measures how closely related a series of
data points are to each other.

Types of Serial Correlation


Positive Serial Correlation
 Positive serial correlation occurs when a positive error for one observation
increases the chance of a positive error for another observation.
 In other words, if there is a positive error in one period, there is a greater
likelihood of a positive error in the next period as well.
 Positive serial correlation also means that a negative error for one
observation increases the chance of a negative error for another
observation.
 So, if there is a negative error in one period, there is a greater likelihood
of a negative error in the next period. Refer Figure 5.5

Figure 5.5 – Positive Serial Correlation

Negative Serial Correlation


 A negative serial correlation occurs when a positive error for one
observation increases the chance of a negative error for another
observation.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 33


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

 In other words, if there is a positive error in one period, there is a greater


likelihood of a negative error in the next period.
 A negative serial correlation also means that a negative error for one
observation increases the chance of a positive error for another
observation.
 So, if there is a negative error in one period, there is a greater likelihood
of a positive error in the next period. Refer Figure 5.6

Figure 5.6 – Negative Serial Correlation

def SerialCorr(series, lag=1):


xs = series[lag:]
ys = series.shift(lag)[lag:]
corr = thinkstats2.Corr(xs, ys)
return corr
After the shift, the first lag values are nan, so I use a slice to remove
them before computing Corr.

Testing for Serial Correlation


Durbin-Watson Test
 The Durbin-Watson test is a statistical test used to determine whether or
not there is a serial correlation in a data set.
 It tests the null hypothesis of no serial correlation against the alternative
positive or negative serial correlation hypothesis.
 The test is named after James Durbin and Geoffrey Watson, who
developed it in 1950.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 34


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

The Durbin-Watson Statistic (DW) is approximated by:

$$ DW = 2(1 − r) $$

Where:
\(r\) is the sample correlation between regression residuals from one
period and the previous period.

 The test statistic can take on values ranging from 0 to 4.


 A value of 2 indicates no serial correlation, a value between 0 and 2
indicates a positive serial correlation, and a value between 2 and 4
indicates a negative serial correlation:

 If there is no autocorrelation, the regression errors will be uncorrelated,


and thus \(DW = 2\)
$$ DW = 2(1 − r) = 2(1 − 0) = 2 $$

 For positive serial autocorrelation, \(DW < 2\).


For example, if serial correlation of the regression residuals = 1,
\(DW = 2(1 − 1) = 0\).

 For negative autocorrelation, \(DW > 2\).


For example, if serial correlation of the regression residual = −1, \(DW =
2(1 − (−1)) = 4\).

 To reject the null hypothesis of no serial correlation, need to find a critical


value lower than the calculated value of d*.

Define \(d_l\) as the lower value and \(d_u\) as the upper value:
o If the DW statistic is less than \(d_l\), we reject the null hypothesis of no
positive serial correlation.
o If the DW statistic is greater than \((4 – d_l)\), we reject the null
hypothesis, indicating a significant negative serial correlation.
o If the DW statistic falls between \(d_l\) and \(d_u\), the test results are
inconclusive.
o If the DW statistic is greater than \(d_u\), we fail to reject the null
hypothesis of no positive serial correlation.
o Refer Figure 5.7

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 35


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

Figure 5.7 - Durbin-Watson Test for Serial Correlation

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 36


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

Example 5.1:
The Durbin-Watson Test for Serial Correlation
Consider a regression output with two independent variables that
generate a DW statistic of 0.654. Assume that the sample size is
15. Test for serial correlation of the error terms at the 5%
significance level.

Solution
From the Durbin-Watson table with \(n = 15\) and \(k = 2\),
\(d_l = 0.95\) and \(d_u = 1.54\).
Since \(d = 0.654 < 0.95 = d_l\),
Reject the null hypothesis and conclude that there is significant
positive autocorrelation.

Example 5.2
Consider a regression model with 80 observations and two independent
variables. Assume that the correlation between the error term and the
first lagged value of the error term is 0.18. The most
appropriate decision is:
A. reject the null hypothesis of positive serial correlation.
B. fail to reject the null hypothesis of positive serial correlation.
C. declare that the test results are inconclusive.
Solution
The correct answer is C.
The test statistic is:
$$ DW \approx 2(1 − r) = 2(1 − 0.18) = 1.64 $$
The critical values from the Durbin Watson table with \(n = 80\) and \(k =
2\) is \(d_l = 1.59\) and \(d_u = 1.69\).
Because 1.69 > 1.64 > 1.59, determine the test results are inconclusive.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 37


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

10. Discuss in detail about Autocorrelation. And differentiate between


serial correlation and autocorrelation.
 Autocorrelation
 Autocorrelation, refers to the degree of correlation of the same variables
between two successive time intervals.
 Autocorrelation represents the degree of similarity between a given time
series and a lagged version of itself over successive time intervals.
 Autocorrelation measures the relationship between a variable's current
value and its past values.
 The value of autocorrelation ranges from -1 to 1.
 An autocorrelation of +1 represents a perfect positive correlation, while
an autocorrelation of -1 represents a perfect negative correlation.
 A value between -1 and 0 represents negative autocorrelation.
 A value between 0 and 1 represents positive autocorrelation.
 Autocorrelation gives information about the trend of a set of historical
data so that it can be useful in the technical analysis

Types of Autocorrelation
 Positive autocorrelation
The observations with positive autocorrelation can be plotted into a
smooth curve. By adding a regression line, it can be observed that
a positive error is followed by another positive one, and a negative
error is followed by another negative one. Refer Figure 5.8

Figure 5.8 – Positive Autocorrelation

 Negative autocorrelation
Conversely, negative autocorrelation represents that the increase
observed in a time interval leads to a proportionate decrease in the
lagged time interval. By plotting the observations with a regression

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 38


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

line, it shows that a positive error will be followed by a negative one


and vice versa. Refer Figure 5.9

Figure 5.9 – Negative Autocorrelation

 Autocorrelation can be applied to different numbers of time gaps,


which is known as lag.
 A lag 1 autocorrelation measures the correlation between the
observations that are a one-time gap apart.
 For example, to learn the correlation between the temperatures of one
day and the corresponding day in the next month, a lag 30
autocorrelation should be used (assuming 30 days in that month).
 Autocorrelation refers to the correlation between a time series variable
and its own lagged values over time. In other words, it measures the
degree of similarity between observations of a variable at different
points in time.
 Autocorrelation is an important concept in time series analysis as it
helps to identify patterns and relationships within the data.
 Positive autocorrelation occurs when a time series variable is
correlated with its past values, while negative autocorrelation occurs
when it is correlated with its future values.
 Zero autocorrelation indicates that there is no correlation between the
variable and its lagged values.

Benefits of Autocorrelation
 Autocorrelation has several benefits in time series analysis:
 Identifying patterns – Autocorrelation helps to identify patterns in
the time series data, which can provide insights into the behavior of
the variable over time.
 Model selection – Autocorrelation can be used to select appropriate
models for time series analysis.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 39


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

 Forecasting – Autocorrelation can help to forecast future values of a


time series variable.
 Validating assumptions – Autocorrelation can be used to validate
assumptions of statistical models.
 Hypothesis testing – Autocorrelation can affect the results of
hypothesis tests, such as t-tests and F-tests. By

Test for Autocorrelation


 Autocorrelation can be assessed using a variety of statistical techniques
such as the autocorrelation function (ACF), partial autocorrelation
function (PACF), and the Durbin-Watson statistic.
 These methods help to quantify the strength and direction of the
autocorrelation and can be used to model and forecast time series data.

 The Durbin-Watson statistic is commonly used to test for autocorrelation.


 It can be applied to a data set by statistical software. T
 he outcome of the Durbin-Watson test ranges from 0 to 4.
 An outcome closely around 2 means a very low level of autocorrelation.
 An outcome closer to 0 suggests a stronger positive autocorrelation, and
an outcome closer to 4 suggests a stronger negative autocorrelation.

 The autocorrelation function (ACF) assesses the correlation between


observations in a time series for a set of lags. The ACF for time series y is
given by:
Corr (yt,yt−k), k=1,2,….
Analysts typically use graphs to display this function.

Computation of Autocorrelation in Python


The pandas.Series.autocorr() function lets you compute the lag-N
(default=1) autocorrelation on a given series.
Code Snippet:
df['series'].autocorr(lag=1)

 Serial Correlation Versus Autocorrelation


1. Serial correlation is a statistical concept that refers to the correlation
between a variable and itself over time. It is used to measure the
degree to which a variable's values at one point in time are related to
its values at another point in time. Serial correlation is often used in
time-series analysis to detect patterns in data and to test whether a
model is appropriate for the data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 40


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

2. Autocorrelation is a specific type of serial correlation that measures


the correlation between a variable and its lagged values. In other
words, autocorrelation measures the degree to which a variable's
values at one point in time are related to its values at previous points
in time. Autocorrelation is often used to assess whether a time-series
model is appropriate for the data.
3. Serial correlation is a more general term that refers to the correlation
between a variable and itself over time, whereas autocorrelation
specifically refers to the correlation between a variable and its lagged
values.
4. In terms of applications, serial correlation is often used to analyze
patterns in data over time, such as trends and seasonality, while
autocorrelation is often used in time-series analysis to assess the fit of
a model and to make predictions about future values.
5. For example, in finance, serial correlation might be used to analyze
the daily returns of a stock or portfolio over time to detect trends and
seasonality. Autocorrelation might be used to test whether a time-
series model is appropriate for the data and to make predictions about
future returns based on past values.

11. Give a brief introduction about Survival Analysis.


Survival Analysis
 Survival analysis is a field of statistics that focuses on analysing the
expected time until a certain event happens.
 Survival analysis can be used for analysing the results of that treatment
in terms of the patients’ life expectancy.
 The term `survival time' specifies the length of time taken for failure to
occur.
 Survival analysis is used to analyse data in which the time until the event
is of interest.
 The response is often referred to as a failure time, survival time, or event
time.
 This branch of statistics developed around measuring the effects of
medical treatment on patients’ survival in clinical trials.
 Examples
o Time until tumour recurrence
o Time until a machine part fails

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 41


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

Survival curves
 The fundamental concept in survival analysis is the survival curve, S(t),
as in figure 5.10, which is a function that maps from a duration, t, to the
probability of surviving longer than t, it’s just the complement of the CDF:
S(t) = 1 − CDF(t)
where CDF(t) is the probability of a lifetime less than or equal to t.

Figure 5.10 - Survival curves

 For example, in the NSFG dataset, given the duration of 11189 complete
pregnancies.
 Can read this data and compute the CDF:
preg = nsfg.ReadFemPreg()
complete = preg.query('outcome in [1, 3, 4]').prglngth
cdf = thinkstats2.Cdf(complete, label='cdf')
 The outcome codes 1, 3, 4 indicate live birth, stillbirth, and miscarriage.
 The DataFrame method query takes a boolean expression and evaluates it
for each row, selecting the rows that yield True.
class SurvivalFunction(object):
def __init__(self, cdf, label=''):
self.cdf = cdf
self.label = label or cdf.label
@property
def ts(self):
return self.cdf.xs
@property
def ss(self):
return 1 - self.cdf.ps

 SurvivalFunction provides two properties:


o ts, which is the sequence of lifetimes,
o ss, which is the survival curve.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 42


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

 From the survival curve can derive the hazard function; for pregnancy
lengths, the hazard function maps from a time, t, to the fraction of
pregnancies that continue until t and then end at t.

 The numerator is the fraction of lifetimes that end at t, which is also


PMF(t).

Figure 5.5 - Hazard curve

Censoring
 In longitudinal studies exact survival time is only known for those
individuals who show the event of interest during the follow-up period.
These individuals are called censored observations.
 The following terms are used in relation to censoring:
 Right censoring: a subject is right censored if it is known that failure
occurs some time after the recorded follow-up period.
 Left censoring: a subject is left censored if it is known that the failure
occurs some time before the recorded follow-up period.
 Interval censoring: a subject is interval censored if it is known that the
event occurs between two times, but the exact time of failure is not
known.

Truncation
 A truncation period means that the outcome of interest cannot possibly
occur.
 A censoring period means that the outcome of interest may have
occurred.
 There are two types of truncation:
 Left truncation: a subject is left truncated if it enters the population at
risk some stage after the start of the follow-up period.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 43


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

 Right truncation: a subject is right truncated if it leaves the population


at risk some stage after the study start.

Figure 5.6: Left-, right-censoring, and truncation

 An `X' indicates that the subject has experienced the outcome of


interest; a `O' indicates censoring.
 Subject A experiences the event of interest on day 7. Subject B does
not experience the event during the study period and is right censored
on day 12 (this implies that subject B experienced the event sometime
 after day 12).
 Subject C does not experience the event of interest during its period of
observation and is censored on day 10.
 Subject D is interval censored: this subject is observed intermittantly
and experiences the event of interest sometime between days 5 { 6 and
7 { 8. Subject E is left censored | it has been found to have already
experienced the event of interest when it enters the study on day 1.
 Subject F is interval truncated: there is no way possible that the event
of interest could occur to this individual between days 4 { 6.
 Subject G is left truncated: there is no way possible that the event of
interest could have occurred before the subject enters the study on
day 3.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 44


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

12. What are the Effective Strategies for Handling Missing Values in Data
Analysis?
Missing Value
 Missing data is defined as the values or data that is not stored for
some variable/s in the given dataset.
 Below is a sample of the missing data from the Titanic dataset.
 The columns ‘Age’ and ‘Cabin’ have some missing values.

Reason for Missing Values


 Past data might get corrupted due to improper maintenance.
 Observations are not recorded for certain fields due to some reasons.
There might be a failure in recording the values due to human error.
 The user has not provided the values intentionally
 Item nonresponse: This means the participant refused to respond.

Reason to handle missing data


 The missing data will decrease the predictive power of the model. If
the algorithms are applied with missing data, then there will be bias
in the estimation of parameters.
 The results are not confident if the missing data is not handled
properly.

Types of Missing Values

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 45


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

Type Definition
Missing completely at Missing data are randomly distributed
random (MCAR) across the variable and unrelated to
other variables.
Missing at random Missing data are not randomly
(MAR) distributed but they are accounted for
by other observed variables.
Missing not at random Missing data systematically differ from
(MNAR) the observed values.

Missing Completely At Random (MCAR)


 In MCAR, the probability of data being missing is the same for all the
observations.
 In this case, there is no relationship between the missing data and
any other values observed or unobserved within the given dataset.
 That is, missing values are completely independent of other
data. There is no pattern.
 In the case of MCAR data, the value could be missing due to human
error, some system/equipment failure, loss of sample, or some
unsatisfactory technicalities while recording the values.
 For Example, suppose in a library there are some overdue books.
Some values of overdue books in the computer system are missing.
The reason might be a human error, like the librarian forgetting to
type in the values.

Missing At Random (MAR)


 MAR data means that the reason for missing values can be explained
by variables which have complete information, as there is some
relationship between the missing data and other values/data.
 In this case, the data is not missing for all the observations.
 It is missing only within sub-samples of the data, and there is some
pattern in the missing values.
 For example, if you check the survey data, you may find that all the
people have answered their ‘Gender,’ but ‘Age’ values
are mostly missing for people who have answered their ‘Gender’ as
‘female.’ (The reason being most of the females don’t want to reveal
their age.)
 So, the probability of data being missing depends only on the observed
value or data. In this case, the variables ‘Gender’ and ‘Age’ are related.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 46


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

The reason for missing values of the ‘Age’ variable can be explained by
the ‘Gender’ variable, but you cannot predict the missing value itself.

Missing Not At Random (MNAR)


 Missing values depend on the unobserved data.
 If there is some structure/pattern in missing data and other observed
data can not explain it, then it is considered to be Missing Not At
Random (MNAR).
 If the missing data does not fall under the MCAR or MAR, it can be
categorized as MNAR.
 It can happen due to the reluctance of people to provide the required
information.
 A specific group of respondents may not answer some questions in a
survey.

Methods for identifying missing data

Functions Descriptions

This function returns a pandas dataframe, where each


.isnull() value is a boolean value True if the value is missing, False
otherwise.

Similarly to the previous function, the values for this one


.notnull()
are False if either NaN or None value is detected.

This function generates three main columns, including the


.info() “Non-Null Count” which shows the number of non-missing
values for each column.

This one is similar to isnull and notnull. However it shows


.isna()
True only when the missing value is NaN type.

Approach to handle missing values in a dataset.


 Deleting Rows with missing values
 Impute missing values for continuous variable
 Impute missing values for categorical variable
 Other Imputation Methods
 Using Algorithms that support missing values

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 47


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 5

 Prediction of missing values


 Imputation using Deep Learning Library — Datawig

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 48

You might also like