0% found this document useful (0 votes)
15 views

Chapter 2

Unit 2 focuses on data analysis, emphasizing its importance across various applications and introducing key concepts such as regression, Bayesian inference, and support vector machines. It outlines the data analysis process, benefits, and applications in fields like business, healthcare, and science, while detailing regression modeling techniques and their uses for predictions. The document also discusses error metrics like MSE, RMSE, and MAPE for evaluating forecasting accuracy.

Uploaded by

Kiara Rai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Chapter 2

Unit 2 focuses on data analysis, emphasizing its importance across various applications and introducing key concepts such as regression, Bayesian inference, and support vector machines. It outlines the data analysis process, benefits, and applications in fields like business, healthcare, and science, while detailing regression modeling techniques and their uses for predictions. The document also discusses error metrics like MSE, RMSE, and MAPE for evaluating forecasting accuracy.

Uploaded by

Kiara Rai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 136

Unit 2

Data Analysis
Learning Objectives
 The importance of data analysis in several classes of applications.
 Concepts of regression and its several variants.
 Bayes Rules and how it can be used to perform Bayesian Inference.
 The basic concepts of Support Vector Machines.
 The meaning of times series analysis, the various components of a times series and
how decomposition can help prediction.
 How to extract rules to describe data from a large data set.
Introduction
 Recent rapid advances in computing, data storage, networks and
sensors have dramatically increased our ability to access, store and
process huge amounts of data.

 The fields of scientific research and business applications both are


always challenged with the need to extract relevant information
from huge amounts of data and heterogeneous data sources, such
as sensors, databases, text archives, images, audio and video
streams, etc.
Data Analysis
 It is a process of inspecting, cleaning, transforming and modelling data with
the goal of discovering useful information, suggesting conclusions and
supporting decision-making.
 Intelligent data analysis (IDA) uses concepts from artificial intelligence, information
retrieval, machine learning, pattern recognition, visualization, distributed
programming.
 The process of IDA typically consists of the following three stages:
 Data preparation
 Data mining and rule finding
 Result validation and interpretation
 It has multiple facets and approaches, encompassing diverse techniques under a
variety of names, in different business, science and social science domains.
Importance of Data Analysis
 Data analysis offers the following benefits:
 Structuring the findings from survey research or other means of data collection
 Provides a picture of data at several levels of granularity from a macro picture into a
micro one
 Acquiring meaningful insights from the data set which can be effectively exploited to
take some critical decisions to improve productivity
 Helps to remove human bias in decision making, through proper statistical treatment
 With the advent of big data, it is even more vital to find a way to analyze the ever
(faster) growing disparate data coursing through their environments and give it
meaning
Data Analytics Applications
 Understanding and targeting customers
 Understanding and optimizing business processes
 Personal quantification and performance optimization
 Improving healthcare and public health
 Improving sports performance
 Improving science and research
 Optimizing machine and device performance
 Improving security and law enforcement
 Improving and optimizing cities and countries
 Financial trading
Regression Modelling Techniques
 Linear Regression
 Developing a Linear Regression Model
 Multiple Linear Regression (MLR)
 Non-Linear Regression
 Logistic Regression
 Classical Multivariate Analysis
Regression

Basic idea:
Use data to identify relationships among
variables and use these relationships to make
predictions.
Regression (Meaning)

 “To move backwards”.

 “Return to an earlier time or stage”

 Re occurrences of trends
 One fundamental task in data analysis is to attempt to find how different
variables are related to each other, and one of the central tools in statistics for
learning about such relationships is regression.

 The basic idea behind regression is "Use the existing historical data to
identify potential relationships among variables" and then "use these
relationships to make predictions" about the future.

 Regression analysis is a statistical process for estimating the


relationships among variables.

 It helps to model and analyze several variables when the focus is on the
relationship between a dependent variable and one or more independent
variables (or "predictors").
Example
 For example, the effect of a price increase in petrol upon demand for petrol-
run cars or say the effect of global fall in oil prices to the inflation rate, etc.

 It also helps to study how the changing behaviour of a set of predictors affects
the behaviour of the dependent variable and allows us to use numerical values
to model these effects.

 In simple terms, regression analysis allows us to model the dependent


variable as a function of its predictors.

 Regression techniques assume the existence of a large volume of data on the


underlying variables of interest and use this data to estimate the quantitative
effect of the causal variables upon the variable that they influence.
 Regression analysis is widely used for prediction and forecasting and thus
is a useful data analysis technique. It can also be used for causal inference
and descriptive modelling.

 There are various kinds of regression techniques available. These


techniques are mostly driven by three metrics:
 number of independent variables,
 type of dependent variables and
 shape of regression line.
needs a two-factor interaction which is provided by the cross-product term
R = Coefficient of Correlation: is the R square = coeff. of determination shows
degree of relationship between two percentage variation in y which is explained by
variables say x and y. It can go between -1 all the x variables together. Higher the better. It
and 1. is always between 0 and 1. It can never be
negative – since it is a squared value.
 Logistic regression is the standard way to model binary
outcomes. It is used to find the probability of event = success
and event = failure.
Regression models
 Linear regression establishes a relationship between dependent variable (Y) and one or
more independent variables (X) using a best fit straight line (also known as regression line).

 This includes simple linear regression analysis, in which there is a single independent
variable and the equation for predicting a dependent variable Y is a linear function of a
given independent variable X.

 The multiple linear regression (MLR) model, finds the relationship of a variable Y to a set of
k quantitative explanatory variables but still in a linear fashion.
Regression models
 Polynomial regression: If the relationship between the variables being
analyzed is not linear in parameters, a number of non-linear regression
techniques may be used to obtain a more accurate regression. This is called
polynomial regression.

 Multivariate Regression There may be situations where the number of


dependent variables may be more than one. We thus need models which
jointly determine the values of two or more dependent variables using two or
more equations.
 Such models are called multivariate regression models because they attempt
to explain multiple dependent variables.
Regression Modelling
Regression Modelling Techniques
 Linear Regression
 Developing a Linear Regression Model
 Multiple Linear Regression (MLR)
 Non-Linear Regression
 Logistic Regression
 Classical Multivariate Analysis
 Multiple regression analysis

 Logistic regression analysis

 Discriminant analysis

 Multivariate analysis of variance

 Factor analysis

 Cluster analysis

 Multidimensional scaling

 Correspondence analysis

 Conjoint analysis

 Canonical correlation

 Structural equation modelling


Bayesian Modelling, Inference and Bayesian Networks
| - under the condition
Inferences: a conclusion reached on the basis of
evidence and reasoning
H1: Hypothesis for prior distribution
H2: Hypothesis for posterior distribution
I: Inference, D: Observed Data
Prior and posterior hypothesis must be equated for our model to be precise. But in real time
scenario is not possible. So we find out the coefficient (alpha) which is termed as normalizing
constant that ensures that P(H|E) is 1.
Example
Bayesian Model – Naive Bayes Classifier
Zero-frequency problem in Naive Bayes
 Naive Bayes Algorithm: based on the Bayes theorem of Probability and
Statistics with a naive assumption that the features are independent of each
other.
 Bayes Algorithm describes the probability of an event, based on prior
knowledge of conditions that might be related to the event.

The assumption of independence of the features is ‘Naive’ because such


scenarios are highly unlikely to be encountered in real life.
The zero-frequency problem
 One of the disadvantages of Naïve-Bayes is that if you have no occurrences
of a class label and a certain attribute value together then the frequency-
based probability estimate will be zero. And this will get a zero when all the
probabilities are multiplied.
 An approach to overcome this ‘zero-frequency problem’ in a Bayesian
environment is to add one to the count for every attribute value-class
combination when an attribute value doesn’t occur with every class value.
 For example, say your training data looked like this:

𝑃(TimeZone=𝑈𝑆|Spam=𝑦𝑒𝑠)=10/10=1
𝑃(TimeZone=𝐸𝑈|Spam=𝑦𝑒𝑠)=0/10=0
Solution: Add one to every value in this table when you’re using it to
calculate probabilities:

𝑃(TimeZone=𝑈𝑆|Spam=𝑦𝑒𝑠)=11/12
𝑃(TimeZone=𝐸𝑈|Spam=𝑦𝑒𝑠)=1/12

This is how we’ll get rid of getting a zero probability.


Support Vector Machine
 Support Vector Machine (SVM) is a supervised machine
learning algorithm capable of performing classification,
regression and even outlier detection.
 The linear SVM classifier works by drawing a straight line
(hyperlane) between two classes.

 All the data points that fall on one side of the line will
be labeled as one class and all the points that fall on the
other side will be labeled as the second.
ARMA:
Autoregressive–moving-average model
Mean Square Error (MSE)

MSE is defined as mean or average of the square of the difference between actual and estimated values.
Mathematically it is represented as:

Month 1 2 3 4 5 6 7 8 9 10 11 12
Actual 42 45 49 55 57 60 62 58 54 50 44 40
Demand
Forecasted 44 46 48 50 55 60 64 60 53 48 42 38
Demand
Error -2 -1 1 5 2 0 -2 -2 1 2 2 2
Squared 4 1 1 25 4 0 4 4 1 4 4 4
Error
Sum of Square Error = 56 and MSE = 56 / 12 = 4.6667 133

School of Computer Engineering


Root Mean Square Error (RMSE)

It is just the square root of the mean square error. Mathematically it is represented as:

Month 1 2 3 4 5 6 7 8 9 10 11 12
Actual 42 45 49 55 57 60 62 58 54 50 44 40
Demand
Forecasted 44 46 48 50 55 60 64 60 53 48 42 38
Demand
Error -2 -1 1 5 2 0 -2 -2 1 2 2 2
Squared 4 1 1 25 4 0 4 4 1 4 4 4
Error
Sum of Square Error = 56, MSE = 56 / 12 = 4.6667, RMSE = SQRT(4.667) = 2.2 134

School of Computer Engineering


Mean Absolute Percentage Error (MAPE)

The formula to calculate MAPE is as follows:

Here, X’(t) represents the forecasted data value of point t and X(t) represents the actual data value of point t. Calculate
MAPE for the below dataset.
Month 1 2 3 4 5 6 7 8 9 10 11 12
Actual 42 45 49 55 57 60 62 58 54 50 44 40
Demand
Forecasted 44 46 48 50 55 60 64 60 53 48 42 38
Demand
 MAPE is commonly used because it’s easy to interpret and easy to explain. For example, a MAPE value of 11.5%
means that the average difference between the forecasted value and the actual value is 11.5%.
 The lower the value for MAPE, the better a model is able to forecast values e.g. a model with a MAPE of 2% is more
accurate than a model with a MAPE of 10%.
135

School of Computer Engineering

You might also like