Chapter 2
Chapter 2
Data Analysis
Learning Objectives
The importance of data analysis in several classes of applications.
Concepts of regression and its several variants.
Bayes Rules and how it can be used to perform Bayesian Inference.
The basic concepts of Support Vector Machines.
The meaning of times series analysis, the various components of a times series and
how decomposition can help prediction.
How to extract rules to describe data from a large data set.
Introduction
Recent rapid advances in computing, data storage, networks and
sensors have dramatically increased our ability to access, store and
process huge amounts of data.
Basic idea:
Use data to identify relationships among
variables and use these relationships to make
predictions.
Regression (Meaning)
Re occurrences of trends
One fundamental task in data analysis is to attempt to find how different
variables are related to each other, and one of the central tools in statistics for
learning about such relationships is regression.
The basic idea behind regression is "Use the existing historical data to
identify potential relationships among variables" and then "use these
relationships to make predictions" about the future.
It helps to model and analyze several variables when the focus is on the
relationship between a dependent variable and one or more independent
variables (or "predictors").
Example
For example, the effect of a price increase in petrol upon demand for petrol-
run cars or say the effect of global fall in oil prices to the inflation rate, etc.
It also helps to study how the changing behaviour of a set of predictors affects
the behaviour of the dependent variable and allows us to use numerical values
to model these effects.
This includes simple linear regression analysis, in which there is a single independent
variable and the equation for predicting a dependent variable Y is a linear function of a
given independent variable X.
The multiple linear regression (MLR) model, finds the relationship of a variable Y to a set of
k quantitative explanatory variables but still in a linear fashion.
Regression models
Polynomial regression: If the relationship between the variables being
analyzed is not linear in parameters, a number of non-linear regression
techniques may be used to obtain a more accurate regression. This is called
polynomial regression.
Discriminant analysis
Factor analysis
Cluster analysis
Multidimensional scaling
Correspondence analysis
Conjoint analysis
Canonical correlation
𝑃(TimeZone=𝑈𝑆|Spam=𝑦𝑒𝑠)=10/10=1
𝑃(TimeZone=𝐸𝑈|Spam=𝑦𝑒𝑠)=0/10=0
Solution: Add one to every value in this table when you’re using it to
calculate probabilities:
𝑃(TimeZone=𝑈𝑆|Spam=𝑦𝑒𝑠)=11/12
𝑃(TimeZone=𝐸𝑈|Spam=𝑦𝑒𝑠)=1/12
All the data points that fall on one side of the line will
be labeled as one class and all the points that fall on the
other side will be labeled as the second.
ARMA:
Autoregressive–moving-average model
Mean Square Error (MSE)
MSE is defined as mean or average of the square of the difference between actual and estimated values.
Mathematically it is represented as:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Actual 42 45 49 55 57 60 62 58 54 50 44 40
Demand
Forecasted 44 46 48 50 55 60 64 60 53 48 42 38
Demand
Error -2 -1 1 5 2 0 -2 -2 1 2 2 2
Squared 4 1 1 25 4 0 4 4 1 4 4 4
Error
Sum of Square Error = 56 and MSE = 56 / 12 = 4.6667 133
It is just the square root of the mean square error. Mathematically it is represented as:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Actual 42 45 49 55 57 60 62 58 54 50 44 40
Demand
Forecasted 44 46 48 50 55 60 64 60 53 48 42 38
Demand
Error -2 -1 1 5 2 0 -2 -2 1 2 2 2
Squared 4 1 1 25 4 0 4 4 1 4 4 4
Error
Sum of Square Error = 56, MSE = 56 / 12 = 4.6667, RMSE = SQRT(4.667) = 2.2 134
Here, X’(t) represents the forecasted data value of point t and X(t) represents the actual data value of point t. Calculate
MAPE for the below dataset.
Month 1 2 3 4 5 6 7 8 9 10 11 12
Actual 42 45 49 55 57 60 62 58 54 50 44 40
Demand
Forecasted 44 46 48 50 55 60 64 60 53 48 42 38
Demand
MAPE is commonly used because it’s easy to interpret and easy to explain. For example, a MAPE value of 11.5%
means that the average difference between the forecasted value and the actual value is 11.5%.
The lower the value for MAPE, the better a model is able to forecast values e.g. a model with a MAPE of 2% is more
accurate than a model with a MAPE of 10%.
135