Multivariate Analysis
An Introduction to Multivariate Analysis
Data analytics is all about looking at various factors to see how
they impact certain situations and outcomes.
When dealing with data that contains more than two
variables, you’ll use multivariate analysis.
Multivariate analysis isn’t just one specific method—rather, it
encompasses a whole range of statistical techniques.
These techniques allow you to gain a deeper understanding of
your data in relation to specific business or real- world
scenarios.
So, if you’re an aspiring data analyst or data scientist,
multivariate analysis is an important concept to get to grips
with.
1. What is multivariate analysis?
In data analytics, we look at different variables (or factors) and how
they might impact certain situations or outcomes.
For example, in marketing, you might look at how the variable “money
spent on advertising” impacts the variable “number of sales.”
In the healthcare sector, you might want to explore whether there’s a
correlation between “weekly hours of exercise” and “cholesterol
level.”
1. What is multivariate analysis?
This helps us to understand why certain outcomes occur, which in turn
allows us to make informed predictions and decisions for the future.
There are three categories of analysis to be aware of:
Univariate analysis, which looks at just one variable
Bivariate analysis, which analyzes two variables
Multivariate analysis, which looks at more than two variables
1. What is multivariate analysis?
As you can see, multivariate analysis encompasses all statistical
techniques that are used to analyze more than two variables at once.
The aim is to find patterns and correlations between several variables
simultaneously—
Allowing for a much deeper, more complex understanding of a given
scenario than you’ll get with bivariate analysis.
An example of multivariate analysis
Let’s imagine you’re interested in the relationship between a person’s
social media habits and their self-esteem.
You could carry out a bivariate analysis, comparing the following two
variables:
1. How many hours a day a person spends on Instagram
2. Their self-esteem score (measured using a self-esteem scale)
An example of multivariate analysis
An example of multivariate analysis
You may or may not find a relationship between the two
variables;
However, you know that, in reality, self-esteem is a complex
concept.
It’s
likely impacted by many different factors—not just how
many hours a person spends on Instagram.
You might also want to consider factors such as age,
employment status, how often a person exercises, and
relationship status (for example).
An example of multivariate analysis
In order to deduce the extent to which each of these variables
correlates with self-esteem, and with each other, you’d need to
run a multivariate analysis.
So we know that multivariate analysis is used when you want to
explore more than two variables at once.
Now let’s consider some of the different techniques you might use to
do this.
Multivariate data analysis techniques and examples
There are many different techniques for multivariate analysis,
and they can be
divided into two categories:
1. Dependence techniques
2. Interdependence techniques
Multivariate analysis techniques: Dependence vs.
interdependence
When we use the terms “dependence” and “interdependence,”
we’re referring to different types of relationships within the
data.
Dependence methods
Dependence methods are used when one or some of the
variables are dependent on others.
Dependence looks at cause and effect;
In other words, can the values of two or more independent
variables be used to explain, describe, or predict the value of
another, dependent variable?
Dependence methods
To give a simple example, the dependent variable of “weight”
might be predicted by independent variables such as “height”
and “age.”
In machine learning, dependence techniques are used to build
predictive models.
The analyst enters input data into the model, specifying which
variables are independent and which ones are dependent
Dependence methods
—in other words, which variables they want the model to
predict, and which variables they want the model to use to
make those predictions
Interdependence methods
Interdependence methods are used to understand the
structural makeup and underlying patterns within a dataset.
In this case, no variables are dependent on others, so you’re
not looking for causal relationships.
Rather, interdependence methods seek to give meaning to a
set of variables or to group them together in meaningful ways.
Interdependence methods
So: One is about the effect of certain variables on others, while the
other is all about the structure of the dataset.
Some useful multivariate analysis techniques are
Multiple linear regression
Multiple logistic regression
Multivariate analysis of variance (MANOVA)
Factor analysis
Cluster analysis
Multiple linear regression
Multiple linear regression is a dependence method which looks at the
relationship between one dependent variable and two or more
independent variables.
A multiple regression model will tell you the extent to which each
independent variable has a linear relationship with the dependent
variable.
This is useful as it helps you to understand which factors are likely to
influence a certain outcome, allowing you to estimate future outcomes.
Example of multiple regression:
As a data analyst, you could use multiple regression to predict crop
growth.
In this example, crop growth is your dependent variable and you want
to see how different factors affect it.
Your independent variables could be rainfall, temperature, amount of
sunlight, and amount of fertilizer added to the soil.
A multiple regression model would show you the proportion of
variance in crop growth that each independent variable accounts for.
Multiple logistic regression
Logistic regression analysis is used to calculate (and predict) the
probability of a binary event occurring.
A binary outcome is one where there are only two possible outcomes;
either the event occurs (1) or it doesn’t (0).
So, based on a set of independent variables, logistic regression can
predict how likely it is that a certain scenario will arise.
It is also used for classification.
Example of logistic regression:
Let’s imagine you work as an analyst within the insurance sector and
you need to predict how likely it is that each potential customer will
make a claim.
You might enter a range of independent variables into your model,
such as age, whether or not they have a serious health condition, their
occupation, and so on.
Using these variables, a logistic regression analysis will calculate the
probability of the event (making a claim) occurring.
Another oft-cited example is the filters used to classify email as
“spam” or “not spam.”
Cluster analysis
Another interdependence technique, cluster analysis is used to group
similar items within a dataset into clusters.
When grouping data into clusters, the aim is for the variables in one
cluster to be more similar to each other than they are to variables in other
clusters.
This is measured in terms of intracluster and intercluster distance.
Intracluster distance looks at the distance between data points within one
cluster.
This should be small.
Cluster analysis
Intercluster distance looks at the distance between data points in
different clusters.
This should ideally be large.
Cluster analysis helps you to understand how data in your sample is
distributed, and to find patterns.
Cluster analysis example:
A prime example of cluster analysis is audience segmentation.
If you were working in marketing, you might use cluster analysis to
define different customer groups which could benefit from more targeted
campaigns.
As a healthcare analyst, you might use cluster analysis to explore
whether certain lifestyle factors or geographical locations are associated
with higher or lower cases of certain illnesses.
Because it’s an interdependence technique, cluster analysis is often
carried out in the early stages of data analysis.