Calculating Covariance and Correlation

Last Updated : 24 May, 2025

Covariance and correlation are two statistical concepts that are used to analyze and find the relationship between the data to understand the datasets better. These two concepts are different but related to each other. In this article we will cover both covariance and correlation with examples and how these concepts are used in data science.

Covariance

Covariance measures how two variables change together indicating the direction of their linear relationship. To understand this, think of features as the different columns in your dataset, such as age, income, or height, which describe various aspects of the data. The covariance matrix helps us measure how these features vary together—whether they are positively related (increase together), negatively related (one increases while the other decreases), or unrelated.

The value of covariance can range between -\infty {to}\infty . It gives insight whether the variable is positive , negative or no relation.

Positive covariance suggests that as one variable increases, the other tends to increase, while a negative covariance indicates an inverse relationship.
Zero Covariance- when two variables are not related with each other and don't have any impact on each other.

However, covariance is not standardized, meaning its value depends on the scale of the variables, making it less interpretable across datasets.

The formula for calculating covariance between two variable (X) and (Y) is given below:

where

X_i: Represents the i_th value of variable X.
Y_i: Represents the i_th value of variable Y.
N: Represents the total number of data points.

While covariance is useful to determine the type of relationship between variables but it is not suitable to interpret the magnitude of the relationship as it is scale-dependent, meaning covariance shows if two variables move together but because it depends on the units of measurement its value doesn't clearly indicate how strong the relationship is.

Now, when we need to understand the relationships between multiple features in a dataset, we use covariance matrix which is a square matrix where each element represents the covariance between a pair of features. The diagonal elements of the matrix capture the variance of individual features, while the off-diagonal elements indicate how pairs of features vary together.

The covariance matrix also plays a key role in feature selection and feature extraction by identifying redundant or highly correlated features that may introduce noise or multicollinearity into models. For example, highly correlated features can be removed to improve model interpretability and reduce overfitting.

Correlation

In covariance we find the relationship between variables but it doesn't find the strength of the relation. This problem is resolved by correlation. So It is a standardized version of covariance that measures both the direction and strength of the linear relationship between two variables by dividing covariance by the product of the standard deviations of the variables.

This results in a dimensionless value ranging from -1 to +1, where values close to +1 or -1 indicate strong positive or negative linear relationships, respectively, and 0 implies no linear relationship. The diagonal of the matrix always contains 1s, as each feature is perfectly correlated with itself.

The formula of correlation is given below:

Where,

Cov(X, Y): Represents the covariance between variables X and Y.
\sigma_X: Represents the standard deviation of variable X.
\sigma_Y: Represents the standard deviation of variable Y.

we can classify the correlation into three types:

Simple Correlation: It measures the relationship between two variables with one number.
Partial Correlation: This helps to show the relationship between two variables while removing the influence of a third variable.
Multiple Correlation: It is a technique that uses two or more variables to predict a single outcome.

Correlation is particularly useful for comparing relationships across different datasets or features in machine learning.

The correlation matrix is particularly useful for feature selection and exploratory data analysis. By identifying highly correlated features, data scientists can detect redundancy in the dataset and decide which features to keep or remove. For example, in regression models, eliminating highly correlated features helps avoid multicollinearity, which can distort the model's predictions. Additionally, the matrix provides insights into which features are most influential for predicting the target variable.

Difference Between Covariance and Correlation

Aspect	Covariance	Correlation
Definition	Measures how two variables vary together, indicating the direction of their linear relationship.	Measures both the relationship and strength of two variables
Unit Dependency	Units depend on the product of the units of the two variables.	Unitless, making it easier to compare across datasets or features.
Range	lies between -\infty {to}\infty	lies between -1 and 1.

Methods of Calculating Correlation

The Graphic Method and Scatter Plot Method are simple, visual ways to understand correlation between variables

1. Graphic Method: In this method, the values of two variables are plotted on a graph, with one variable on the X-axis and the other on the Y-axis.

By observing the trend of the plotted points:
- An upward trend (from bottom-left to top-right) indicates a positive correlation.
- A downward trend (from top-left to bottom-right) indicates a negative correlation.
However, this method only gives an idea about the direction of the relationship and does not provide any information about its magnitude or strength.

2. Scatter Plot Method: A scatter plot is a more detailed version of the graphic method, where individual data points are plotted on a graph.

Observing the arrangement of points:
- If the points form an upward pattern from left to right, there is a positive correlation.
- If they form a downward pattern from left to right, there is a negative correlation.
- If there is no discernible pattern, it suggests no correlation.
Scatter plots are widely used in exploratory data analysis because they visually highlight relationships, outliers, and trends in data.

scatters_plots_correlation_examples — Scatter plots showing correlation

Karl Pearson's Coefficient and Spearman’s Rank Correlation

When analyzing relationships between variables, it is important to choose the right correlation method based on the type of data and its distribution. Two widely used methods are Karl Pearson's Coefficient of Correlation and Spearman’s Rank Correlation.

1. Karl Pearson's Coefficient: This is the most common method to calculate correlation. It provides a standardized value (ranging from -1 to +1) that indicates both the direction (positive or negative) and the strength of the relationship. This method assumes that the data follows a normal distribution and that the relationship between variables is linear.

It is also a default method to find correlation in many programming languages using the formula:

r = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{\sqrt{\sum (X - \bar{X})^2 \cdot \sum (Y - \bar{Y})^2}}

where \bar{X} and \bar{Y} are the means of X and Y.

It only works for linear relationships.
Sensitive to outliers, which can distort results.
Assumes normality in data distribution.

2. Spearman’s Rank Correlation : It is used when data does not follow a normal distribution or when dealing with ordinal data (data that can be ranked but not measured precisely). Unlike Pearson’s correlation, Spearman’s method measures the strength and direction of a monotonic relationship, meaning that as one variable increases, the other consistently increases or decreases, even if the relationship is not linear. The formula is:

\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}

where di is the difference between the ranks of X and Y and n is the total number of observations.

Both methods are essential tools in data science:

Karl Pearson's Coefficient is ideal for analyzing continuous variables with linear relationships.
Spearman’s Rank Correlation is better suited for non-linear or ranked data.

Limitations of Covariance and Correlation

Covariance:

Covariance depends on the units used making it hard to interpret if the variables are measured in different units.
It only works for straight-line relationships and doesn’t capture more complex or curved relationships.
Covariance doesn’t show how strong the relationship.

Correlation:

Correlation assumes a straight-line relationship and doesn’t work well with non-linear relationships.
Correlation can be heavily affected by outliers.
Correlation is best for normally distributed data and may not be reliable for skewed data.

Calculating Covariance and Correlation

affrahug38e

Improve

Article Tags :

Practice Tags :

Machine Learning

Calculating Covariance and Correlation

Covariance

Correlation

Difference Between Covariance and Correlation

Methods of Calculating Correlation

Karl Pearson's Coefficient and Spearman’s Rank Correlation

Limitations of Covariance and Correlation

Similar Reads

Thank You!

What kind of Experience do you want to share?