Detecting Multicollinearity with VIF - Python

Last Updated : 22 May, 2025

Multicollinearity occurs when two or more independent variables in a regression model are highly correlated which leads to unstable coefficient estimates and reduces model reliability. This makes it difficult to identify the individual effect of each predictor on the dependent variable. The Variance Inflation Factor (VIF) is used to detect multicollinearity in regression analysis. In this article, we’ll see VIF and how to use it in Python to identify multicollinearity.

Mathematics behind Variance Inflation Factor (VIF) Formula

Variance Inflation Factor (VIF) measures the increase in the variance of a regression coefficient caused by multicollinearity among predictor variables. It does this by regressing each independent variable against all other independent variables in the model to calculate the coefficient of determination or R^2

Formula for VIF is:

VIF=\frac{1}{1-R^2}

where R-squared(R^2) is the coefficient of determination in linear regression which represents how well one feature can be predicted from others with values ranging between 0 and 1. A higher R^2 means a stronger relationship with other variables which leads to a higher VIF.

If R-squared is close to 1 this indicates high multicollinearity because other variables almost entirely explain the variable.

As we see from the formula, greater the value of R-squared greater is the VIF. Hence greater VIF denotes greater correlation. Generally a VIF above 5 shows a high multicollinearity.

By understanding the VIF formula we can accurately detect multicollinearity in our regression models and take necessary steps to address it.

Multicollinearity Detection using VIF in Python

To detect multicollinearity in regression analysis we can implement the Variance Inflation Factor (VIF) using the statsmodels library. This function calculates the VIF value for each feature in the dataset helping us identify multicollinearity.

Syntax : statsmodels.stats.outliers_influence.variance_inflation_factor(exog, exog_idx)

Parameters:

exog: Array or DataFrame of independent variables (features).
exog_idx: Index of the feature for which VIF is calculated.

Consider a dataset of 500 individuals containing their gender, height, weight and Body Mass Index (BMI). Here, Index is the dependent variable and Gender, Height and Weight are independent variables. You can download the dataset from here. We will be using Pandas library for its implementation.

Python

import pandas as pd 

data = pd.read_csv('/content/BMI.csv')

print(data.head())

Output:

Here we are using the below approch:

Converting categorical variables like Gender into numeric form.
Passing each feature index to variance_inflation_factor() to calculate the VIF.
Storing the results in a Pandas DataFrame for easy interpretation.

Python

from statsmodels.stats.outliers_influence import variance_inflation_factor

data['Gender'] = data['Gender'].map({'Male':0, 'Female':1})

X = data[['Gender', 'Height', 'Weight']]

vif_data = pd.DataFrame()
vif_data["feature"] = X.columns

vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                          for i in range(len(X.columns))]
print(vif_data)

Output :

High VIF values for Height and Weight shows strong multicollinearity between these two variables which makes sense because a person’s height influences their weight. Detecting such relationships helps us to understand and improve the stability of our regression models.

What to do if VIF is High?

Here are several effective strategies to address high VIF values and improve model performance:

1. Removing Highly Correlated Features

Use a correlation matrix to identify features with strong correlations typically above 0.7 or 0.8.
Drop one of the correlated features, the one which is less important or with a higher VIF. Removing such features reduces redundancy and improves model interpretability and stability.

2. Combining Variables or Using Dimensionality Reduction Techniques

Create new variables by combining correlated features like calculating Body Mass Index (BMI) from height and weight.
Apply Principal Component Analysis (PCA) to transform correlated variables into uncorrelated components. These components can replace original features which helps in removing multicollinearity while preserving most of the data’s variance.

Understanding and correcting multicollinearity in regression is important for improving model accuracy in fields like econometrics where variable relationships play a important role.

Detecting Multicollinearity with VIF - Python

cosine1509

Improve

Article Tags :

Practice Tags :

Detecting Multicollinearity with VIF - Python

Mathematics behind Variance Inflation Factor (VIF) Formula

Multicollinearity Detection using VIF in Python

What to do if VIF is High?

Similar Reads

Thank You!

What kind of Experience do you want to share?