Open In App

Ordinary Least Squares (OLS) using statsmodels

Last Updated : 10 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Ordinary Least Squares (OLS) is a widely used statistical method for estimating the parameters of a linear regression model. It minimizes the sum of squared residuals between observed and predicted values. In this article we will learn how to implement Ordinary Least Squares (OLS) regression using Python's statsmodels module.

Overview of Linear Regression Model

A linear regression model establishes the relationship between a dependent variable (y) and one or more independent variables (x):

\hat{y} = b_1 x + b_0

Where:

  • \hat{y} : Predicted value of y
  • b1​: Slope of the line (coefficient of x)
  • b0​: Intercept (value of y when x=0)

The OLS method minimizes the total sum of squares of residuals (S) defined as:

S = \sum_{i=1}^{n} \epsilon_i^2 = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

To find the optimal values of b0​ and b1​ partial derivatives of S with respect to each coefficient are taken and set to zero.

Implementation OLS Regression Using Statsmodels

Step 1: Import Required Libraries

Before starting, we need to import necessary libraries like pandas , numpy and matplotlib.

Python
import statsmodels.api as sm 
import pandas as pd          
import matplotlib.pyplot as plt 
import numpy as np  

Step 2: Load and Prepare the Data

We load the dataset from a CSV file using pandas. You can download dataset from here. The dataset contains two columns:

  • x: Independent variable (predictor).
  • y: Dependent variable (response).
Python
data = pd.read_csv('train.csv')

x = data['x'].tolist()  
y = data['y'].tolist()

Step 3: Add a Constant Term

In linear regression the equation includes an intercept term (b0​). To include this term in the model we use the add_constant() function from statsmodels.

Python
x = sm.add_constant(x)

Step 4: Perform OLS Regression

Now we fit the OLS regression model using the OLS() function. This function takes the dependent variable (y) and the independent variable (x) as inputs.

Python
result = sm.OLS(y, x).fit()

print(result.summary())

Output :

Screenshot-2025-04-08-103151
  • The output shows that the regression model fits the data very well with an R-squared of 0.989.
  • The independent variable x1 is highly significant (p < 0.001) and has a strong positive effect on the target variable.
  • The intercept (const) is not statistically significant (p = 0.200) meaning it may not contribute meaningfully.
  • Residuals are normally distributed as indicated by the Omnibus and Jarque-Bera test p-values (> 0.05).
  • The Durbin-Watson value is ~2 indicating no autocorrelation in residuals.
  • The overall model is statistically significant with a very high F-statistic and a near-zero p-value

Step 5: Visualize the Regression Line

To better understand the relationship between x and y we plot the original data points and the fitted regression line.

Python
plt.scatter(data['x'], data['y'], color='blue', label='Data Points')

x_range = np.linspace(data['x'].min(), data['x'].max(), 100)
y_pred = result.params[0] + result.params[1] * x_range 

plt.plot(x_range, y_pred, color='red', label='Regression Line')
plt.xlabel('Independent Variable (X)')
plt.ylabel('Dependent Variable (Y)')
plt.title('OLS Regression Fit')
plt.legend()
plt.show()

Output:

file
Regression Line

The above plot shows a strong linear relationship between the independent variable (X) and the dependent variable (Y). Blue dots represent the actual data points which are closely aligned with the red regression line indicating a good model fit.


Next Article
Practice Tags :

Similar Reads