Ordinary Least Squares (OLS) Regression in R
Last Updated :
28 May, 2024
Ordinary Least Squares (OLS) Regression allows researchers to understand the impact of independent variables on the dependent variable and make predictions based on the model.
Ordinary Least Squares (OLS) Regression in R
Ordinary Least Squares (OLS) regression is a powerful statistical method used to analyze the relationship between one or more independent variables and a dependent variable. It's a cornerstone of regression analysis and is widely utilized across various disciplines, including economics, social sciences, finance, and more. OLS regression aims to find the best-fitting line (or hyperplane in multiple dimensions) through a set of data points, minimizing the sum of squared differences between observed and predicted values.
How is OLS Regression different from other regression algorithms?
Ordinary Least Squares (OLS) regression is a specific type of regression algorithm that differs from other regression algorithms in several ways:
| OLS Regression
| Other Regression
|
---|
Linear Relationship
| OLS regression assumes a linear relationship between the independent and dependent variables.
| Other algorithms can capture non-linear relationships by including higher-order terms or using non-linear functions.
|
Minimization Objective
| OLS regression minimizes the sum of squared differences between observed and predicted values of the dependent variable.
| Other regression algorithms may use different optimization objectives, such as minimizing absolute errors (as in Lasso and Ridge regression) or maximizing likelihood (as in logistic regression).
|
Assumptions
| OLS regression relies on several assumptions, including linearity, homoscedasticity, independence of errors, and normality of errors.
| Other regression algorithms may have different sets of assumptions or may be more robust to violations of these assumptions.
|
Interpretability
| OLS regression provides easily interpretable coefficients that represent the effect of each independent variable on the dependent variable.
| Other regression algorithms, such as decision trees or neural networks, may provide less interpretable models with complex structures.
|
Complexity
| OLS regression is relatively simple and computationally efficient, making it suitable for small to moderately sized datasets with a limited number of predictors.
| Other regression algorithms may be more complex and computationally intensive, allowing for more flexibility and scalability but requiring larger datasets and more computational resources.
|
Mathematically, the OLS estimation formula can be represented as:
Given a dataset with ? observations and ? independent variables, denoted by ?1, ?2, . . ., ??, and a dependent variable ?, the OLS estimation formula for the coefficients (?) is:
\hat{\beta} = (X^T X)^{-1} X^T Y
Where:
- ?^ is the vector of estimated coefficients.
- ? is the design matrix containing the independent variables (with dimensions ? × (?+1), including a column of ones for the intercept).
- ? is the vector of observed values of the dependent variable (with dimensions ?×1.
- ?? represents the transpose of matrix ?.
- (???)−1 denotes the inverse of the matrix ???.
This formula calculates the estimates for the coefficients (?^) that minimize the sum of squared differences between the observed values of the dependent variable and the predicted values based on the regression model.
The following step-by-step example shows how to perform OLS regression in R.
Step 1. Install and load the required libraries
We will install and load the required libraries for Ordinary Least Squares (OLS) Regression in R.
R
# Load necessary libraries
library(ggplot2)
library(readr)
Step 2. Load the Dataset
Here we perform Ordinary Least Squares regression on a random sample of weather data to analyze the relationship between temperature and humidity.
Link: WeatherHistory
R
# Set the seed for reproducibility
set.seed(123)
# Load a random sample of the dataset
weather_data <- read_csv_sample("E:/weatherHistory.csv", n = 1000)
head(weather_data)
Output:
Formatted.Date Summary Precip.Type Temperature..C.
1 2006-04-01 00:00:00.000 +0200 Partly Cloudy rain 9.472222
2 2006-04-01 01:00:00.000 +0200 Partly Cloudy rain 9.355556
3 2006-04-01 02:00:00.000 +0200 Mostly Cloudy rain 9.377778
4 2006-04-01 03:00:00.000 +0200 Partly Cloudy rain 8.288889
5 2006-04-01 04:00:00.000 +0200 Mostly Cloudy rain 8.755556
6 2006-04-01 05:00:00.000 +0200 Partly Cloudy rain 9.222222
Apparent.Temperature..C. Humidity Wind.Speed..km.h. Wind.Bearing..degrees.
1 7.388889 0.89 14.1197 251
2 7.227778 0.86 14.2646 259
3 9.377778 0.89 3.9284 204
4 5.944444 0.83 14.1036 269
5 6.977778 0.83 11.0446 259
6 7.111111 0.85 13.9587 258
Visibility..km. Loud.Cover Pressure..millibars. Daily.Summary
1 15.8263 0 1015.13 Partly cloudy throughout the day.
2 15.8263 0 1015.63 Partly cloudy throughout the day.
3 14.9569 0 1015.94 Partly cloudy throughout the day.
4 15.8263 0 1016.41 Partly cloudy throughout the day.
5 15.8263 0 1016.51 Partly cloudy throughout the day.
6 14.9569 0 1016.66 Partly cloudy throughout the day.
Step 3: Perform OLS Regression
Now we create a model and perform Ordinary Least Squares (OLS) Regression in R Programming language.
R
# Perform OLS regression with 'Temperature (C)' as the dependent
model <- lm(`Temperature (C)` ~ Humidity, data = weather_data)
# Summary of the regression model
summary(model)
Output:
Call:
lm(formula = Temperature..C. ~ Humidity, data = weather_data)
Residuals:
Min 1Q Median 3Q Max
-52.415 -5.091 0.378 5.741 18.804
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.6369 0.0927 373.7 <2e-16 ***
Humidity -30.8944 0.1219 -253.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.4 on 96451 degrees of freedom
Multiple R-squared: 0.3997, Adjusted R-squared: 0.3997
F-statistic: 6.423e+04 on 1 and 96451 DF, p-value: < 2.2e-16
In this linear model, Humidity
has a strong negative relationship with Temperature (C)
. The coefficient for Humidity
is -30.8944, meaning for each unit increase in Humidity
, the Temperature (C)
decreases by about 30.89 units. The model explains approximately 39.97% of the variance in Temperature (C)
, and the results are highly statistically significant.
Step 4. Visualize the OLS model
Now we will visualize the Ordinary Least Squares (OLS) Regression model.
R
# Create scatter plot of 'Temperature (C)' against 'Humidity' with regression line
ggplot(data = weather_data, aes(x = Humidity, y = `Temperature (C)`)) +
geom_point() + # Add points for observed data
geom_smooth(method = "lm", se = FALSE) + # Add regression line
labs(x = "Humidity", y = "Temperature (C)") + # Labels for axes
ggtitle("OLS Regression of Temperature vs Humidity") # Title for the plot
Output:
Ordinary Least Squares (OLS) Regression in RThe plot visually explain the linear relationship between Humidity
and Temperature (C)
. Given the regression line's negative slope, it visually confirms that higher humidity levels are associated with lower temperatures, as indicated by the regression model coefficients. This visualization helps in understanding the strength and direction of the relationship between the two variables.
Importance of OLS Regression in Data Analysis
- Foundation of Statistical Modeling: OLS regression is a fundamental technique that forms the basis for many other statistical methods and machine learning algorithms.
- Simplicity and Interpretability: The method is straightforward to implement and the results are easy to interpret. Coefficients provide clear insights into the relationships between independent and dependent variables.
- Diagnostic Insights: OLS regression provides valuable diagnostic statistics, such as standard errors, t-values, p-values, and R-squared, which help assess model performance and predictor significance.
- Versatility: It can be applied to various fields, including economics, finance, healthcare, and social sciences, to understand and predict outcomes based on input data.
- Model Validation: The technique helps validate hypotheses and informs decision-making by quantifying the strength and direction of relationships between variables.
Conclusion
In the conclusion, Ordinary Least Squares (OLS) regression is a fundamental technique in machine learning for modeling relationships between variables. Despite its simplicity, users must be aware of assumptions and challenges like linearity and multicollinearity. While OLS is valuable for its interpretability, it's important to supplement it with advanced methods when dealing with complex datasets or non-linear relationships.
Similar Reads
Weighted Least Squares Regression in Python
Weighted Least Squares (WLS) regression is a powerful extension of ordinary least squares regression, particularly useful when dealing with data that violates the assumption of constant variance. In this guide, we will learn brief overview of Weighted Least Squares regression and demonstrate how to
6 min read
Is least squares regression the same as linear regression
Yes, Least squares regression and linear regression are closely related in machine learning, but theyâre not quite the same. Linear regression is a type of predictive model that assumes a linear relationship between input features and the output variable. Least squares is a common method used to fin
2 min read
R-squared Regression Analysis in R Programming
For the prediction of one variableâs value(dependent variable) through other variables (independent variables) some models are used that are called regression models. For further calculating the accuracy of this prediction another mathematical tool is used, which is R-squared Regression Analysis or
5 min read
Simple Linear Regression in R
Regression shows a line or curve that passes through all the data points on the target-predictor graph in such a way that the vertical distance between the data points and the regression line is minimum What is Linear Regression?Linear Regression is a commonly used type of predictive analysis. Linea
12 min read
Ordinary Least Squares and Ridge Regression Variance in Scikit Learn
In statistical modeling, Ordinary Least Squares (OLS) and Ridge Regression are two widely used techniques for linear regression analysis. OLS is a traditional method that finds the line of best fit through the data by minimizing the sum of the squared errors between the predicted and actual values.
7 min read
Lasso Regression in R Programming
Lasso regression is a classification algorithm that uses shrinkage in simple and sparse models(i.e models with fewer parameters). In Shrinkage, data values are shrunk towards a central point like the mean. Lasso regression is a regularized regression algorithm that performs L1 regularization which a
11 min read
Non-Linear Regression in R
Non-Linear Regression is a statistical method that is used to model the relationship between a dependent variable and one of the independent variable(s). In non-linear regression, the relationship is modeled using a non-linear equation. This means that the model can capture more complex and non-line
6 min read
R - NonLinear Least Square
In non-linear function, the points plotted on the graph are not linear and thus, do not give a curve or line on the graph. So, non-linear regression analysis is used to alter the parameters of the function to obtain a curve or regression line that is closed to your data. To perform this, Non-Linear
3 min read
Ordinary Least Squares (OLS) using statsmodels
Ordinary Least Squares (OLS) is a widely used statistical method for estimating the parameters of a linear regression model. It minimizes the sum of squared residuals between observed and predicted values. In this article we will learn how to implement Ordinary Least Squares (OLS) regression using P
3 min read
Regression and its Types in R Programming
Regression analysis is a statistical tool to estimate the relationship between two or more variables. There is always one response variable and one or more predictor variables. Regression analysis is widely used to fit the data accordingly and further, predicting the data for forecasting. It helps b
5 min read