Lasso Regression in R Programming
Last Updated :
23 May, 2024
Lasso regression is a classification algorithm that uses shrinkage in simple and sparse models(i.e models with fewer parameters). In Shrinkage, data values are shrunk towards a central point like the mean. Lasso regression is a regularized regression algorithm that performs L1 regularization which adds a penalty equal to the absolute value of the magnitude of coefficients.
“LASSO” stands for Least Absolute Shrinkage and Selection Operator. Lasso regression is good for models showing high levels of multicollinearity or when you want to automate certain parts of model selection i.e variable selection or parameter elimination. Lasso regression solutions are quadratic programming problems that can best be solved with software like RStudio, Matlab, etc. It has the ability to select predictors.
[Tex]\text{minimize} \left( \frac{1}{2N} \sum_{i=1}^{N} (y_i – \beta_0 – \sum_{j=1}^{p} x_{ij}\beta_j)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right)
[/Tex]
Here:
- N: is the number of observations.
- p: is the number of predictors.
- yi : is the response variable for the i-th observation.
- xij : is the value of the j-th predictor for the i-th observation.
- β0: is the intercept term.
- βj : are the coefficients for the predictors.
- λ : is the regularization parameter controlling the strength of the L1 penalty term.
The algorithm minimizes the sum of squares with constraint. Some Beta are shrunk to zero that results in a regression model. A tuning parameter lambda controls the strength of the L1 regularization penalty. lambda is basically the amount of shrinkage:
- When lambda = 0, no parameters are eliminated.
- As lambda increases, more and more coefficients are set to zero and eliminated & bias increases.
- When lambda = infinity, all coefficients are eliminated.
- As lambda decreases, variance increases.
Also, If an intercept is included in the model, it is left unchanged. Now let’s implementing Lasso regression in R programming Language.
Implementation in R
The Dataset
Big Mart dataset consists of 1559 products across 10 stores in different cities. Certain attributes of each product and store have been defined. It consists of 12 features i.e Item_Identifier( is a unique product ID assigned to every distinct item), Item_Weight(includes the weight of the product), Item_Fat_Content(describes whether the product is low fat or not), Item_Visibility(mentions the percentage of the total display area of all products in a store allocated to the particular product), Item_Type(describes the food category to which the item belongs), Item_MRP(Maximum Retail Price (list price) of the product), Outlet_Identifier(unique store ID assigned. It consists of an alphanumeric string of length 6), Outlet_Establishment_Year(mentions the year in which store was established), Outlet_Size(tells the size of the store in terms of ground area covered), Outlet_Location_Type(tells about the size of the city in which the store is located), Outlet_Type(tells whether the outlet is just a grocery store or some sort of supermarket) and Item_Outlet_Sales( sales of the product in the particular store).
Load the required packages
R
library (data.table)
library (dplyr)
library (glmnet)
library (ggplot2)
library (caret)
library (xgboost)
library (e1071)
library (cowplot)
|
We will load the all required libraries and now we will load our dataset.
Loading Dataset
R
train = fread ( "C:\\Users\\GFG19565\\Downloads\\Train.csv" )
test = fread ( "C:\\Users\\GFG19565\\Downloads\\Test.csv" )
str (train)
|
Output:
Classes ‘data.table’ and 'data.frame': 8523 obs. of 12 variables:
$ Item_Identifier : chr "FDA15" "DRC01" "FDN15" "FDX07" ...
$ Item_Weight : num 9.3 5.92 17.5 19.2 8.93 ...
$ Item_Fat_Content : chr "Low Fat" "Regular" "Low Fat" "Regular" ...
$ Item_Visibility : num 0.016 0.0193 0.0168 0 0 ...
$ Item_Type : chr "Dairy" "Soft Drinks" "Meat" "Fruits and Vegetables" ...
$ Item_MRP : num 249.8 48.3 141.6 182.1 53.9 ...
$ Outlet_Identifier : chr "OUT049" "OUT018" "OUT049" "OUT010" ...
$ Outlet_Establishment_Year: int 1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
$ Outlet_Size : chr "Medium" "Medium" "Medium" "" ...
$ Outlet_Location_Type : chr "Tier 1" "Tier 3" "Tier 1" "Tier 3" ...
$ Outlet_Type : chr "Supermarket Type1" "Supermarket Type2" "Supermarket Type1" "Grocery Store"
$ Item_Outlet_Sales : num 3735 443 2097 732 995 ...
- attr(*, ".internal.selfref")=<externalptr>
Performing EDA and Basic operations on dataset
R
test[, Item_Outlet_Sales := NA ]
combi = rbind (train, test)
missing_index = which ( is.na (combi$Item_Weight))
for (i in missing_index)
{
item = combi$Item_Identifier[i]
combi$Item_Weight[i] =
mean (combi$Item_Weight[combi$Item_Identifier == item],
na.rm = T)
}
colSums ( is.na (combi))
|
Output:
Item_Identifier Item_Outlet_Sales
0 5681
Item_Weight Item_Fat_Contentlow fat
0 0
Item_Fat_ContentLow Fat Item_Fat_Contentreg
0 0
Item_Fat_ContentRegular Item_Visibility
0 0
Item_MRP Outlet_IdentifierOUT013
0 0
Outlet_IdentifierOUT017 Outlet_IdentifierOUT018
0 0
Outlet_IdentifierOUT019 Outlet_IdentifierOUT027
0 0
Outlet_IdentifierOUT035 Outlet_IdentifierOUT045
0 0
Outlet_IdentifierOUT046 Outlet_IdentifierOUT049
0 0
Outlet_TypeSupermarket Type1 Outlet_TypeSupermarket Type2
0 0
Outlet_TypeSupermarket Type3 Outlet_Size_num
0 0
Outlet_Location_Type_num
0
Here we combine both the data set train and test data and then we check the column wise missing values in combine dataset.
Replacing 0 in Item_Visibility with mean
R
zero_index = which (combi$Item_Visibility == 0)
for (i in zero_index)
{
item = combi$Item_Identifier[i]
combi$Item_Visibility[i] =
mean (combi$Item_Visibility[combi$Item_Identifier == item],
na.rm = T)
}
|
This code uses vectorized operations and the ave
function to replace zero values in “Item_Visibility” with the mean of the non-zero values for the corresponding “Item_Identifier.”
Perform Encoding
R
combi[, Outlet_Size_num := ifelse (Outlet_Size == "Small" , 0,
ifelse (Outlet_Size == "Medium" ,
1, 2))]
combi[, Outlet_Location_Type_num :=
ifelse (Outlet_Location_Type == "Tier 3" , 0,
ifelse (Outlet_Location_Type == "Tier 2" , 1, 2))]
combi[, c ( "Outlet_Size" , "Outlet_Location_Type" ) := NULL ]
ohe_1 = dummyVars ( "~." , data = combi[, - c ( "Item_Identifier" ,
"Outlet_Establishment_Year" ,
"Item_Type" )], fullRank = T)
ohe_df = data.table ( predict (ohe_1, combi[, - c ( "Item_Identifier" ,
"Outlet_Establishment_Year" ,
"Item_Type" )]))
combi = cbind (combi[, "Item_Identifier" ], ohe_df)
head (combi)
|
Output:
Item_Identifier Item_Weight Item_Fat_Contentlow fat
1: FDA15 9.300 0
2: DRC01 5.920 0
3: FDN15 17.500 0
4: FDX07 19.200 0
5: NCD19 8.930 0
6: FDP36 10.395 0
Item_Fat_ContentLow Fat Item_Fat_Contentreg Item_Fat_ContentRegular
1: 1 0 0
2: 0 0 1
3: 1 0 0
4: 0 0 1
5: 1 0 0
6: 0 0 1
Item_Visibility Item_MRP Outlet_IdentifierOUT013
1: 0.016047301 249.8092 0
2: 0.019278216 48.2692 0
3: 0.016760075 141.6180 0
4: 0.017834338 182.0950 0
5: 0.009779829 53.8614 1
6: 0.057058689 51.4008 0
Outlet_IdentifierOUT017 Outlet_IdentifierOUT018
1: 0 0
2: 0 1
3: 0 0
4: 0 0
5: 0 0
6: 0 1
Outlet_IdentifierOUT019 Outlet_IdentifierOUT027
1: 0 0
2: 0 0
3: 0 0
4: 0 0
5: 0 0
6: 0 0
Outlet_IdentifierOUT035 Outlet_IdentifierOUT045
1: 0 0
2: 0 0
3: 0 0
4: 0 0
5: 0 0
6: 0 0
Outlet_IdentifierOUT046 Outlet_IdentifierOUT049
1: 0 1
2: 0 0
3: 0 1
4: 0 0
5: 0 0
6: 0 0
Outlet_TypeSupermarket Type1 Outlet_TypeSupermarket Type2
1: 1 0
2: 0 1
3: 1 0
4: 0 0
5: 1 0
6: 0 1
Outlet_TypeSupermarket Type3 Item_Outlet_Sales Outlet_Size_num
1: 0 3735.1380 1
2: 0 443.4228 1
3: 0 2097.2700 1
4: 0 732.3800 2
5: 0 994.7052 2
6: 0 556.6088 1
Outlet_Location_Type_num
1: 2
2: 0
3: 2
4: 0
5: 0
6: 0
This process helps convert categorical information about outlet size and location into a format that machine learning models can understand, where numbers represent different categories. The new numerical variables, “Outlet_Size_num” and “Outlet_Location_Type_num,” are then used in the subsequent analysis.
- It transforms categorical variables in the dataset, named
combi
, into a format suitable for machine learning. It creates binary columns for each category using a process called one-hot encoding. - The transformation is applied to all columns except “Item_Identifier,” “Outlet_Establishment_Year,” and “Item_Type.” The resulting dataset,
ohe_df
, contains these newly created binary columns. Finally, the original “Item_Identifier” is combined with the one-hot encoded variables to produce an extended dataset for machine learning analysis.
R
skewness (combi$Item_Visibility)
skewness (combi$price_per_unit_wt)
combi[, Item_Visibility := log (Item_Visibility + 1)]
num_vars = which ( sapply (combi, is.numeric))
num_vars_names = names (num_vars)
combi_numeric = combi[, setdiff (num_vars_names,
"Item_Outlet_Sales" ),
with = F]
prep_num = preProcess (combi_numeric,
method= c ( "center" , "scale" ))
combi_numeric_norm = predict (prep_num, combi_numeric)
combi[, setdiff (num_vars_names,
"Item_Outlet_Sales" ) := NULL ]
combi = cbind (combi, combi_numeric_norm)
train = combi[1: nrow (train)]
test = combi[( nrow (train) + 1): nrow (combi)]
test[, Item_Outlet_Sales := NULL ]
|
In this code we convert skewness in numeric variables within the dataset, combi
. It calculates and logs transformations for two numeric features, “Item_Visibility” and “price_per_unit_wt,” aiming to reduce skewness.
The data is then scaled and centered using z-score normalization. Numeric independent variables are separated for preprocessing, and the preProcess
function is applied for centering and scaling. The transformed numeric features are integrated back into the dataset. Finally, the data is split into training and testing sets, and the target variable “Item_Outlet_Sales” is removed from the test set in preparation for machine learning analysis.
Create Lasso Regression Model
R
set.seed (123)
control = trainControl (method = "cv" , number = 5)
Grid_la_reg = expand.grid (alpha = 1,
lambda = seq (0.001, 0.1, by = 0.0002))
lasso_model = train (x = train[, - c ( "Item_Identifier" ,
"Item_Outlet_Sales" )],
y = train$Item_Outlet_Sales,
method = "glmnet" ,
trControl = control,
tuneGrid = Grid_la_reg
)
lasso_model
|
Output:
glmnet
8523 samples
21 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 6819, 6819, 6819, 6817, 6818
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
0.0010 1129.244 0.5623967 837.116
0.0012 1129.244 0.5623967 837.116
0.0014 1129.244 0.5623967 837.116
0.0016 1129.244 0.5623967 837.116
0.0018 1129.244 0.5623967 837.116
0.0020 1129.244 0.5623967 837.116
0.0022 1129.244 0.5623967 837.116
0.0024 1129.244 0.5623967 837.116
0.0026 1129.244 0.5623967 837.116
0.0028 1129.244 0.5623967 837.116.........
Tuning parameter 'alpha' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.1.
This results are from applying the glmnet algorithm to a dataset with 8523 samples and 21 predictors. The algorithm is tuned using cross-validated resampling with 5 folds, which helps evaluate its performance more robustly.
The table displays the root mean squared error (RMSE), R-squared value, and mean absolute error (MAE) across different regularization parameter values (lambda). The goal is to find the lambda value that minimizes prediction errors. In this snippet, results for lambda ranging from 0.001 to 0.0028 are shown, indicating that, regardless of lambda.
The RMSE, R-squared, and MAE values remain consistent, suggesting a stable performance for the glmnet model on this data.
Calculate mean validation score and summary
R
summary (lasso_model)
mean (lasso_model$resample$RMSE)
|
Output:
Length Class Mode
a0 68 -none- numeric
beta 1428 dgCMatrix S4
df 68 -none- numeric
dim 2 -none- numeric
lambda 68 -none- numeric
dev.ratio 68 -none- numeric
nulldev 1 -none- numeric
npasses 1 -none- numeric
jerr 1 -none- numeric
offset 1 -none- logical
call 5 -none- call
nobs 1 -none- numeric
lambdaOpt 1 -none- numeric
xNames 21 -none- character
problemType 1 -none- character
tuneValue 2 data.frame list
obsLevels 1 -none- logical
param 0 -none- list
mean(lasso_model$resample$RMSE)
[1] 1129.244
- The ‘a0’ is a numeric vector of length 68, representing the intercept term in the model.
- ‘beta’ is a sparse matrix (dgCMatrix class) with 1428 elements, indicating the coefficients for each predictor.
- ‘df’ is a numeric vector of length 68, denoting the degrees of freedom for each lambda value.
- ‘dim’ is a numeric vector of length 2, providing the dimensions of the ‘beta’ matrix.
- ‘lambda’ is a numeric vector of length 68, representing the regularization parameter values.
- ‘dev.ratio’ is a numeric vector of length 68, indicating the ratio of deviances for each lambda value.
Visualize the Lasso Model
R
lasso_coefs <- as.matrix ( coef (lasso_model$finalModel, s = lasso_model$bestTune$lambda))
lasso_coefs_df <- data.frame (
Predictor = colnames (lasso_coefs),
Coefficient = as.numeric (lasso_coefs)
)
lasso_coefs_df <- lasso_coefs_df[lasso_coefs_df$Coefficient != 0, ]
library (ggplot2)
lasso_plot <- ggplot (lasso_coefs_df, aes (x = reorder (Predictor, -Coefficient),
y = Coefficient)) +
geom_bar (stat = "identity" , fill = "skyblue" , color = "black" ) +
coord_flip () +
labs (title = "Lasso Regression Coefficients" ,
x = "Predictor" ,
y = "Coefficient" ) +
theme_minimal ()
print (lasso_plot)
|
Output:

Lasso Regression in R Programming
The lasso model coefficients are extracted using the coef
function, considering the optimal lambda value determined during model tuning.
- The coefficients are organized into a data frame for plotting, including the predictor names and their corresponding coefficients.
- Coefficients that are precisely zero are filtered out, focusing on the relevant predictors with non-zero coefficients.
- The non-zero coefficients are presented in a horizontal bar plot, with the predictors reordered by the magnitude of their coefficients.
- The plot is with blue bars, black borders, and minimal theme design for clarity.
The resulting plot provides a visual representation of the lasso regression coefficients, aiding in the interpretation of predictor importance in the model.
The regularization parameter increases, RMSE remains constant.