How to Use Repeated Random Training/Test Splits Inside train() Function in R
Last Updated :
13 Jun, 2024
When building machine learning models, it's crucial to have reliable estimates of model performance. One effective way to achieve this is by using repeated random training/test splits, also known as repeated holdout validation. The caret package in R Programming Language provides a convenient function called train() to facilitate this process.
This article will guide you through the steps to use repeated random training/test splits inside the train() function, with a practical example.
Introduction to train()
The train() function from the caret package is a powerful tool for training and tuning machine learning models. It supports various resampling methods, including repeated holdout validation, k-fold cross-validation, and bootstrapping. This flexibility allows you to select the most appropriate resampling technique for your specific needs.
Repeated Random Training and Test Splits
Repeated Random Training and Test Splits known as holdout validation involves splitting the dataset into training and test sets multiple times, training the model on the training set, and evaluating it on the test set for each split. This approach provides a more robust estimate of model performance by reducing the variance associated with a single train/test split.
Implemention of Repeated Random Training and Test Splits in R
Consider a scenario where you have a dataset containing information about houses, including features like the number of bedrooms, bathrooms, and square footage, and you want to predict the house price. We'll use the train() function with repeated random training/test splits to evaluate a linear regression model.
Step 1: Load Necessary Packages
First, load the necessary packages, including caret and dplyr.
R
# Install and load necessary packages
install.packages("caret")
install.packages("dplyr")
library(caret)
library(dplyr)
Step 2: Generate Example Dataset
Create a synthetic dataset with features such as the number of bedrooms, bathrooms, and square footage, as well as the corresponding house prices.
R
# Set seed for reproducibility
set.seed(123)
# Number of observations
n <- 100
# Generate synthetic data
bedrooms <- sample(1:5, n, replace = TRUE)
bathrooms <- sample(1:3, n, replace = TRUE)
sq_footage <- rnorm(n, mean = 2000, sd = 500)
price <- 100000 + 50000 * bedrooms + 30000 * bathrooms + 100 * sq_footage +
rnorm(n, mean = 0, sd = 50000)
# Create dataset
house_data <- data.frame(Bedrooms = bedrooms, Bathrooms = bathrooms,
SqFootage = sq_footage, Price = price)
head(house_data)
Output:
Bedrooms Bathrooms SqFootage Price
1 3 1 2281.495 562189.5
2 3 3 1813.781 552915.8
3 2 1 2488.487 473166.7
4 2 3 1812.710 394625.9
5 3 2 2526.356 536579.7
6 5 2 1475.411 533047.6
Step 3: Define the Model and Train Control
Define the model you want to fit and specify the train control parameters, including the resampling method and number of repeats.
R
# Define the model
model <- trainControl(
method = "repeatedcv", # Repeated cross-validation
number = 10, # Number of folds
repeats = 5, # Number of repeats
verboseIter = TRUE # Print training log
)
Step 4: Train the Model
Use the train() function to train the model with repeated random training/test splits.
R
# Train the model using the train() function
set.seed(123)
fit <- train(
Price ~ ., # Model formula
data = house_data, # Dataset
method = "lm", # Linear regression model
trControl = model, # Train control parameters
metric = "RMSE" # Performance metric
)
Output:
+ Fold01.Rep1: intercept=TRUE
- Fold01.Rep1: intercept=TRUE
+ Fold02.Rep1: intercept=TRUE
- Fold02.Rep1: intercept=TRUE
+ Fold03.Rep1: intercept=TRUE
- Fold03.Rep1: intercept=TRUE
+ Fold04.Rep1: intercept=TRUE
- Fold04.Rep1: intercept=TRUE
+ Fold05.Rep1: intercept=TRU
Step 5: Evaluate the Model
Review the model performance metrics, including the RMSE and MAE, to evaluate how well the model performs across different train/test splits.
R
# Print model summary
print(fit)
# Get model performance metrics
results <- fit$results
print(results)
Output:
Linear Regression
100 samples
3 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 88, 89, 90, 92, 90, 92, ...
Resampling results:
RMSE Rsquared MAE
50347.17 0.7764902 40486.79
Tuning parameter 'intercept' was held constant at a value of TRUE
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 TRUE 50347.17 0.7764902 40486.79 10739.16 0.1106781 10816.58
The output of the train() function includes a summary of the model and performance metrics. The results object contains detailed performance metrics for each resampling iteration, including the mean and standard deviation of the RMSE and MAE.
Conclusion
Using repeated random training/test splits inside the train() function in R allows for robust performance estimation of machine learning models. This approach reduces the variance associated with a single train/test split and provides more reliable insights into model performance. By following the step-by-step guide provided in this article, you can effectively use repeated holdout validation in your machine learning projects, ensuring accurate and trustworthy model evaluations.
Similar Reads
How To Do Train Test Split Using Sklearn In Python
In this article, let's learn how to do a train test split using Sklearn in Python. Train Test Split Using Sklearn The train_test_split() method is used to split our data into train and test sets. First, we need to divide our data into features (X) and labels (y). The dataframe gets divided into X_t
5 min read
How to Use the replicate() Function in R?
replicate() function in R Programming Language is used to evaluate an expression N number of times repeatedly. Syntax: replicate(n, expression) where expression is a statement to evaluaten is the number of times to evaluate the expressionMethod 1: Replicate a value n times Here we will replicate som
1 min read
How to split data into training and testing in Python without sklearn
Here we will learn how to split a dataset into Train and Test sets in Python without using sklearn. The main concept that will be used here will be slicing. We can use the slicing functionalities to break the data into separate (train and test) parts. If we were to use sklearn this task is very easy
2 min read
How to split the Dataset With scikit-learn's train_test_split() Function
In this article, we will discuss how to split a dataset using scikit-learns' train_test_split(). sklearn.model_selection.train_test_split() function: The train_test_split() method is used to split our data into train and test sets. First, we need to divide our data into features (X) and labels (y).
8 min read
Split the Dataset into the Training & Test Set in R
In this article, we are going to see how to Splitting the dataset into the training and test sets using R Programming Language. Method 1: Using base RÂ The sample() method in base R is used to take a specified size data set as input. The data set may be a vector, matrix or a data frame. This method
4 min read
How to split a Dataset into Train and Test Sets using Python
One of the most important steps in preparing data for training a ML model is splitting the dataset into training and testing sets. This simply means dividing the data into two parts: one to train the machine learning model (training set), and another to evaluate how well it performs on unseen data (
3 min read
How to Repeat a Random Sample in R
In statistical analysis and data science, it is often important to understand the behavior of a dataset by taking random samples. Repeating a random sample allows researchers to observe how consistent their results are across different iterations. In R, this can be achieved using various functions.
4 min read
R - Calculate Test MSE given a trained model from a training set and a test set
Mean Squared Error (MSE) is a widely used metric for evaluating the performance of regression models. It measures the average of the squares of the errors. the average squared difference between the actual and predicted values. The Test MSE, specifically, helps in assessing how well the model genera
4 min read
How to Generate a Sample Using the Sample Function in R?
In this article, we will discuss how to generate a sample using the sample function in R. Sample() function is used to generate the random elements from the given data with or without replacement. Syntax: sample(data, size, replace = FALSE, prob = NULL) where, data can be a vector or a dataframesize
2 min read
How to Specify Split in a Decision Tree in R Programming?
Decision trees are versatile and widely used machine learning algorithms for both classification and regression tasks. A fundamental aspect of building decision trees is determining how to split the dataset at each node effectively. In this comprehensive guide, we will explore the theory behind deci
6 min read