Cross validation in R without caret package
Last Updated :
24 Apr, 2025
Cross-validation is a technique for evaluating the performance of a machine learning model by training it on a subset of the data and evaluating it on the remaining data. It is a useful method for estimating the performance of a model when you don't have a separate test set, or when you want to get a better estimate of the model's performance than a single train/test split would provide.
There are several ways to perform cross-validation in R:
- Manually splitting the data into folds: You can use the create folds function from the e1071 library, or write your own code to split the data into the desired number of folds. You can then loop through the folds, using each one as the test set and the remaining folds as the training set, and evaluate the model on each test set.
- Using the train function from the caret package: The caret package provides a convenient function, train, that can be used to train and evaluate machine learning models using a variety of cross-validation methods. It also provides a number of functions for generating the folds, such as createFoldDummyVars and createMultiFolds.
- Using the cv.glm function from the boot package: The boot package provides a function, cv.glm, that can be used to perform cross-validation for generalized linear models. It allows you to specify the number of folds and the type of cross-validation to use (e.g. stratified or non-stratified).
- Using the cv.lm function from the boot package: Similar to cv.glm, the boot package also provides a function, cv.lm, that can be used to perform cross-validation for linear models.
- Using the kfoldCV function from the mlr package: The mlr package provides a function, kfoldCV, that can be used to perform k-fold cross-validation for a variety of machine learning models. It allows you to specify the number of folds and the type of cross-validation to use (e.g. stratified or non-stratified).
Cross-validation in R without caret package
There are several ways to perform cross-validation on datasets in the R Programming language. But in this example, we will use tidyverse and e1071 libraries. Load the tidyverse and e1071 libraries. The tidyverse library is not necessary for cross-validation, but it is used in this example to split the data into folds. The e1071 library is used to train the SVM model.
R
# Load the required libraries
library(tidyverse)
library(e1071)
Use the createFolds function from the e1071 library to split the data into 10 folds. The function takes the data as the first argument and the number of folds as the second argument. It returns a list of vectors, where each vector contains the indices of the rows that belong to a fold. Initialize an empty vector to store the results of the cross-validation.
R
# Split the data into 10 folds
folds <- createFolds(iris$Species, k = 10)
# Initialize a vector to store the results
results <- c()
Loop through the folds. Split the data into training and test sets using the indices in the current fold. Train the SVM model on the training data using the SVM function from the e1071 library. Make predictions on the test data using the predict function. by calculating the accuracy of the predictions by comparing them to the true labels and taking the mean. Append the accuracy to the results vector.
R
# Loop through the folds
for (i in 1:10)
# Split the data into training and test sets
train <- iris[-folds[[i]],]
test <- iris[folds[[i]],]
# Train the model on the training data
model <- svm(Species ~ ., data = train)
# Make predictions on the test data
predictions <- predict(model, test)
# Calculate the accuracy
accuracy <- mean(predictions == test$Species)
# Store the result
results <- c(results, accuracy)
After the loop finishes, calculate the mean of the results using the mean.
R
# Calculate the mean and standard
# deviation of the results
mean(results)
Output:
0.933333333333333
The first line is the mean accuracy, and the second line is the standard deviation of the accuracy scores. In this case, the mean accuracy is 0.9, which indicates that the model was able to correctly classify 90% of the test samples on average across all folds. The standard deviation of 0.0 indicates that the accuracy was the same on all folds.
This code uses the createFolds function from the e1071 library to split the data into 10 folds. It then trains a support vector machine (SVM) model on each fold, using the remaining folds as the training data, and evaluates the model on the fold being used as the test data. Finally, it calculates the mean and standard deviation of the accuracy scores obtained on each fold to get an idea of the model's performance.
Similar Reads
Cross Validation on a Dataset with Factors in R
Cross-validation is a widely used technique in machine learning and statistical modeling to assess how well a model generalizes to new data. When working with datasets containing factors (categorical variables), it's essential to handle them appropriately during cross-validation to ensure unbiased p
4 min read
Feature Selection with the Caret R Package
The Caret (Classification And REgression Training) is an R package that provides a unified interface for performing machine learning tasks, such as data preprocessing, model training and performance evaluation. One of the tasks that Caret can help with is feature selection, which involves selecting
6 min read
How to do nested cross-validation with LASSO in caret or tidymodels?
Nested cross-validation is a robust technique used for hyperparameter tuning and model selection. When working with complex models like LASSO (Least Absolute Shrinkage and Selection Operator), it becomes essential to understand how to implement nested cross-validation efficiently. In this article, w
10 min read
Non-Linear Regressions with Caret Package in R
Non-linear regression is used to fit relationships between variables that are beyond the capability of linear regression. It can fit intricate relationships like exponential, logarithmic and polynomial relationships. Caret, a package in R, offers a simple interface to develop and compare machine lea
3 min read
How To Use Readxl Package To Read Data In R
In this article let's discuss how to use the Readxl Package to read data in the R Programming Language. Readxl Package in RThe readxl package in R is used to read data from the Excel files, i.e., the format .xls and .xlsx files. The readxl package in R provides a function called read_excel() which i
3 min read
How to Unload a Package Without Restarting R
R is a widely used statistical programming language popularly known for its packages and libraries. R helps in statistical analysis identifying patterns and predicting new points and outcomes based on the historical dataset. The packages that we install in R make the analysis easier but sometimes th
3 min read
How to perform 10 fold cross validation with LibSVM in R?
Support Vector Machines (SVM) are a powerful tool for classification and regression tasks. LibSVM is a widely used library that implements SVM, and it can be accessed in R with the e1071 package. Cross-validation, particularly 10-fold cross-validation, is an essential technique for assessing the per
4 min read
K-fold Cross Validation in R Programming
The prime aim of any machine learning model is to predict the outcome of real-time data. To check whether the developed model is efficient enough to predict the outcome of an unseen data point, performance evaluation of the applied machine learning model becomes very necessary. K-fold cross-validati
8 min read
Data mining with caret package
The process of discovering patterns and relationships in large datasets is known as Data mining. It involves a combination of statistical and computational techniques that allow analysts to extract useful information from data. The caret package in R is a powerful tool for data mining that provides
7 min read
Visualize Confusion Matrix Using Caret Package in R
The Confusion Matrix is a type of matrix that is used to visualize the predicted values against the actual Values. The row headers in the confusion matrix represent predicted values and column headers are used to represent actual values. The Confusion matrix contains four cells as shown in the below
4 min read