Machine Learning
using caret API
Week 14
Machine learning in R
• There are LARGE assortment of ML packages in R: essentially one for
every possible learning method.
• Each package has it's own unique ways of reading data in, outputting
results, and post-processing.
• This can make it difficult to implement different types of models
quickly.
• The caret package eases this process by creating a system of wrapper
functions that make it very easy to implement models
• Caret is short for Classification And REgression Training.
Steps for Building a Machine
Learning Model
1. Data Preparation and Preprocessing
a) Split the dataset into training and validation
b) Descriptive statistics
c) Impute missing values
d) One-Hot Encoding (dummy variables)
e) Variable transformations
2. Feature Selection (Optional)
3. Training and Tuning the model
4. Ensemble the predictions
5. Performance Optimization
6. Evaluating performance of multiple machine learning algorithms
Let us start with a dataset
• We will use an Orange Juice Dataset from this link (
https://raw.githubusercontent.com/selva86/datasets/master/orange_
juice_withmissing.csv
)
• The goal of this dataset is to predict which of the two brands of
orange juices did the customers buy.
• The predictor variables are characteristics of the customer and the
product itself.
• The response variable is ‘Purchase’ which takes either the value
‘CH'(citrus hill) or ‘MM'(minute maid).
Let us start with a dataset
library(caret)
orange <-
read.csv('https://raw.githubusercontent.com/selva86/datasets/master/
orange_juice_withmissing.csv')
str(orange)
head(orange[, 1:10])
Data Preparation and
Preprocessing
Split the dataset into training
and validation
• createDataPartition() takes as input the Y variable in the source
dataset and the percentage data that should go into training as the p
argument. It returns the rownumbers that should form the training
dataset.
• Plus, you need to set list=F, to prevent returning the result as a list.
• We will split the dataset into training(80%) and test(20%) datasets
using caret’s createDataPartition function.
Split the dataset into training
and validation
# Create the training and test datasets
set.seed(100)
# Step 1: Get row numbers for the training data
trainRowNumbers <- createDataPartition(orange$Purchase, p=0.8, list=FALSE)
# Step 2: Create the training dataset
trainData <- orange[trainRowNumbers,]
# Step 3: Create the test dataset
testData <- orange[-trainRowNumbers,]
# Store X and Y for later use.
x = trainData[, 2:18]
y = trainData$Purchase
Descriptive statistics
• The skimr package provides a nice solution to show key descriptive
stats for each column.
• The skimr::skim_to_wide() produces a nice dataframe containing the
descriptive stats of each of the columns. The dataframe output
includes a nice histogram drawn without any plotting help.
Descriptive statistics
library(skimr)
skimmed <- skim_to_wide(trainData)
skimmed[, c(1:5, 9:11, 13, 15:16)]
• Notice the number of missing values for each feature, mean, median,
proportion split of categories in the factor variables, percentiles and
the histogram in the last column.
Handling missing values using
preProcess()
• it is a common practice to replace the missing values with the mean
of the column.
• If it’s a categorical variable, it is common to replace the missings with
the most frequently occurring value, aka, the mode.
• But this is quite a basic and a rather rudimentary approach.
Handling missing values using
preProcess() cont.
• We can actually predict the missing values by considering the rest of
the available variables as predictors.
• A popular algorithm to do imputation is the k-Nearest Neighbors. This
can be quickly and easily be done using caret.
• caret offers a nice convenient preProcess function that can predict
missing values besides other preprocessing.
What is k-Nearest Neighbors?
Handling missing values using
preProcess() cont.
• To predict the missing values with k-Nearest Neighbors using
preProcess():
• You need to set the method=knnImpute for k-Nearest Neighbors and apply it
on the training data. This creates a preprocess model.
• Then use predict() on the created preprocess model by setting the newdata
argument on the same training data.
• Caret also provides bagImpute as an alternative imputation algorithm.
Handling missing values using
preProcess() cont.
# Create the knn imputation model on the training data
preProcess_missingdata_model <- preProcess(trainData,
method='knnImpute')
preProcess_missingdata_model
# Use the imputation model to predict the values of missing data points
library(RANN) # required for knnImpute
trainData <- predict(preProcess_missingdata_model, newdata = trainData)
anyNA(trainData)
One-Hot Encoding (dummy
variables)
• converting the categorical column to numeric in order for it to be
used by the machine learning algorithms.
• Just replacing the categories with a number may not be meaningful
• So what you can do instead is to convert the categorical variable with
as many binary (1 or 0) variables as there are categories.
One-Hot Encoding (dummy
variables)
One-Hot Encoding (dummy
variables)
• In caret, one-hot-encodings can be created using dummyVars(). Just pass in all the
features to dummyVars() as the training data and all the factor columns will
automatically be converted to one-hot-encodings.
dummies_model <- dummyVars(Purchase ~ ., data=trainData)
trainData_mat <- predict(dummies_model, newdata = trainData)
trainData <- data.frame(trainData_mat)
str(trainData)
One-Hot Encoding (dummy
variables)
• In above case, we had one categorical variable, Store7 with 2
categories. It was one-hot-encoded to produce two new columns –
Store7.No and Store7.Yes.
Visualize the importance of
variables (featurePlot)
• Simply set the X and Y parameters and set plot='box'. You can additionally adjust
the label font size (using strip) and the scales to be free.
featurePlot(x = trainData[, 1:18],
y = as.factor(trainData$Purchase),
plot = “density",
strip=strip.custom(par.strip.text=list(cex=.7)),
scales = list(x = list(relation="free"),
y = list(relation="free")))
Feature Selection Using Recursive
Feature Elimination (rfe)
• RFE works in 3 broad steps:
• Step 1: Build a ML model on a training dataset and estimate the feature
importance on the test dataset.
• Step 2: Keeping priority to the most important variables, iterate through by
building models of given subset sizes, that is, subgroups of most important
predictors determined from step 1. Ranking of the predictors is recalculated
in each iteration.
• Step 3: The model performances are compared across different subset sizes
to arrive at the optimal number and list of final predictors.
Recursive Feature Elimination (rfe)
code
set.seed(100)
options(warn=-1)
subsets <- c(1:5, 10, 15, 18)
ctrl <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 5,
verbose = FALSE)
lmProfile <- rfe(x=trainData[, 1:18], y=as.factor(trainData$Purchase),
sizes = subsets,
rfeControl = ctrl)
lmProfile
Recursive Feature Elimination (rfe)
output
rfeControl parameter
• We set what type of algorithm and what cross validation method
should be used.
• In above case, the cross validation method is repeatedcv which
implements k-Fold cross validation repeated 5 times
• The method used is random forest based (rfFuncs)
Selecting a Machine
Learning Model
Machine Learning Models in R
• Now comes the important stage where you actually build the machine learning
model.
• To know what models caret supports, run the following:
modelnames <- paste(names(getModelInfo()), collapse=', ')
modelnames
• if you want to know more details like the hyperparameters and if it can be used of
regression or classification problem, then do a
modelLookup("algorithm name in caret")
Step 1: Understand your ingredients
• Data
• How they are collected?
• Well-defined features?
• Type of data?
• Assumptions, Distribution, image pixels, etc
• Goal
• Predict?
• Describe?
• Explore?
• Metrics
• Accuracy level expected
• Speed in Scoring
• Interpretability
• Ease of implementation and maintenance
Step 2: Match the Ingredients to ML
Algorithms
Step 3: Assess the Results in
Systematic Approach
• After selecting the set of models of interest, build these models
• Compare these models
• Choose the best one fits your metrics
Train a model
modelLookup('earth')
# Set the seed for reproducibility
set.seed(100)
# Train the model using randomForest and predict on the training data itself.
model_mars = train(Purchase ~ ., data=trainData, method='earth')
fitted <- predict(model_mars)
model_mars
compute variable importance
• MARS supports computing variable importances, let’s extract the
variable importances using varImp() to understand which variables
came out to be useful.
varimp_mars <- varImp(model_mars)
plot(varimp_mars, main="Variable Importance with MARS")
Prepare the test dataset and predict
• A default MARS model has been selected.
• Now in order to use the model to predict on new data, the data has to
be preprocessed and transformed just the way we did on the training
data.
• If you recall, we did the pre-processing in the following sequence:
• Missing Value imputation –> One-Hot Encoding –> Range
Normalization
Prepare the test dataset and predict
code
# Step 1: Impute missing values
testData2 <- predict(preProcess_missingdata_model, testData)
# Step 2: Create one-hot encodings (dummy variables)
testData3 <- predict(dummies_model, testData2)
# Step 3: Transform the features to range between 0 and 1
testData4 <- predict(preProcess_range_model, testData3)
# View
head(testData4[, 1:10])
Predict on testData
• The test dataset is prepared. Let’s predict the Y.
# Predict on testData
predicted <- predict(model_mars, testData4)
head(predicted)
Confusion Matrix
• The confusion matrix is a tabular representation to compare the
predictions (data) vs the actuals (reference).
confusionMatrix(reference = as.factor(testData$Purchase), data = predicted, mode='everything',
positive='MM')
Optimizing the Model for Better
Performance
• The train() function takes a trControl argument that accepts the
output of trainControl().
• Inside trainControl() you can control how the train() will:
• Cross validation method to use.
• How the results should be summarised using a summary function
• The summaryFunction can be twoClassSummary if Y is binary class or
multiClassSummary if the Y has more than 2 categories.
Optimizing the Model for Better
Performance
# Define the training control
fitControl <- trainControl(
method = 'cv', # k-fold cross validation
number = 5, # number of folds
savePredictions = 'final', # saves predictions for optimal tuning parameter
classProbs = T, # should class probabilities be returned
summaryFunction=twoClassSummary # results summary function
)
# Step 1: Tune hyper parameters by setting tuneLength
set.seed(100)
model_mars2 = train(Purchase ~ ., data=trainData, method='earth', tuneLength = 5, metric='ROC', trControl = fitControl)
model_mars2
# Step 2: Predict on testData and Compute the confusion matrix
predicted2 <- predict(model_mars2, testData4)
confusionMatrix(reference = testData$Purchase, data = predicted2, mode='everything', positive='MM')
How to evaluate performance of
multiple machine learning
algorithms?
• Caret provides the resamples() function where you can provide
multiple machine learning models and collectively evaluate them.
• Let’s first train some more algorithms.
Training Adaboost
set.seed(100)
# Train the model using adaboost
model_adaboost = train(Purchase ~ ., data=trainData,
method='adaboost', tuneLength=2, trControl = fitControl)
model_adaboost
Training Random Forest
set.seed(100)
# Train the model using rf
model_rf = train(Purchase ~ ., data=trainData, method='rf',
tuneLength=5, trControl = fitControl)
model_rf
Training xgBoost Dart
set.seed(100)
# Train the model using MARS
model_xgbDART = train(Purchase ~ ., data=trainData,
method='xgbDART', tuneLength=5, trControl = fitControl, verbose=F)
model_xgbDART
Training SVM
set.seed(100)
# Train the model using MARS
model_svmRadial = train(Purchase ~ ., data=trainData,
method='svmRadial', tuneLength=15, trControl = fitControl)
model_svmRadial
Run resamples() to compare the
models
# Compare model performances using resample()
models_compare <- resamples(list(ADABOOST=model_adaboost, RF=model_rf,
XGBDART=model_xgbDART, MARS=model_mars3, SVM=model_svmRadial))
# Summary of the models performances
summary(models_compare)
# Draw box plots to compare models
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(models_compare, scales=scales)
Conclusion
• This was just an overview of how caret can be used in R.
• Deep understanding of Machine Learning is mandatory to utilize this
tool effectively
• By time, you will get used to these steps and you will realize that the
most difficult part in any project is acquiring and cleaning the data.
Resources
• https://www.machinelearningplus.com/machine-learning/caret-packa
ge/
• https://blogs.sas.com/content/subconsciousmusings/2020/12/09/ma
chine-learning-algorithm-use/#prettyPhoto
• https://www.youtube.com/watch?v=-oZcf0QEzYM