0% found this document useful (0 votes)

39 views46 pages

Lecture 9 Machine Learning Using Caret API Updated

Uploaded by

ameera.attiah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views46 pages

Lecture 9 Machine Learning Using Caret API Updated

Uploaded by

ameera.attiah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 46

Machine Learning

using caret API

Week 14
Machine learning in R
• There are LARGE assortment of ML packages in R: essentially one for
every possible learning method.
• Each package has it's own unique ways of reading data in, outputting
results, and post-processing.
• This can make it difficult to implement different types of models
quickly.
• The caret package eases this process by creating a system of wrapper
functions that make it very easy to implement models
• Caret is short for Classification And REgression Training.
Steps for Building a Machine
Learning Model
1. Data Preparation and Preprocessing
a) Split the dataset into training and validation
b) Descriptive statistics
c) Impute missing values
d) One-Hot Encoding (dummy variables)
e) Variable transformations
2. Feature Selection (Optional)
3. Training and Tuning the model
4. Ensemble the predictions
5. Performance Optimization
6. Evaluating performance of multiple machine learning algorithms
Let us start with a dataset
• We will use an Orange Juice Dataset from this link (
https://raw.githubusercontent.com/selva86/datasets/master/orange_
juice_withmissing.csv
)
• The goal of this dataset is to predict which of the two brands of
orange juices did the customers buy.
• The predictor variables are characteristics of the customer and the
product itself.
• The response variable is ‘Purchase’ which takes either the value
‘CH'(citrus hill) or ‘MM'(minute maid).
Let us start with a dataset

library(caret)
orange <-
read.csv('https://raw.githubusercontent.com/selva86/datasets/master/
orange_juice_withmissing.csv')
str(orange)
head(orange[, 1:10])
Data Preparation and
Preprocessing
Split the dataset into training
and validation
• createDataPartition() takes as input the Y variable in the source
dataset and the percentage data that should go into training as the p
argument. It returns the rownumbers that should form the training
dataset.
• Plus, you need to set list=F, to prevent returning the result as a list.
• We will split the dataset into training(80%) and test(20%) datasets
using caret’s createDataPartition function.
Split the dataset into training
and validation
# Create the training and test datasets
set.seed(100)

# Step 1: Get row numbers for the training data

trainRowNumbers <- createDataPartition(orange$Purchase, p=0.8, list=FALSE)

# Step 2: Create the training dataset

trainData <- orange[trainRowNumbers,]

# Step 3: Create the test dataset

testData <- orange[-trainRowNumbers,]

# Store X and Y for later use.

x = trainData[, 2:18]
y = trainData$Purchase
Descriptive statistics
• The skimr package provides a nice solution to show key descriptive
stats for each column.

• The skimr::skim_to_wide() produces a nice dataframe containing the

descriptive stats of each of the columns. The dataframe output
includes a nice histogram drawn without any plotting help.
Descriptive statistics
library(skimr)
skimmed <- skim_to_wide(trainData)
skimmed[, c(1:5, 9:11, 13, 15:16)]

• Notice the number of missing values for each feature, mean, median,
proportion split of categories in the factor variables, percentiles and
the histogram in the last column.
Handling missing values using
preProcess()
• it is a common practice to replace the missing values with the mean
of the column.

• If it’s a categorical variable, it is common to replace the missings with

the most frequently occurring value, aka, the mode.

• But this is quite a basic and a rather rudimentary approach.

Handling missing values using
preProcess() cont.
• We can actually predict the missing values by considering the rest of
the available variables as predictors.

• A popular algorithm to do imputation is the k-Nearest Neighbors. This

can be quickly and easily be done using caret.

• caret offers a nice convenient preProcess function that can predict

missing values besides other preprocessing.
What is k-Nearest Neighbors?
Handling missing values using
preProcess() cont.
• To predict the missing values with k-Nearest Neighbors using
preProcess():

• You need to set the method=knnImpute for k-Nearest Neighbors and apply it
on the training data. This creates a preprocess model.

• Then use predict() on the created preprocess model by setting the newdata
argument on the same training data.

• Caret also provides bagImpute as an alternative imputation algorithm.

Handling missing values using
preProcess() cont.
# Create the knn imputation model on the training data
preProcess_missingdata_model <- preProcess(trainData,
method='knnImpute')
preProcess_missingdata_model

# Use the imputation model to predict the values of missing data points
library(RANN) # required for knnImpute
trainData <- predict(preProcess_missingdata_model, newdata = trainData)
anyNA(trainData)
One-Hot Encoding (dummy
variables)
• converting the categorical column to numeric in order for it to be
used by the machine learning algorithms.

• Just replacing the categories with a number may not be meaningful

• So what you can do instead is to convert the categorical variable with

as many binary (1 or 0) variables as there are categories.
One-Hot Encoding (dummy
variables)
One-Hot Encoding (dummy
variables)
• In caret, one-hot-encodings can be created using dummyVars(). Just pass in all the
features to dummyVars() as the training data and all the factor columns will
automatically be converted to one-hot-encodings.

dummies_model <- dummyVars(Purchase ~ ., data=trainData)

trainData_mat <- predict(dummies_model, newdata = trainData)

trainData <- data.frame(trainData_mat)

str(trainData)
One-Hot Encoding (dummy
variables)
• In above case, we had one categorical variable, Store7 with 2
categories. It was one-hot-encoded to produce two new columns –
Store7.No and Store7.Yes.
Visualize the importance of
variables (featurePlot)
• Simply set the X and Y parameters and set plot='box'. You can additionally adjust
the label font size (using strip) and the scales to be free.

featurePlot(x = trainData[, 1:18],

y = as.factor(trainData$Purchase),
plot = “density",
strip=strip.custom(par.strip.text=list(cex=.7)),
scales = list(x = list(relation="free"),
y = list(relation="free")))
Feature Selection Using Recursive
Feature Elimination (rfe)
• RFE works in 3 broad steps:
• Step 1: Build a ML model on a training dataset and estimate the feature
importance on the test dataset.
• Step 2: Keeping priority to the most important variables, iterate through by
building models of given subset sizes, that is, subgroups of most important
predictors determined from step 1. Ranking of the predictors is recalculated
in each iteration.
• Step 3: The model performances are compared across different subset sizes
to arrive at the optimal number and list of final predictors.
Recursive Feature Elimination (rfe)
code
set.seed(100)
options(warn=-1)

subsets <- c(1:5, 10, 15, 18)

ctrl <- rfeControl(functions = rfFuncs,

method = "repeatedcv",
repeats = 5,
verbose = FALSE)

lmProfile <- rfe(x=trainData[, 1:18], y=as.factor(trainData$Purchase),

sizes = subsets,
rfeControl = ctrl)

lmProfile
Recursive Feature Elimination (rfe)
output
rfeControl parameter
• We set what type of algorithm and what cross validation method
should be used.
• In above case, the cross validation method is repeatedcv which
implements k-Fold cross validation repeated 5 times
• The method used is random forest based (rfFuncs)
Selecting a Machine
Learning Model
Machine Learning Models in R
• Now comes the important stage where you actually build the machine learning
model.
• To know what models caret supports, run the following:

modelnames <- paste(names(getModelInfo()), collapse=', ')

modelnames

• if you want to know more details like the hyperparameters and if it can be used of
regression or classification problem, then do a

modelLookup("algorithm name in caret")

Step 1: Understand your ingredients
• Data
• How they are collected?
• Well-defined features?
• Type of data?
• Assumptions, Distribution, image pixels, etc
• Goal
• Predict?
• Describe?
• Explore?
• Metrics
• Accuracy level expected
• Speed in Scoring
• Interpretability
• Ease of implementation and maintenance
Step 2: Match the Ingredients to ML
Algorithms
Step 3: Assess the Results in
Systematic Approach
• After selecting the set of models of interest, build these models

• Compare these models

• Choose the best one fits your metrics

Train a model
modelLookup('earth')

# Set the seed for reproducibility

set.seed(100)

# Train the model using randomForest and predict on the training data itself.
model_mars = train(Purchase ~ ., data=trainData, method='earth')
fitted <- predict(model_mars)
model_mars
compute variable importance
• MARS supports computing variable importances, let’s extract the
variable importances using varImp() to understand which variables
came out to be useful.

varimp_mars <- varImp(model_mars)

plot(varimp_mars, main="Variable Importance with MARS")
Prepare the test dataset and predict
• A default MARS model has been selected.
• Now in order to use the model to predict on new data, the data has to
be preprocessed and transformed just the way we did on the training
data.
• If you recall, we did the pre-processing in the following sequence:
• Missing Value imputation –> One-Hot Encoding –> Range
Normalization
Prepare the test dataset and predict
code
# Step 1: Impute missing values
testData2 <- predict(preProcess_missingdata_model, testData)

# Step 2: Create one-hot encodings (dummy variables)

testData3 <- predict(dummies_model, testData2)

# Step 3: Transform the features to range between 0 and 1

testData4 <- predict(preProcess_range_model, testData3)

# View
head(testData4[, 1:10])
Predict on testData
• The test dataset is prepared. Let’s predict the Y.

# Predict on testData
predicted <- predict(model_mars, testData4)
head(predicted)
Confusion Matrix
• The confusion matrix is a tabular representation to compare the
predictions (data) vs the actuals (reference).

confusionMatrix(reference = as.factor(testData$Purchase), data = predicted, mode='everything',

positive='MM')
Optimizing the Model for Better
Performance
• The train() function takes a trControl argument that accepts the
output of trainControl().
• Inside trainControl() you can control how the train() will:
• Cross validation method to use.
• How the results should be summarised using a summary function
• The summaryFunction can be twoClassSummary if Y is binary class or
multiClassSummary if the Y has more than 2 categories.
Optimizing the Model for Better
Performance
# Define the training control
fitControl <- trainControl(
method = 'cv', # k-fold cross validation
number = 5, # number of folds
savePredictions = 'final', # saves predictions for optimal tuning parameter
classProbs = T, # should class probabilities be returned
summaryFunction=twoClassSummary # results summary function
)

# Step 1: Tune hyper parameters by setting tuneLength

set.seed(100)
model_mars2 = train(Purchase ~ ., data=trainData, method='earth', tuneLength = 5, metric='ROC', trControl = fitControl)
model_mars2

# Step 2: Predict on testData and Compute the confusion matrix

predicted2 <- predict(model_mars2, testData4)
confusionMatrix(reference = testData$Purchase, data = predicted2, mode='everything', positive='MM')
How to evaluate performance of
multiple machine learning
algorithms?
• Caret provides the resamples() function where you can provide
multiple machine learning models and collectively evaluate them.

• Let’s first train some more algorithms.

Training Adaboost
set.seed(100)

# Train the model using adaboost

model_adaboost = train(Purchase ~ ., data=trainData,
method='adaboost', tuneLength=2, trControl = fitControl)
model_adaboost
Training Random Forest
set.seed(100)

# Train the model using rf

model_rf = train(Purchase ~ ., data=trainData, method='rf',
tuneLength=5, trControl = fitControl)
model_rf
Training xgBoost Dart
set.seed(100)

# Train the model using MARS

model_xgbDART = train(Purchase ~ ., data=trainData,
method='xgbDART', tuneLength=5, trControl = fitControl, verbose=F)
model_xgbDART
Training SVM
set.seed(100)

# Train the model using MARS

model_svmRadial = train(Purchase ~ ., data=trainData,
method='svmRadial', tuneLength=15, trControl = fitControl)
model_svmRadial
Run resamples() to compare the
models
# Compare model performances using resample()
models_compare <- resamples(list(ADABOOST=model_adaboost, RF=model_rf,
XGBDART=model_xgbDART, MARS=model_mars3, SVM=model_svmRadial))

# Summary of the models performances

summary(models_compare)

# Draw box plots to compare models

scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(models_compare, scales=scales)
Conclusion
• This was just an overview of how caret can be used in R.
• Deep understanding of Machine Learning is mandatory to utilize this
tool effectively
• By time, you will get used to these steps and you will realize that the
most difficult part in any project is acquiring and cleaning the data.
Resources
• https://www.machinelearningplus.com/machine-learning/caret-packa
ge/
• https://blogs.sas.com/content/subconsciousmusings/2020/12/09/ma
chine-learning-algorithm-use/#prettyPhoto
• https://www.youtube.com/watch?v=-oZcf0QEzYM

Practical Machine Learning Guide
No ratings yet
Practical Machine Learning Guide
7 pages
R Assignment
No ratings yet
R Assignment
8 pages
Da Thoery
No ratings yet
Da Thoery
24 pages
Machine Learning for IT Students
No ratings yet
Machine Learning for IT Students
99 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Features Election
No ratings yet
Features Election
18 pages
Handling The Dataset Using R - Word
No ratings yet
Handling The Dataset Using R - Word
54 pages
Caret
No ratings yet
Caret
1 page
Assignment 4 R Program1
No ratings yet
Assignment 4 R Program1
11 pages
Data Science
No ratings yet
Data Science
15 pages
Praktikum Modul 3
No ratings yet
Praktikum Modul 3
5 pages
ISYE6501 Homework 2
No ratings yet
ISYE6501 Homework 2
11 pages
BuildingPredictiveModelsR Caret
No ratings yet
BuildingPredictiveModelsR Caret
26 pages
Stroke Prediction Data Guide
No ratings yet
Stroke Prediction Data Guide
5 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
R Lab Program
No ratings yet
R Lab Program
20 pages
BDA Lab Manual (12 Weeks)
No ratings yet
BDA Lab Manual (12 Weeks)
22 pages
Analysis Course HW2
No ratings yet
Analysis Course HW2
13 pages
Introduction to the caret Package in R
No ratings yet
Introduction to the caret Package in R
10 pages
Classification Models for Churn Prediction
No ratings yet
Classification Models for Churn Prediction
6 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
R Programming for Data Analysis
No ratings yet
R Programming for Data Analysis
6 pages
Da Exp9,10
No ratings yet
Da Exp9,10
9 pages
Data Science Machine Learning
No ratings yet
Data Science Machine Learning
369 pages
Data Science Machine Learning
No ratings yet
Data Science Machine Learning
470 pages
Chenhao HW1
No ratings yet
Chenhao HW1
5 pages
Practical Machine Learning Course Notes
No ratings yet
Practical Machine Learning Course Notes
76 pages
Topic 2
No ratings yet
Topic 2
47 pages
ISYE 6501 Georgia Tech Hmwk3.1a
No ratings yet
ISYE 6501 Georgia Tech Hmwk3.1a
4 pages
Caret PDF
No ratings yet
Caret PDF
1 page
Machine Learning with CARET in R
No ratings yet
Machine Learning with CARET in R
90 pages
Model Lab
No ratings yet
Model Lab
6 pages
R Machine Learning Commands Guide
No ratings yet
R Machine Learning Commands Guide
2 pages
Top R Packages for Machine Learning
No ratings yet
Top R Packages for Machine Learning
9 pages
Tidy Models
No ratings yet
Tidy Models
39 pages
Introduction R
No ratings yet
Introduction R
9 pages
R Data Preprocessing Guide
No ratings yet
R Data Preprocessing Guide
6 pages
DM Lab Practical Examination Report
No ratings yet
DM Lab Practical Examination Report
18 pages
Mars 05
No ratings yet
Mars 05
28 pages
Introduction to the caret Package
No ratings yet
Introduction to the caret Package
10 pages
Data Pre-Processing & Regression Analysis
No ratings yet
Data Pre-Processing & Regression Analysis
8 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Recipes For Data Processing
No ratings yet
Recipes For Data Processing
51 pages
R Course - Part7 ML - Exercise Sheet 2024
No ratings yet
R Course - Part7 ML - Exercise Sheet 2024
8 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Data Analytics Lab Manual - 250402 - 095326
No ratings yet
Data Analytics Lab Manual - 250402 - 095326
58 pages
Shahun Term Workr1
No ratings yet
Shahun Term Workr1
34 pages
Da Lab File 2
No ratings yet
Da Lab File 2
13 pages
Machine Learning With R
No ratings yet
Machine Learning With R
2 pages
Data Sampling and Feature Engineering Guide
No ratings yet
Data Sampling and Feature Engineering Guide
2 pages
Machine Learning with R Guide
No ratings yet
Machine Learning with R Guide
2 pages
7406HW02 1
No ratings yet
7406HW02 1
3 pages
R Lab File Deepak
No ratings yet
R Lab File Deepak
27 pages
MART Tutorial for R Users
No ratings yet
MART Tutorial for R Users
24 pages
Lec 6 Adversary Searchv1 Part2
No ratings yet
Lec 6 Adversary Searchv1 Part2
44 pages
Lec 2 Searching
No ratings yet
Lec 2 Searching
67 pages
Week 5 Data Wrangling
No ratings yet
Week 5 Data Wrangling
96 pages
Week 4 Data Visualisation I
No ratings yet
Week 4 Data Visualisation I
34 pages
Class XI Equilibrium: pH and Ionization Calculations
No ratings yet
Class XI Equilibrium: pH and Ionization Calculations
24 pages
Antibody Discovery-Ebook
No ratings yet
Antibody Discovery-Ebook
58 pages
Imo Level2 Class 6 Set 5
No ratings yet
Imo Level2 Class 6 Set 5
6 pages
Operations Research: Question Bank, 1/e: Book Information Sheet Book Information Sheet
No ratings yet
Operations Research: Question Bank, 1/e: Book Information Sheet Book Information Sheet
1 page
Pharma Feasibility in Bangladesh
No ratings yet
Pharma Feasibility in Bangladesh
26 pages
Optimizing Stroke Recognition With MediaPipe and Machine Learning An Explainable AI Approach For Facial Landmark Analysis
No ratings yet
Optimizing Stroke Recognition With MediaPipe and Machine Learning An Explainable AI Approach For Facial Landmark Analysis
26 pages
Retro-Analytical Reasoning IQ Tests For The High Range
No ratings yet
Retro-Analytical Reasoning IQ Tests For The High Range
21 pages
Meditation
No ratings yet
Meditation
3 pages
DLP For G9 Lavoisier - One-Act Play
No ratings yet
DLP For G9 Lavoisier - One-Act Play
9 pages
Pa State Police Study Guide Crime6 - p2
No ratings yet
Pa State Police Study Guide Crime6 - p2
2 pages
Principles of American Journalism An Introduction 3rd Edition Test Bank
No ratings yet
Principles of American Journalism An Introduction 3rd Edition Test Bank
8 pages
Algebra Expression Quiz Questions
No ratings yet
Algebra Expression Quiz Questions
1 page
LCD Power Circuit Schematic A5881
100% (1)
LCD Power Circuit Schematic A5881
1 page
Brainy Kl5 Short Tests Unit 6 Lesson 3
No ratings yet
Brainy Kl5 Short Tests Unit 6 Lesson 3
1 page
Force, Energy, and Machines Worksheet
No ratings yet
Force, Energy, and Machines Worksheet
15 pages
Foundation Unit Test Trigonometry PDF
No ratings yet
Foundation Unit Test Trigonometry PDF
17 pages
Music & Performing Arts Industry - Simplified
No ratings yet
Music & Performing Arts Industry - Simplified
38 pages
Circumference and Area of Circles
No ratings yet
Circumference and Area of Circles
2 pages
Foodies by Donald Getz
No ratings yet
Foodies by Donald Getz
3 pages
cp1h P080-E1 11 1 csm1004100 PDF
No ratings yet
cp1h P080-E1 11 1 csm1004100 PDF
34 pages
How To Become A Doctor in The Philippines
100% (1)
How To Become A Doctor in The Philippines
5 pages
Sulfex Employ Training
100% (1)
Sulfex Employ Training
55 pages
Short Film Review
No ratings yet
Short Film Review
2 pages
Military Equipment in The Islamic Eastern Mediterranean 10th 14th Centuries
No ratings yet
Military Equipment in The Islamic Eastern Mediterranean 10th 14th Centuries
9 pages
Practice Directive 1-20172
No ratings yet
Practice Directive 1-20172
8 pages
Ssoar 2015 3 Kolekar Ebanking - in - India
No ratings yet
Ssoar 2015 3 Kolekar Ebanking - in - India
8 pages
Waste-to-Energy Conversion Guide
100% (3)
Waste-to-Energy Conversion Guide
25 pages
Carbopol Ultrez 10 Polymer: Product Specifications
No ratings yet
Carbopol Ultrez 10 Polymer: Product Specifications
1 page
Just-in-Time Manufacturing Overview
100% (1)
Just-in-Time Manufacturing Overview
27 pages
Image and Screenshot File List
No ratings yet
Image and Screenshot File List
21 pages