What You Need To Know About R (Ebook)
What You Need To Know About R (Ebook)
about R
Raghav Bali
Dipanjan Sarkar
BIRMINGHAM - MUMBAI
What you need to know about R
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the authors, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
Raghav Bali has a master's degree (gold medalist) in Information Technology from
International Institute of Information Technology, Bangalore. He is an IT engineer
at Intel, the world's largest silicon company, where he works on analytics, business
intelligence, and application development to develop scalable machine learning-based
solutions. He has worked as an analyst and developer on domains, such as ERP,
Finance, and BI with some of the top companies of the world.
Raghav is a technology enthusiast who loves reading and playing around with new
gadgets and technologies. He recently co-authored a book on Machine Learning
titled R Machine Learning by Example, Packt Publishing. He is a shutterbug, capturing
moments when he isn't busy solving problems.
Andrea has written and contributed to a few useful R packages and regularly shares
insightful advice and tutorials about R programming.
https://www.packtpub.com/books/subscription/packtlib
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print and bookmark content
On demand and accessible via web browser
[i]
Table of Contents
R Cheat Sheets 40
Data processing and transformation 40
Data handling 40
Basic data types 41
Data structures 41
General utilities 43
Math and modeling 44
Math and modeling utilities 44
Math and modeling packages 45
Plotting 46
Plotting packages 47
Summary 47
What to do next? 48
Broaden your horizons with Packt 48
[ ii ]
What you need to
know about R
This eGuide is designed to act as a brief, practical introduction to R. It is full of
practical examples which will get you up a running quickly with the core tasks
of R.
We assume that you know a bit about what R is, what it does, and why you want to
use it, so this eGuide won't give you a history lesson in the background of R. What
this eGuide will give you, however, is a greater understanding of the key basics of R
so that you have a good idea of how to advance after you've read the guide. We can
then point you in the right direction of what to learn next after giving you the basic
knowledge to do so.
Cover the fundamentals and the things you really need to know, rather than
niche or specialized areas.
Assume that you come from a fairly technical background and so understand
what the technology is and what it broadly does.
Focus on what things are and how they work.
Include practical examples to get you up, running, and productive quickly.
Preface
Overview
R is a scripting language that is aimed at performing statistical analysis. It draws
inspiration from S, a statistical programming language that was developed by
AT&T. It also provides a multitude of options, tools, and libraries to make statistical
analysis easy and effective. R has grown over the years as a result of its open source
nature. It is a community-driven language that provides powerful tools for data
processing, manipulation, visualization, and publishing. It continues to evolve with
an ever-increasing list of packages and libraries, along with constant improvements
to the overall language.
Statistical analysis was the reason for R's inception and it has grown both in
importance and functionality over the years to become a go-to language. Data
scientists and statisticians alike use it to quickly prototype as well as to build
complex models and analysis. R finds applications for financial analysis and
modeling, food and drug data analysis, clinical trial analysis, and so on.
[ iv ]
Preface
[v]
R Ecosystem
This is an introductory section, and it will get you started with the basics of R along
with its ecosystem. It will also prepare you for some exciting features and examples
in the coming section. In this section, we will cover the following topics:
Installation
R is an open source and free software environment (and an interpreted language)
that is available for all major operating systems, such as UNIX and LINUX,
Windows, and OS X. As of writing, the current version is 3.2.5 (code named Very
Very Secure Dishes), and it is available at https://www.r-project.org/.
[1]
What you need to know about R
R's setup is straightforward and nicely outlined at the preceding link. For
the Windows environment, the setup requires downloading the setup file
and following the instructions from the executable. For Unix and Unix-like
environments, R can be installed from the prompt directly (wherever Unix-like
binaries are available). Enthusiasts can also build R from source as outlined in
the steps mentioned in the FAQ section of the r-project.
Configuration
R can be configured in a simple way to personalize startup. A file named Rprofile.
site exists in the installation directory. This simple R script file is checked each
time R is loaded into memory; hence, it executes any instructions (for instance,
functions, default directories, and so on) that are mentioned in this file. Another level
of customization can be achieved, where each user of the system can personalize R's
startup by adding a file, named Rprofile, to their home directory.
Startup modes
R supports the two following execution modes:
[2]
What you need to know about R
Workspace
R has robust memory management, where it allocates and keeps track of all the
objects in the environment. R's workspace is nothing but the current working
environment, which holds user-defined objects, such as variables, data, functions,
and so on. R provides various utility functions to manipulate the workspace.
Some of the standard utilities are as follows:
There are many other utility functions that are available. Readers are urged to
explore them using the help() command and R's documentation.
Operators
Programming languages require operators to perform actions, such as calculations,
transformations, and so on. Similar to other programming languages, R supports
operators for Logical, Mathematical, and Conditional operations, and so on. As R was
designed keeping statistical analysis in mind, the operators are robust enough to
handle not just basic data types, such as integers, floats, and characters, but they can
also handle matrices, vectors, strings, and arrays.
[3]
What you need to know about R
=: This is used interchangeably with the <- operator for assignment; R's
standards reserve the usage of the = operator for parameter passing only.
This operator assigns a value to parameters of a function without creating
a variable in the user's workspace. The following is an example of this:
> foo(x=1)
> foo_1(y<-2)
>ls()
In this example, y would be present in the list of variables but x will not be!
<- or ->: This is the default assignment operator that is used in R. This is
part of R's lineage from the days of S. This operator works both sides as the
following example states:
> x <- 3
> 15 -> y
There's even an <<- operator to access objects in the parent scope. Readers are
encouraged to go through R's documentation for more details.
Data types
To enable statistical analysis, R supports all basic data types, such as numeric
(integer, double), character, boolean, and so on. Apart from these basic data types,
R also provides a data type called factor for categorical data and complex for the
storage of complex numbers. This also handles missing values and nonexisting
objects differently, using the NA and NULL keywords, respectively. You should
not confuse NA with NaN, where NA denotes missing values, while NaN (NaN
stands for not a number and is a keyword in R) is used to represent undefined or
unrepresentable values. This also handles infinity using the Inf keyword (-Inf for
negative infinity). Each of these data types has a number of utility functions to check
missing values, length, and so on. A common set of functions for each data type are
as and is, which help in typecasting and checking the data type, respectively. For
instance, as.character() typecasts the input as a character, while is.character()
is used to check whether the input is of the character type or not.
[4]
What you need to know about R
Data structures
Data structures are at the core of R, and they provide a very powerful foundation
to the language. Any object or variable is vector (as in the mathematical vector) by
default unless specified otherwise. Lists, arrays (n-dimensional), matrices, and so
on are available out of the box. Lists are recursive in nature, that is, they can contain
other lists as elements, while vectors, arrays, and matrices are atomic in nature. R also
provides a unique tabular data structure called data frames. Data frames represent a
two-dimensional structure just like a matrix. However, unlike matrices, a data frame
can have different columns containing different data types. All the components of a
data frame must be of equal length. Consider the following example:
>book.sections<- c("section 1", "section 2", "section 3")
>section.pages<- c(6,26,10)
More on each of these data structures are covered in the upcoming sections.
Installing packages
As mentioned earlier, R is a community-driven language, and it owes its immense
power to an ever-increasing list of packages that add on to the capabilities of the
platform. In the R community, the term library is used instead of the term package.
R provides the following utilities to handle packages (and many more):
[5]
What you need to know about R
Getting help
It is very quick and simple to check documentation or get help related to R,
its packages, or utilities. The following utilities are available to get help:
RStudio
RStudio provides various features; some of them are as follows:
[6]
What you need to know about R
Other IDEs
Apart from RStudio, various other IDEs are also available:
RCommander: http://www.rcommander.com/
Eclipse R StatET: http://www.walware.de/goto/statet
ESS or Emacs Speaks Stats: http://ess.r-project.org/
There are many more specialized and fully-loaded R IDEs. Similar to Eclipse, Visual
Studio also provides R and R-related plugins for .NET developers. Use the one that
suits you the best. For the purpose of this book, we will stick with RStudio.
[7]
What you need to know about R
From RStudio, go to the File Menu, select New File as RMarkdown. Then, choose
the output format from the dialog box and provide a title and author name:
Add content in standard Markdown syntax in the script pane or window, and click
on the Knit HTML icon:
[8]
What you need to know about R
The markdown script is processed, and a preview window pops up. You can publish
to the Web directly from this pop-up by clicking on the Publish button:
RPubs works without RStudio as well. This requires the installation of packages
such as knitr, rmarkdown, and so on.
[9]
What you need to know about R
R Shiny requires the shiny package to be added to the list of packages. The
documentation for R Shiny is fairly detailed and easy to understand. Refer to the
tutorial and examples mentioned at the official website at http://shiny.rstudio.
com/tutorial/.
The following screenshot displays a sample Shiny app with its rendered output and
code side by side:
[ 10 ]
What you need to know about R
Data Analysis
In the previous section, you got a quick glance into the entire ecosystem of tools and
frameworks that R offers to analyze data and present your findings in various ways,
including reproducible Markdown documents as well as web applications. R is a
programming language at heart; it is also a software environment, was primarily
built for the statistical analysis of data leveraging a wide variety of statistical
techniques and graphical methods to visualize results. In this section, we will look
at what a typical data analysis workflow looks like, and then we will analyze a real
dataset using exploratory and statistical analysis techniques.
[ 11 ]
What you need to know about R
Business Understanding: This is the initial stage that focuses on the business
context of the problem that has to be solved at hand and uses domain and
business knowledge to plan out the main objectives and results that are
intended from the data analysis workflow.
Data Acquisition and Understanding: This stage's main focus is to
acquire data of interest and understand the meaning and semantics of the
various data points and attributes that are present in the data. Some initial
exploration of the data may also be done at this stage.
Data Preparation: This stage usually involves data munging, cleaning,
and transformation. Data quality issues are also dealt with in this stage.
The final dataset is usually used for analysis and modeling.
Modeling and Analysis: This stage mainly focuses on analyzing the data and
building models using specific techniques. Often, we need to apply further
data transformations that are based on different modeling algorithms.
Evaluation: This is perhaps one of the most crucial stages. Building models
and analyzing the data for patterns and insights are not the end of the
analysis. In this stage, we evaluate the results that are obtained from different
techniques and iterations, and then we select the best possible method or
analysis, which gives us the insights that we need based on our business
requirements. Often, this stage involves reiterating through the previous two
steps to reach a final agreement based on the results.
Deployment: This is the final model where decision systems that are based
on analysis are deployed so that end users can start consuming the results
and utilizing them. This deployed system can be as complex as a real-time
prediction system or as simple as an ad-hoc report.
[ 12 ]
What you need to know about R
The following figure shows the relationship between the various stages in the
CRISP-DM model:
In principle, the CRISP-DM model is very clear and concise, and this makes it easy
for data science practitioners and analysts to follow in their daily processes. We will
look at a real dataset in the next section, and apply some of these principles in our
data analysis process to get valuable insights from analyzing data.
[ 13 ]
What you need to know about R
Our main objective is to explore and analyze the mtcars dataset, which is readily
available in R. This dataset contains data about several automobiles, specifically
cars and various attributes that are related to each car. Some of our main objectives
are listed, as follows:
Next, we will focus on getting our dataset and understanding the semantics of each
attribute in the dataset and what they indicate.
[ 14 ]
What you need to know about R
Next, we will inspect some details about the dataset, which can be done using the
str command, as follows:
This gives us the following output, which tells us the various attributes in the dataset
and gives us a quick peek at their values.
We observe that the dataset is stored in a data structure of the data.frame type,
which is basically a two-dimensional tabular structure that is similar to a spreadsheet,
where each row is a particular data point that consists of different attributes that are
represented by different columns. In our dataset, we have 32 observations or data
points that form the rows and 11 attributes that form the columns. Each data point
or row is for a particular car, and each column is a specific attribute that is related to
the car, such as mpg, which indicates the Miles per Gallon of this car. We will now
understand the data in more detail by accessing the dataset metadata information
using the help command, as follows:
# detailed information about the dataset
help(mtcars)
[ 15 ]
What you need to know about R
This command displays detailed information regarding the data, which was originally
extracted from the 1974 issue of Motor Trend US magazine. This data has information
for a total of 32 cars. Each car is described by a total of 11 attributes, which are
described in detail in the following snapshot. They are pretty self-explanatory except
perhaps the vs attribute, which actually indicates whether the car has a V-engine or a
Straight engine:
We will now look at what the actual data looks like in the dataset using the
following command:
# view the raw data
head(mtcars, 5)
This command shows us the top five rows in the dataset, which are shown in the
following snapshot:
Now that we have a good understanding of the data and what it looks like, we will
proceed to the next step of data preparation before analyzing it.
[ 16 ]
What you need to know about R
If you closely observe the attribute data types from str(mtcars), which we executed
earlier, you will see that each attribute has been declared as num, which is a numeric
type, by R. However, in reality several variables are not of the numeric type, and we
have to change this based on the variable semantics and values. If you have taken
a basic course on statistics, you might know that usually we deal with two types of
variable or attribute most of the time:
As all the variables or attributes in our dataset were converted to numeric by default,
we will only need to convert the categorical variables from numeric data types to
factors, which is how R represents categorical attributes.
We will first implement our own utility function to carry out this data type conversion
using the following code snippet. A function is basically a block of code that usually
takes some input, performs some operations, and may or may not return an output:
[ 17 ]
What you need to know about R
We will use this function on our existing mtcars data frame to transform the cyl,
vs, am, gear, and carb attributes into categorical attributes using the following
code snippet:
## perform data type transformation
categorical.vars<- c("cyl", "vs", "am", "gear", "carb")
mtcars<- to.factors(mtcars, categorical.vars)
Now, we will observe whether this data type transformation was successful using
the following snippet:
# verify transformation
str(mtcars)
We can then see the attribute details in the data frame with the transformed
data types in the following snapshot, which indicates that our transformations
were successful:
This brings us to the end of our data preparation stage, and we will now perform
some analysis on our dataset in the next section.
[ 18 ]
What you need to know about R
One basic visualization includes scatter plots where we usually have an attribute on
the x and y axes, and we plot the various data points in the two-dimensional space to
see the relationship between the attributes. We will plot a
pairs scatterplot between all possible attributes in our mtcars dataset with the
following code snippet:
# pairs plot observing relationships between variables
pairs(mtcars, panel = panel.smooth,
main = "Pairs plot for mtcars data set")
This gives us the following scatterplot, which shows the relationship between each
pair of attributes in the dataset:
[ 19 ]
What you need to know about R
Next, we will leverage the use of a dot chart to plot the Miles per Gallon (mpg) value
for all the cars in our dataset using the following code snippet:
# mpg of cars
dotchart(mtcars$mpg, labels=row.names(mtcars),
cex=0.7, pch=16,
main="Miles per Gallon (mpg) of Cars",
xlab = "Miles per Gallon (mpg)")
This gives us a dot plot of the mpg values of each car, as shown in the following figure:
[ 20 ]
What you need to know about R
Some interesting insights that we see from the previous chart is that the top two cars
with the least mpg are Lincoln Continental and Cadillac Fleetwood. Similarly, the top
two cars with the highest mpg are Toyota Corolla and Fiat 128. We can also compute
this using R to prove our observations using the following code snippets, where we
use the order function to sort the mpg values before filtering out the necessary data:
head(mtcars[order(mtcars$mpg),], 2)
This gives us the top two cars with the least Miles per Gallon, as seen in the
following snapshot:
To get the top two cars with the maximum Miles per Gallon, we use the following
code snippet:
tail(mtcars[order(mtcars$mpg),], 2)
Thus, we see that our observations from the visualization were correct, and we got
the same results from our R code snippets.
We will now plot some simple bar charts for car frequencies that are related to
several attributes in the dataset. The next plot shows us the car counts grouped by
cylinders (cyl) using the following code. You can look at further details, such as
graph positioning and alignment, which we perform using the cex parameters by
checking the documentation using the ?barplot command:
# cylinder counts
barplot(table(mtcars$cyl),
col="lightblue",
main="Car Cylinder Counts Distribution",
xlab="Number of Cylinders", ylab="Total Cars",
cex.main = 0.8, cex.axis=0.6,
cex.names=0.6, cex.lab=0.8)
[ 21 ]
What you need to know about R
We observe that cars with eight cylinders are the most numerous followed by cars
with four cylinders. Next, we plot a similar bar plot of car counts that are grouped
by gear using the following code:
# gear counts
barplot(table(mtcars$gear),
col="lightblue",
main="Car Gear Counts Distribution",
xlab="Number of Gears", ylab="Total Cars",
cex.main = 0.8, cex.axis=0.6,
cex.names=0.6, cex.lab=0.8)
This gives us the following bar plot, where we observe cars with three gears are most
numerous:
[ 22 ]
What you need to know about R
The last simple bar chart will depict car counts that are grouped by the type
of transmission. For this, we relabel the factor variable levels from 0 and 1
to Automatic and Manual first, and then we plot the chart, as shown in the
following code:
# transmission counts
mtcars$am<- factor(mtcars$am,labels=c('Automatic','Manual'))
barplot(table(mtcars$am),
col="lightblue",
main="Car Transmission Type",
xlab="Number of Gears", ylab="Total Cars",
cex.main = 0.8, cex.axis=0.6,
cex.names=0.6, cex.lab=0.8)
This gives us the following plot, which clearly depicts that there are more cars with
automatic transmission in our dataset:
We will now visualize the data using some more complex visualizations in this
segment. To start off, let's visualize the car distribution by cylinders as well as
transmission using the following code snippet:
# visualizing cars distribution by cylinders and transmission
counts<- table(mtcars$am, mtcars$cyl)
barplot(counts, main="Car Distribution by Cylinders and Transmission",
xlab="Number of Cylinders", ylab="Total Cars",
col=c("steelblue","lightblue"),
legend=rownames(counts), beside=TRUE,
[ 23 ]
What you need to know about R
This gives us the following plot, where we see that most cars with automatic
transmission have eight cylinders and most cars with manual transmission
have four cylinders:
[ 24 ]
What you need to know about R
This gives us the following plot where we observe most cars with three gears have
eight cylinders:
Now, we will create a grouped dot plot showing the miles per gallon of various cars
that are grouped by number of cylinders using the following code snippet:
# visualizing car mpg distribution by cylinder
# add a color column within the data frame for plotting
mtcars<- within(mtcars, {
color<- ifelse(cyl == 4, "coral", ifelse(cyl == 6,
"cadetblue", "darkolivegreen"))
})
dotchart(mtcars$mpg, labels=row.names(mtcars),
groups=mtcars$cyl,
color=mtcars$color,
cex=0.7, pch=16,
main="Miles per Gallon (mpg) of Cars\nby Cylinders",
xlab = "Miles per Gallon (mpg)")
# remove the color column within the data frame after plotting
mtcars<- within(mtcars, rm("color"))
[ 25 ]
What you need to know about R
This gives us the following dot plot where we observe cars with four cylinders tend
to have the maximum miles per gallon compared to other cars!
We will now leverage a visualization library, called ggplot2, to plot some boxplots
that display the relationship of mpg with some other car attributes. Here, mpg is our
variable of interest. We will try to perform some statistical inference and regression
modeling later taking mpg as the response variable, which we will try to predict
based on the other attributes of the various cars. The ggplot2 visualization library is
an excellent library and is a plotting system used in R. It is based on the grammar of
graphics, which helps in building extremely complex plots with minimal efforts. It is
often used to produce publication-quality plots.
[ 26 ]
What you need to know about R
We will start by observing car mpg distributions over the number of cylinders using
the following code snippets:
# load visualization dependencies
library(ggplot2)
theme<- theme_set(theme_minimal())
[ 27 ]
What you need to know about R
This gives us the following visualization, where we see that the median mpg of
cars with four cylinders is the maximum, followed by cars with six and eight
cylinders, respectively.
We can also prove that our observations are programmatically correct using the
aggregate function in the following code snippet:
[ 28 ]
What you need to know about R
This gives us the following table where we observe that the median values are the
same as the ones that we observed in the plot, and the mean values are also quite
similar to the median:
We will now observe car mpg distributions over transmission type, using the
following code snippet:
# Car MPGs by Transmission type visualization
ggplot(mtcars,
mapping=aes_string(y = "mpg", x = "am")) +
xlab("Transmission Type") +
ylab("Miles per Gallon (mpg)") +
ggtitle("Distribution of Miles per Gallon (mpg)\nby
transmission type") +
geom_boxplot(outlier.colour = NULL,
aes_string(colour="am",
fill="am"), alpha=0.8) +
stat_summary(geom = "crossbar",
width=0.70,
fatten=0.5,
color="white",
fun.data = function(x) {
return(c(y=median(x),
ymin=median(x),
ymax=median(x)))
}
) +
stat_summary(fun.data = function(x) {
return(c(y = median(x)*1.03,
label = round(median(x),2)))
},
geom = "text",
fun.y = mean,
colour = "white")
[ 29 ]
What you need to know about R
This gives us the following chart where we see the median mpg for cars with manual
transmission is 22.8, which is much higher than 17.3, the median mpg for cars with
automatic transmission:
We can verify these statistics using the aggregate function, as we did earlier using
the following code:
# insights into average\median mpg of cars by transmission
aggregate(list(mpg=mtcars$mpg),
list(transmission=mtcars$am),
FUN=function(mpg) {
c(avg=mean(mpg),
median=median(mpg)
)
}
)
[ 30 ]
What you need to know about R
This gives us the following table, where we clearly observe that the mean and median
mpg for cars with manual transmission is higher than for automatic transmission cars:
We will now try to prove a hypothesis, which is based on the preceding data, that
the mpg statistic (mean) is different for cars with manual and automatic transmission
using the principles of statistical inference in the next section.
Statistical inference
Statistical inference is the process of inferring or deducing patterns, insights, and
properties of a dataset using methods such as hypothesis testing. In the previous
segment, we used visualizations and aggregations to see that the average miles per
gallon were significantly different for cars with automatic and manual transmission.
We will now use a statistical test to prove this.
We will start off with a hypothesis (H0) that the difference in mpg means for
automatic and manual transmission cars is zero. Our alternate hypothesis will be that
the difference in means is not zero. As our sample size is quite small, using a t-test
would be appropriate here.
A t-test, often called a Student's t-test, is a statistical hypothesis test that can be used
to determine whether two sets of data are significantly different from each other
using a test statistic, which is the average mpg in our case. An underlying assumption
also is that this test statistic follows a normal distribution. We will start by viewing
the data distribution for car miles per gallon using the following code snippet:
# view data distribution
ggplot(mtcars, aes(x=mpg)) +
geom_density(colour="steelblue",
fill="lightblue", alpha=0.8) +
expand_limits(x = 0, y = 0)
[ 31 ]
What you need to know about R
This gives us the following distribution of car mpg values in the form of a density
plot, and we see that the distribution is almost a perfect bell-shaped distribution,
which is the characteristic of a normal distribution.
We will now perform the t-test using the following code snippet:
# t-test
t.test(mpg ~ am, data = mtcars)
This gives us the following output, which clearly shows us that the mean mpg in the
automatic transmission group of cars is much lower as compared to the mean mpg in
the manual transmission group of cars:
[ 32 ]
What you need to know about R
We can also visualize the t-test results using some density plots using the
following code:
# visualzing t-test results
aggr<- aggregate(list(mpg=mtcars$mpg),
list(transmission=mtcars$am),
FUN=function(mpg){c(avg=mean(mpg))})
ggplot(mtcars, aes(x=mpg)) +
geom_density(aes(group=am, colour=am, fill=am),
alpha=0.6) +
geom_vline(data=aggr, aes(xintercept=mpg, color=transmission),
linetype="dashed", size=1)
This gives us the following density plot where the dotted lines indicate the mean
mpg for manual and automatic transmission cars; these are the exact figures that
we obtained earlier from our t-test and aggregations:
This finalizes our segment about statistical inference. Next, we will focus on the final
section of our analysis, which is related to statistical modeling.
[ 33 ]
What you need to know about R
As we have a far smaller number of samples in our dataset, we will train our model
on almost all the samples and try to predict the mpg value for one sample. First, we
will prepare our datasets using the following code:
# prepare datasets
car.to.predict<- mtcars[15, ]
training.data<- mtcars[-15, ]
Next, we will build our first regression model on only the training data using the
following code:
# build initial model
initial_model<- lm(mpg ~ ., data = training.data)
Now, we can view the details of the model we just built using the
summary(initial_model) command, which gives us detailed information regarding
the model, the coefficients of the various variables, and different metrics. You will
observe that the adjusted R-squared value is 0.7832, which indicates that 78.32%
of variation in our response variable (mpg) is explained by our input variables.
The higher this value is, then the better our model will be because it will be able to
explain most of the variability that is observed in the response variable, which we
want to predict.
[ 34 ]
What you need to know about R
Now, we will try to build a series of regression models and select the best model
from them on the basis of an evaluation metric called Akaike Information Criterion
(AIC), which we will inspect in detail. The following code snippet
steps through multiple regression models and finally selects the best one:
# best model selection
best_model<- step(initial_model, direction = "both")
This generates an output for a series of steps, and we depict the output of the final
step by selecting the best model in the following snapshot:
We see that wt, qsec, and am were the most important attributes, used as input
variables to build the best regression model, and the AIC value of the model is 60.51,
which is the least of all the models that were generated. It is used as a measurement
to evaluate the quality of various regression models against each other and then
select the best model, which minimizes the loss of information and has a minimum
number of parameters. We can view the details of the best model using the following
code:
# view model details
summary(best_model)
[ 35 ]
What you need to know about R
This gives us the following information, where we observe that the three input
variables that were used to create the final model were wt, qsec, and am. We also notice
that the adjusted R-squared value is 0.8179, which is better than the value that we
obtained in our initial model. We can observe this in the following output snapshot:
We will now look at the car whose mpg value we want to predict using the
following code:
# MPG of car to predict
print(data.frame(car.to.predict=data.matrix(
list(rownames(car.to.predict),
car.to.predict[,"mpg"])
)), row.names = FALSE)
This gives us the following output, showing the actual mpg of the car:
We now predict the value of mpg for this car using our initial model, as follows:
# initial model prediction
predict(initial_model, car.to.predict)
[ 36 ]
What you need to know about R
Next, we make another prediction using our best model using the following code:
# best model prediction
predict(best_model, car.to.predict)
We observe that our best model predicts much more effectively than our initial
model and the predicted value of mpg for Cadillac Fleetwood (11.3) is much closer
to its true mpg value (10.4), as compared to the predicted mpg value (16.1) that was
obtained from the initial model.
Finally, we will look at some regression model diagnostics and residual plots using
the following code snippet:
# best model diagnostics
par(mfrow = c(2, 2))
plot(best_model)
[ 37 ]
What you need to know about R
The points in the Residuals vs. Fitted plot seem to be randomly scattered on
the plot, and they verify homoscedasticity, which indicates the variance of
error is uniform across all x values
The Normal Q-Q plot consists of the points that mostly fall on the line,
indicating that the residuals are almost normally distributed
The Scale-Location plot consists of points that are scattered in a constant
band pattern, indicating constant variance
There are also some distinct points of interest (outliers or leverage points) in
the plots, which we shall discuss next
You may have noticed some specific data points with the names of cars mentioned
in the preceding plots. These points are often known as outliers, and they can be
separated into two types of points: influence and leverage points.
Influential points are data points that, if removed from the dataset, they change the
parameter estimates of the regression model by a significant amount and cause a
notable change in the computation results, based on changing the position of the
regression line. We can compute the influential data points in our dataset using the
following code snippet:
# influence points
influential<- dfbetas(best_model)
tail(sort(influential[,4]),4)
The dfbetas function is usually used to find out the extent to which one of these
influential points has affected the estimate of the regression line and corresponding
coefficients. This gives us the top four influential points, as shown in the following
snapshot:
Leverage points are data points that always have high or extreme values of the
independent variables such that they might have a greater ability to move the
regression line, based on its position as compared to the other data points. These
points can also be influential if they fall outside the general pattern of the other data
points, thus, greatly affecting the position of the regression line. We will compute the
high leverage data points in our dataset using the following code snippet:
# leverage points
leverage<- hatvalues(best_model)
tail(sort(leverage),4)
[ 38 ]
What you need to know about R
We use the hatvalues function to get the following top-four high-leverage points,
that are depicted in the following snapshot:
You will notice that several of these cars are depicted in the diagnostic plots of our
model, which we saw earlier.
This brings us to the end of our data analysis process. By now, you have seen the
benefits of exploratory analysis, visualizations, and modeling to get interesting
insights from data.
[ 39 ]
What you need to know about R
R Cheat Sheets
We just learned about the CRISP-DM model while utilizing and understanding
various R constructs and libraries to solve a real-world problem. This concluding
section presents various utilities, tricks, and techniques in the form of cheat sheets
to facilitate a quick look-up.
In this section, we will present cheat sheets that are organized in the following manner:
Data handling
To extract and load data for any kind of analysis, R provides pretty powerful and
easy-to-use utility functions. Some of these are listed, as follows:
[ 40 ]
What you need to know about R
numeric (integer and double) and character: These are data types that
are available in R
factor: This allows you to store categorical data while a complex data type
is used for complex numbers
is.<data_type> and as.<data_type>: These are used to check data types
and type conversion, respectively
length(<variable>): This gives you a count of characters in a variable
Data structures
R provides many data structures out of the box, which we discuss in the
following subsections.
Vectors
This is the most basic data structure in R. It is similar to a mathematical vector. The
following are ways to interact with a vector in R:
r[1]: This allows you to access elements using square braces. The element
count begins from 1.
r[ x > 100 ]: These vectors support logical expressions as indices.
r[5:10]: These vectors support subselection. The given example returns
vector values between the index 5 to 10.
r[-1]: This returns all indices except 1.
factor(x): This converts a vector x to factor.
which.max(x) and which.min(x): These return the maximum and
minimum values of x, respectively.
rev(x): This reverses the elements of x.
table(x): This gives you the frequency table for elements of the x vector.
match(a,b): This returns values from a which exist in b; otherwise, this is
not applicable.
[ 41 ]
What you need to know about R
Lists
A list is an ordered collection of named or unnamed objects, which may or may not
be homogenous. These are recursive data structures; that is, a list's element can itself
be a list. A list can be manipulated using the following:
[ 42 ]
What you need to know about R
Data frames
Data frames are tabular structures that can have columns of different data types and
attributes. A data frame may contain components of the numeric, character, factor,
or list types, or it may contain other data frames. The following utilities help in
manipulating data frames:
General utilities
Apart from the utilities and the other constructs that we just discussed, R provides a
rich set of general utilities to make data analysis even easier. Check out the following
utilities:
c(1:5): This is a generic function that concatenates values. The given example
would generate a vector with values 1 to 5.
rep(<value>,<count>): This generates a vector with repeating <value>
elements of the <count> size.
seq(to,from): This generates a sequence vector starting with to and ending
with from. You can also specify increments; the default is 1.
sort(c(10,9,8,7): This returns a sorted vector 7,8,9,10.
order(10,9,1,2): This returns indices in ascending order as 3,4,2,1.
rank(10,5,6,9): This returns the rank order of elements as 4,1,2,3.
summary(<object>): This has summary details, such as min, max, mean,
median, and so on, for the object.
choose(n,k): This returns the combination of k in n repetitions.
na.omit(x): This suppresses all the missing values (nas) from x.
na.fail(x): This errors out if x contains even a single missing value.
unique(x): This returns only distinct or unique values of x. This works with
vectors and data frames.
paste(...): This converts objects to strings and concatenates them.
[ 43 ]
What you need to know about R
[ 44 ]
What you need to know about R
[ 45 ]
What you need to know about R
Plotting
Statistical analysis and data science are way too difficult without graphs and
visualization. R has a rich set of utilities and libraries for plotting. Let's have a look
at a few of these:
plot(y): This plots the values of y on the y axis ordered by indices on the
x axis.
plot(x,y): This plots values on the x and y axis, respectively.
barplot(x): This is a bar plot of the values of x.
hist(x): This is a histogram of frequencies of the elements of x.
pie(x): This is a pie chart for the elements of x.
boxplot(x): This is a boxplot for the elements of x.
plot.ts(x): This is a plot with respect to time.
mosaicplot(x): This is a mosaic graph of residuals of a log-linear regression.
contour(x,y,z): This is a contour plot of x and y, where x and y must be
vectors and z should be a matrix of the x X y dimension.
qqplot(x,y): This is a quantile plot of y with respect to x.
abline(m,c): This draws a line with the m slope and the c intercept. This can
also be used to draw horizontal, vertical, and regression lines.
rect(x1,y1,x2,y2): This draws a rectangle, based on the top-left (x1,y1)
and bottom-right (x2,y2) coordinates.
polygon(x,y): This draws a polygon, connecting the elements of x and y.
xlim,ylim: These are the x and y limits of a graph.
[ 46 ]
What you need to know about R
Plotting packages
Let's now take a look at some plotting packages.
Summary
Using this guide, we went on a journey from the origins of R to using it to analyze
the mtcars dataset available in R. Throughout the guide, we learned about the R
ecosystem, its tools, and services, and we understood the basic constructs of R along
with the CRISP-DM data analysis model or life cycle to perform different analyses.
We performed exploratory analysis, and we went on to draw relationships and
insights using various packages for regression modeling and visualization.
We also looked at the model evaluation methods for such models. We concluded
the guide by listing neat tips and tricks, along with popular and standard sets of
packages and utilities, for quick reference.
[ 47 ]
What you need to know about R
What to do next?
Broaden your horizons with Packt
If youre interested in R, then youve come to the right place. Weve got a diverse
range of products that should appeal to budding as well as proficient specialists in
the field of R.
[ 48 ]
What you need to know about R
To learn more about R and find out what you want to learn next, visit the R technology
page at https://www.packtpub.com/tech/r
If you have any feedback on this eBook, or are struggling with something we havent
covered, let us know at [email protected].
Get a 50% discount on your next eBook or video from www.packtpub.com using
the code:
[ 49 ]