R for Data Analysis
R for Data Analysis
Trevor French
11/13/22
ii
Table of contents
I Introduction 1
Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Structure of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
About Me . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1 What is R? 5
1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Setup 9
3.1 Install R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Install R Studio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.1 Posit Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.2 Replit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.3 Kaggle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
II Part I: Fundamentals 15
4 Getting Familiar with RStudio 19
4.1 Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Source Pane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.5 Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.6 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
iii
iv TABLE OF CONTENTS
5 Programming Basics 29
5.1 Executing Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1 Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.2 Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4.1 Arithmetic Operators . . . . . . . . . . . . . . . . . . . . 35
5.4.2 Comparison Operators . . . . . . . . . . . . . . . . . . . . 35
5.4.3 Logical Operators . . . . . . . . . . . . . . . . . . . . . . 36
5.4.4 Assignment Operators . . . . . . . . . . . . . . . . . . . . 37
5.4.5 Miscellaneous Operators . . . . . . . . . . . . . . . . . . . 39
5.5 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.6 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.6.1 While Loops . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.6.2 For Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.7 Conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.8 R packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.9 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Data Types 45
6.1 Numeric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.1.1 Double . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.1.2 Integer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Complex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3 Character . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.4 Logical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.5 Raw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.6 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7 Data Structure 49
7.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.3 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.4 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.5 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.6 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.7 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Exercises 53
Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
TABLE OF CONTENTS v
Exercises 75
Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
13 Outliers 97
13.1 Finding Outliers Visually . . . . . . . . . . . . . . . . . . . . . . 97
13.1.1 Scatter Plot . . . . . . . . . . . . . . . . . . . . . . . . . . 97
13.1.2 Box Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
13.1.3 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
13.1.4 Density Plot . . . . . . . . . . . . . . . . . . . . . . . . . 101
13.2 Finding Outliers Statistically . . . . . . . . . . . . . . . . . . . . 103
13.2.1 Standard Deviation . . . . . . . . . . . . . . . . . . . . . 103
13.3 Removing Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . 103
13.4 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Exercises 109
Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
16 Regression 123
16.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
16.2 Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . 125
16.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 127
16.4 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
17 Plotting 129
17.1 Plotting your Regression Model . . . . . . . . . . . . . . . . . . . 129
17.2 Plots Available in Base R . . . . . . . . . . . . . . . . . . . . . . 135
17.2.1 Box Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
17.2.2 Plot Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 138
17.2.3 Pie Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
17.2.4 Bar Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
17.2.5 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
17.2.6 Density Plot . . . . . . . . . . . . . . . . . . . . . . . . . 142
TABLE OF CONTENTS vii
Exercises 147
Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
19 R Markdown 171
19.1 Format Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
19.2 HTML Document Example . . . . . . . . . . . . . . . . . . . . . 178
19.3 R Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
19.4 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
20 R Shiny 189
20.1 Quickstart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
20.2 Basic Components of a Shiny Application . . . . . . . . . . . . . 194
20.2.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
20.2.2 UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
20.2.3 Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
20.2.4 Putting it Together . . . . . . . . . . . . . . . . . . . . . . 196
20.3 Deploying Application . . . . . . . . . . . . . . . . . . . . . . . . 197
20.3.1 ShinyApps.io . . . . . . . . . . . . . . . . . . . . . . . . . 197
20.3.2 Configuring Account . . . . . . . . . . . . . . . . . . . . . 199
20.4 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Exercises 205
Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
References 207
viii TABLE OF CONTENTS
Part I
Introduction
1
Prerequisites 3
Prerequisites
No prior knowledge is required to begin this book. The content will start at
the very beginning by showing you how to set up your R environment and the
basics of programming in R. By the end of the book, you will be able to perform
intermediate analytics techniques such as linear regression and automatic report
generation.
You will need an environment which you use to run your code. It is recom-
mended that you download R and R Studio locally for this requirement. This
book will walk you through how to do that as well as offer alternatives if that
is not an option for you.
License
This work is free to use, and is licensed under a Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License.
About Me
I have an M.S. in Data Analytics, a B.S. in Business Analytics, and currently
work in industry as an Analytics Manager for a software company. I began
my journey into analytics by working as a Data Analyst for the university I
was attending. This role allowed me to automate processes, build dashboards,
deliver reports to executive stakeholders, and provide insight on how operations
might be improved. I performed this role until I was promoted to lead the
team. Later, I worked for a major CPG company driving pricing and promotion
strategy for a large piece of the business.
Despite my education, most of my basic analytics knowledge was hard-won
through self-study. I created this resource to be what I wish I had when I
started my journey into the analytics domain. Additionally, I don’t believe that
one must be a domain expert to be effective at analyzing data. In fact, I think
most people can quickly learn the skills necessary to be very effective at it.
Physical copies of this book are not currently available; however, you can down-
load a pdf in the top left corner of this site. Feel free to contribute by reporting a
typo or leaving a pull request at https://github.com/TrevorFrench/R-for-Data-
Analysis.
Chapter 1
What is R?
1.1 History
R was built by Ross Ihaka and Robert Gentleman at the University of Auckland
and was first released in 1993.
Robert Gentleman and Ross Ihaka “both had an interest in statistical comput-
ing and saw a common need for a better software environment in [their] Macin-
tosh teaching laboratory. [They] saw no suitable commercial environment and
[they] began to experiment to see what might be involved in developing one
[them]selves.” (Ihaka 1998)
While R was officially first released in 1993, it wasn’t until 1995 that Ross Ihaka
and Robert Gentlemann were convinced by Martin Mächler to release the source
code freely (Ihaka 1998).
1.2 Resources
• You can learn more about R here: https://www.r-project.org/
• Read Ross Ihaka’s account of R’s origination: https://www.stat.auckland.
ac.nz/~ihaka/downloads/Interface98.pdf
5
6 CHAPTER 1. WHAT IS R?
Data analysis at its most simple form is the process of searching for meaning in
data with the ultimate goal to draw insight from that meaning.
2. Data Acquisition - As you might imagine, you must acquire your data
before conducting an analysis. This may be done through methods such as
manual creation of datasets, importing pre-constructed data, or leveraging
APIs.
3. Data Preparation - Most data will not be received in the precise format
you need to begin your analysis. The process of data preparation involves
structuring and adding features to your data.
7
8 CHAPTER 2. WHAT IS DATA ANALYSIS?
2.2 Resources
• “Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing
and Presenting Data” by EMC Education Services: https://onlinelibrary.
wiley.com/doi/book/10.1002/9781119183686
• “Managing the Analytics Life Cycle for Decisions at Scale” by SAS:
https://www.sas.com/content/dam/SAS/en_us/doc/whitepaper1/
manage-analytical-life-cycle-continuous-innovation-106179.pdf
Chapter 3
Setup
This chapter will walk you through downloading the R programming language
as well as R Studio, which is a popular tool for interacting with the R ecosystem.
Additionally, there are alternatives to R Studio listed at the end of the chapter.
However, R Studio is the recommended environment for completing this book.
3.1 Install R
Before you do anything, you’ll need to download R. This download will allow
your computer to interpret the R code you write later on.
1. Download R From R: The R Project for Statistical Computing
2. Select “download R”
3. Choose any link but preferably the one closest to your physical location
9
10 CHAPTER 3. SETUP
6. Press “download”
3.2. INSTALL R STUDIO 11
7. Open installer
8. Follow the prompts and leave all options set as their default values
3.3 Alternatives
3.3.1 Posit Cloud
Posit Cloud offers users a way to replicate the full RStudio experience without
having to download or set anything up on your personal computer. You can
sign up for a free account here:
3.3.2 Replit
Replit allows users to code in 50+ languages in the browser. While you won’t
be able to follow along with the RStudio specific examples, you will be able to
run R code. You can sign up for a free account here:
14 CHAPTER 3. SETUP
3.3.3 Kaggle
Kaggle is one of the most popular sites for data analysts to compete in data
competitions, find data, and discuss data topics. They also have a feature that
allows you to write and run R (and Python) code. You can sign up for a free
account here:
3.4 Resources
• “R Installation and Administration” by the R Core Team: https://cran.r-
project.org/doc/manuals/r-release/R-admin.html
Part II
Part I: Fundamentals
15
17
This section will introduce you to the basics of programming in the context of R.
There are four chapters in this book. Each chapter has a brief description listed
below. After you have finished reading through each of them, you will have
the opportunity to attempt practical exercises to reinforce your newly-gained
knowledge.
Ĺ Note
4.1 Customization
You are able to customize how your version of RStudio looks by following these
steps:
19
20 CHAPTER 4. GETTING FAMILIAR WITH RSTUDIO
3. Choose ‘Appearance’ and select your favorite theme from the ‘Editor
Theme’ section
4.2. SOURCE PANE 21
4. Press ‘Apply’
There are other customization options avaialable as well. Feel free to explore
the “Global Options” section to make your version of RStudio your own.
If you don’t see the source pane, you may need to create a new R script by
pressing “Ctrl + Shift + N” (“Cmd + Shift + N” on Mac) or by selecting “R
Script” from the “New File” dropdown in the top left corner.
a. Show in New Window- This allows you to pop the source pane into a
new window by itself.
b. Save Current Document- This saves the file contained in the tab you
currently have active.
c. Source on Save- Automatically sources your file every time you hit save.
“Sourcing” is similar to “Running” in the sense that both will execute your
code; however, sourcing will execute your saved file rather than copying
lines of code into the console.
d. Find/Replace- this feature allows you to find and replace specified text,
similar to find and replace features in other tools such as Excel.
e. Code Tools- This brings up a menu of options which help you to code
more efficiently. Some of these tools include formatting your code and
help with function definitions.
f. Compile Report- This allows you to compile a report directly from an R
script without needing to use additional frameworks such as R Markdown.
g. Run Current Selection- This allows you to highlight a portion of your
code and run only that portion.
h. Re-run Previous Code Region- This option will execute the last sec-
tion of code that you ran.
i. Go to Previous/Next Section/Chunk- These up and down arrows
allow you to navigate through sections of your code without needing to
scroll.
j. Source Contents- This option will save your active document if it isn’t
already saved and then source the file.
k. Outline- Pressing this option will pop open an outline of your current
file.
l. Adjust Frame Size- These two options will adjust the size of the source
pane inside of R Studio.
24 CHAPTER 4. GETTING FAMILIAR WITH RSTUDIO
4.3 Console
The console pane is the bottom left pane in RStudio. This pane has three tabs:
“Console”, “Terminal”, and “Background Jobs”.
• The “Console” tab is where you will be able to run R code directly without
writing a script (this will be covered in the next chapter).
• The “Terminal” tab is the same terminal you have on your computer. This
can be adjusted in the global options.
• The “Background Jobs” tab is where you can start and manage processes
that need to run behind the scenes.
4.4. ENVIRONMENT 25
4.4 Environment
The environment pane is the top right pane in RStudio. This is where you will
manage all things related to your development environment. This pane has four
tabs: “Environment”, “History”, “Connections”, and “Tutorial”.
• The “Environment” tab will display all information relevant to your cur-
rent environment. This includes data, variables, and functions. This is
also the place where you can view and manage your memory usage as well
as your workspace.
• The “History” tab allows you to view the history of your executed code.
You can search through these commands and even select and re-execute
them.
• The “Connections” tab is where you can create and manage connections
to databases.
4.5 Files
The files pane is the bottom right pane in RStudio. This pane has six tabs:
“Files”, “Plots”, “Packages”, “Help”, “Viewer”, and “Presentation”.
• The “Files” tab is a file explorer of sorts. You can view the contents of a
directory, navigate to new directories, and manage files here.
• The “Plots” tab is where the output of your generated plots will show up.
You can also export your plots from this tab.
• The “Packages” tab allows you to view all available packages within your
environment. From this tab, you can read more about each package as
well as update and access packages.
• The “Help” tab allows you to search for information about functions to
include examples, descriptions, and available parameters.
• The “Viewer” tab is where certain types of content such as quarto docu-
ments will be displayed when rendered.
• The “Presentation” tab is similar to the “Viewer” tab except the content
type will be presentations.
4.6. RESOURCES 27
4.6 Resources
• “Editing and Executing Code in the RStudio IDE” from the R Studio Sup-
port team: https://support.rstudio.com/hc/en-us/articles/200484448-
Editing-and-Executing-Code
• “Code Folding and Sections in the RStudio IDE” from the R Studio Sup-
port team: https://support.rstudio.com/hc/en-us/articles/200484568-
Code-Folding-and-Sections-in-the-RStudio-IDE
• “Keyboard Shortcuts in the RStudio IDE” from the R Studio Sup-
port team: https://support.rstudio.com/hc/en-us/articles/200711853-
Keyboard-Shortcuts-in-the-RStudio-IDE
• “Navigating Code in the RStudio IDE” from the R Studio Support team:
https://support.rstudio.com/hc/en-us/articles/200710523-Navigating-
Code-in-the-RStudio-IDE
28 CHAPTER 4. GETTING FAMILIAR WITH RSTUDIO
Chapter 5
Programming Basics
This chapter will walk you through executing code and writing scripts in R. You
will then build upon that knowledge by learning about comments, variables,
operators, functions, loops, conditionals, and libraries. While this chapter is
titled “Programming Basics”, the knowledge you will have learned by the end
of this chapter is enough for you to accomplish a huge variety of tasks.
• in the console
• in a script
5.1.1 Console
The first way to run code is directly in the console. If you’re working in RStudio,
you will access the console through the “console” pane.
29
30 CHAPTER 5. PROGRAMMING BASICS
In the following example, the text “print(3+2)”” is typed into the console. The
user then presses enter and sees the result: “[1] 5”.
print(3+2)
[1] 5
You may be wondering what “[1]” represents. This is simply a line number in
the console and can be ignored for most practical purposes. Additionally, most
of the examples in this book will be structured in this way: formatted code
immediately followed by the code output.
5.1.2 Script
You likely will be using scripts most of the time when working in R. A script
is just a file that allows you to type out longer sequences of code and execute
them all at once.
For those of you following along in RStudio, you can create a script by pressing
“Ctrl + Shift + N” on Windows or by selecting “R Script” from the “New File”
dropdown in the top left corner.
5.1. EXECUTING CODE 31
From here you can type the same command from before into the source pane.
Next, you’ll want to save your file by pressing “Ctrl + S” on Windows or by
selecting “Save” from the “File” dropdown in the top left corner. Now just give
your file a name and your file will automatically be saved as a “.R” file.
Finally, run your newly created R script by pressing the “source” button.
32 CHAPTER 5. PROGRAMMING BASICS
5.2 Comments
Comments are present in most (if not all) programming languages. They allow
the user to write text in their code that isn’t executed or read by computers.
Comments can serve many purposes such as notes, instructions, or formatting.
Comments are created in R by using the “#” symbol. Here’s an example:
# This is a comment
print(3+2)
[1] 5
Some programming languages allow you a “bulk-comment” feature which allows
you to quickly comment out multiple consecutive lines of text. However, in R,
5.3. VARIABLES 33
there is no such option. Each line must begin with a “#” symbol, as such:
[1] 5
Comments don’t have to start at the beginning of a line. You are able to start
comments anywhere on a line like in this example:
[1] 5
5.3 Variables
Variables are used in programming to give values to a symbol. In the following
example we have a variable named “rate” which is equal to 15, a variable named
“hours” which is equal to 4, and a variable named “total_cost” which is equal
to rate * hours.
rate <- 15
hours <- 4
total_cost <- rate * hours
print(total_cost)
[1] 60
5.4 Operators
An operator is a symbol that allows you to perform an action or define some
sort of logic. The following image demonstrates the operators that are available
to you in R.
34 CHAPTER 5. PROGRAMMING BASICS
5.4. OPERATORS 35
3 + 3
[1] 6
3 - 3
[1] 0
3 * 3
[1] 9
3 ^ 3
[1] 27
10 / 7
[1] 1.428571
10 %% 7
[1] 3
10 %/% 7
[1] 1
3 == 3
[1] TRUE
36 CHAPTER 5. PROGRAMMING BASICS
3 != 3
[1] FALSE
3 > 3
[1] FALSE
3 < 3
[1] FALSE
3 >= 3
[1] TRUE
3 <= 3
[1] TRUE
[1] TRUE
5.4. OPERATORS 37
This example is the same as the previous one with the exception that we have
negated the second condition with a “NOT” operator.
[1] FALSE
The following two examples are essentially the same as the first two except that
we are using “OR” operators rather than “AND” operators
[1] TRUE
global_var
[1] "global"
local_var
[1] "local"
38 CHAPTER 5. PROGRAMMING BASICS
Next, let’s create a function to test out the global assignment operator (“«-”).
Inside this function, we will assign a new value to both of the variables we just
created; however, we will use the “<-” operator for the local_var and the “«-”
operator for the global_var so that we can observe the difference in behavior.
Ĺ Note
Functions are covered directly after this section. If the concept of functions
is unfamiliar to you, feel free to jump ahead and come back later.
my_function()
[1] "na"
[1] "na"
This function performs how you would expect it to intuitively, right? The
interesting part comes next when we print out the values of these variables
again.
global_var
[1] "na"
local_var
[1] "local"
From this result, we can see the difference in behavior caused by the differing
assignment operators. When using the “<-” operator inside the function, it’s
scope is limited to just the function that it lives in. On the other hand, the “«-”
operator has the ability to edit the value of the variable outside of the function
as well.
You may now be wondering why both the local and the global assignment oper-
ators have two separate denotations. The following example demonstrates the
difference between the two.
5.4. OPERATORS 39
x <- 3
3 -> y
[1] 3
[1] 3
There is also a third assignment operator that can be used: “=”. You will
generally use the local assignment operator; however, you may notice that the
“=” operator is used within certain functions as you progress. You can find
more information about these three operators in the resources section.
3 %in% 1:3
[1] TRUE
x <- matrix(
c(1,3,3,7)
, nrow = 2
, ncol = 2
, byrow = TRUE)
x %*% x
[,1] [,2]
[1,] 10 24
[2,] 24 58
40 CHAPTER 5. PROGRAMMING BASICS
5.5 Functions
Functions allow you to bundle a predefined set of operations into one command.
The syntax of functions in R is as follows.
[1] 5
add_numbers(50, 50)
[1] 100
Finally, you can return a value from a function as such:
print("John's Raise:")
print(johns_raise)
[1] 4500
print("Jane's Raise:")
print(janes_raise)
[1] 4500
5.6 Loops
There are two types of loops in R: while loops and for loops.
# Set i equal to 1
i <- 1
[1] 1
[1] 2
[1] 3
Additionally, you can add ‘break’ statements to while loops to stop the loop
early.
42 CHAPTER 5. PROGRAMMING BASICS
i <- 1
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] "Stopping halfway"
[1] "jane"
[1] "john"
5.7 Conditionals
You are also able to execute a command if a condition is met by using “if”
statements.
if (2 > 0) {
print("true")
}
[1] "true"
if (2 > 3) {
print("two is greater than three")
} else if (2 < 3) {
print("two is not greater than three")
}
x <- 20
if (x < 20) {
print("x is less than 20")
} else if (x > 20) {
print("x is greater than 20")
} else {
print("x is equal to 20")
}
5.8 R packages
Packages allow you to access functions other people have created and shared
in a standard format, e.g. via the Comprehensive R Archive Network (CRAN),
Bioconductor, the r-universe or e.g. as github repositories.
To access a package’s functionality, you first have to add it to your system’s
library. Afterward, you can check it out for use in your current session with the
library() command.
In this example, we will be installing and loading a common package named
“dplyr”.
You first retrieve it from CRAN with the following command.
install.packages("dplyr")
Next, you make it available in your R session with the library() command.
(Alternatively, you can also use the require() command.)
library(dplyr)
You are now able to access all of the functions available in the dplyr library!
44 CHAPTER 5. PROGRAMMING BASICS
Sometimes users in the R community create their own packages that aren’t
distributed through the CRAN network. You can still use these packages, but
you’ll just have to perform an extra step or two. One of the most common
places to host packages is Github. The following example will demonstrate how
to load a package that I created from Github.
First you’ll need to install the “remotes” package. As the name might suggest,
this package allows you to access other packages from remote locations.
install.packages("remotes")
Next you’ll need to install the remote package of your choosing. In our case,
we’ll execute the following code.
remotes::install_github("TrevorFrench/trevoR")
In the previous example, we used the “install_github” function from the “re-
motes” package and then specified the Github path of the remote repository by
typing “TrevorFrench/trevoR”. This code is functionally the same as the “in-
stall.packages” function. You may have noticed a new piece of syntax though.
The “::” in between “remotes” and “install_github” tells R to use the “in-
stall_github” function from the “remotes” library without the need to require
the library via the “library” function. This syntax can be used with any other
function from any other library.
Now that the remote package is installed, we can require it in the same way we
would any other package.
library(trevoR)
5.9 Resources
• W3 Schools R Tutorial: https://www.w3schools.com/r/
• Assignment Operators: https://stat.ethz.ch/R-manual/R-devel/library/
base/html/assignOps.html
Chapter 6
Data Types
6.1 Numeric
6.1.1 Double
Let’s explore the “double” data type by assigning a number to a variable and
then check its type by using the “typeof” function. Alternatively, we can use
the “is.double” function to check whether or not the variable is a double.
x <- 6.2
typeof(x)
45
46 CHAPTER 6. DATA TYPES
[1] "double"
is.double(x)
[1] TRUE
Next, let’s check whether or not the variable is numeric by using the “is.numeric”
function.
is.numeric(x)
[1] TRUE
This function should return “TRUE” as well, which demonstrates the fact that
a double is a subset of the numeric data type.
6.1.2 Integer
Let’s explore the “integer” data type by assigning a whole number followed by
the capital letter “L” to a variable and then check its type by using the “typeof”
function. Alternatively, we can use the “is.integer” function to check whether
or not the variable is an integer.
x <- 6L
# By using the "typeof" function, we can check the data type of x
typeof(x)
[1] "integer"
is.integer(x)
[1] TRUE
Next, let’s check whether or not the variable is numeric by using the “is.numeric”
function.
is.numeric(x)
[1] TRUE
This function should return “TRUE” as well, demonstrating that an integer is
also a subset of the numeric data type.
6.2 Complex
Complex data types make use of the mathematical concept of an imaginary
number through the use of the lowercase letter “i”. The following example sets
6.3. CHARACTER 47
x <- 6i
typeof(x)
[1] "complex"
6.3 Character
Character data types store text data. When creating characters, make sure you
wrap your text in quotation marks.
x <- "Hello!"
typeof(x)
[1] "character"
6.4 Logical
Logical data types store either “TRUE” or “FALSE”. Unlike characters, these
data should not be wrapped in quotation marks.
x <- TRUE
typeof(x)
[1] "logical"
6.5 Raw
Used less often, the raw data type will store data as raw bytes. You can convert
character data types to raw data types by using the “charToRaw” function.
Similarly, you can convert integer data types to raw data types through the use
of the “intToBits” function.
x <- charToRaw("Hello!")
print(x)
[1] 48 65 6c 6c 6f 21
typeof(x)
[1] "raw"
48 CHAPTER 6. DATA TYPES
x <- intToBits(6L)
print(x)
[1] 00 01 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[26] 00 00 00 00 00 00 00
typeof(x)
[1] "raw"
6.6 Resources
• W3 Schools: https://www.w3schools.com/r/r_data_types.asp
• “Advanced R” by Hadley Wickham: https://adv-r.hadley.nz/vectors-
chap.html#atomic-vectors
• “Bits and Bytes” from Stanford CS 101: https://web.stanford.edu/class/
cs101/bits-bytes.html
Chapter 7
Data Structure
In computer science, a data structure refers to the method which one uses to
organize their data. Six basic data structures are commonly used in R:
7.1 Vectors
We can create a vector by using the “c” function to combine multiple values
into a single vector. In the following example, we will combine four separate
numbers into a single vector and the output the resulting vector to see what it
looks like.
x <- c(1, 3, 3, 7)
print(x)
[1] 1 3 3 7
49
50 CHAPTER 7. DATA STRUCTURE
7.2 Lists
Lists are a collection of objects. This means that each element can be a different
data type (unlike vectors). In the following example we’ll create a list containing
two character objects and one vector with the “list” function.
print(person)
[[1]]
[1] "John"
[[2]]
[1] "Smith"
[[3]]
[1] 1 3 3 7
7.3 Matrices
A matrix is a two-dimensional array where the data is all of the same type. In
the following example, we’ll create a matrix with three rows and four columns.
x <- matrix(
c(1,3,3,7,1,3,3,7,1,3,3,7)
, nrow = 3
, ncol = 4
, byrow = TRUE)
print(x)
7.4 Factors
Factors are used to designate levels within categorical data. In the following
example, we’ll use the “factor” function on a vector of assorted color names to
receive the “levels” which it contains.
print(colors)
print(df)
id person
1 1 John
2 2 Jane
7.6 Arrays
Arrays are objects that can have more than two dimensions. This is sometimes
referred to as being “n-dimensional”. The dimensions of the following example
are 1 x 4 x 3. You’ll see that the data consist of one row and four columns
spread out over a third dimension.
x <- array(
c(1,3,3,7,1,3,3,7,1,3,3,7)
, dim = c(1,4,3))
52 CHAPTER 7. DATA STRUCTURE
print(x)
, , 1
, , 2
, , 3
7.7 Resources
• W3 Schools: https://www.w3schools.com/r/r_vectors.asp
Exercises
Questions
Exercise: 5-A
Write a function called “multiply” that accepts two numbers as arguments
and outputs the product of those two numbers when called as is demon-
strated below.
multiply(3, 3)
# [1] 9
Exercise: 5-B
Write an equation that returns the remainder of 12 divided by 8.
Exercise: 5-C
Write an equation that returns the remainder of 36 divided by 10.
Exercise: 5-D
Write a “while” loop that prints all even numbers from 0 to 10.
It’s possible for this task to be accomplished in several ways; however, the
output of your program should always look like this:
53
54 Exercises
# [1] 0
# [1] 2
# [1] 4
# [1] 6
# [1] 8
# [1] 10
Exercise: 5-E
You are given a vector that looks like this:
Write a for loop that loops through your vector and prints any element
greater than or equal to 3.
It’s possible for this task to be accomplished in several ways; however, the
output of your program should always look like this:
# [1] 3
# [1] 4
# [1] 5
# [1] 6
# [1] 7
# [1] 8
# [1] 9
# [1] 10
# [1] 11
# [1] 12
Exercise: 6-A
Convert the following character variable to a variable with the data type
“raw”:
You should store your raw data in a variable named “raw_data”, print the
data to the console, and check the data type with the “typeof” function.
Your output should look like the following:
Questions 55
print(raw_data)
# [1] 54 72 65 76 6f 72 20 72 6f 63 6b 73
typeof(raw_data)
# [1] "raw"
Exercise: 6-B
Create a variable named “spending” and give it a value of 120. Then create
a variable named “budget” and give it a value of 100. Next, check whether
spending is greater than budget and store the resulting logical data in a
variable named “over_budget”. Finally, print the value of “over_budget”
variable and check it’s data type with the “typeof” function.
Your final output should look like this:
print(over_budget)
# [1] TRUE
typeof(over_budget)
# [1] "logical"
Exercise: 7-A
Create a vector named “animal” and give it the following three values:
“cow”, “cat”, “pig”. Create a second vector named “sound” and give it
the following three values: “moo”, “meow”, “oink”. Finally, create a data
frame named “animal_sounds” and assign each of these vectors to be a
column.
After printing the resulting data frame to the console, you should get the
following output:
# animal sound
# 1 cow moo
# 2 cat meow
# 3 pig oink
56 Exercises
Answers
Answer: 5-A
One way you could accomplish this task is demonstrated in the following
solution.
multiply(3, 3)
[1] 9
Answer: 5-B
A remainder is referred to as “modulus” in programming. We can use the
“%%” operator to accomplish this. For this example, the output of your
equation should be 4.
12 %% 8
[1] 4
Answer: 5-C
A remainder is referred to as “modulus” in programming. We can use the
“%%” operator to accomplish this. For this example, the output of your
equation should be 6.
36 %% 10
[1] 6
Answer: 5-D
Here’s one way you could write your while loop to achieve this output:
Answers 57
i <- 0
[1] 0
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
Answer: 5-E
Here’s one way you could write your for loop to achieve this output:
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
[1] 12
Answer: 6-A
You can accomplish this task with the “charToRaw” function.
58 Exercises
[1] 54 72 65 76 6f 72 20 72 6f 63 6b 73
typeof(raw_data)
[1] "raw"
Answer: 6-B
The following example demonstrates how you can accomplish this task.
[1] TRUE
typeof(over_budget)
[1] "logical"
Answer: 7-A
The following example demonstrates how you can accomplish this task.
animal sound
1 cow moo
2 cat meow
3 pig oink
Part III
59
61
Before conducting an analysis you must first acquire your data, e.g. via manual
creation, importing pre-constructed data, or leveraging APIs.
• Included Datasets- R comes with a variety of built-in datasets. This
chapter will teach you how to view the catalog of included datasets, pre-
view individual datasets, and begin working with the data.
• Import from Spreadsheets- Most R users will have to work with spread-
sheets at some point in their careers. This chapter will teach you how to
import data from spreadsheets, e.g. from a .csv or .xlsx file, and get the
imported data into a format that’s easy to work with.
• Working with APIs- API stands for Application Programming Interface.
These sorts of tools are commonly used to programmatically pull data from
a third party resource. This chapter demonstrates how you can begin to
leverage these tools in your own workflows.
62
Chapter 8
Included Datasets
R comes with a variety of datasets already built in. This chapter will teach you
how to view the catalog of included datasets, preview individual datasets, and
begin working with the data.
You can view the complete list of datasets available along with a brief description
for each one by typing “data()” into your console.
data()
This will open a new tab in your RStudio instance that looks similar to the
following image:
63
64 CHAPTER 8. INCLUDED DATASETS
Ĺ Note
It may not be necessary for you to load your dataset via the “data” func-
tion prior to using it. Additionally, some datasets may require you to add
them to your search path by using using the “attach” function (conversely,
you can remove datasets from your search path by using the “detach” func-
tion).
data("iris")
8.2. WORKING WITH INCLUDED DATA 65
This command will add a new object “iris” to our R session. Let’s preview the
“iris” dataset by using the “head” function.
head(iris)
Finally, you can view more information about any given dataset by typing its
name into the “Help” tab in the “Files” pane.
66 CHAPTER 8. INCLUDED DATASETS
8.3.1 mtcars
head(mtcars)
8.3. COMMON DATASETS 67
8.3.2 faithful
head(faithful)
eruptions waiting
3.600 79
1.800 54
3.333 74
2.283 62
4.533 85
2.883 55
8.3.3 ChickWeight
head(ChickWeight)
8.3.4 Titanic
head(Titanic)
Sex
Class Male Female
1st 0 0
2nd 0 0
3rd 35 17
Crew 0 0
Sex
Class Male Female
1st 118 4
2nd 154 13
3rd 387 89
Crew 670 3
Sex
Class Male Female
1st 5 1
2nd 11 13
3rd 13 14
Crew 0 0
Sex
Class Male Female
1st 57 140
2nd 14 80
3rd 75 76
Crew 192 20
8.4 Resources
• List of datasets available in Base R: https://www.rdocumentation.org/
packages/datasets/versions/3.6.2
Chapter 9
Most R users will have to work with spreadsheets at some point in their careers.
This chapter will teach you how to import data from a .csv or .xlsx file, and how
to get the imported data into a format that’s easy to work with. Additionally,
this chapter will demonstrate how to import multiple files at once and combine
them all into a single dataframe.
Ĺ Note
It’s worth noting that it isn’t necessary to store the file path as a variable
before calling the function; however, this habit may save you time down
the road.
Alternatively, if you have multiple files from the same directory that need to be
imported, you could do something more like the following code snippet.
69
70 CHAPTER 9. IMPORT FROM SPREADSHEETS
library(readxl)
input <- "C:/File Location/example.xlsx"
df <- read_excel(input)
install.packages("readr")
library(readr)
You can list the paths to all .csv files in a directory with the dir() command:
wd <- "C:/YOURWORKINGDIRECTORY"
dir(wd, full.names = TRUE, pattern = ".csv")
Ĺ Note
All of the headers must match in your CSV files must match exactly for
this function to work as expected.
9.4 Resources
• trevoR package documentation: https://github.com/TrevorFrench/
trevoR
Chapter 10
API stands for Application Programming Interface. These sorts of tools are
commonly used to programmatically pull data from a third party resource. This
chapter demonstrates how one can begin to leverage these tools in their own
workflows.
The following example uses the Helium API to return data about its blockchain
network.
install.packages(c('httr', 'jsonlite'))
library('httr')
library('jsonlite')
res = GET("https://api.helium.io/v1/stats")
print(res)
71
72 CHAPTER 10. WORKING WITH APIS
Response [https://api.helium.io/v1/stats]
Date: 2022-08-04 01:25
Status: 200
Content-Type: application/json; charset=utf-8
Size: 922 B
data = fromJSON(rawToChar(res$content))
names(data)
[1] "data"
Go one level deeper into the data set and print out the names again.
data = data$data
names(data)
[1] "token_supply"
[1] "election_times"
[1] "counts"
[1] "challenge_counts"
[1] "block_times"
token_supply = data$token_supply
print(token_supply)
10.5. ADDING PARAMETERS TO REQUESTS 73
[1] 124675821
res = GET("https://api.helium.io/v1/dc_burns/sum",
query = list(min_time = "2020-07-27T00:00:00Z"
, max_time = "2021-07-27T00:00:00Z"))
data = fromJSON(rawToChar(res$content))
fee = data$data$fee
print(fee)
[1] 10112755000
res = GET("https://api.helium.io/v1/dc_burns/sum",
query = list(min_time = "2020-07-27T00:00:00Z"
, max_time = "2021-07-27T00:00:00Z"),
add_headers(`Accept`='application/json', `Connection`='keep-live'))
data = fromJSON(rawToChar(res$content))
fee = data$data$fee
print(fee)
[1] 10112755000
10.7 Resources
• Blog post by Trevor French: https://medium.com/trevor-french/api-calls-
in-r-136290ead81d
Questions
Exercise: 8-A
Create a data frame called “cars” that contains the first five rows of the
mtcars dataset by using the “head” function. After printing to the console,
you should get the following result:
Exercise: 9-A
Write a function named “read_file” which will accept a file name as a
parameter named “file_name”. The function should then read in a csv
with the specified name, store it as a data frame named “df”, and return
“df” as the final output.
Exercise: 9-B
In exercise 9-A you created a function that will allow you to read a csv file.
Extend this function by adding a second parameter named “csv” which will
accept either “TRUE” or “FALSE”. The functionality shouldn’t change if
the parameter is equal to “TRUE”; however, if the function is equal to
“FALSE”, the function should allow the user to read in an xlsx file instead.
For example, if a user wanted to read in a csv file they would use the
function in this way:
75
76 Exercises
read_file("iris.csv", TRUE)
If the user wanted to read in an xlsx file they would use the function in
this way:
read_file("iris.xlsx", FALSE)
Answers
Answer: 8-A
This task can be accomplished with the following code:
Answer: 9-A
This task can be accomplished with the following code:
Answer: 9-B
Here’s one way you could write your function to accomplish this task:
Answers 77
library(readxl)
if (csv == FALSE) {
df <- read_excel(file_name)
return(df)
}
}
78 Exercises
Part IV
79
81
Most data will not be received in the precise format you need to begin your
analysis. The process of data preparation is where you will structure and add
features to your data.
• Data Cleaning- This chapter will cover the basics of cleaning your data,
including renaming variables, splitting text, replacing values, dropping
columns, and dropping rows. These basic actions will be essential to
preparing your data prior to developing insights.
• Handling Missing Data- You may encounter situations where some of
your data are missing. This chapter will cover best practices on dealing
with missing data and introduce the tools to do so.
• Outliers- Outliers are observations that fall outside the expected scope of
the dataset. It’s important to identify outliers and either choose analyses
strategies that are robust to their presence or deal with them appropriately
before moving into the next analysis phase.
• Organizing Data- This chapter will focus on sorting, filtering, and group-
ing your datasets.
82
Chapter 11
Data Cleaning
This chapter will cover the basics of cleaning your data including renaming
variables, splitting text, replacing values, dropping columns, and dropping rows.
These basic actions will be essential to preparing your data prior to developing
insights.
df <- head(iris)
print(df)
Now, let’s change our column names (which contain different properties of iris
species) into “snake case”, e.g. all words are lowercase and separated by under-
scores. We’ll do this through the use of the “colnames” function. In the following
example, we are renaming each column individually by specifying what number
column to adjust.
83
84 CHAPTER 11. DATA CLEANING
Let’s change the column names again, but use “camel case” this time, e.g. the
first word will be lowercase, and all subsequent words will have the first letter
capitalized. Instead of using the column number though, this time we’ll use the
actual name of the column we want to adjust.
Alternatively, you can use the “rename” function from the “dplyr” package.
library(dplyr)
df <- rename(df, "plantSpecies" = "species")
library(tidyr)
df <- data.frame(person = c("John_Doe", "Jane_Doe"))
person
John_Doe
Jane_Doe
We now have a data frame with one column that contains a first name and a
last name combined by an underscore. Let’s now split the two names into their
own separate columns.
first_name last_name
John Doe
Jane Doe
Let’s break down what just happened. We first declared that “df” was going
to be equal to the output of the function that followed by typing “df <-”. Next
we told the separate function that it would be altering the existing dataframe
called “df” by typing “df %>%”.
We then gave the separate function three arguments. The first argument was
the column we were going to be editing, “person”. The second argument was
86 CHAPTER 11. DATA CLEANING
the names of our two new columns, “first_name” and “last_name”. Finally, the
third argument was our desired delimiter, “_“.
student grade
John 83
Jane 97
Joe 74
Janet 27
Now that our dataset is assembled, let’s decide that we’re going to institute
a minimum grade of 60. To do this we’re going to need to replace any grade
lower than 60 with 60. The following example demonstrates one way you could
accomplish that.
student grade
John 83
Jane 97
Joe 74
Janet 60
df <- head(mtcars)
print(df)
11.4. DROP COLUMNS 87
Next, we can either drop columns by specifying the columns we want to keep
or by specifying the ones we want to drop. The following example will get rid
of the “carb” column by specifying that we want to keep every other column.
df <- subset(df, select = c(mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear))
Alternatively, let’s try getting rid of the “gear” column directly. We can do this
by putting a “-” in front of the “c” function.
One other way you could drop columns if you wanted to use index numbers
rather than column names is demonstrated below.
df <- df[,-c(1,3:7)]
cyl vs am
Mazda RX4 6 0 1
Mazda RX4 Wag 6 0 1
Datsun 710 4 1 1
Hornet 4 Drive 6 1 0
Hornet Sportabout 8 0 0
Valiant 6 1 0
As you can see, we used the square brackets to select a subset of our dataframe
and then pasted our values after the comma to declare that we were choosing
columns rather than rows. After that we used the “-” symbol to say that we
were choosing columns to drop rather than columns to keep. Finally, we chose
to drop columns 1 as well as columns 3 through 7.
df <- df[-c(1:2),]
cyl vs am
Datsun 710 4 1 1
Hornet 4 Drive 6 1 0
Hornet Sportabout 8 0 0
Valiant 6 1 0
11.6 Resources
• “Separate” function documentation: https://tidyr.tidyverse.org/
reference/separate.html
Chapter 12
You may encounter situations while analysing data that some of your data are
missing. This chapter will cover best practices in regards to handling these
situations as well as the technical details on how to remedy the data.
Missing data will often be represented by either “NA” or “ ” in R. Sometimes
you will be able to manage by just ignoring this data; however, other times you
will need to “impute” the missing data. This just means you end up coming up
with a value that makes sense to use in place of the missing data. The three
imputation methods we are going to cover in this chapter are constant value
imputation, central tendency imputation, and multiple imputation.
print(blanks)
print(nas)
89
90 CHAPTER 12. HANDLING MISSING DATA
We can use the “is.na” function to identify data with “NA” values. The following
example demonstrates how the function works. The output ends up being a
“TRUE” or “FALSE” to designate whether each observation is an “NA” value.
is.na(nas)
We can then take this one step further and use the function to filter for “NA”
values.
[1] NA
This works great; however, it’s more likely that you would want to see the values
which aren’t equal to “NA”. This can be accomplished by using the “NOT”
operator “!”.
If your missing data is just an empty string (““) rather than an”NA” value, you
can use simple comparison operators to accomplish the same thing.
blanks == ""
[1] ""
When working with dataframes rather than just vectors, you can also use the
“na.omit” function to remove complete rows with “NA” values.
12.2. CONSTANT VALUE IMPUTATION 91
student score
1 John 100
2 Jane 80
3 Joe NA
df <- na.omit(df)
print(df)
student score
1 John 100
2 Jane 80
print(df)
student score
1 John 100
2 Jane 80
3 Joe NA
Depending on the context, it may make sense for you to ignore this observation
prior to calculating the average score. It could also make sense for you to assign
a value of “0” to this student’s test score.
Let’s demonstrate how you would replace “NA” values with a constant value of
“0”.
92 CHAPTER 12. HANDLING MISSING DATA
df[is.na(df)] <- 0
print(df)
student score
1 John 100
2 Jane 80
3 Joe 0
print(df)
employee hours_spent
1 John 12
2 Jane 14
3 Joe NA
4 Janet 9
The following example demonstrates how you can replace missing values with
an average of the rest of the employees’ time spent.
[1] 11.66667
employee hours_spent
1 John 12.00000
2 Jane 14.00000
3 Joe 11.66667
4 Janet 9.00000
12.4. MULTIPLE IMPUTATION 93
Alternatively, we can reset our dataframe and replace “NA” values with the
median value by doing the following.
# RESET DATAFRAME
df$hours_spent <- hours_spent
[1] 12
employee hours_spent
1 John 12
2 Jane 14
3 Joe 12
4 Janet 9
Ĺ Note
Linear regression is covered more in-depth later in this book. Don’t worry
if this example feels completely unfamiliar at this point.
We’ll begin by creating a dataframe with both an “x” and a “y” variable.
print(df)
y x
94 CHAPTER 12. HANDLING MISSING DATA
1 10 8
2 8 6
3 NA 9
4 9 7
5 4 2
6 NA 12
Next, let’s use the “lm” function to create a linear model and then print out a
summary of that model.
Residuals:
1 2 4 5
0 0 0 0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2 0 Inf <2e-16 ***
x 1 0 Inf <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
y x
1 10 8
2 8 6
3 11 9
4 9 7
12.5. RESOURCES 95
5 4 2
6 14 12
12.5 Resources
• “Missing-data Imputation” from Columbia: http://www.stat.columbia.
edu/~gelman/arm/missing.pdf
96 CHAPTER 12. HANDLING MISSING DATA
Chapter 13
Outliers
Outliers are observations that fall outside the expected scope of the dataset.
It’s important to identify outliers in your data and determine the necessary
treatment for them before moving into the next analysis phase.
plot(mtcars$mpg)
97
98 CHAPTER 13. OUTLIERS
10 15 20 25 30
mtcars$mpg
0 5 10 15 20 25 30
Index
20
0
2 4 6 8 10 12
Index
13.1. FINDING OUTLIERS VISUALLY 99
Another way to quickly visualize outliers is to use the “boxplot” function. This
plot will allow you to evaluate outliers in a more systematic way.
boxplot(mtcars$mpg)
10 15 20 25 30
The solid black line represents the median value of your dataset. The top and
bottom “whiskers” represent your extreme values (minimum and maximum).
The top and bottom of the “box” represent the first and third quartile.
boxplot(data)
100 CHAPTER 13. OUTLIERS
100
60
20
0
13.1.3 Histogram
Histograms will allow you to see how often values occur within certain buckets.
hist(mtcars$mpg)
Histogram of mtcars$mpg
12
Frequency
0 2 4 6 8
10 15 20 25 30 35
mtcars$mpg
13.1. FINDING OUTLIERS VISUALLY 101
hist(data)
Histogram of data
8 10
Frequency
6
4
2
0
0 20 40 60 80 100
data
plot(density(mtcars$mpg))
102 CHAPTER 13. OUTLIERS
density.default(x = mtcars$mpg)
0.06
Density
0.03
0.00
10 20 30 40
N = 32 Bandwidth = 2.477
plot(density(data))
density.default(x = data)
0.08
Density
0.04
0.00
0 20 40 60 80 100
N = 12 Bandwidth = 1.839
13.2. FINDING OUTLIERS STATISTICALLY 103
sd <- sd(data)
print(sd)
[1] 27.31078
Next, let’s calculate the mean of our dataset.
[1] 12.66667
Finally, for each record in our vector, let’s calculate how many standard devia-
tions it falls from the mean.
[1] 1 4 7 9 2 6 3 4 2 7 8
104 CHAPTER 13. OUTLIERS
[1] 1 4 7 9 2 6 3 4 2 7 8
13.4 Resources
“Statistics - Standard Deviation” by W3 Schools: https://www.w3schools.
com/statistics/statistics_standard_deviation.php “Identifying outliers
with the 1.5xIQR rule”: https://www.khanacademy.org/math/statistics-
probability/summarizing-quantitative-data/box-whisker-plots/a/identifying-
outliers-iqr-rule
Chapter 14
Organizing Data
This chapter will focus on sorting, filtering, and grouping your datasets.
[1] 5 9 3 2 7
Next we’ll sort our data by using the “sort” function. This function will return
your original data but sorted in ascending order.
sort(completed_tasks)
[1] 2 3 5 7 9
Alternatively, you can set the “decreasing” parameter to “TRUE” to sort your
data in descending order.
[1] 9 7 5 3 2
105
106 CHAPTER 14. ORGANIZING DATA
The “order” function will return the index of each item in your vector in sorted
order. This function also has a “decreasing” parameter which can be set to
“TRUE”.
order(completed_tasks)
[1] 4 3 1 5 2
Finally, the “rank” function will return the rank of each item in your vector in
ascending order.
rank(completed_tasks)
[1] 3 5 2 1 4
14.2 Filtering
You may have noticed in previous chapters that we’ve used comparison operators
to filter our data. Let’s review by filtering out completed tasks greater than or
equal to 7.
completed_tasks[completed_tasks < 7]
[1] 5 3 2
Alternatively, you can use the “filter” function from the “dplyr” library. Let’s
use this function with the “iris” dataset to filter out any species other than
virginica.
head(iris)
library(dplyr)
virginica <- filter(iris, Species == "virginica")
14.3. GROUPING 107
14.3 Grouping
One final resource for you to leverage as you organize your data is the
“group_by” function from the “dplyr” library.
If we wanted to group the iris dataset by species we might do something similar
to the following example.
library(dplyr)
grouped_species <- iris %>% group_by(Species)
Now if we print out our resulting dataset you’ll notice that the “group_by”
operation we just performed doesn’t change how the data looks by itself.
head(grouped_species)
In order to change the structure of our dataset we’ll need to specify how our
groups should be treated by combining the “group_by” function with another
dplyr “verb” such as “summarise”.
head(grouped_species)
Now each of the three species in the iris dataset have their average sepal length,
sepal width, petal length, and petal width displayed.
You can find more information about the “group_by” function and other dplyr
“verbs” in the resources section below.
14.4 Resources
• dplyr “filter” function documentation: https://dplyr.tidyverse.org/
reference/filter.html
• dplyr “group_by” function documentation: https://dplyr.tidyverse.org/
reference/group_by.html
Exercises
Questions
Exercise: 11-A
Create a dataframe named “df” which is equal to the first three columns
and the first five rows of the “mtcars” dataset. Next, rename the “mpg”
column to “miles_per_gallon”.
After printing the resulting dataframe to the console you should have the
following results:
Exercise: 12-A
You are given the following dataframe:
109
110 Exercises
# var_1 var_2
# 1 3 8
# 3 2 6
# 4 9 4
# 6 2 5
# 7 7 5
Exercise: 12-B
Take the original “df” dataframe from exercise 12-A and apply a constant
value of “5” to each “NA” value. Store this new dataframe in a variable
named “constant_value”.
Your final output after printing “constant_value” to the console should
look like this:
print(constant_value)
# var_1 var_2
# 1 3 8
# 2 4 5
# 3 2 6
# 4 9 4
# 5 5 8
# 6 2 5
# 7 7 5
Answers 111
Exercise: 12-C
Take the same “df” dataframe from exercises 12-A and 12-B and apply an
average value of each column to “NA” values in each respective column.
Store this new dataframe in a variable named “mean_value”.
Your final output after printing “mean_value” to the console should look
like this:
print(mean_value)
# var_1 var_2
# 1 3.0 8
# 2 4.0 6
# 3 2.0 6
# 4 9.0 4
# 5 4.5 8
# 6 2.0 5
# 7 7.0 5
Exercise: 13-A
Use the “Nile” dataset to create a histogram to view the distribution of
it’s data.
Exercise: 14-A
Take the dataframe created in exercise 11-A and drop any row where the
“disp” column is equal to “160”.
You should receive the following results when you print the resulting
dataframe to the console.
Answers
Answer: 11-A
This task could be accomplished in the following way:
112 Exercises
library(dplyr)
filter, lag
The following objects are masked from 'package:base':
Answer: 12-A
This task could be accomplished in the following way:
var_1 var_2
1 3 8
3 2 6
4 9 4
6 2 5
7 7 5
Answer: 12-B
There are several ways this task could be accomplished; however, the
following example demonstrates one way to do it.
Answers 113
constant_value <- df
constant_value[is.na(constant_value)] <- 5
print(constant_value)
var_1 var_2
1 3 8
2 4 5
3 2 6
4 9 4
5 5 8
6 2 5
7 7 5
Answer: 12-C
There are several ways this task could be accomplished; however, the
following example demonstrates one way to do it.
mean_value <- df
mean_value$var_1[is.na(mean_value$var_1)] <- mean_1
mean_value$var_2[is.na(mean_value$var_2)] <- mean_2
print(mean_value)
var_1 var_2
1 3.0 8
2 4.0 6
3 2.0 6
4 9.0 4
5 4.5 8
6 2.0 5
7 7.0 5
114 Exercises
Answer: 13-A
hist(Nile)
Histogram of Nile
20
Frequency
10
5
0
Nile
Answer: 14-A
This task could be accomplished in the following way:
library(dplyr)
df <- mtcars[1:5, 1:3]
df <- rename(df, "miles_per_gallon" = "mpg")
115
117
Summary Statistics
summary(mtcars$mpg)
[1] 20.09062
119
120 CHAPTER 15. SUMMARY STATISTICS
[1] 19.2
[1] 6.026948
[1] 36.3241
[1] 10.4
[1] 33.9
levels(iris$Species)
table(iris$Species)
15.3 Resources
• “Exploring Data and Descriptive Statistics (using R)” from princeton:
https://www.princeton.edu/~otorres/sessions/s2r.pdf
122 CHAPTER 15. SUMMARY STATISTICS
Chapter 16
Regression
Ĺ Note
𝑦 = 𝑚𝑥 + 𝑏
This is a simple linear model many people begin with where x and y are the
independent and dependent variables, respectively, m is the slope (or coefficient
of x), and b is the intercept.
To perform linear regression in R, you’ll use the “lm” function. Let’s try it out
on the “faithful” dataset.
123
124 CHAPTER 16. REGRESSION
head(faithful)
eruptions waiting
3.600 79
1.800 54
3.333 74
2.283 62
4.533 85
2.883 55
The “lm” function will accept at least two parameters which represent “y” and
“x” in this format:
lm(y ~ x)
Let’s try this out by setting the y variable to eruptions and the x variable
to waiting. We can then view the output of our linear model by using the
“summary” function.
Call:
lm(formula = faithful$eruptions ~ faithful$waiting)
Residuals:
Min 1Q Median 3Q Max
-1.29917 -0.37689 0.03508 0.34909 1.19329
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.874016 0.160143 -11.70 <2e-16 ***
faithful$waiting 0.075628 0.002219 34.09 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Additionally, you can use the “data” parameter rather than putting the name
of the dataset before every variable.
head(mtcars)
126 CHAPTER 16. REGRESSION
Now, let’s try to predict mpg and use every other column as a variable then see
what the results look like.
Call:
lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs +
am + gear + carb, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.4506 -1.6044 -0.1196 1.2193 4.6271
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.30337 18.71788 0.657 0.5181
cyl -0.11144 1.04502 -0.107 0.9161
disp 0.01334 0.01786 0.747 0.4635
hp -0.02148 0.02177 -0.987 0.3350
drat 0.78711 1.63537 0.481 0.6353
wt -3.71530 1.89441 -1.961 0.0633 .
qsec 0.82104 0.73084 1.123 0.2739
vs 0.31776 2.10451 0.151 0.8814
am 2.52023 2.05665 1.225 0.2340
gear 0.65541 1.49326 0.439 0.6652
carb -0.19942 0.82875 -0.241 0.8122
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = mtcars)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
16.4 Resources
• “Lecture 9 - Linear regression in R” by Professor Alexandra Choulde-
chova at Carnegie Mellon University: https://www.andrew.cmu.edu/user/
128 CHAPTER 16. REGRESSION
achoulde/94842/lectures/lecture09/lecture09-94842.html
• “Logistic Regression” by Erin Bugbee and Jared Wilber: https://mlu-
explain.github.io/logistic-regression/
• “Visualizing OLS Linear Regression Assumptions in R” by Trevor French
https://medium.com/trevor-french/visualizing-ols-linear-regression-
assumptions-in-r-e762ba7afaff
Chapter 17
Plotting
This chapter will cover the basics of creating plots in R. It will begin by demon-
strating the plotting capabilities available in R out of the box. These capabilities
are often referred to as “Base R”. In the resources section, you can also find re-
sources to learn more about “ggplot2” which is one of the most common plotting
libraries in R.
y x
-4.400327 1
5.428396 2
1.401835 3
8.347445 4
4.653595 5
1.768966 6
plot(df$x, df$y)
129
130 CHAPTER 17. PLOTTING
80
df$y
40
0
0 20 40 60 80 100
df$x
Additionally, you can alter the appearance of your points by using the “pch”,
“cex”, and “col” options. PCH stands for Plot Character and will adjust the
symbol used for your points. The available point shapes are listed in the image
below.
ggpubr::show_point_shapes()
17.1. PLOTTING YOUR REGRESSION MODEL 131
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25
The “cex” option allows you to adjust the symbol size. The default value is 1.
If you were to change the value to .75, for example, the plot symbol would be
scaled down the 3/4 of the default size. The “col” option allows you to adjust
the color of your plot symbols.
plot(df$x
, df$y
, col=rgb(0.4,0.4,0.8,0.6)
, pch=16
, cex=1.2)
132 CHAPTER 17. PLOTTING
80
df$y
40
0
0 20 40 60 80 100
df$x
You can adjust the axes with the “xlab”, “ylab”, “xaxt”, and “yaxt” options
(amongst other available options). In the following example we will remove the
axes altogether.
plot(df$x
, df$y
, col=rgb(0.4,0.4,0.8,0.6)
, pch=16
, cex=1.2
, xlab=""
, ylab=""
, xaxt="n"
, yaxt="n")
17.1. PLOTTING YOUR REGRESSION MODEL 133
Finally, you can add a trend line by creating a model and adding the fitted
values to the graph. We’ll also adjust the line width and color with the “lwd”
and “col” parameters, respectively.
plot(df$x
, df$y
, col=rgb(0.4,0.4,0.8,0.6)
, pch=16
, cex=1.2
, xlab=""
, ylab=""
, xaxt="n"
, yaxt="n")
The model also returns confidence intervals for the predictions, which can be
added
# Extract the upper and lower 95% confidence intervals of the predictions
conf_interval <- predict(
model,
newdata=data.frame(x=df$x),
interval = "prediction",
level = 0.95)
plot(df$x
, df$y
, col=rgb(0.4,0.4,0.8,0.6)
, pch=16
, cex=1.2
, xlab=""
, ylab=""
, xaxt="n"
, yaxt="n")
abline(model, col=2, lwd=2)
lines(df$x, conf_interval[,2], col="blue", lty=2)
lines(df$x, conf_interval[,3], col="blue", lty=2)
17.2. PLOTS AVAILABLE IN BASE R 135
Now that you’ve seen how to build a scatterplot in R, let’s take a look at other
plots available in Base R.
One plot you’ve already seen in the outliers chapter is the box plot. These plots
can be created via the “boxplot” function.
boxplot(mtcars$mpg)
136 CHAPTER 17. PLOTTING
10 15 20 25 30
We can build on this plot by specifying the dataset with the “data” parameter,
removing the “mtcars$” prefix from our variable, adding a plot title with the
“main” parameter, and adding axis labels with the “xlab” and “ylab” parameters.
Additionally, we are going to add an additional variable for our data to be
categorized by.
boxplot(mpg ~ gear
, data = mtcars
, main = "Car Mileage by Gear"
, xlab = "Number of Forward Gears"
, ylab = "Miles Per Gallon")
17.2. PLOTS AVAILABLE IN BASE R 137
10 15 20 25 30
3 4 5
Finally, we can set the box colors with the “col” parameter and set “notch”
equal to “TRUE” to give our boxes notches. If the notches of two plots do
not overlap this is ‘strong evidence’ that the two medians differ Chambers and
Tukey (1983).
boxplot(mpg ~ am
, data = mtcars
, notch = TRUE
, col = (c("blue", "grey"))
, main = "Car Mileage by Engine"
, xlab = "Automatic?"
, ylab = "Miles Per Gallon")
138 CHAPTER 17. PLOTTING
10 15 20 25 30
0 1
Automatic?
pairs(iris)
Sepal.Length
4.5
4.0
Sepal.Width
2.0
1 4 7
Petal.Length
2.5
Petal.Width
0.5
1.0 2.5
Species
This plot gives us the ability to see how each variable interacts with one another.
Let’s try plotting a pie chart of species in the iris dataset via the “pie” function.
This function accepts numerical values so we’ll need to use the “table” function
on our column as well.
pie(table(iris$Species))
setosa
versicolor
virginica
You can view the full list of available parameters for this and other functions
through the help tab in the files pane in R Studio.
140 CHAPTER 17. PLOTTING
Let’s try a bar plot on the same dataset with the “barplot” function.
barplot(table(iris$Species))
17.2. PLOTS AVAILABLE IN BASE R 141
10 20 30 40 50
0
17.2.5 Histogram
You may recall that we also used histigrams in the outliers chapter to try to
visually identify extreme values. Here’s a quick recap:
hist(mtcars$mpg)
Histogram of mtcars$mpg
12
Frequency
0 2 4 6 8
10 15 20 25 30 35
mtcars$mpg
142 CHAPTER 17. PLOTTING
We also used the following example in the outliers chapter to create a density
plot:
plot(density(mtcars$mpg))
density.default(x = mtcars$mpg)
0.06
Density
0.03
0.00
10 20 30 40
N = 32 Bandwidth = 2.477
We can take this one step further by adding a title and a shape to the plot.
MPG Distribution
0.06
Density
0.03
0.00
10 20 30 40
N = 32 Bandwidth = 2.477
dotchart(df$sales)
144 CHAPTER 17. PLOTTING
8 10 12 14 16 18 20
Miles
Aaron
Jordan
Isaac
Trevor
Alaka
Michael
Reagan
Michael
Steven
Taylor
Susan
8 10 12 14 16 18 20
Hardware
Miles
Aaron
Jordan
Isaac
Professional Services
Michael
Steven
Taylor
Susan
Software
Trevor
Alaka
Michael
Reagan
8 10 12 14 16 18 20
Hardware
Miles
Aaron
Jordan
Isaac
Professional Services
Michael
Steven
Taylor
Susan
Software
Trevor
Alaka
Michael
Reagan
8 10 12 14 16 18 20
146 CHAPTER 17. PLOTTING
dotchart(df$sales
, labels = df$salesperson
, groups = groups
, gcolor = group_colors
, color = group_colors[groups]
, pch = 16)
Hardware
Miles
Aaron
Jordan
Isaac
Professional Services
Michael
Steven
Taylor
Susan
Software
Trevor
Alaka
Michael
Reagan
8 10 12 14 16 18 20
17.3 Resources
• ggplot2 documentation: https://ggplot2.tidyverse.org/
• ggplot2 cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/
data-visualization.pdf
• ggplot2 extension gallery: https://exts.ggplot2.tidyverse.org/gallery/
• R colors: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
Exercises
Questions
Exercise: 15-A
Use the “summary” function to get summary statistics for all columns in
the “mtcars” dataset.
Your final output should resemble the following:
147
148 Exercises
Exercise: 16-A
Use the “lm” function to create a linear model using the “ChickWeight”
dataset. Your model should predict the “weight” variable using the “Diet”
and “Time” variables.
Name your linear model “lm” and then view a summary of your model
using the “summary” function. The output of your summary should look
like this:
# Call:
# lm(formula = weight ~ Diet + Time, data = ChickWeight)
# Residuals:
# Min 1Q Median 3Q Max
# -136.851 -17.151 -2.595 15.033 141.816
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 10.9244 3.3607 3.251 0.00122 **
# Diet2 16.1661 4.0858 3.957 8.56e-05 ***
# Diet3 36.4994 4.0858 8.933 < 2e-16 ***
# Diet4 30.2335 4.1075 7.361 6.39e-13 ***
# Time 8.7505 0.2218 39.451 < 2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Exercise: 17-A
Create a density plot using the “Nile” dataset.
Answers
Answer: 15-A
Here’s how you can accomplish this task:
summary(mtcars)
Answers 149
Answer: 16-A
You can create your model with the following code:
Call:
lm(formula = weight ~ Diet + Time, data = ChickWeight)
Residuals:
Min 1Q Median 3Q Max
-136.851 -17.151 -2.595 15.033 141.816
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.9244 3.3607 3.251 0.00122 **
Diet2 16.1661 4.0858 3.957 8.56e-05 ***
Diet3 36.4994 4.0858 8.933 < 2e-16 ***
Diet4 30.2335 4.1075 7.361 6.39e-13 ***
Time 8.7505 0.2218 39.451 < 2e-16 ***
---
150 Exercises
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Answer: 17-A
You can create your density plot with the following code:
plot(density(Nile))
density.default(x = Nile)
0.0015
Density
0.0000
Part V: Reporting
151
153
“It feels like we’re all suffering from information overload or data glut.
And the good news is there might be an easy solution to that, and
that’s using our eyes more. So, visualizing information, so that we
can see the patterns and connections that matter and then designing
that information so it makes more sense, or it tells a story, or allows
us to focus only on the information that’s important. Failing that,
visualized information can just look really cool.” -David McCandless
(McCandless 2010)
Finally, it’s important to report on your data to make it easy for others to
extract and understand the information that is most relevant.
• Spreadsheets- Spreadsheets are a common way to communicate informa-
tion to stakeholders. This chapter will go over how to export .xlsx and .csv
files from R, how to format those spreadsheets, and how to add formulas
to them.
• R Markdown- R Markdown allows you to create documents in a pro-
grammatic fashion that improves reproducibility. This chapter will cover
some of the different formats that are available in R as well as how to
create them.
• R Shiny- R Shiny is a tool used to develop web applications and is com-
monly deployed in the use of creating dashboards, hosting static reports,
and custom tooling.
154
Chapter 18
Spreadsheets
18.1 Export
18.1.1 Export .csv Files
In order to export a dataframe to a CSV file, you can use the “write.csv” function.
This function will accept a dataframe followed by the desired output location
of your file. Let’s start by creating a sample dataframe to work with.
id person
12 John
27 Jane
23 NA
Now, let’s specify the location we want to store the CSV file at and execute
the “write.csv” function. (We use the file.path() to specify a path to the
example.csv file in a temporary directory that will automatically be erased
when your R session ends.)
155
156 CHAPTER 18. SPREADSHEETS
This will give you a file that looks like the following image.
You’ll notice that the first column contains the row numbers of the dataframe.
This can be remedied by setting “row.names” to “FALSE” as follows.
Finally, you’ll notice that one of the names is an “NA” value. You can tell
R how to handle these values at the time of exporting your file with the “na”
argument. This argument will replace any “NA” values with the value of your
choice. Let’s try replacing the “NA” value with “Unspecified”.
library(writexl)
output <- "C:/File Location/example.xlsx"
write_xlsx(df, output)
18.2 Formatting
When saving Excel workbooks, you can also leverage the “openxlsx” library
to format and add formulas to your workbook. Let’s use the iris dataset to
demonstrate these capabilities.
library(openxlsx)
Next, let’s break down the iris dataset into three separate datasets based on
species.
18.2. FORMATTING 159
Now, we’ll use the “createWorkbook” function from the “openxlsx” library to
create a blank workbook object.
wb <- createWorkbook()
We’ll now add three worksheets to our workbook. These worksheets will ulti-
mately be tabs in our Excel workbook.
addWorksheet(wb, "Setosa")
addWorksheet(wb, "Versicolor")
addWorksheet(wb, "Virginica")
We can also create styles to apply to our workbook. Let’s create a style for our
headers as well as a style for the body of our data.
Let’s now apply our three datasets to the workbook object we previously created.
addStyle(wb
, "Setosa"
, cols = 1:length(setosa)
, rows = 2:nrow(setosa)
, style = body
, gridExpand = TRUE)
addStyle(wb
, "Versicolor"
, cols = 1:length(versicolor)
, rows = 2:nrow(versicolor)
, style = body
, gridExpand = TRUE)
, rows = 1
, style = heading
, gridExpand = TRUE)
addStyle(wb
, "Virginica"
, cols = 1:length(virginica)
, rows = 2:nrow(virginica)
, style = body
, gridExpand = TRUE)
This will result in a workbook that looks like the following image.
162 CHAPTER 18. SPREADSHEETS
library(openxlsx)
# Create datasets
setosa <- iris[which(iris$"Species" == "setosa"),]
versicolor <- iris[which(iris$"Species" == "versicolor"),]
virginica <- iris[which(iris$"Species" == "virginica"),]
#Add worksheets
addWorksheet(wb, "Setosa")
addWorksheet(wb, "Versicolor")
addWorksheet(wb, "Virginica")
# Create Styles
heading <- createStyle(fontName = "Segoe UI"
, fontSize = 12
, fontColour = "#FFFFFF"
, bgFill = "#244062"
, textDecoration = "bold")
, startCol = 1
, startRow = 1
, rowNames = FALSE)
addStyle(wb
, "Setosa"
, cols = 1:length(setosa)
, rows = 2:nrow(setosa)
, style = body
, gridExpand = TRUE)
addStyle(wb
, "Versicolor"
, cols = 1:length(versicolor)
, rows = 2:nrow(versicolor)
, style = body
, gridExpand = TRUE)
addStyle(wb
, "Virginica"
18.2. FORMATTING 165
, cols = 1:length(virginica)
, rows = 2:nrow(virginica)
, style = body
, gridExpand = TRUE)
You may notice that this script is a little longer than it needs to be. Let’s try
to simplify it with a loop.
The following script will accomplish the exact same thing as the first script.
library(openxlsx)
wb <- createWorkbook()
for (i in 1:3) {
df <- as.data.frame(datasets[i])
addWorksheet(wb, worksheets[i])
writeData(wb
, worksheets[i]
, df
, startCol = 1
, startRow = 1
, rowNames = FALSE)
addStyle(wb
, worksheets[i]
, cols = 1:length(df)
, rows = 1
, style = heading
166 CHAPTER 18. SPREADSHEETS
, gridExpand = TRUE)
addStyle(wb
, worksheets[i]
, cols = 1:length(df)
, rows = 2:nrow(df)
, style = body
, gridExpand = TRUE)
}
18.3 Formulas
If we wanted to add another column to each of our worksheets that used an
Excel formula to determine the ratio between the sepal length and the sepal
width, we could use the “writeFormula” function to accomplish that.
The following example uses a loop that creates a formula for each row which
divides the respective value in column A by the the respective value in column
B. Next we add the heading style to the first row in column six and add a
header named “Sepal.Ratio”. Finally, we write the formula vector to column six
beginning on row 2.
library(openxlsx)
wb <- createWorkbook()
for (i in 1:3) {
df <- as.data.frame(datasets[i])
18.3. FORMULAS 167
addWorksheet(wb, worksheets[i])
writeData(wb
, worksheets[i]
, df
, startCol = 1
, startRow = 1
, rowNames = FALSE)
addStyle(wb
, worksheets[i]
, cols = 1:length(df)
, rows = 1
, style = heading
, gridExpand = TRUE)
addStyle(wb
, worksheets[i]
, cols = 1:length(df)
, rows = 2:nrow(df)
, style = body
, gridExpand = TRUE)
addStyle(wb
, worksheets[i]
, cols = 6
, rows = 1
, style = heading
, gridExpand = TRUE)
writeData(wb
, worksheets[i]
, "Sepal.Ratio"
, startCol = 6
, startRow = 1
, rowNames = FALSE)
writeFormula(wb
, worksheets[i]
, formula
, startCol = 6
, startRow = 2)
168 CHAPTER 18. SPREADSHEETS
This gives us an Excel workbook that looks like the following image.
18.4. RESOURCES 169
18.4 Resources
• openxlsx documentation: https://cran.r-project.org/web/packages/
openxlsx/openxlsx.pdf
170 CHAPTER 18. SPREADSHEETS
Chapter 19
R Markdown
We’ll begin by creating a new document by selecting the “New File” button
towards the top left corner of R Studio and choosing “R Markdown” from the
dropdown menu.
171
172 CHAPTER 19. R MARKDOWN
19.1. FORMAT OPTIONS 173
This will display a menu that looks like the following image.
You’ll notice that you have four main options on the left-hand side: “Document”,
“Presentation”, “Shiny”, and “From Template”.
174 CHAPTER 19. R MARKDOWN
Each of these options will have several sub-options. The “Document” option,
for example is selected by default and you can see there are three sub-options
on the right-hand side: “HTML”, “PDF”, and “Word”.
19.1. FORMAT OPTIONS 175
The “Shiny” option allows you to create either presentations or documents which
include interactive Shiny components.
19.1. FORMAT OPTIONS 177
Finally, the “From Template” option will display several options for you to
leverage pre-made templates.
178 CHAPTER 19. R MARKDOWN
Let’s choose the HTML sub-option from the Document option and select “OK”.
19.2. HTML DOCUMENT EXAMPLE 179
This will result in a new file in your source pane that looks similar to the
following image.
180 CHAPTER 19. R MARKDOWN
You can either continue to edit your document with markdown code or you can
select the “visual” option towards the top-left corner of the source pane to have
more of a traditional text editor experience.
19.2. HTML DOCUMENT EXAMPLE 181
Selecting this will prompt you to save your file. After you do so, your rendered
document will appear in your viewer tab.
19.3. R NOTEBOOK 183
In addition to the preview being displayed in your viewer tab, you should now
also have an HTML file located in the same place as you saved your R Markdown
file. You can select this file to preview it in your browser as well as send it to
others for them to preview.
19.3 R Notebook
Let’s try creating a notebook by selecting the “New File” button towards the
top left corner of R Studio and choosing “R Notebook” from the dropdown
menu.
184 CHAPTER 19. R MARKDOWN
19.3. R NOTEBOOK 185
This will generate a new file in your source pane that looks like the following
image.
You’ll notice that there is no “knit” option like there is in an ordinary R Mark-
down file. This is because this file is meant to be shared in its current format
rather than as a rendered document. The “knit” option is replaced by a “pre-
view” option. Selecting this option will result in the following output.
186 CHAPTER 19. R MARKDOWN
This generates a preview of your file in the viewer tab. You may also notice
that the output of the plot(cars) code has not been rendered in the preview.
This is because code has to be explicitly run in R Notebooks in order for it to
be displayed in the rendered preview.
Let’s run the code by pressing the green play button inside the code chunk.
Now if you preview the notebook again you’ll see the plot output included.
19.4. RESOURCES 187
19.4 Resources
• “Document Templates” from “R Markdown: The Definitive Guide”:
https://bookdown.org/yihui/rmarkdown/document-templates.html?
version=2022.07.2%2B576&mode=desktop
• R Markdown Formats: https://rmarkdown.rstudio.com/formats.html
• R Markdown Home Page: https://rmarkdown.rstudio.com/
• R Markdown Notebooks: https://rmarkdown.rstudio.com/lesson-10.html
188 CHAPTER 19. R MARKDOWN
Chapter 20
R Shiny
20.1 Quickstart
Let’s create a new project containing a shiny application. Projects allow you to
bundle multiple files into a a single workspace. You can create a new project
via the “Create a new project” button towards the top left corner in RStudio.
189
190 CHAPTER 20. R SHINY
Since we are starting this project from scratch, let’s choose the “New Directory”
option.
Now you can see there are many types of projects that you can create (not just
Shiny Applications). However, we are going to choose “Shiny Application” for
this example.
20.1. QUICKSTART 191
This is going to create a new folder containing your project files. Choose what
you would like to name that folder and where you would like for it to be saved.
192 CHAPTER 20. R SHINY
If you’re working in RStudio, you should now have a sample application in your
source pane. We’ll go more in depth into what all of this means later.
20.1. QUICKSTART 193
For now, let’s demo what this app looks like by pressing the “Run App” button
towards the top right corner of your source pane. You should see a screen pop
up that looks like this.
194 CHAPTER 20. R SHINY
We can see that the application is using the faithful dataset to create a histogram
which accepts user input to dynamically adjust the number of bins presented in
the histogram.
20.2.1 Libraries
One library you will always need to include in your shiny applications is the
“shiny” library. Make sure you include any other libraries you plan on using in
your code.
library(shiny)
20.2.2 UI
The next thing we see in our code is the creation of our UI object. This is where
the application layout is created. The first function is the “fluidPage” function.
20.2. BASIC COMPONENTS OF A SHINY APPLICATION 195
This is probably one of the most common ways to create user interfaces in shiny
applications. Layouts created with the fluid page methodology are organized
into rows and columns and scale to fit varying browser sizes.
The “titlePanel” function creates a panel with your title inside of it. In our case,
this function is responsible for “Old Faithful Geyser Data” being displayed at
the top of the page.
Next, we see the “sidebarLayout” function. This is essentially a pre-constructed
layout which consists of a “sidebar” panel and a “main” panel which are created
using the “sidebarPanel” and “mainPanel” functions, respectively. You’ll notice
that our sidebar is actually located above our main panel rather than to the
side. This is just because the size of our browser was small enough that they
collapsed to be stacked on top of each other. If you increase the size of your
browser, you will see the sidebar return to it’s original location.
Inside of the “sidebarPanel” function, we have a function called “sliderInput”.
The “sliderInput” function creates the component which allows the user to select
the number of bins in our app. We can see this function gives the component the
name “bins”, the title “Number of Bins”, a minimum input of “1”, a maximum
value of “50”, and a default value of “30”.
The last component of our UI object is the “mainPanel” function. This main
panel designates the section where our output plot will ultimately go as can be
observed by the “plotOutput” function nested inside of it. This “plotOutput”
function is given the name “distPlot”. This is done so that it can be referenced
later in our server function.
ui <- fluidPage(
# Application title
titlePanel("Old Faithful Geyser Data"),
)
)
20.2.3 Server
After we create the UI object, we’ll need to create our server function. We’ll pass
two arguments into the function: “input” and “output”. The input argument
allows us to access data from the user interface while the output argument allows
us to pass data back to the user interface.
Ĺ Note
The UI can only accept the plot we are going to send it because it is using
the “plotOutput” function. If you were going to send a different form of
data, the UI would need to have the corresponding function in order to
accept it.
For example, if your server was going to send a table to the UI your server
would need to use the “renderTable” function and your UI would need to
use the “tableOutput” function.
20.3.1 ShinyApps.io
Once you create your account and see your dashboard, you can navigate to your
“tokens” by selecting your name in the top right corner and choosing “tokens”
from the dropdown menu.
198 CHAPTER 20. R SHINY
Now that your token has been generated, select the blue “Show” button to view
it.
You should now have a window that resembles the following image. Select the
“Show secret” button and then copy the code to your clipboard for use later.
20.3. DEPLOYING APPLICATION 199
The next thing we’ll need to do is to link RStudio to your ShinyApps.io account.
You can do this by navigating back to RStudio and choosing the dropdown
menu next to the publish button. From here, select the “Manage Accounts”
option.
You’ll then get a window the resembles the following image. Choose the “Con-
nect” button to continue.
200 CHAPTER 20. R SHINY
Now you’ll have the oppportunity to paste your token from you ShinyApps.io
account. After you do so, press the “Connect Account” button.
202 CHAPTER 20. R SHINY
Now that RStudio is linked to your ShinyApps.io account, you can press the
publish button. You’ll then get a window which allows you to name your app
before publishing. Once you are satisfied with the name you choose, select
“Publish”.
20.4. RESOURCES 203
After a few moments, your browser should launch displaying your newly created
Shiny App!
20.4 Resources
• Shiny Home Page: https://shiny.rstudio.com/
• Shiny UI Editor: https://rstudio.github.io/shinyuieditor/
204 CHAPTER 20. R SHINY
Exercises
Questions
Exercise: 18-A
Write the first seven rows of the “faithful” dataset to a csv file named
“faithful.csv”. Make sure you do not include any row names in your output
file.
Exercise: 18-B
Write the entire “faithful” dataset to an xlsx file using the “saveWorkbook”
function. Name the tab (worksheet) that the data is on “data” and make
the text in the header row bold.
Answers
Answer: 18-A
You can accomplish this through the use of the “write.csv” function.
Answer: 18-B
The following code will allow you to accomplish this task.
205
206 Exercises
library(openxlsx)
wb <- createWorkbook()
heading <- createStyle(textDecoration = "bold")
addWorksheet(wb, "data")
writeData(wb
, "data"
, faithful
, startCol = 1
, startRow = 1
, rowNames = FALSE)
addStyle(wb
, "data"
, cols = 1:length(faithful)
, rows = 1
, style = heading
, gridExpand = TRUE)
saveWorkbook(wb, "faithful.xlsx", overwrite = TRUE)
References
207
208 References