R Prgramming
R Prgramming
What is R
R is a popular programming language used for statistical computing and
graphical presentation.
Why Use R?
It is a great resource for data analysis, data visualization, data science
and machine learning
It provides many statistical techniques (such as statistical tests,
classification, clustering and data reduction)
It is easy to draw graphs in R, like pie charts, histograms, box plot,
scatter plot, etc++
It works on different platforms (Windows, Mac, Linux)
It is open-source and free
It has a large community support
It has many packages (libraries of functions) that can be used to solve
different problems
R Syntax
Syntax
To output text in R, use single or double quotes:
Example
"Hello World!"
Example
5
10
25
Example
5 + 5
Print
Unlike many other programming languages, you can output code in R without
using a print function:
Example
"Hello World!"
However, R does have a print() function available if you want to use it. This
might be useful if you are familiar with other programming languages, such
as Python, which often uses the print() function to output code.
Example
print("Hello World!")
Comments
Comments can be used to explain R code, and to make it more readable. It
can also be used to prevent execution when testing alternative code.
Comments starts with a #. When executing code, R will ignore anything that
starts with #.
Example
# This is a comment
"Hello World!"
Example
"Hello World!" # This is a comment
Comments does not have to be text to explain the code, it can also be used
to prevent R from executing the code:
Example
# "Good morning!"
"Good night!"
Multiline Comments
Unlike other programming languages, such as Java, there are no syntax in R
for multiline comments. However, we can just insert a # for each line to
create multiline comments:
Example
# This is a comment
# written in
# more than just one line
"Hello World!"
R Variables
Creating Variables in R
Variables are containers for storing data values.
R does not have a command for declaring a variable. A variable is created the
moment you first assign a value to it. To assign a value to a variable, use
the <- sign. To output (or print) the variable value, just type the variable
name:
Example
name <- "John"
age <- 40
Example
name <- "John Doe"
However, R does have a print() function available if you want to use it. This
might be useful if you are familiar with other programming languages, such
as Python, which often use a print() function to output variables.
Example
name <- "John Doe"
Concatenate Elements
You can also concatenate, or join, two or more elements, by using
the paste() function.
Example
text <- "awesome"
Example
text1 <- "R is"
text2 <- "awesome"
paste(text1, text2)
Example
num1 <- 5
num2 <- 10
num1 + num2
If you try to combine a string (text) and a number, R will give you an error:
Example
num <- 5
text <- "Some text"
num + text
Result:
Multiple Variables
R allows you to assign the same value to multiple variables in one line:
Example
# Assign the same value to multiple variables in one line
var1 <- var2 <- var3 <- "Orange"
Variable Names
A variable can have a short name (like x and y) or a more descriptive name
(age, carname, total_volume). Rules for R variables are:
Variables can store data of different types, and different types can do
different things.
In R, variables do not need to be declared with any particular type, and can
even change type after they have been set:
Example
my_var <- 30 # my_var is type of numeric
my_var <- "Sally" # my_var is now of type character (aka string)
R has a variety of data types and object classes. You will learn much more
about these as you continue to get to know R.
We can use the class() function to check the data type of a variable:
Example
# numeric
x <- 10.5
class(x)
# integer
x <- 1000L
class(x)
# complex
x <- 9i + 3
class(x)
# character/string
x <- "R is exciting"
class(x)
# logical/boolean
x <- TRUE
class(x)
R Numbers
Numbers
There are three number types in R:
numeric
integer
complex
Variables of number types are created when you assign a value to them:
Example
x <- 10.5 # numeric
y <- 10L # integer
z <- 1i # complex
Numeric
A numeric data type is the most common type in R, and contains any number
with or without a decimal, like: 10.5, 55, 787:
Example
x <- 10.5
y <- 55
Example
x <- 1000L
y <- 55L
Complex
A complex number is written with an "i" as the imaginary part:
Example
x <- 3+5i
y <- 5i
Type Conversion
You can convert from one type to another with the following functions:
[Link]()
[Link]()
[Link]()
Example
x <- 1L # integer
y <- 2 # numeric
Example
10 + 5
Example
10 - 5
For example, the min() and max() functions can be used to find the lowest or
highest number in a set:
Example
max(5, 10, 15)
Example
sqrt(16)
abs()
The abs() function returns the absolute (positive) value of a number:
Example
abs(-4.7)
Example
ceiling(1.4)
floor(1.4)
R Strings
String Literals
Strings are used for storing text.
Example
"hello"
'hello'
Example
str <- "Hello"
str # print the value of str
Multiline Strings
You can assign a multiline string to a variable like this:
Example
str <- "Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua."
If you want the line breaks to be inserted at the same position as in the code,
use the cat() function:
Example
str <- "Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua."
cat(str)
String Length
There are many usesful string functions in R.
Example
str <- "Hello World!"
nchar(str)
Check a String
Use the grepl() function to check if a character or a sequence of characters
are present in a string:
Example
str <- "Hello World!"
grepl("H", str)
grepl("Hello", str)
grepl("X", str)
Example
str1 <- "Hello"
str2 <- "World"
paste(str1, str2)
Escape Characters
To insert characters that are illegal in a string, you must use an escape
character.
Example
str <- "We are the so-called "Vikings", from the north."
str
Result:
str <- "We are the so-called \"Vikings\", from the north."
str
cat(str)
Note that auto-printing the str variable will print the backslash in the output.
You can use the cat() function to print it without backslash.
Code Result
\\ Backslash
\n New Line
\r Carriage Return
\t Tab
\b Backspace
R Booleans / Logical
Values
Booleans (Logical Values)
In programming, you often need to know if an expression is true or false.
When you compare two values, the expression is evaluated and R returns the
logical answer:
Example
10 > 9 # TRUE because 10 is greater than 9
10 == 9 # FALSE because 10 is not equal to 9
10 < 9 # FALSE because 10 is greater than 9
Example
a <- 10
b <- 9
a > b
You can also run a condition in an if statement, which you will learn much
more about in the if..else chapter.
Example
a <- 200
b <- 33
if (b > a) {
print ("b is greater than a")
} else {
print("b is not greater than a")
}
R Operators
Operators
Operators are used to perform operations on variables and values.
In the example below, we use the + operator to add together two values:
Example
10 + 5
Arithmetic operators
Assignment operators
Comparison operators
Logical operators
Miscellaneous operators
R Arithmetic Operators
Arithmetic operators are used with numeric values to perform common
mathematical operations:
+ Addition x+y
- Subtraction x-y
* Multiplication x*y
/ Division x/y
^ Exponent x^y
%% Modulus (Remainder x %% y
from division)
R Assignment Operators
Assignment operators are used to assign values to variables:
Example
my_var <- 3
my_var <<- 3
3 -> my_var
3 ->> my_var
R Comparison Operators
Comparison operators are used to compare two values:
Operator Name Example
== Equal x == y
!= Not equal x != y
R Logical Operators
Logical operators are used to combine conditional statements:
Operato Description
r
R Miscellaneous Operators
Miscellaneous operators are used to manipulate data:
== Equal x == y
!= Not equal x != y
The if Statement
An "if statement" is written with the if keyword, and it is used to specify a
block of code to be executed if a condition is TRUE:
Example
a <- 33
b <- 200
if (b > a) {
print("b is greater than a")
}
In this example we use two variables, a and b, which are used as a part of the
if statement to test whether b is greater than a. As a is 33, and b is 200, we
know that 200 is greater than 33, and so we print to screen that "b is greater
than a".
Else If
The else if keyword is R's way of saying "if the previous conditions were not
true, then try this condition":
Example
a <- 33
b <- 33
if (b > a) {
print("b is greater than a")
} else if (a == b) {
print ("a and b are equal")
}
In this example a is equal to b, so the first condition is not true, but the else
if condition is true, so we print to screen that "a and b are equal".
If Else
The else keyword catches anything which isn't caught by the preceding
conditions:
Example
a <- 200
b <- 33
if (b > a) {
print("b is greater than a")
} else if (a == b) {
print("a and b are equal")
} else {
print("a is greater than b")
}
In this example, a is greater than b, so the first condition is not true, also
the else if condition is not true, so we go to the else condition and print to
screen that "a is greater than b".
Example
a <- 200
b <- 33
if (b > a) {
print("b is greater than a")
} else {
print("b is not greater than a")
}
Nested If Statements
You can also have if statements inside if statements, this is
called nested if statements.
Example
x <- 41
if (x > 10) {
print("Above ten")
if (x > 20) {
print("and also above 20!")
} else {
print("but not above 20.")
}
} else {
print("below 10.")
}
AND
The & symbol (and) is a logical operator, and is used to combine conditional
statements:
Example
Test if a is greater than b, AND if c is greater than a:
a <- 200
b <- 33
c <- 500
OR
The | symbol (or) is a logical operator, and is used to combine conditional
statements:
Example
Test if a is greater than b, or if c is greater than a:
a <- 200
b <- 33
c <- 500
if (a > b | a > c) {
print("At least one of the conditions is true")
}
R While Loop
Loops
Loops can execute a block of code as long as a specified condition is reached.
Loops are handy because they save time, reduce errors, and they make code
more readable.
while loops
for loops
R While Loops
With the while loop we can execute a set of statements as long as a condition
is TRUE:
Example
Print i as long as i is less than 6:
i <- 1
while (i < 6) {
print(i)
i <- i + 1
}
In the example above, the loop will continue to produce numbers ranging
from 1 to 5. The loop will stop at 6 because 6 < 6 is FALSE.
Example
Exit the loop if i is equal to 4.
i <- 1
while (i < 6) {
print(i)
i <- i + 1
if (i == 4) {
break
}
}
The loop will stop at 3 because we have chosen to finish the loop by using
the break statement when i is equal to 4 (i == 4).
Next
With the next statement, we can skip an iteration without terminating the
loop:
Example
Skip the value of 3:
i <- 0
while (i < 6) {
i <- i + 1
if (i == 3) {
next
}
print(i)
}
When the loop passes the value 3, it will skip it and continue to loop.
Yahtzee!
If .. Else Combined with a While Loop
To demonstrate a practical example, let us say we play a game of Yahtzee!
Example
Print "Yahtzee!" If the dice number is 6:
dice <- 1
while (dice <= 6) {
if (dice < 6) {
print("No Yahtzee")
} else {
print("Yahtzee!")
}
dice <- dice + 1
}
If the loop passes the values ranging from 1 to 5, it prints "No Yahtzee".
Whenever it passes the value 6, it prints "Yahtzee!".
For Loops
A for loop is used for iterating over a sequence:
Example
for (x in 1:10) {
print(x)
}
This is less like the for keyword in other programming languages, and works
more like an iterator method as found in other object-orientated
programming languages.
With the for loop we can execute a set of statements, once for each item in
a vector, array, list, etc..
Example
Print every item in a list:
for (x in fruits) {
print(x)
}
Example
Print the number of dices:
for (x in dice) {
print(x)
}
The for loop does not require an indexing variable to set beforehand, like
with while loops.
Break
With the break statement, we can stop the loop before it has looped through
all the items:
Example
Stop the loop at "cherry":
for (x in fruits) {
if (x == "cherry") {
break
}
print(x)
}
The loop will stop at "cherry" because we have chosen to finish the loop by
using the break statement when x is equal to "cherry" (x == "cherry").
Next
With the next statement, we can skip an iteration without terminating the
loop:
Example
Skip "banana":
for (x in fruits) {
if (x == "banana") {
next
}
print(x)
}
When the loop passes "banana", it will skip it and continue to loop.
Yahtzee!
If .. Else Combined with a For Loop
To demonstrate a practical example, let us say we play a game of Yahtzee!
Example
Print "Yahtzee!" If the dice number is 6:
If the loop reaches the values ranging from 1 to 5, it prints "No Yahtzee" and
its number. When it reaches the value 6, it prints "Yahtzee!" and its number.
Nested Loops
It is also possible to place a loop inside another loop. This is called a nested
loop:
Example
Print the adjective of each fruit in a list:
Creating a Function
To create a function, use the function() keyword:
Example
my_function <- function() { # create a function with the name
my_function
print("Hello World!")
}
Call a Function
To call a function, use the function name followed by parenthesis,
like my_function():
Example
my_function <- function() {
print("Hello World!")
}
Arguments are specified after the function name, inside the parentheses. You
can add as many arguments as you want, just separate them with a comma.
The following example has a function with one argument (fname). When the
function is called, we pass along a first name, which is used inside the
function to print the full name:
Example
my_function <- function(fname) {
paste(fname, "Griffin")
}
my_function("Peter")
my_function("Lois")
my_function("Stewie")
Parameters or Arguments?
The terms "parameter" and "argument" can be used for the same thing:
information that are passed into a function.
Number of Arguments
By default, a function must be called with the correct number of arguments.
Meaning that if your function expects 2 arguments, you have to call the
function with 2 arguments, not more, and not less:
Example
This function expects 2 arguments, and gets 2 arguments:
my_function("Peter", "Griffin")
If you try to call the function with 1 or 3 arguments, you will get an error:
Example
This function expects 2 arguments, and gets 1 argument:
my_function("Peter")
Example
my_function <- function(country = "Norway") {
paste("I am from", country)
}
my_function("Sweden")
my_function("India")
my_function() # will get the default value, which is Norway
my_function("USA")
Return Values
To let a function return a result, use the return() function:
Example
my_function <- function(x) {
return (5 * x)
}
print(my_function(3))
print(my_function(5))
print(my_function(9))
Nested Functions
There are two ways to create a nested function:
Example
Call a function within another function:
Nested_function(Nested_function(2,2), Nested_function(3,3))
Example Explained
Example Explained
You cannot directly call the function because the Inner_func has been defined
(nested) inside the Outer_func.
We need to create a new variable called output and give it a value, which is 3
here.
We then print the output with the desired value of "y", which in this case is 5.
Recursion
R also accepts function recursion, which means a defined function can call
itself.
The developer should be very careful with recursion as it can be quite easy to
slip into writing a function which never terminates, or one that uses excess
amounts of memory or processor power. However, when written correctly,
recursion can be a very efficient and mathematically-elegant approach to
programming.
To a new developer it can take some time to work out how exactly this works,
best way to find out is by testing and modifying it.
Example
tri_recursion <- function(k) {
if (k > 0) {
result <- k + tri_recursion(k - 1)
print(result)
} else {
result = 0
return(result)
}
}
tri_recursion(6)
Global Variables
Variables that are created outside of a function are known
as global variables.
Example
Create a variable outside of a function and use it inside the function:
my_function()
If you create a variable with the same name inside a function, this variable
will be local, and can only be used inside the function. The global variable
with the same name will remain as it was, global and with the original value.
Example
Create a variable inside of a function with the same name as the global
variable:
z
txt # print txt
If you try to print txt, it will return "global variable" because we are
printing txt outside the function.
To create a global variable inside a function, you can use the global
assignment operator <<-
Example
If you use the assignment operator <<-, the variable belongs to the global
scope:
my_function()
print(txt)
Also, use the global assignment operator if you want to change a global
variable inside a function:
Example
To change the value of a global variable inside a function, refer to the
variable by using the global assignment operator <<-:
my_function()
To combine the list of items to a vector, use the c() function and separate
the items by a comma.
In the example below, we create a vector variable called fruits, that combine
strings:
Example
# Vector of strings
fruits <- c("banana", "apple", "orange")
# Print fruits
fruits
Example
# Vector of numerical values
numbers <- c(1, 2, 3)
# Print numbers
numbers
Example
# Vector with numerical values in a sequence
numbers <- 1:10
numbers
You can also create numerical values with decimals in a sequence, but note
that if the last element does not belong to the sequence, it is not used:
Example
# Vector with numerical decimals in a sequence
numbers1 <- 1.5:6.5
numbers1
Example
# Vector of logical values
log_values <- c(TRUE, FALSE, TRUE, FALSE)
log_values
Vector Length
To find out how many items a vector has, use the length() function:
Example
fruits <- c("banana", "apple", "orange")
length(fruits)
Sort a Vector
To sort items in a vector alphabetically or numerically, use
the sort() function:
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")
numbers <- c(13, 3, 5, 7, 20, 2)
Access Vectors
You can access the vector items by referring to its index number inside
brackets []. The first item has index 1, the second item has index 2, and so
on:
Example
fruits <- c("banana", "apple", "orange")
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")
Change an Item
To change the value of a specific item, refer to the index number:
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")
# Print fruits
fruits
Repeat Vectors
To repeat vectors, use the rep() function:
Example
Repeat each value:
repeat_each
Example
Repeat the sequence of the vector:
repeat_times
Example
Repeat each value independently:
repeat_indepent
Example
numbers <- 1:10
numbers
Example
numbers <- seq(from = 0, to = 100, by = 20)
numbers
Note: The seq() function has three parameters: from is where the sequence
starts, to is where the sequence stops, and by is the interval of the
sequence.
R Lists
Lists
A list in R can contain many different data types inside it. A list is a collection
of data which is ordered and changeable.
Example
# List of strings
thislist <- list("apple", "banana", "cherry")
Access Lists
You can access the list items by referring to its index number, inside
brackets. The first item has index 1, the second item has index 2, and so on:
Example
thislist <- list("apple", "banana", "cherry")
thislist[1]
Example
thislist <- list("apple", "banana", "cherry")
thislist[1] <- "blackcurrant"
# Print the updated list
thislist
List Length
To find out how many items a list has, use the length() function:
Example
thislist <- list("apple", "banana", "cherry")
length(thislist)
Example
Check if "apple" is present in the list:
Example
Add "orange" to the list:
thislist <- list("apple", "banana", "cherry")
append(thislist, "orange")
To add an item to the right of a specified index, add " after=index number"
in the append() function:
Example
Add "orange" to the list after "banana" (index 2):
Example
Remove "apple" from the list:
Range of Indexes
You can specify a range of indexes by specifying where to start and where to
end the range, by using the : operator:
Example
Return the second, third, fourth and fifth item:
thislist <-
list("apple", "banana", "cherry", "orange", "kiwi", "melon", "man
go")
(thislist)[2:5]
Example
Print all items in the list, one by one:
for (x in thislist) {
print(x)
}
The most common way is to use the c() function, which combines two
elements together:
Example
list1 <- list("a", "b", "c")
list2 <- list(1,2,3)
list3 <- c(list1,list2)
list3
R Matrices
Matrices
A matrix is a two dimensional data set with columns and rows.
Example
# Create a matrix
thismatrix <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2)
Example
thismatrix <- matrix(c("apple", "banana", "cherry", "orange"),
nrow = 2, ncol = 2)
thismatrix
thismatrix[1, 2]
The whole row can be accessed if you specify a comma after the number in
the bracket:
Example
thismatrix <- matrix(c("apple", "banana", "cherry", "orange"),
nrow = 2, ncol = 2)
thismatrix[2,]
The whole column can be accessed if you specify a comma before the
number in the bracket:
Example
thismatrix <- matrix(c("apple", "banana", "cherry", "orange"),
nrow = 2, ncol = 2)
thismatrix[,2]
Example
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineappl
e", "pear", "melon", "fig"), nrow = 3, ncol = 3)
thismatrix[c(1,2),]
Access More Than One Column
More than one column can be accessed if you use the c() function:
Example
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineappl
e", "pear", "melon", "fig"), nrow = 3, ncol = 3)
thismatrix[, c(1,2)]
Example
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineappl
e", "pear", "melon", "fig"), nrow = 3, ncol = 3)
Note: The cells in the new column must be of the same length as the
existing matrix.
Example
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineappl
e", "pear", "melon", "fig"), nrow = 3, ncol = 3)
Note: The cells in the new row must be of the same length as the existing
matrix.
Example
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange", "mango", "pineapp
le"), nrow = 3, ncol =2)
thismatrix
Example
Check if "apple" is present in the matrix:
Example
thismatrix <- matrix(c("apple", "banana", "cherry", "orange"),
nrow = 2, ncol = 2)
dim(thismatrix)
Matrix Length
Use the length() function to find the dimension of a Matrix:
Example
thismatrix <- matrix(c("apple", "banana", "cherry", "orange"),
nrow = 2, ncol = 2)
length(thismatrix)
Example
Loop through the matrix items and print them:
Example
# Combine matrices
Matrix1 <- matrix(c("apple", "banana", "cherry", "grape"), nrow
= 2, ncol = 2)
Matrix2 <-
matrix(c("orange", "mango", "pineapple", "watermelon"), nrow = 2,
ncol = 2)
# Adding it as a rows
Matrix_Combined <- rbind(Matrix1, Matrix2)
Matrix_Combined
# Adding it as a columns
Matrix_Combined <- cbind(Matrix1, Matrix2)
Matrix_Combined
R Arrays
Arrays
Compared to matrices, arrays can have more than two dimensions.
We can use the array() function to create an array, and the dim parameter to
specify the dimensions:
Example
# An array with one dimension with values ranging from 1 to 24
thisarray <- c(1:24)
thisarray
Example Explained
In the example above we create an array with the values 1 to 24.
Example
thisarray <- c(1:24)
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray[2, 3, 2]
Example
thisarray <- c(1:24)
# Access all the items from the first row from matrix one
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray[c(1),,1]
# Access all the items from the first column from matrix one
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray[,c(1),1]
A comma (,) before c() means that we want to access the column.
A comma (,) after c() means that we want to access the row.
Example
Check if the value "2" is present in the array:
2 %in% multiarray
dim(multiarray)
Array Length
Use the length() function to find the dimension of an array:
Example
thisarray <- c(1:24)
multiarray <- array(thisarray, dim = c(4, 3, 2))
length(multiarray)
Example
thisarray <- c(1:24)
multiarray <- array(thisarray, dim = c(4, 3, 2))
for(x in multiarray){
print(x)
}
R Data Frames
Data Frames
Data Frames are data displayed in a format as a table.
Data Frames can have different types of data inside it. While the first column
can be character, the second and third can be numeric or logical. However,
each column should have the same type of data.
Example
# Create a data frame
Data_Frame <- [Link] (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Example
Data_Frame <- [Link] (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Data_Frame
summary(Data_Frame)
Access Items
We can use single brackets [ ], double brackets [[ ]] or $ to access columns
from a data frame:
Example
Data_Frame <- [Link] (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Data_Frame[1]
Data_Frame[["Training"]]
Data_Frame$Training
Add Rows
Use the rbind() function to add new rows in a Data Frame:
Example
Data_Frame <- [Link] (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Example
Data_Frame <- [Link] (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Example
Data_Frame <- [Link] (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Example
Data_Frame <- [Link] (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
dim(Data_Frame)
You can also use the ncol() function to find the number of columns
and nrow() to find the number of rows:
Example
Data_Frame <- [Link] (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
ncol(Data_Frame)
nrow(Data_Frame)
Example
Data_Frame <- [Link] (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
length(Data_Frame)
Example
Data_Frame1 <- [Link] (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
And use the cbind() function to combine two or more data frames in R
horizontally:
Example
Data_Frame3 <- [Link] (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Demography: Male/Female
Music: Rock, Pop, Classic, Jazz
Training: Strength, Stamina
To create a factor, use the factor() function and add a vector as argument:
Example
# Create a factor
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Ro
ck", "Jazz"))
Result:
You can see from the example above that that the factor has four levels
(categories): Classic, Jazz, Pop and Rock.
Example
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Ro
ck", "Jazz"))
levels(music_genre)
Result:
Example
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Ro
ck", "Jazz"), levels =
c("Classic", "Jazz", "Pop", "Rock", "Other"))
levels(music_genre)
Result:
Factor Length
Use the length() function to find out how many items there are in the factor:
Example
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Ro
ck", "Jazz"))
length(music_genre)
Result:
[1] 8
Access Factors
To access the items in a factor, refer to the index number, using [] brackets:
Example
Access the third item:
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Ro
ck", "Jazz"))
music_genre[3]
Result:
[1] Classic
Levels: Classic Jazz Pop Rock
Example
Change the value of the third item:
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Ro
ck", "Jazz"))
music_genre[3]
Result:
[1] Pop
Levels: Classic Jazz Pop Rock
Note that you cannot change the value of a specific item if it is not already
specified in the factor. The following example will produce an error:
Example
Trying to change the value of the third item ("Classic") to an item that does
not exist/not predefined ("Opera"):
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Ro
ck", "Jazz"))
music_genre[3] <- "Opera"
music_genre[3]
Result:
Warning message:
In `[<-.factor`(`*tmp*`, 3, value = "Opera") :
invalid factor level, NA generated
However, if you have already specified it inside the levels argument, it will
work:
Example
Change the value of the third item:
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Ro
ck", "Jazz"), levels =
c("Classic", "Jazz", "Pop", "Rock", "Opera"))
music_genre[3]
Result:
[1] Opera
Levels: Classic Jazz Pop Rock Opera
R Graphics
R Plotting
Plot
The plot() function is used to draw points (markers) in a diagram.
At its simplest, you can use the plot() function to plot two numbers against
each other:
Example
Draw one point in the diagram, at position (1) and position (3):
plot(1, 3)
Result:
To draw more points, use vectors:
Example
Draw two points in the diagram, one at position (1, 3) and one in position (8,
10):
Result:
Multiple Points
You can plot as many points as you like, just make sure you have the same
number of points in both axis:
Example
plot(c(1, 2, 3, 4, 5), c(3, 7, 8, 9, 12))
Result:
For better organization, when you have many values, it is better to use
variables:
Example
x <- c(1, 2, 3, 4, 5)
y <- c(3, 7, 8, 9, 12)
plot(x, y)
Result:
Sequences of Points
If you want to draw dots in a sequence, on both the x-axis and the y-axis,
use the : operator:
Example
plot(1:10)
Result:
Draw a Line
The plot() function also takes a type parameter with the value l to draw a
line to connect all the points in the diagram:
Example
plot(1:10, type="l")
Result:
Plot Labels
The plot() function also accept other parameters, such
as main, xlab and ylab if you want to customize the graph with a main title
and different labels for the x and y-axis:
Example
plot(1:10, main="My Graph", xlab="The x-axis", ylab="The y axis")
Result:
Graph Appearance
There are many other parameters you can use to change the appearance of
the points.
Colors
Use col="color" to add a color to the points:
Example
plot(1:10, col="red")
Result:
Size
Use cex=number to change the size of the points (1 is default, while 0.5 means
50% smaller, and 2 means 100% larger):
Example
plot(1:10, cex=2)
Result:
Point Shape
Use pch with a value from 0 to 25 to change the point shape format:
Example
plot(1:10, pch=25, cex=2)
Result:
The values of the pch parameter ranges from 0 to 25, which means that we
can choose up to 26 different types of point shapes:
R Line
Line Graphs
A line graph has a line that connects all the points in a diagram.
To create a line, use the plot() function and add the type parameter with a
value of "l":
Example
plot(1:10, type="l")
Result:
Line Color
The line color is black by default. To change the color, use the col parameter:
Example
plot(1:10, type="l", col="blue")
Result:
Line Width
To change the width of the line, use the lwd parameter (1 is default,
while 0.5 means 50% smaller, and 2 means 100% larger):
Example
plot(1:10, type="l", lwd=2)
Result:
Line Styles
The line is solid by default. Use the lty parameter with a value from 0 to 6 to
specify the line format.
For example, lty=3 will display a dotted line instead of a solid line:
Example
plot(1:10, type="l", lwd=5, lty=3)
Result:
Multiple Lines
To display more than one line in a graph, use the plot() function together
with the lines() function:
Example
line1 <- c(1,2,3,4,5,10)
line2 <- c(2,5,7,8,9,10)
Result:
R Scatter Plot
Scatter Plots
You learned from the Plot chapter that the plot() function is used to plot
numbers against each other.
A "scatter plot" is a type of plot used to display the relationship between two
numerical variables, and plots one dot for each observation.
It needs two vectors of same length, one for the x-axis (horizontal) and one
for the y-axis (vertical):
Example
x <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y <- c(99,86,87,88,111,103,87,94,78,77,85,86)
plot(x, y)
Result:
The observation in the example above should show the result of 12 cars
passing by.
That might not be clear for someone who sees the graph for the first time, so
let's add a header and different labels to describe the scatter plot better:
Example
x <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y <- c(99,86,87,88,111,103,87,94,78,77,85,86)
Result:
It seems that the newer the car, the faster it drives, but that could be a
coincidence, after all we only registered 12 cars.
Compare Plots
In the example above, there seems to be a relationship between the car
speed and age, but what if we plot the observations from another day as
well? Will the scatter plot tell us something else?
To compare the plot with another plot, use the points() function:
Example
Draw two plots on the same figure:
Result:
Note: To be able to see the difference of the comparison, you must assign
different colors to the plots (by using the col parameter). Red represents the
values of day 1, while blue represents day 2. Note that we have also added
the cex parameter to increase the size of the dots.
Example
# Create a vector of pies
x <- c(10,20,30,40)
Result:
Example Explained
As you can see the pie chart draws one pie for each value in the vector (in
this case 10, 20, 30, 40).
By default, the plotting of the first pie starts from the x-axis and
move counterclockwise.
Note: The size of each pie is determined by comparing the value with all the
other values, by using this formula:
Start Angle
You can change the start angle of the pie chart with
the [Link] parameter.
The value of [Link] is defined with angle in degrees, where default angle
is 0.
Example
Start the first pie at 90 degrees:
# Display the pie chart and start the first pie at 90 degrees
pie(x, [Link] = 90)
Result:
Labels and Header
Use the label parameter to add a label to the pie chart, and use
the main parameter to add a header:
Example
# Create a vector of pies
x <- c(10,20,30,40)
Result:
Colors
You can add a color to each pie with the col parameter:
Example
# Create a vector of colors
colors <- c("blue", "yellow", "green", "black")
Result:
Legend
To add a list of explanation for each pie, use the legend() function:
Example
# Create a vector of labels
mylabel <- c("Apples", "Bananas", "Cherries", "Dates")
Example
# x-axis values
x <- c("A", "B", "C", "D")
# y-axis values
y <- c(2, 4, 6, 8)
barplot(y, [Link] = x)
Result:
Example Explained
Bar Color
Use the col parameter to change the color of the bars:
Example
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
Example
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
Result:
Bar Width
Use the width parameter to change the width of the bars:
Example
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
Result:
Horizontal Bars
If you want the bars to be displayed horizontally instead of vertically,
use horiz=TRUE:
Example
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
Result:
R Statistics
Statistics Introduction
Statistics is the science of analyzing, reviewing and conclude data.
R Data Set
Data Set
A data set is a collection of data, often presented in a table.
There is a popular built-in data set in R called "mtcars" (Motor Trend Car
Road Tests), which is retrieved from the 1974 Motor Trend US Magazine.
In the examples below (and for the next chapters), we will use
the mtcars data set, for statistical purposes:
Example
# Print the mtcars data set
mtcars
Result:
Example
# Use the question mark to get information about the data set
?mtcars
Result:
Usage
mtcars
Format
A data frame with 32 observations on 11 (numeric) variables.
mp
[, 1] Miles/(US) gallon
g
[, 2] cyl Number of cylinders
[, 4] hp Gross horsepower
qse
[, 7] 1/4 mile time
c
Engine (0 = V-shaped, 1 =
[, 8] vs
straight)
Transmission (0 = automatic, 1
[, 9] am
= manual)
[,10 gea
Number of forward gears
]r
[,11 car
Number of carburetors
]b
Note
Henderson and Velleman (1981) comment in a footnote to Table 1: 'Hocking
[original transcriber]'s noncrucial coding of the Mazda's rotary engine as a
straight six-cylinder engine and the Porsche's flat engine as a V engine, as
well as the inclusion of the diesel Mercedes 240D, have been retained to
enable direct comparisons to be made with previous analyses.'
Source
Henderson and Velleman (1981), Building multiple regression models
interactively. Biometrics, 37, 391-411.
Examples
require(graphics)
pairs(mtcars, main = "mtcars data", gap = 1/4)
coplot(mpg ~ disp | [Link](cyl), data = mtcars,
panel = [Link], rows = 1)
## possibly more meaningful, e.g., for summary() or bivariate
plots:
mtcars2 <- within(mtcars, {
vs <- factor(vs, labels = c("V", "S"))
am <- factor(am, labels = c("automatic", "manual"))
cyl <- ordered(cyl)
gear <- ordered(gear)
carb <- ordered(carb)
})
summary(mtcars2)
Get Information
Use the dim() function to find the dimensions of the data set, and
the names() function to view the names of the variables:
Example
Data_Cars <- mtcars # create a variable of the mtcars data set
for better organization
# Use names() to find the names of the variables from the data
set
names(Data_Cars)
Result:
[1] 32 11
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am"
"gear"
[11] "carb"
Use the rownames() function to get the name of each row in the first column,
which is the name of each car:
Example
Data_Cars <- mtcars
rownames(Data_Cars)
Result:
From the examples above, we have found out that the data set
has 32 observations (Mazda RX4, Mazda RX4 Wag, Datsun 710, etc)
and 11 variables (mpg, cyl, disp, etc).
Here is a brief explanation of the variables from the mtcars data set:
disp Displacement
hp Gross horsepower
Example
Data_Cars <- mtcars
Data_Cars$cyl
Result:
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6
8 4
Example
Data_Cars <- mtcars
sort(Data_Cars$cyl)
Result:
[1] 4 4 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8
8 8
From the examples above, we see that most cars have 4 and 8 cylinders.
Analyzing the Data
Now that we have some information about the data set, we can start to
analyze it with some statistical numbers.
For example, we can use the summary() function to get a statistical summary
of the data:
Example
Data_Cars <- mtcars
summary(Data_Cars)
Do not worry if you do not understand the output numbers. You will master
them shortly.
The summary() function returns six statistical numbers for each variable:
Min
First quantile (percentile)
Median
Mean
Third quantile (percentile)
Max
R Max and Min
Max Min
In the previous chapter, we introduced the mtcars data set. We will continue
to use this data set throughout the next pages.
You learned from the R Math chapter that R has several built-in math
functions. For example, the min() and max() functions can be used to find the
lowest or highest value in a set:
Example
Find the largest and smallest value of the variable hp (horsepower).
max(Data_Cars$hp)
min(Data_Cars$hp)
Result:
[1] 335
[1] 52
Now we know that the largest horsepower value in the set is 335, and the
lowest 52.
We could take a look at the data set and try to find out which car these two
values belongs to:
Observation of cars
mpg cyl disp hp drat wt qsec vs am
gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1
4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1
4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1
4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0
3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0
3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0
3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0
3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0
4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0
4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0
4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0
4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0
3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0
3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0
3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0
3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0
3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0
3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1
4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1
4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1
4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0
3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0
3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0
3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0
3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0
3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1
4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1
5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1
5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1
5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1
5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1
5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1
4 2
By observing the table, it looks like the largest hp value belongs to a Maserati
Bora, and the lowest belongs to a Honda Civic.
However, it is much easier (and safer) to let R find out this for us.
For example, we can use the [Link]() and [Link]() functions to find the
index position of the max and min value in the table:
Example
Data_Cars <- mtcars
[Link](Data_Cars$hp)
[Link](Data_Cars$hp)
Result:
[1] 31
[1] 19
Example
Data_Cars <- mtcars
rownames(Data_Cars)[[Link](Data_Cars$hp)]
rownames(Data_Cars)[[Link](Data_Cars$hp)]
Result:
Example of data points that could have been outliers in the mtcars data set:
Mean
To calculate the average value (mean) of a variable from the mtcars data set,
find the sum of all values, and divide the sum by the number of values.
Example
Find the average weight (wt) of a car:
mean(Data_Cars$wt)
Result:
[1] 3.21725
Median
The median value is the value in the middle, after you have sorted all the
values.
If we take a look at the values of the wt variable (from the mtcars data set),
we will see that there are two numbers in the middle:
Note: If there are two numbers in the middle, you must divide the sum of
those numbers by two, to find the median.
Luckily, R has a function that does all of that for you: Just use
the median() function to find the middle value:
Example
Find the mid point value of weight (wt):
median(Data_Cars$wt)
Result:
[1] 3.325
Mode
The mode value is the value that appears the most number of times.
R does not have a function to calculate the mode. However, we can create
our own function to find it.
If we take a look at the values of the wt variable (from the mtcars data set),
we will see that the numbers 3.440 are often shown:
Instead of counting it ourselves, we can use the following code to find the
mode:
Example
Data_Cars <- mtcars
names(sort(-table(Data_Cars$wt)))[1]
Result:
[1] "3.44"
From the example above, we now know that the number that appears the
most number of times in mtcars wt variable is 3.44 or 3.440 lbs.
R Percentiles
Percentiles
Percentiles are used in statistics to give you a number that describes the
value that a given percent of the values are lower than.
If we take a look at the values of the wt (weight) variable from the mtcars data
set:
Observation of wt (weight)
What is the 75. percentile of the weight of the cars? The answer is 3.61 or 3
610 lbs, meaning that 75% or the cars weight 3 610 lbs or less:
Example
Data_Cars <- mtcars
Result:
75%
3.61
If you run the quantile() function without specifying the c() parameter, you
will get the percentiles of 0, 25, 50, 75 and 100:
Example
Data_Cars <- mtcars
quantile(Data_Cars$wt)
Result:
Quartiles
Quartiles are data divided into four parts, when sorted in an ascending order:
1. The value of the first quartile cuts off the first 25% of the data
2. The value of the second quartile cuts off the first 50% of the data
3. The value of the third quartile cuts off the first 75% of the data
4. The value of the fourth quartile cuts off the 100% of the data