R Programming Course Notes: Overview and History of R
R Programming Course Notes: Overview and History of R
Contents
Overview and History of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Coding Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Workspace and Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
R Console and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
R Objects and Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Sequence of Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Subsetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Understanding Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Split-Apply-Combine Funtions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Dates and Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Base Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Reading Tabular Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Control Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Scoping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
R Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1
– 1996 public mailing list created R-help and R-devel
– 1997 R Core Group formed
– 2000 R v1.0.0 released
• R Features
– Syntax similar to S, semantics similar to S, runs on any platforms, frequent releasees
– lean software, functionalities in modular packages, sophisticated graphics capabilities
– useful for interactive work, powerful programming language
– active user community and FREE (4 freedoms)
∗ freedom to run the program
∗ freedom to study how the program works and adapt it
∗ freedom to redistribute copies
∗ freedom to improve the program
• R Drawbacks
– 40 year-old technology
– little built-in support for dynamic/3D graphics
– functionality based on consumer demand
– objects generally stored in physical memory (limited by hardware)
• Design of the R system
– 2 conceptual parts: base R from CRAN vs. everything else
– functionality divided into different packages
∗ base R contains core functionality and fundamental functions
∗ other utility packages included in the base install: util, stats, datasets, …
∗ Recommended packages: bootclass, KernSmooth, etc
– 5000+ packages available
Coding Standards
2
R Console and Evaluation
3
• implicit coercion
– matrix/vector can only contain one data type, so when attempting to create matrix/vector with
different classes, forced coercion occurs to make every element to same class
∗ least common denominator is the approach used (basically everything is converted to a class
that all values can take, numbers → characters) and no errors generated
∗ coercion occurs to make every element to same class (implicit)
∗ x <- c(NA, 2, "D") will create a vector of character class
• list() = special vector wit different classes of elements
– list = vector of objects of different classes
– elements of list use [[]], elements of other vectors use []
• logical vectors = contain values TRUE, FALSE, and NA, values are generated as result of logical condi-
tions comparing two objects/values
• paste(characterVector, collapse = " ") = join together elements of the vector and separating
with the collapse parameter
• paste(vec1, vec2, sep = " ") = join together different vectors and separating with the sep pa-
rameter
– Note: vector recycling applies here too
– LETTERS, letters= predefined vectors for all 26 upper and lower letters
• unique(values) = returns vector with all duplicates removed
# initiate a vector
x <-c(NA, 1, "cx", NA, 2, "dsa")
class(x)
## [1] "character"
4
x
# convert to matrix
dim(x) <- c(3, 2)
class(x)
## [1] "matrix"
## [,1] [,2]
## [1,] NA NA
## [2,] "1" "2"
## [3,] "cx" "dsa"
5
Arrays
Factors
• factors are used to represent categorical data (integer vector where each value has a label)
• 2 types: unordered vs ordered
• treated specially by lm(), glm()
• Factors easier to understand because they self describe (vs. 1 and 2)
• factor(c("a", "b"), levels = c("1", "2")) = creates factor
– levels() argument can be used to specify baseline levels vs other levels
∗ Note:without explicit specification, R uses alphabetical order
– table(factorVar) = how many of each are in the factor
Missing Values
• is.na(), is.nan() = use to test if each element of the vector is NA and NaN
– Note: cannot compare NA (with ==) as it is not a value but a placeholder for a quantity that is
not available
• sum(my_na) = sum of a logical vector (TRUE = 1 and FALSE = 0) is effectively the number of TRUEs
• Removing NA Values
6
– complete.cases(obj1, obj2) = creates logical vector where TRUE is where both values exist,
and FALSE is where any is NA
∗ can be used on data frames as well
∗ complete.cases(data.frame) = creates logical vectors indicating which observation/row is
good
∗ data.frame[logicalVector, ] = returns all observations with complete data
• Imputing Missing Values = replacing missing values with estimates (can be averages from all other
data with the similar conditions)
Sequence of Numbers
Subsetting
Vectors
7
– names(vet) <- c("a", "b", "c") = assign/change names of vector
• identical(obj1, obj2) = returns TRUE if two objects are exactly equal
• all.equal(obj1, obj2) = returns TRUE if two objects are near equal
Lists
Matrices
Partial Matching
Logic
• <, >= = less than, greater or equal to
• == = exact equality
• != = inequality
• A | B = union
• A & B = intersection
• ! = negation
• & or | evaluates every instance/element in vector
• && or || evaluate only first element
– Note: All AND operators are evaluated before OR operators
• isTRUE(condition) = returns TRUE or FALSE of the condition
• xor(arg1, arg2) = exclusive OR, one argument must equal TRUE one must equal FALSE
• which(condition) = find the indicies of elements that satisfy the condition (TRUE)
• any(condition) = TRUE if one or more of the elements in logical vector is TRUE
• all(condition) = TRUE if all of the elements in logical vector is TRUE
8
Understanding Data
Split-Apply-Combine Funtions
• loop functions = convenient ways of implementing the Split-Apply-Combine strategy for data analysis
split()
apply()
9
– … = other arguments that need to be passed to other functions
• examples
– apply(x, 1, sum) or apply(x, 1, mean) = find row sums/means
– apply(x, 2, sum) or apply(x, 2, mean) = find column sums/means
– apply(x, 1, quantile, probs = c(0.25, 0.75)) = find 25% 75% percentile of each row
– a <- array(rnorm(2*2*10), c(2, 2, 10)) = create 10 2x2 matrix
– apply(a, c(1, 2), mean) = returns the means of 10
lapply()
• loops over a list and evaluate a function on each element and always returns a list
– Note: since input must be a list, it is possible that conversion may be needed
• lapply(x, FUN, ...) = takes list/vector as input, applies a function to each element of the list,
returns a list of the same length
– x = list (if not list, will be coerced into list through “as.list”, if not possible —> error)
∗ data.frame are treated as collections of lists and can be used here
– FUN = function (without parentheses)
∗ anonymous functions are acceptable here as well - (i.e function(x) x[,1])
– … = other/additional arguments to be passed for FUN (i.e. min, max for runif())
• example
– lapply(data.frame, class) = the data.frame is a list of vectors, the class value for each vector
is returned in a list (name of function, class, is without parentheses)
– lapply(values, function(elem), elem[2]) = example of an anonymous function
sapply()
vapply()
• safer version of sapply in that it allows to you specify the format for the result
– vapply(flags, class, character(1)) = returns the class of values in the flags variable in
the form of character of length 1 (1 value)
tapply()
• split data into groups, and apply the function to data within each subgroup
• tapply(data, INDEX, FUN, ..., simplify = FALSE) = apply a function over subsets of a vector
– data = vector
– INDEX = factor/list of factors
– FUN = function
– … = arguments to be passed to function
– simplify = whether to simplify the result
10
• example
– x <- c(rnorm(10), runif(10), rnorm(10, 1))
– f <- gl(3, 10); tapply(x, f, mean) = returns the mean of each group (f level) of x data
mapply()
## [[1]]
## [1] 1 1 1 1
##
## [[2]]
## [1] 2 2 2
##
## [[3]]
## [1] 3 3
##
## [[4]]
## [1] 4
aggregate()
• aggregate computes summary statistics of data subsets (similar to multiple tapply at the same time)
• aggregate(list(name = dataToCompute), list(name = factorVar1,name = factorVar2),
function, na.rm = TRUE)
– dataToCompute = this is what the function will be applied on
– factorVar1, factorVar1 = factor variables to split the data by
– Note: order matters here in terms of how to break down the data
– function = what is applied to the subsets of data, can be sum/mean/median/etc
– na.rm = TRUE → removes NA values
Simulation
11
– sample(10) = select positive integer sample of size 10 without repeat
• Each probability distribution functions usually have 4 functions associated with them:
– r*** function (for “random”) → random number generation (ex. rnorm)
– d*** function (for “density”) → calculate density (ex. dunif)
– p*** function (for “probability”) → cumulative distribution (ex. ppois)
– q*** function (for “quantile”) → quantile function (ex. qbinom)
• If Φ is the cumulative distribution function for a standard Normal distribution, then pnorm(q) = Φ(q)
and qnorm(p) = Φ−1 (q).
• set.seed() = sets seed for randon number generator to ensure that the same data/analysis can be
reproduced
Simulation Examples
• rbinom(1, size = 100, prob = 0.7) = returns a binomial random variable that represents the num-
ber of successes in a give number of independent trials
– 1 = corresponds number of observations
– size = 100 = corresponds with the number of independent trials that culminate to each resultant
observation
– prob = 0.7 = probability of success
• rnorm(n, mean = m, sd = s) = generate n random samples from the standard normal distribution
(mean = 0, std deviation = 1 by default)
– rnorm(1000) = 1000 draws from the standard normal distribution
– n = number of observation generated
– mean = m = specified mean of distribution
– sd = s = specified standard deviation of distribution
• dnorm(x, mean = 0, sd = 1, log = FALSE)
– log = evaluate on log scale
• pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
– lower.tail = left side, FALSE = right
• qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
– lower.tail = left side, FALSE = right
• rpois(n, lambda) = generate random samples from the poisson distrbution
– n = number of observations generated
– lambda = λ parameter for the poisson distribution or rate
• rpois(n, r) = generating Poisson Data
– n = number of values
– r = rate
• ppois(n, r) = cumulative distribution
– ppois(2, 2) = P r(x <= 2)
• replicate(n, rpois()) = repeat operation n times
12
Generate Numbers for a Linear Model
• Linear model
set.seed(20)
x <- rnorm(100) # normal
x <- rbinom(100, 1, 0.5) # binomial
e <- rnorm(100, 0, 2)
y <- 0.5 + 2* x + e
• Poisson model
Y P oisson(µ)log(µ) = β0 + β1 x where β0 = 0.5, β1 = 2
x <- rnorm(100)
log.mu <- 0.5 + 0.3* x
y <- rpois(100, exp(log.mu))
Base Graphics
13
Reading Tabular Data
• read.table(), read.csv() = most common, read text files (rows, col) return data frame
• readLines() = read lines of text, returns character vector
• source(file) = read R code
• dget() = read R code files (R objects that have been reparsed)
• load(), unserialize() = read binary objects
• writing data
– write.table(), writeLines(), dump(), put(), save(), serialize()
• read.table() arguments:
– file = name of file/connection
– header = indicator if file contains header
– sep = string indicating how columns are separated
– colClasses = character vector indicating what each column is in terms of class
– nrows = number of rows in dataset
– comment.char = char indicating beginning of comment
– skip = number of lines to skip in the beginning
– stringsAsFactors = defaults to TRUE, should characters be coded as Factor
• read.table can be used without any other argument to create data.frame
– telling R what type of variables are in each column is helpful for larger datasets (efficiency)
– read.csv() = read.table except default sep is comma (read.table default is sep = " " and
header = TRUE)
Larger Tables
14
Interfaces to the Outside World
Control Structures
• Common structures are
– if, else = testing a condition
– for = execute a loop a fixed number of times
– while = execute a loop while a condition is true
– repeat = execute an infinite loop
– break = break the execution of a loop
– next = skip an interation of a loop
– return = exit a function
• Note: Control structures are primarily useful for writing programs; for command-line interactive work,
the apply functions are more useful
if - else
# basic structure
if(<condition>) {
## do something
} else {
## do something else
}
# if tree
if(<condition1>) {
## do something
} else if(<condition2>) {
## do something different
} else {
## do something different
}
• y <- if(x>3){10} else {0} = slightly different implementation than normal, focus on assigning
value
15
for
# basic structure
for(i in 1:10) {
# print(i)
}
while
count <- 0
while(count < 10) {
# print(count)
count <- count + 1
}
x0 <- 1
tol <- 1e-8
repeat {
x1 <- computeEstimate()
if(abs(x1 - x0) < tol) {
break
} else {
x0 <- x1 # requires algorithm to converge
}
}
• Note: The above loop is a bit dangerous because there’s no guarantee it will stop
– Better to set a hard limit on the number of iterations (e.g. using a for loop) and then report
whether convergence was achieved or not.
16
next and return
for(i in 1:100) {
if(i <= 20) {
## Skip the first 20 iterations
next
}
## Do something here
}
Functions
• name <- function(arg1, arg2, …){ }
– inputs can be specified with default values by arg1 = 10
– it is possible to define an argument to NULL
– returns last expression of function
– many functions have na.rm, can be set to TRUE to remove NA values from calculation
• structure
f <- function(<arguments>) {
## Do something interesting
}
• function are first class object and can be treated like other objects (pass into other functions)
– functions can be nested, so that you can define a function inside of another function
• function have named arguments (i.e. x = mydata) which can be used to specifiy default values
– sd(x = mydata) (matching by name)
• formal arguments = arguments included in the functional definition
– formals() = returns all formal arguments
– not all functional call specifies all arguments, some can be missing and may have default values
– args() = return all arguments you can specify
– multiple arguments inputted in random orders (R performs positional matching) → not recom-
mended
– argument matching order: exact → partial → positional
∗ partial = instead of typing data = x, use d = x
• Lazy Evaluation
– R will evaluate as needed, so everything executes until an error occurs
∗ f <- function (a, b) {a^2}
∗ if b is not used in the function, calling f(5) will not produce an error
• ... argument
– used to extend other functions by representing the rest of the arguments
– generic functions use ... to pass extra arguments (i.e. mean = 1, sd = 2)
– necessary when the number of arguments passed can not be known in advance
∗ functions that use ... = paste(), cat()
– Note: arguments coming after ... must be explicitly matched and cannot be partially matched
17
Scoping
• scoping rules determine how a value is associated with a free variable in a function
• free variables = variables not explicitly defined in the function (not arguments, or local variables -
variable defined in the function)
• R uses lexical/static scoping
– common alternative = dynamic scoping
– lexical scoping = values of free vars are searched in the environment in which the function is
defined
∗ environment = collection of symbol/value pairs (x = 3.14)
· each package has its own environment
· only environment without parent environment is the empty environment
– closure/function closure = function + associated environment
• search order for free variable
1. environment where the function is defined
2. parent environment
3. … (repeat if multiple parent environments)
4. top level environment: global environment (worspace) or namespace package
5. empty environment → produce error
• when a function/variable is called, R searches through the following list to match the first result
1. .GlobalEnv
2. package:stats
3. package:graphics
4. package:grDeviced
5. package:utils
6. package:datasets
7. package:methods
8. Autoloads
9. package:base
• order matters
– .GlobalEnv = everything defined in the current workspace
– any package that gets loaded with library() gets put in position 2 of the above search list
– namespaces are separate for functions and non-functions
∗ possible for object c and function c to coexist
Scoping Example
## [1] 27
18
square(3) # defines x = 3
## [1] 9
## [1] 3
y <- 10
f <- function(x) {
y <- 2
y^2 + g(x)
}
g <- function(x) {
x*y
}
• Lexical Scoping
1. f(3) → calls g(x)
2. y isn’t defined locally in g(x) → searches in parent environment (working environment/global
workspace)
3. finds y → y = 10
• Dynamic Scoping
1. f(3) → calls g(x)
2. y isn’t defined locally in g(x) → searches in calling environment (f function)
3. find y → y <- 2
– parent frame = refers to calling environment in R, environment from which the function was
called
• Note: when the defining environment and calling environment is the same, lexical and dynamic scoping
produces the same result
• Consequences of Lexical Scoping
– all objects must be carried in memory
– all functions carry pointer to their defining environment (memory address)
19
Optimization
• optimization routines in R (optim, nlm, optimize) require you to pass a function whose argument is
a vector of parameters
– Note: these functions minimize, so use the negative constructs to maximize a normal likelihood
• constructor functions = functions to be fed into the optimization routines
• example
## function(p) {
## params[!fixed] <- p
## mu <- params[1]
## sigma <- params[2]
## a <- -0.5*length(data)*log(2*pi*sigma^2)
## b <- -0.5*sum((data-mu)^2) / (sigma^2)
## -(a + b)
## }
## <bytecode: 0x7fd9698ff5a8>
## <environment: 0x7fd9690721e0>
# Estimating Prameters
optim(c(mu = 0, sigma = 1), nLL)$par
## mu sigma
## 1.218239 1.787343
# Fixing sigma = 2
nLL <- make.NegLogLik(normals, c(FALSE, 2))
optimize(nLL, c(-1, 3))$minimum
## [1] 1.217775
# Fixing mu = 1
nLL <- make.NegLogLik(normals, c(1, FALSE))
optimize(nLL, c(1e-6, 10))$minimum
## [1] 1.800596
20
Debugging
R Profiler
# system.time example
system.time({
n <- 1000
r <- numeric(n)
for (i in 1:n) {
x <- rnorm(n)
r[i] <- mean(x)
}
})
• system.time(expression)
– takes R expression, returns amount of time needed to execute (assuming you know where)
– computes time (in sec) → gives time until error if error occurs
– can wrap multiple lines of code with {}
– returns object of class proc_time
∗ user time = time computer experience
21
∗ elapsed time = time user experience
∗ usually close for standard computation
· elapse > user = CPU wait around other processes in the background (read webpage)
· elapsed < user = multiple processor/core (use multi-threaded libraries)
– Note: R doesn’t multi-thread (performing multiple calculations at the same time) with basic
package
∗ Basic Linear Algebra Standard [BLAS] libraries do, prediction, regression routines, matrix
∗ i.e. vecLib/Accelerate, ATLAS, ACML, MKL
• Rprof() = useful for complex code only
– keeps track of functional call stack at regular intervals and tabulates how much time is spent in
each function
– default sampling interval = 0.02 second
– calling Rprof() generates Rprof.out file by default
∗ Rprof("output.out") = specify the output file
– Note: should NOT be used with system.time()
• summaryRprof() = summarizes Rprof() output, 2 methods for normalizing data
– loads the Rprof.out file by default, can specify output file summaryRprof("output.out")
– by.total = divide time spent in each function by total run time
– by.self = first subtracts out time spent in functions above in call stack, and calculates ratio to
total
– $sample.interval = 0.02 → interval
– $sampling.time = 7.41 → seconds, elapsed time
∗ Note: generally user spends all time at top level function (i.e. lm()), but the function simply
calls helper functions to do work so it is not useful to know about the top level function times
– Note: by.self = more useful as it focuses on each individual call/function
– Note: R must be compiled with profiles support (generally the case)
• good to break code into functions so profilers can give useful information about where time is spent
• C/FORTRAN code is not profiled
Miscellaneous
22