R is a programming language for statistical analysis and graphics. It is an open-source language developed by statisticians to allow for easy statistical analysis and visualization of data. The document provides an overview of R, discussing its origins, functionality, uses in data science, and popular packages and IDEs used with R. Examples are given of basic R syntax for vectors, matrices, data frames, plotting, and applying functions to data.
What is R?
•GNU Project Developed by John Chambers @ Bell Lab
• Free software environment for statistical computing and graphics
• Functional programming language written primarily in C, Fortran
4/23/2013 3Confidential | Copyright 2012 Trend Micro Inc.
4.
R Language
• Ris functional programming language
• R is an interpreted language
• R is object oriented-language
5.
Why Using R
•Statistic analysis on the fly
• Mathematical function and graphic module embedded
• FREE! & Open Source!
– http://cran.r-project.org/src/base/
Data Scientist ofthese Companies Using R
What is your programming language of
choice, R, Python or something else?
“I use R, and occasionally matlab, for data analysis. There is
a large, active and extremely knowledgeable R community at
Google.”
http://simplystatistics.org/2013/02/15/interview-with-nick-chamandy-statistician-at-google/
4/23/2013 7Confidential | Copyright 2013 Trend Micro Inc.
“Expert knowledge of SAS (With Enterprise
Guide/Miner) required and candidates with
strong knowledge of R will be preferred”
http://www.kdnuggets.com/jobs/13/03-29-apple-sr-data-
scientist.html?utm_source=twitterfeed&utm_medium=facebook&utm_campaign=t
fb&utm_content=FaceBook&utm_term=analytics#.UVXibgXOpfc.facebook
8.
Commercial support forR
• In 2007, Revolution Analytics providea commercial support for
Revolution R
– http://www.revolutionanalytics.com/products/revolution-r.php
– http://www.revolutionanalytics.com/why-revolution-r/which-r-is-right-for-me.php
• Big Data Appliance, which integrates R, Apache Hadoop, Oracle
Enterprise Linux, and a NoSQL database with the
Exadata hardware
– http://www.oracle.com/us/products/database/big-data-
appliance/overview/index.html
9.
Revolotion R
• Freefor Community Version
– http://www.revolutionanalytics.com/downloads/
– http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php
4/23/2013 9Confidential | Copyright 2013 Trend Micro Inc.
Base R 2.14.2
64
Revolution R
(1-core)
Revolution R
(4-core)
Speedup (4 core)
Matrix
Calculation
17.4 sec 2.9 sec 2.0 sec 7.9x
Matrix Functions 10.3 sec 2.0 sec 1.2 sec 7.8x
Program Control 2.7 sec 2.7 sec 2.7 sec Not Appreciable
Web App Development
Shinymakes it super simple for R users like you to turn
analyses into interactive web applications that anyone
can use
http://www.rstudio.com/shiny/
4/23/2013 11Confidential | Copyright 2013 Trend Micro Inc.
Lists
• Contain aheterogeneous selection of objects
– e <- list(thing="hat", size="8.25"); e
– l <- list(a=1,b=2,c=3,d=4,e=5,f=6,g=7,h=8,i=9,j=10)
– l$j
– man = list(name="Qoo", height=183); man$name
19.
Factor
• Ordered collectionof items to present categorical value
• Different values that the factor can take are called levels
• Factors
– phone =
factor(c('iphone', 'htc', 'iphone', 'samsung', 'iphone', 'samsung'))
– levels(phone)
4/23/2013 19Confidential | Copyright 2013 Trend Micro Inc.
20.
Matrices & Array
•Array
– An extension of a vector to more than two dimensions
– a <- array(c(1,2,3,4,5,6,7,8,9,10,11,12),dim=c(3,4))
• Matrices
– A vector to two dimensions – 2d-array
– x = c(1,2,3); y = c(4,5,6); rbind(x,y);cbind(x,y)
– x = rbind(c(1,2,3),c(4,5,6)); dim(x)
– x<-matrix(c(1,2,3,4,5,6),nr=3);
– x<-matrix(c(1,2,3,4,5,6),nrow=3, ,byrow=T)
– x<-matrix(c(1,2,3,4),nr=2);y<-matrix(c(5,6),nr=2); x%*%y
– t(matrix(c(1,2,3,4),nr=2))
– solve(matrix(c(1,2,3,4),nr=2))
21.
Data Frame
• Usefulway to represent tabular data
• essentially a matrix with named columns may also
include non-numerical variables
• Example
– df = data.frame(a=c(1,2,3,4,5),b=c(2,3,4,5,6));df
22.
Function
• Function
– `%myop%`<- function(a, b) {2*a + 2*b}; 1 %myop% 1
– f <- function(x) {return(x^2 + 3)}
create.vector.of.ones <- function(n) {
return.vector <- NA;
for (i in 1:n) {
return.vector[i] <- 1;
} return.vector;
}
– create.vector.of.ones(3)
• Control Structures
– If …else…
– Repeat, for, while
• Catch error – trycatch
Objects and Classes
•All R code manipulates objects.
• Every object in R has a type
• In assignment statements, R will copy the object, not
just the reference to the object Attributes
25.
S3 & S4Object
• Many R functions were implemented using S3 methods
• In S version 4 (hence S4), formal classes and methods
were introduced that allowed
– Multiple arguments
– Abstract types
– inheritance.
26.
OOP of S4
•S4 OOP Example
– setClass("Student", representation(name =
"character", score="numeric"))
– studenta = new ("Student", name="david", score=80 )
– studentb = new ("Student", name="andy", score=90 )
setMethod("show", signature("Student"),
function(object) {
cat(object@score+100)
})
– setGeneric("getscore", function(object)
standardGeneric("getscore"))
– Studenta
27.
Packages
• A packageis a related set of functions, help files, and
data files that have been bundled together.
• Basic Command
– library(rpart)
– CRAN
– Install
– (.packages())
28.
Package used inMachine Learning for
Hackers
4/23/2013 28Confidential | Copyright 2013 Trend Micro Inc.
29.
Apply
• Apply
– Returnsa vector or array or list of values obtained by applying a
function to margins of an array or matrix.
– data <- cbind(c(1,2),c(3,4))
– data.rowsum <- apply(data,1,sum)
– data.colsum <- apply(data,2,sum)
– data
4/23/2013 29Confidential | Copyright 2013 Trend Micro Inc.
30.
Apply
• lapply
– returnsa list of the same length as X, each element of which is
the result of applying FUN to the corresponding element of X.
• sapply
– is a user-friendly version and wrapper of lapply by default
returning a vector, matrix or
• vapply
– is similar to sapply, but has a pre-specified type of return
value, so it can be safer (and sometimes faster) to use.
4/23/2013 30Confidential | Copyright 2013 Trend Micro Inc.
31.
File IO
• Saveand Load
– x = USPersonalExpenditure
– save(x, file="~/test.RData")
– rm(x)
– load("~/test.RData")
– x
IRIS Dataset
• TheIris flower data set or Fisher's Iris data set is
a multivariate data set introduced by Sir Ronald
Fisher (1936) as an example ofdiscriminant analysis.[1] It
is sometimes called Anderson's Iris data set
– http://en.wikipedia.org/wiki/Iris_flower_data_set
4/23/2013 35Confidential | Copyright 2013 Trend Micro Inc.
Iris setosa Iris versicolor Iris virginica
Performance Tips
• UseBuilt-in Math Functions
• Use Environments for Lookup Tables
• Use a Database to Query Large Data Sets
• Preallocate Memory
• Monitor How Much Memory You Are Using
• Cleaning Up Objects
• Functions for Big Data Sets
• Parallel Computation with R
38.
R for MachineLearning
4/23/2013 38Confidential | Copyright 2012 Trend Micro Inc.
39.
Helps of theTopic
• ?read.delim
– # Access a function's help file
• ??base::delim
– # Search for 'delim' in all help files for functions in 'base'
• help.search("delimited")
– # Search for 'delimited' in all help files
• RSiteSearch("parsing text")
– # Search for the term 'parsing text' on the R site.
Resource (Con’d)
• Conference
–useR!
– R in Finance
– R in Insurance
– Others
– Joint Statistical Meetings
– Royal Statistical Society Conference
• Local User Group
– http://blog.revolutionanalytics.com/local-r-groups.html
• Taiwan R User Group
– http://www.facebook.com/Tw.R.User
– http://www.meetup.com/Taiwan-R/
4/23/2013 46Confidential | Copyright 2013 Trend Micro Inc.