R - scripted data
History
Language
Packages
Tools
RPubs
Slidify
Shiny
A Brief History of R
– 1976 S - Bell Labs; Fortran
– John Chambers
– 1988 S Version 3; C language
● 1991 R Created
– Ross Ihaka and Robert Gentleman
● 1993 R Announced
– 1993 S licensed to StatSci (now Insightful)
● 2000 R Version 1.0.0 released
– 2004 S purchased from Lucent (2MM)
– 2008 TIBCO acquires Insightful (25MM)
Other “Stats” Tools
● R – additional, commercial support
Oracle: “Big Data Appliance” - R + Hadoop
+ Linux + NoSQL + Exadata(H/W)
IBM: R executing in Hadoop (massively
parallel in-databse analytics)
● SAS (SAS Institute) dev. 1966, 1st rel 1972
● SPSS (IBM) 1st rel 1968
Model Development and
Execution Comparison
http://inside-bigdata.com/2014/06/25/revolution-r-enterprise-vs-sas-performance/
Oracle + INTEL Libraries
https://blogs.oracle.com/R/entry/oracle_r_distribution_performance_benchmark
Language
● Derviative of S (S PLUS)
● Portable (includes Playstation 3)
● Interpreted, calls into C libraries
● Functional!
● GPL
● 40 year old technology
● Open Source (you want it, you do it)
Data Types
● Symbols refer to objects
● Object attributes
– names
– dimnames
– dimensions
– class
– length
– user defined attributes/metadata
Data Types
● Object types – single class, except list
– List
(may have mixed classes)
– Vectors
(scalar is a vector of length 1)
– Matrices
(vector with 'dimension' attribute)
(column major order)
Data Types
● Object types
– Factors
● Categorical data (like an enumeration)
– Data frames
● Special list, each element has same length
● Elements are columns with length rows
● Each elements (column) has its own type
● row.names() attribute to name the rows
● Convert to matrix with data.matrix()
● Load with read.table(), read.csv()
Data Types
● Object “atomic” classes
– character
– numeric (double precision real)
– integer
– complex
– logical (booleans)
Numeric and Integer include Inf and NaN
1 / Inf == 0 !
any class can be NA
NaN is NA, NA is not NaN
Data Types
● Dates
– “Date” class
– Days since epoch (1970-01-01)
● Times
– “POSIXct” or “POSIXlt” class
– Seconds since epoch
● Coerce to string with as.Date()
● Generic functions include 'weekdays()',
months()', 'quarters()'
Operators
● Grouping: ()
● Assignment: to<-from AND from->to
● Vectorized: + - ! * / ^ %% & |
● ~ ? : %/% %*% %o% %x% %in% < > == >=
<= && ||
● Element access: [[]] [] $
● Function argument types:
– symbol, symbol=default, ...
Control Structures
● if, else
● for
● while
● repeat
● break, next, return
Apply
● apply – apply functions over arrays
● lapply – apply functions over list / vector
● sapply – apply function to data frames
● tapply – apply function over ragged array
● mapply – apply function to multiple objects
Functions
● Functions are objects
● Functional closure consists of:
– Formal argument list
– Function body (definition)
– Environment
● Each of these can be assigned to
● Assign to environment can eliminate
unwanted environment capture
Packages
● CRAN (Comprehensive R Archive Network)
– Main site, includes R download
● Bioconductor
– Analysis of genomic data
– Next generation high-throughput
sequencing
● R-forge
● GitHub and Personal repositories
Packages
● Analysis
– Statistical analysis (stats, linprog)
● Linear (and general linear) modeling
● Tree models
● Analysis of variance
– Machine learning (caret, kernlab)
● Clustering (forests, k-means, knn, etc)
● Training and predictions
● Cross validation and error analysis
Packages
● Graphics
– Base graphics
● Plot: plot, hist, ...
● Annotate: text, lines, points, axis, ...
– Lattice
● Single command: xyplot, bwplot, ...
– Ggplot2
● Single command: qplot
● Defining objects: aesthetics, geoms
● Chain commands: ggplot, geom_*, ...
Packages
● Data visualization
– rCharts (GitHub), converts visualizations to
Javascript (e.g. d3.js)
http://www.google.com/trends/explore#q=R%20language%2C%20Data%20Visualization%2C%20D3.js%2C%20Processing.js&cmpt=q
Tools
● Command line
● Rstudio (can run on remote Linux server)
● Rkward
● Rcommander (tcl/tk)
● JGR – Java (GUI for R)
● Rattle - RGtk2
Tools
● Debugging
– Print statements!
– Interactive tools:
● traceback() – stack trace on error
● debug() – flags function for stepping
● browser() - stops function and enters debug
● trace() - insert trace statements
● recover() - modify error behavior, can
browse call stack
Tools
● Profiling
– “We should forget about small efficiencies,
say about 97% of the time: premature
optimization is the root of all evil”
– Donald Knuth
– system.time() - CPU, wall times
– Rprof() - use symmaryRprof() to see results
● Do not use Rprof() and system.time()
together
● Calls to C/Fortran libraries not profiled
Data Exploration
● Script it!
– If you can't repeat it, it didn't happen
● Get the data (ingest)
– Functions to download, uncompress,
unarchive, store, read, and organize
● Clean the data
– Handle missing and incomplete data,
impute values, identify outliers
Data Exploration
● Look at the data (models, visualization)
– Model – regressions (linear, logistic),
clustering, ANOVA
– Refine models and plot the result
● Look for systematic issues – unexpected
trends, bias, unexplained variance, error
estimates, residual analysis
● Explore complexity – number of explanatory
factors
– Plot the models
● What does it look like?
Reproducible Research
● Allows others to validate the work
● Ensures that the results are accepted
● Reduces the chance of errors propagating
– http://youtu.be/7gYIs7uYbMo
– 2010 Anil Potti resigns from Duke after
research was found flawed (off by 1!)
● Clinical trials based on the flawed research
was finally cancelled
● Closed data, non-reproducible research
exacerbated the problem
Reproducible Research
● Don't do things by hand – especially editing
spreadsheets to “clean up” data (removing
outliers, validating, editing) or dowloading
files
● Actions taken by hand need very detailed
documentation to reproduce – such as
download sites and what files were
downloaded to
● GUIs are convenient, but can't be repeated
Reproducible Research
● Capture the steps in a script:
– download.file(“http://...”, “localfile.zip”)
● Can be repeated as long as the link is
available. Can keep and manage the
downloaded file if that is an issue
– Use version control
● Capture small steps at a time (git is good
for this!)
● Can track changes and revert if needed
● Can use GitHub, BitBucket, SouceForge to
publish the results as well
Reproducible Research
● Capture environment – OS, tools, versions
● Don't save outputs – regenerate
– Ok to cache results while in use, but don't
store the results, just the code+data that
produced it
– If you keep intermediate files, document
how they were created
● Set random seed
Sharing Research
● Rmarkdown – markdown with embedded R
– knitr package executes the R fragments
and embeds the code and results into
markdown, which can convert to HTML or
PDF
– Literate programming!
● Hosted documentation
– Rpubs (rpubs.com)
– GitHub gh-pages (github.io)
Sharing Research
● Embedded presentations
– Author using slidify package
– Rmarkdown with embedded R code
– Creates HTML5 presentation slide deck
– Can include inline quizes
Data Products
● Interactive visualizations
– shiny, shinyapp packages
– RStudio includes interactive display of
shiny applications during development
– Generates bootstrap + HTML5 + javascript
+ d3 application
● Hosted!
– Hosted at shinyapp.io
– Private? Server images available (for
purchase)

More Related Content

PPTX
R Programming Language
PPTX
Workshop presentation hands on r programming
ODP
Introduction to the language R
PPT
R tutorial for a windows environment
PPTX
R programming Fundamentals
PPTX
Introduction To R Language
PDF
R programming language: conceptual overview
PDF
Introduction to R
R Programming Language
Workshop presentation hands on r programming
Introduction to the language R
R tutorial for a windows environment
R programming Fundamentals
Introduction To R Language
R programming language: conceptual overview
Introduction to R

What's hot (19)

PDF
R programming groundup-basic-section-i
PPTX
R Programming Tutorial for Beginners - -TIB Academy
PDF
R basics
 
PDF
RDataMining slides-r-programming
PPT
R programming slides
PDF
Introduction to Data Mining with R and Data Import/Export in R
PPTX
Getting Started with R
PDF
R programming & Machine Learning
PDF
R programming for data science
PPT
R programming by ganesh kavhar
PDF
Introduction to Rstudio
PPTX
R language tutorial
PDF
Introduction to data analysis using R
PDF
Functional Programming in R
PDF
Introduction to R Programming
PPTX
R programming language
PPTX
Apache pig presentation_siddharth_mathur
KEY
Presentation R basic teaching module
PPTX
A Workshop on R
R programming groundup-basic-section-i
R Programming Tutorial for Beginners - -TIB Academy
R basics
 
RDataMining slides-r-programming
R programming slides
Introduction to Data Mining with R and Data Import/Export in R
Getting Started with R
R programming & Machine Learning
R programming for data science
R programming by ganesh kavhar
Introduction to Rstudio
R language tutorial
Introduction to data analysis using R
Functional Programming in R
Introduction to R Programming
R programming language
Apache pig presentation_siddharth_mathur
Presentation R basic teaching module
A Workshop on R
Ad

Similar to R - the language (20)

PDF
R basics
PPTX
Big data analytics with R tool.pptx
PPT
R-Programming.ppt it is based on R programming language
PPT
Basics of R-Programming with example.ppt
PPT
Basocs of statistics with R-Programming.ppt
PPT
How to obtain and install R.ppt
PPT
Advanced Data Analytics with R Programming.ppt
PDF
Machine Learning in R
PPT
Introduction to R for Data Science Technology
PPT
R Programming for Statistical Applications
PPT
R-programming with example representation.ppt
PDF
R tutorial
PPT
introduction to R with example, Data science
PPTX
DATA MINING USING R (1).pptx
PPT
Slides on introduction to R by ArinBasu MD
PPT
Basics of R-Progranmming with instata.ppt
PPT
PPT
17641.ppt
PDF
Data Analysis with R (combined slides)
PDF
Practical data science_public
R basics
Big data analytics with R tool.pptx
R-Programming.ppt it is based on R programming language
Basics of R-Programming with example.ppt
Basocs of statistics with R-Programming.ppt
How to obtain and install R.ppt
Advanced Data Analytics with R Programming.ppt
Machine Learning in R
Introduction to R for Data Science Technology
R Programming for Statistical Applications
R-programming with example representation.ppt
R tutorial
introduction to R with example, Data science
DATA MINING USING R (1).pptx
Slides on introduction to R by ArinBasu MD
Basics of R-Progranmming with instata.ppt
17641.ppt
Data Analysis with R (combined slides)
Practical data science_public
Ad

Recently uploaded (20)

PDF
©️ 01_Algorithm for Microsoft New Product Launch - handling web site - by Ale...
PPTX
inbound2857676998455010149.pptxmmmmmmmmm
PDF
Grey Minimalist Professional Project Presentation (1).pdf
PPTX
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
PPTX
PPT for Diseases.pptx, there are 3 types of diseases
PPTX
Stats annual compiled ipd opd ot br 2024
PPTX
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
PDF
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
PPTX
machinelearningoverview-250809184828-927201d2.pptx
PPTX
Introduction to Fundamentals of Data Security
PPTX
Chapter security of computer_8_v8.1.pptx
PDF
Mcdonald's : a half century growth . pdf
PDF
technical specifications solar ear 2025.
PPTX
recommendation Project PPT with details attached
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PPTX
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
PPTX
cp-and-safeguarding-training-2018-2019-mmfv2-230818062456-767bc1a7.pptx
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PPT
dsa Lec-1 Introduction FOR THE STUDENTS OF bscs
©️ 01_Algorithm for Microsoft New Product Launch - handling web site - by Ale...
inbound2857676998455010149.pptxmmmmmmmmm
Grey Minimalist Professional Project Presentation (1).pdf
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
PPT for Diseases.pptx, there are 3 types of diseases
Stats annual compiled ipd opd ot br 2024
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
machinelearningoverview-250809184828-927201d2.pptx
Introduction to Fundamentals of Data Security
Chapter security of computer_8_v8.1.pptx
Mcdonald's : a half century growth . pdf
technical specifications solar ear 2025.
recommendation Project PPT with details attached
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
cp-and-safeguarding-training-2018-2019-mmfv2-230818062456-767bc1a7.pptx
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
dsa Lec-1 Introduction FOR THE STUDENTS OF bscs

R - the language

  • 1. R - scripted data History Language Packages Tools RPubs Slidify Shiny
  • 2. A Brief History of R – 1976 S - Bell Labs; Fortran – John Chambers – 1988 S Version 3; C language ● 1991 R Created – Ross Ihaka and Robert Gentleman ● 1993 R Announced – 1993 S licensed to StatSci (now Insightful) ● 2000 R Version 1.0.0 released – 2004 S purchased from Lucent (2MM) – 2008 TIBCO acquires Insightful (25MM)
  • 3. Other “Stats” Tools ● R – additional, commercial support Oracle: “Big Data Appliance” - R + Hadoop + Linux + NoSQL + Exadata(H/W) IBM: R executing in Hadoop (massively parallel in-databse analytics) ● SAS (SAS Institute) dev. 1966, 1st rel 1972 ● SPSS (IBM) 1st rel 1968
  • 4. Model Development and Execution Comparison http://inside-bigdata.com/2014/06/25/revolution-r-enterprise-vs-sas-performance/
  • 5. Oracle + INTEL Libraries https://blogs.oracle.com/R/entry/oracle_r_distribution_performance_benchmark
  • 6. Language ● Derviative of S (S PLUS) ● Portable (includes Playstation 3) ● Interpreted, calls into C libraries ● Functional! ● GPL ● 40 year old technology ● Open Source (you want it, you do it)
  • 7. Data Types ● Symbols refer to objects ● Object attributes – names – dimnames – dimensions – class – length – user defined attributes/metadata
  • 8. Data Types ● Object types – single class, except list – List (may have mixed classes) – Vectors (scalar is a vector of length 1) – Matrices (vector with 'dimension' attribute) (column major order)
  • 9. Data Types ● Object types – Factors ● Categorical data (like an enumeration) – Data frames ● Special list, each element has same length ● Elements are columns with length rows ● Each elements (column) has its own type ● row.names() attribute to name the rows ● Convert to matrix with data.matrix() ● Load with read.table(), read.csv()
  • 10. Data Types ● Object “atomic” classes – character – numeric (double precision real) – integer – complex – logical (booleans) Numeric and Integer include Inf and NaN 1 / Inf == 0 ! any class can be NA NaN is NA, NA is not NaN
  • 11. Data Types ● Dates – “Date” class – Days since epoch (1970-01-01) ● Times – “POSIXct” or “POSIXlt” class – Seconds since epoch ● Coerce to string with as.Date() ● Generic functions include 'weekdays()', months()', 'quarters()'
  • 12. Operators ● Grouping: () ● Assignment: to<-from AND from->to ● Vectorized: + - ! * / ^ %% & | ● ~ ? : %/% %*% %o% %x% %in% < > == >= <= && || ● Element access: [[]] [] $ ● Function argument types: – symbol, symbol=default, ...
  • 13. Control Structures ● if, else ● for ● while ● repeat ● break, next, return
  • 14. Apply ● apply – apply functions over arrays ● lapply – apply functions over list / vector ● sapply – apply function to data frames ● tapply – apply function over ragged array ● mapply – apply function to multiple objects
  • 15. Functions ● Functions are objects ● Functional closure consists of: – Formal argument list – Function body (definition) – Environment ● Each of these can be assigned to ● Assign to environment can eliminate unwanted environment capture
  • 16. Packages ● CRAN (Comprehensive R Archive Network) – Main site, includes R download ● Bioconductor – Analysis of genomic data – Next generation high-throughput sequencing ● R-forge ● GitHub and Personal repositories
  • 17. Packages ● Analysis – Statistical analysis (stats, linprog) ● Linear (and general linear) modeling ● Tree models ● Analysis of variance – Machine learning (caret, kernlab) ● Clustering (forests, k-means, knn, etc) ● Training and predictions ● Cross validation and error analysis
  • 18. Packages ● Graphics – Base graphics ● Plot: plot, hist, ... ● Annotate: text, lines, points, axis, ... – Lattice ● Single command: xyplot, bwplot, ... – Ggplot2 ● Single command: qplot ● Defining objects: aesthetics, geoms ● Chain commands: ggplot, geom_*, ...
  • 19. Packages ● Data visualization – rCharts (GitHub), converts visualizations to Javascript (e.g. d3.js) http://www.google.com/trends/explore#q=R%20language%2C%20Data%20Visualization%2C%20D3.js%2C%20Processing.js&cmpt=q
  • 20. Tools ● Command line ● Rstudio (can run on remote Linux server) ● Rkward ● Rcommander (tcl/tk) ● JGR – Java (GUI for R) ● Rattle - RGtk2
  • 21. Tools ● Debugging – Print statements! – Interactive tools: ● traceback() – stack trace on error ● debug() – flags function for stepping ● browser() - stops function and enters debug ● trace() - insert trace statements ● recover() - modify error behavior, can browse call stack
  • 22. Tools ● Profiling – “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil” – Donald Knuth – system.time() - CPU, wall times – Rprof() - use symmaryRprof() to see results ● Do not use Rprof() and system.time() together ● Calls to C/Fortran libraries not profiled
  • 23. Data Exploration ● Script it! – If you can't repeat it, it didn't happen ● Get the data (ingest) – Functions to download, uncompress, unarchive, store, read, and organize ● Clean the data – Handle missing and incomplete data, impute values, identify outliers
  • 24. Data Exploration ● Look at the data (models, visualization) – Model – regressions (linear, logistic), clustering, ANOVA – Refine models and plot the result ● Look for systematic issues – unexpected trends, bias, unexplained variance, error estimates, residual analysis ● Explore complexity – number of explanatory factors – Plot the models ● What does it look like?
  • 25. Reproducible Research ● Allows others to validate the work ● Ensures that the results are accepted ● Reduces the chance of errors propagating – http://youtu.be/7gYIs7uYbMo – 2010 Anil Potti resigns from Duke after research was found flawed (off by 1!) ● Clinical trials based on the flawed research was finally cancelled ● Closed data, non-reproducible research exacerbated the problem
  • 26. Reproducible Research ● Don't do things by hand – especially editing spreadsheets to “clean up” data (removing outliers, validating, editing) or dowloading files ● Actions taken by hand need very detailed documentation to reproduce – such as download sites and what files were downloaded to ● GUIs are convenient, but can't be repeated
  • 27. Reproducible Research ● Capture the steps in a script: – download.file(“http://...”, “localfile.zip”) ● Can be repeated as long as the link is available. Can keep and manage the downloaded file if that is an issue – Use version control ● Capture small steps at a time (git is good for this!) ● Can track changes and revert if needed ● Can use GitHub, BitBucket, SouceForge to publish the results as well
  • 28. Reproducible Research ● Capture environment – OS, tools, versions ● Don't save outputs – regenerate – Ok to cache results while in use, but don't store the results, just the code+data that produced it – If you keep intermediate files, document how they were created ● Set random seed
  • 29. Sharing Research ● Rmarkdown – markdown with embedded R – knitr package executes the R fragments and embeds the code and results into markdown, which can convert to HTML or PDF – Literate programming! ● Hosted documentation – Rpubs (rpubs.com) – GitHub gh-pages (github.io)
  • 30. Sharing Research ● Embedded presentations – Author using slidify package – Rmarkdown with embedded R code – Creates HTML5 presentation slide deck – Can include inline quizes
  • 31. Data Products ● Interactive visualizations – shiny, shinyapp packages – RStudio includes interactive display of shiny applications during development – Generates bootstrap + HTML5 + javascript + d3 application ● Hosted! – Hosted at shinyapp.io – Private? Server images available (for purchase)