An overview of R: Text Analytics
Ashraf Uddin
PhD Scholar, Dept. of Computer Science
South Asian University, New Delhi
https://sites.google.com/site/ashrafuddininfo/
About R
What is R?
R is a dialect of the S language.
R is a free software programming language
software environment for statistical computing and graphics
 widely used among statisticians and data miners for
developing statistical software and data analysis
The source code for the R software environment is written primarily
in C, Fortran, and R
History of R
1991: Created by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand
1993: First announcement of R to the public.
1995: R was made as free software.
1997: The R Core Group is formed (containing some people
associated with S-PLUS). The core group controls the source
code for R.
2000: R version 1.0.0 is released.
2013: R version 3.1.2 has been released on 2014-10-31.
Statistical features of R
provides a wide variety of statistical and graphical techniques:
linear and nonlinear modelling
classical statistical tests
 time-series analysis,
classification, clustering
Others
easily extensible through functions and extensions
Many of R's standard functions are written in R itself
C, C++, and Fortran code can be linked and called at run time
strength of R is static graphics
Dynamic and interactive graphics are available through additional
packages
Programming features of R
R is an interpreted language, users typically access it through
a command-line interpreter.
Like other similar languages such as MATLAB, R supports matrix
arithmetic
R supports procedural programming with functions
for some functions, object-oriented programming with generic
functions
Features of R continued...
functionality is divided into modular packages
Graphics capabilities very sophisticated.
Useful for interactive work, but contains a powerful
programming language for developing new tools
Very active and vibrant user community; R-help and R-devel
mailing lists and Stack Overflow
Design of the R System
The R system is divided into 2 conceptual parts:
The “base” R system that you download from CRAN
Everything else.
R functionality is divided into a number of packages
The “base” R system contains, among other things, the base
package which is required to run R and contains the most
fundamental functions.
The other packages contained in the “base” system include utils,
stats, datasets, graphics, grDevices, grid, methods, tools, parallel,
compiler, splines, tcltk, stats4.
There are also other packages: tm, stringr, boot, class, cluster,
codetools, foreign, KernSmooth, lattice, mgcv, nlme, rpart, survival,
MASS, spatial, nnet, Matrix.
Design of the R System continued...
And there are many other packages available:
There are about 4000 packages on CRAN that have been developed
by users and programmers around the world.
Start Working in R
Download & Install R: http://www.r-project.org/
Download & Install R studio: http://www.rstudio.com/products/rstudio/download/,
Wikipedia
Materials:
Chambers (2008). Software for Data Analysis, Springer. (your textbook)
Chambers (1998). Programming with Data, Springer.
Venables & Ripley (2002). Modern Applied Statistics with S, Springer.
Venables & Ripley (2000). S Programming, Springer.
Pinheiro & Bates (2000). Mixed-Effects Models in S and S-PLUS, Springer.
Murrell (2005). R Graphics, Chapman & Hall/CRC Press.·
Springer has a series of books called Use R!.
A longer list of books is at http://www.r-project.org/doc/bib/R-books.html
Course on R: https://www.coursera.org/course/rprog
Data Types and Basic Operations
Objects
R has five basic or “atomic” classes of objects:
1. Character
2. numeric (real numbers)
3. Integer
4. Complex
5. logical (True/False)
The most basic object is a vector
A vector can only contain objects of the same class
BUT: The one exception is a list, which is represented as
a vector but can contain objects of different classes.
Data Types and Basic Operations continued...
Numbers
Numbers in R a generally treated as numeric
a special number Inf which represents infinity; e.g. 1 / 0; Inf can be used
in ordinary calculations; e.g. 1 / Inf is 0
The value NaN represents an undefined value (“not a number”); e.g. 0 /
0; NaN can also be thought of as a missing value
Data Types and Basic Operations continued...
Attributes
R objects can have attributes
names, dimnames
dimensions (e.g. matrices, arrays)
Class
Length
Data Types and Basic Operations continued...
Entering Input
The <- symbol is the assignment operator.
The grammar of the language determines whether an
expression is complete or not.
The # character indicates a comment. Anything to the right of
the # (including the # itself) is ignored.
> x <- 1
> print(x)
[1] 1
> x
[1] 1
> msg<- "hello"
> x <- ## Incomplete expression
Data Types and Basic Operations continued...
Printing
The : operator is used to create integer sequences.
> x <- 1:20
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[16] 16 17 18 19 20
Data Types and Basic Operations continued...
Creating Vectors
The c() function can be used to create vectors of objects.
Using the vector() function
> x <- c(0.5, 0.6) ## numeric
> x <- c(TRUE, FALSE) ## logical
> x <- c(T, F) ## logical
> x <- c("a", "b", "c") ## character
> x <- 9:29 ## integer
> x <- c(1+0i, 2+4i) ## complex
> x <- vector("numeric", length = 10)
> x
[1] 0 0 0 0 0 0 0 0 0 0
Data Types and Basic Operations continued...
Mixing Objects
What about the following?
When different objects are mixed in a vector, coercion occurs
so that every element in the vector is of the same class.
> y <- c(1.7, "a") ## character
> y <- c(TRUE, 2) ## numeric
> y <- c("a", TRUE) ## character
Data Types and Basic Operations continued...
Matrices
Matrices are vectors with a dimension attribute. The dimension attribute
is itself an integer vector of length 2 (nrow, ncol)
> m <- matrix(nrow = 2, ncol = 3)
> m
[,1] [,2] [,3]
[1,] NA NA NA
[2,] NA NA NA
> dim(m)
[1] 2 3
> attributes(m)$dim
[1] 2 3
> m <- matrix(1:6, nrow = 2, ncol = 3)
> m
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
Data Types and Basic Operations continued...
Matrices: Matrix sum & multiplication
> m<-matrix(data=c(1,0,0,4,4,3), nrow=2,ncol=3)
> n<-matrix(data=c(1,2,3,4,5,6), nrow=2,ncol=3)
> m+n
[,1] [,2] [,3]
[1,] 2 3 9
[2,] 2 8 9
> m*n
[,1] [,2] [,3]
[1,] 1 0 20
[2,] 0 16 18
> m%*%n
Error in m %*% n : non-conformable arguments
>n<-matrix(data=c(1,2,3,4), nrow=2,ncol=2)
>n %*% m
[,1] [,2] [,3]
[1,] 1 12 13
[2,] 2 16 20
Data Types and Basic Operations continued...
Lists
Lists are a special type of vector that can contain elements of different
classes. Lists are a very important data type in R and you should get to
know them well.
> x <- list(1, "a", TRUE, 1 + 4i)
> x
[[1]]
[1] 1
[[2]]
[1] "a"
[[3]]
[1] TRUE
[[4]]
[1] 1+4i
Data Types and Basic Operations continued...
Data Frames
Data frames are used to store tabular data
They are represented as a special type of list where every element of
the list has to have the same length
Each element of the list can be thought of as a column and the length of
each element of the list is the number of rows
Unlike matrices, data frames can store different classes of objects in
each column (just like lists); matrices must have every element be the
same class
Data Types and Basic Operations continued...
Data Frames
Data frames (as csvfile)
> x <- data.frame(foo = 1:4, bar = c(T, T, F, F))
> x
foo bar
1 1 TRUE
2 2 TRUE
3 3 FALSE
4 4 FALSE
> nrow(x)
[1] 4
> ncol(x)
[1] 2
> data<-read.csv("G:/records.csv")
>cd<-data[data$PY==2000,]
>cd<-data[data$PY==2012,]
Reading and Writing Data continued...
Reading Data
There are a few principal functions reading data into R.
read.table, read.csv, for reading tabular data
readLines, for reading lines of a text file
Reading and Writing Data continued...
Writing Data
There are analogous functions for writing data to files
write.table
writeLines
save
Reading and Writing Data continued...
Reading Lines of a Text File
>con <- file("foo.txt", "r")
> x <- readLines(con, 10)
> x
[1] "1080" "10-point" "10th" "11-point"
[5] "12-point" "16-point" "18-point" "1st"
[9] "2" "20-point
## This might take time
>con <- url("http://www.jhsph.edu", "r")
>x <- readLines(con)
> head(x)
[1] "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">"
[2] ""
[3] "<html>"
[4] "<head>"
[5] "t<meta http-equiv="Content-Type" content="text/html;charset=utf-8
Functions
Functions are created using the function() directive and are
stored as R objects just like anything else.
Functions can be passed as arguments to other functions
Functions can be nested
The return value of a function is the last expression in the function body to be
evaluated.
f <- function(<arguments>) {
## Do something interesting
}
Functions continued...
Defining a Function
average<-function(array=numeric(1)){
sum<-0
for(i in 1: length(array)){
sum<-sum+array[i]
}
value<-sum/length(array)
value
}
> m<-c(10,11,2)
> average(m)
[1] 7.666667
> average(10)
[1] 10
> average()
[1] 0
Implementation: Word Frequency
text.files<-
list.files(path="C:/Users/Ashraf/Desktop/txt",full.names = T)
for(fp in text.files){
data<-readLines(con = fp) #read text file line by line
words<-character()
# extract words from each line
for(line in 1: length(data)){
if(data[line]!=""){
list<-unlist(strsplit(data[line]," "))
list<-list[list!=""] #remove the empty strings
words<-c(words,list)
}
}
show(sort(table(words),decreasing = T))
}
Implementation: POS Tagging
## packages NLP, openNLP
library("tm")
library("NLP")
library("openNLP")
## Some text.
data("acq")
s <- as.String(acq[[10]])
## Need sentence and word token annotations.
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- annotate(s, list(sent_token_annotator,
word_token_annotator))
pos_tag_annotator <- Maxent_POS_Tag_Annotator()
#pos_tag_annotator
a3 <- annotate(s, pos_tag_annotator, a2)
a3w <- subset(a3, type == "word")
tags <- sapply(a3w$features, "[[", "POS")
show(sprintf("%s/%s", s[a3w], tags))
Implementation: Text Classification
Training data set
Test data set
Data set (Training +Test data set)
Example: Sports, News, Opinion/ Reviews
Two basic steps
Representation of text documents (TDM)
Supervised/ Unsupervised algorithm
Implementation: Text Classification
Making TDM (Term Document Matrix):
Making Corpus
Clean Corpus (removing punctuation, stop words, white space, lower case)
Your suggestion is highly appreciated.
Thank You

A short tutorial on r

  • 1.
    An overview ofR: Text Analytics Ashraf Uddin PhD Scholar, Dept. of Computer Science South Asian University, New Delhi https://sites.google.com/site/ashrafuddininfo/
  • 2.
    About R What isR? R is a dialect of the S language. R is a free software programming language software environment for statistical computing and graphics  widely used among statisticians and data miners for developing statistical software and data analysis The source code for the R software environment is written primarily in C, Fortran, and R
  • 3.
    History of R 1991:Created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand 1993: First announcement of R to the public. 1995: R was made as free software. 1997: The R Core Group is formed (containing some people associated with S-PLUS). The core group controls the source code for R. 2000: R version 1.0.0 is released. 2013: R version 3.1.2 has been released on 2014-10-31.
  • 4.
    Statistical features ofR provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling classical statistical tests  time-series analysis, classification, clustering Others easily extensible through functions and extensions Many of R's standard functions are written in R itself C, C++, and Fortran code can be linked and called at run time strength of R is static graphics Dynamic and interactive graphics are available through additional packages
  • 5.
    Programming features ofR R is an interpreted language, users typically access it through a command-line interpreter. Like other similar languages such as MATLAB, R supports matrix arithmetic R supports procedural programming with functions for some functions, object-oriented programming with generic functions
  • 6.
    Features of Rcontinued... functionality is divided into modular packages Graphics capabilities very sophisticated. Useful for interactive work, but contains a powerful programming language for developing new tools Very active and vibrant user community; R-help and R-devel mailing lists and Stack Overflow
  • 7.
    Design of theR System The R system is divided into 2 conceptual parts: The “base” R system that you download from CRAN Everything else. R functionality is divided into a number of packages The “base” R system contains, among other things, the base package which is required to run R and contains the most fundamental functions. The other packages contained in the “base” system include utils, stats, datasets, graphics, grDevices, grid, methods, tools, parallel, compiler, splines, tcltk, stats4. There are also other packages: tm, stringr, boot, class, cluster, codetools, foreign, KernSmooth, lattice, mgcv, nlme, rpart, survival, MASS, spatial, nnet, Matrix.
  • 8.
    Design of theR System continued... And there are many other packages available: There are about 4000 packages on CRAN that have been developed by users and programmers around the world.
  • 9.
    Start Working inR Download & Install R: http://www.r-project.org/ Download & Install R studio: http://www.rstudio.com/products/rstudio/download/, Wikipedia Materials: Chambers (2008). Software for Data Analysis, Springer. (your textbook) Chambers (1998). Programming with Data, Springer. Venables & Ripley (2002). Modern Applied Statistics with S, Springer. Venables & Ripley (2000). S Programming, Springer. Pinheiro & Bates (2000). Mixed-Effects Models in S and S-PLUS, Springer. Murrell (2005). R Graphics, Chapman & Hall/CRC Press.· Springer has a series of books called Use R!. A longer list of books is at http://www.r-project.org/doc/bib/R-books.html Course on R: https://www.coursera.org/course/rprog
  • 10.
    Data Types andBasic Operations Objects R has five basic or “atomic” classes of objects: 1. Character 2. numeric (real numbers) 3. Integer 4. Complex 5. logical (True/False) The most basic object is a vector A vector can only contain objects of the same class BUT: The one exception is a list, which is represented as a vector but can contain objects of different classes.
  • 11.
    Data Types andBasic Operations continued... Numbers Numbers in R a generally treated as numeric a special number Inf which represents infinity; e.g. 1 / 0; Inf can be used in ordinary calculations; e.g. 1 / Inf is 0 The value NaN represents an undefined value (“not a number”); e.g. 0 / 0; NaN can also be thought of as a missing value
  • 12.
    Data Types andBasic Operations continued... Attributes R objects can have attributes names, dimnames dimensions (e.g. matrices, arrays) Class Length
  • 13.
    Data Types andBasic Operations continued... Entering Input The <- symbol is the assignment operator. The grammar of the language determines whether an expression is complete or not. The # character indicates a comment. Anything to the right of the # (including the # itself) is ignored. > x <- 1 > print(x) [1] 1 > x [1] 1 > msg<- "hello" > x <- ## Incomplete expression
  • 14.
    Data Types andBasic Operations continued... Printing The : operator is used to create integer sequences. > x <- 1:20 > x [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 [16] 16 17 18 19 20
  • 15.
    Data Types andBasic Operations continued... Creating Vectors The c() function can be used to create vectors of objects. Using the vector() function > x <- c(0.5, 0.6) ## numeric > x <- c(TRUE, FALSE) ## logical > x <- c(T, F) ## logical > x <- c("a", "b", "c") ## character > x <- 9:29 ## integer > x <- c(1+0i, 2+4i) ## complex > x <- vector("numeric", length = 10) > x [1] 0 0 0 0 0 0 0 0 0 0
  • 16.
    Data Types andBasic Operations continued... Mixing Objects What about the following? When different objects are mixed in a vector, coercion occurs so that every element in the vector is of the same class. > y <- c(1.7, "a") ## character > y <- c(TRUE, 2) ## numeric > y <- c("a", TRUE) ## character
  • 17.
    Data Types andBasic Operations continued... Matrices Matrices are vectors with a dimension attribute. The dimension attribute is itself an integer vector of length 2 (nrow, ncol) > m <- matrix(nrow = 2, ncol = 3) > m [,1] [,2] [,3] [1,] NA NA NA [2,] NA NA NA > dim(m) [1] 2 3 > attributes(m)$dim [1] 2 3 > m <- matrix(1:6, nrow = 2, ncol = 3) > m [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6
  • 18.
    Data Types andBasic Operations continued... Matrices: Matrix sum & multiplication > m<-matrix(data=c(1,0,0,4,4,3), nrow=2,ncol=3) > n<-matrix(data=c(1,2,3,4,5,6), nrow=2,ncol=3) > m+n [,1] [,2] [,3] [1,] 2 3 9 [2,] 2 8 9 > m*n [,1] [,2] [,3] [1,] 1 0 20 [2,] 0 16 18 > m%*%n Error in m %*% n : non-conformable arguments >n<-matrix(data=c(1,2,3,4), nrow=2,ncol=2) >n %*% m [,1] [,2] [,3] [1,] 1 12 13 [2,] 2 16 20
  • 19.
    Data Types andBasic Operations continued... Lists Lists are a special type of vector that can contain elements of different classes. Lists are a very important data type in R and you should get to know them well. > x <- list(1, "a", TRUE, 1 + 4i) > x [[1]] [1] 1 [[2]] [1] "a" [[3]] [1] TRUE [[4]] [1] 1+4i
  • 20.
    Data Types andBasic Operations continued... Data Frames Data frames are used to store tabular data They are represented as a special type of list where every element of the list has to have the same length Each element of the list can be thought of as a column and the length of each element of the list is the number of rows Unlike matrices, data frames can store different classes of objects in each column (just like lists); matrices must have every element be the same class
  • 21.
    Data Types andBasic Operations continued... Data Frames Data frames (as csvfile) > x <- data.frame(foo = 1:4, bar = c(T, T, F, F)) > x foo bar 1 1 TRUE 2 2 TRUE 3 3 FALSE 4 4 FALSE > nrow(x) [1] 4 > ncol(x) [1] 2 > data<-read.csv("G:/records.csv") >cd<-data[data$PY==2000,] >cd<-data[data$PY==2012,]
  • 22.
    Reading and WritingData continued... Reading Data There are a few principal functions reading data into R. read.table, read.csv, for reading tabular data readLines, for reading lines of a text file
  • 23.
    Reading and WritingData continued... Writing Data There are analogous functions for writing data to files write.table writeLines save
  • 24.
    Reading and WritingData continued... Reading Lines of a Text File >con <- file("foo.txt", "r") > x <- readLines(con, 10) > x [1] "1080" "10-point" "10th" "11-point" [5] "12-point" "16-point" "18-point" "1st" [9] "2" "20-point ## This might take time >con <- url("http://www.jhsph.edu", "r") >x <- readLines(con) > head(x) [1] "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">" [2] "" [3] "<html>" [4] "<head>" [5] "t<meta http-equiv="Content-Type" content="text/html;charset=utf-8
  • 25.
    Functions Functions are createdusing the function() directive and are stored as R objects just like anything else. Functions can be passed as arguments to other functions Functions can be nested The return value of a function is the last expression in the function body to be evaluated. f <- function(<arguments>) { ## Do something interesting }
  • 26.
    Functions continued... Defining aFunction average<-function(array=numeric(1)){ sum<-0 for(i in 1: length(array)){ sum<-sum+array[i] } value<-sum/length(array) value } > m<-c(10,11,2) > average(m) [1] 7.666667 > average(10) [1] 10 > average() [1] 0
  • 27.
    Implementation: Word Frequency text.files<- list.files(path="C:/Users/Ashraf/Desktop/txt",full.names= T) for(fp in text.files){ data<-readLines(con = fp) #read text file line by line words<-character() # extract words from each line for(line in 1: length(data)){ if(data[line]!=""){ list<-unlist(strsplit(data[line]," ")) list<-list[list!=""] #remove the empty strings words<-c(words,list) } } show(sort(table(words),decreasing = T)) }
  • 28.
    Implementation: POS Tagging ##packages NLP, openNLP library("tm") library("NLP") library("openNLP") ## Some text. data("acq") s <- as.String(acq[[10]]) ## Need sentence and word token annotations. sent_token_annotator <- Maxent_Sent_Token_Annotator() word_token_annotator <- Maxent_Word_Token_Annotator() a2 <- annotate(s, list(sent_token_annotator, word_token_annotator)) pos_tag_annotator <- Maxent_POS_Tag_Annotator() #pos_tag_annotator a3 <- annotate(s, pos_tag_annotator, a2) a3w <- subset(a3, type == "word") tags <- sapply(a3w$features, "[[", "POS") show(sprintf("%s/%s", s[a3w], tags))
  • 29.
    Implementation: Text Classification Trainingdata set Test data set Data set (Training +Test data set) Example: Sports, News, Opinion/ Reviews Two basic steps Representation of text documents (TDM) Supervised/ Unsupervised algorithm
  • 30.
    Implementation: Text Classification MakingTDM (Term Document Matrix): Making Corpus Clean Corpus (removing punctuation, stop words, white space, lower case)
  • 31.
    Your suggestion ishighly appreciated. Thank You