CHAPTER 02
INTRODUCTION TO R
17-12-2024
R is a programming language and environment specifically designed for statistical
computing and data analysis.
It is case sensitive.
● It is widely used for data visualization, statistical modeling, machine learning,
and scientific research.
● It includes powerful libraries for statistical techniques.
● R was first released in 1995.
● R was inspired by the S programming language
● R was created by Ross Ihaka and Robert Gentleman at the University of
Auckland, New Zealand.
WHY DEVELOPED?
● Improvement Over S Language
● Ease of Use for Statisticians
● Reproducibility of Research
● Extensibility
● Academic and Scientific Focus
BASIC MATHS
● 1+1
[1] 2
● 3*6
[1] 18
● > 9/2
[1] 4.5
● > 8*2-9
[1] 7
● > 5^2
[1] 25
● > "monica"
[1] "monica"
VARIABLES
DECLARATION
x<- 90
y=8
15->z
> x<-90
> y=8
> 15->z
> sum(x,y,z)
[1] 113
> assign("q",40). Assigns a variable with a given value.
To know the datatype of a variable we use
● class(__)
Variable names can contain any combination of alpha numeric character with
periods(.) and underscore(_)
It cant be started with underscore and number
> 8num=9
Error: unexpected symbol in "8num"
> _num=1
Error: unexpected symbol in "_num"
> num1=30
DATATYPES
4 TYPES
● Numeric (int,float)
● Character (strings)
● Date /POSIXct (time based)
● Logical (true/false)
Numeric data
● T he most commonly used numeric data is numeric.
● This is similar to a float or double in other languages.
● It handles integers and decimals, both positive and negative, and of course,
zero.
● A numeric value stored in a variable is automatically assumed to be numeric.
● Testing whether a variable is numeric is done with the function is.numeric.
● Testing whether the number is.numeric is done by is.numeric
is.numeric
> is.numeric(num1)
[1] TRUE
Integer
● As the name implies this is for whole numbers only, no decimals.
● To set an integer to a variable it is necessary to append the value with an L.
● <-5L
Only whole number ,no decimals.
If numeric decimals r used.
Whole number=integer
Append with capital L then it is treated as integer.
> j<-3L
> class(j)
[1] "integer"
> j=3
> class(j)
[1] "numeric"
> is.numeric(j)
[1] TRUE
> is.integer(j)
[1] FALSE
TO REMOVE ANY INTEGER
rm(X)
CHARACTER DATA TYPE
● Char
● Factor
> x="monica"
>x
[1] "monica"
> class(x)
[1] "character"
> y=factor("welcome")
>y
[1] welcome
Levels: welcome
Length of character/number
Returns number of character or returns number of intergers in the given data
nchar(x)
> nchar(x)
[1] 6
DATE
● R has numeric diff types of dates.the most useful are date and POSIXct
● Date stores just a date
● Yyyy-mm-dd
● POSIXct stores date and time
● Both objects are actually represented as the number of days.
DATE
● example
date=as.Date("2024-12-19")
> date
[1] "2024-12-19"
> class(date)
[1] "Date"
as.numeric describe the number of days till the mentioned date
> as.numeric(date)
[1] 20076
POSIXct
> date1=as.POSIXct("2024-12-19 8:28")
> date1
[1] "2024-12-19 08:28:00 IST"
> class(date1)
[1] "POSIXct" "POSIXt"
> date3=as.Date("2024-12-19 3:30")
> date3
[1] "2024-12-19"
LOGICAL
is.logical
TRUE=1
FALSE=0
> TRUE*4
[1] 4
> FALSE*3
[1] 0
> Y=TRUE*29
>Y
[1] 29
> is.logical(Y)
[1] FALSE–gives false as it it not logical but it is logical
> class(Y)
[1] "numeric"
8==3
[1] FALSE
> 8>9
[1] FALSE
> 9>12
[1] FALSE
> 12>6
[1] TRUE
> 5>=6
[1] FALSE
> 7<=4
[1] FALSE
> 7<=9
[1] TRUE
DATA STRUCTURES
Vector is nothing but an array.
Vector is collection of elements of same type.
c(elements)
● Elements shld be common type
● Vectors cannot be of mixed type.
● We do not any any dimensions for vectors in r programming
language.
● THE MOST COMMON WAY TO CREATE VECTOR IS C.
● No need of for loop and range function
Example
● > a=c(2,4,6,8)
● > a
● [1] 2 4 6 8
No need of loop just mention the range
> b=1:10
>b
[1] 1 2 3 4 5 6 7 8 9 10
VECTOR OPERATIONS
Multiply each element by 3
The multiplication operator(*)
> s=1:10
>s
[1] 1 2 3 4 5 6 7 8 9 10
> s*3
[1] 3 6 9 12 15 18 21 24 27 30
> s+2
[1] 3 4 5 6 7 8 9 10 11 12
> s-1
[1] 0 1 2 3 4 5 6 7 8 9
> s/3
[1] 0.3333333 0.6666667 1.0000000 1.3333333 1.6666667 2.0000000 2.3333333
2.6666667 3.0000000
[10] 3.3333333
> s^5
[1] 1 32 243 1024 3125 7776 16807 32768 59049 100000
We can start the range from even negative numbers
Decreasing order as well as increasing order
VECTOR ADDITION
The length of the elements assigned to character shld be of same length
> x=-5:4
> y=1:10
>x
[1] -5 -4 -3 -2 -1 0 1 2 3 4
>y
[1] 1 2 3 4 5 6 7 8 9 10
> x+y
[1] -4 -2 0 2 4 6 8 10 12 14
> m=1:5
n=0:2
> m+n
[1] 1 3 5 4 6
Warning message:
In m + n : longer object length is not a multiple of shorter object length
x-y
[1] -6 -6 -6 -6 -6 -6 -6 -6 -6 -6
x*y
[1] -5 -8 -9 -8 -5 0 7 16 27 40
x/y
[1] -5.0000000 -2.0000000 -1.0000000 -0.5000000 -0.2000000 0.0000000 0.1428571
0.2500000
[9] 0.3333333 0.4000000
READ LINE
A function which returns string function
Concatinates any number of strings cat
#read a number
x=readline(prompt ="enter a number a=")
x=as.integer(x)
y=readline(prompt="enter a number b=")
y=as.integer(y)
z=x+y
cat("sum is",z)
enter a number a=2
enter a number b=3
sum is 5
Write an R SCRIPT TO ACCEPT THE TWO INTEGERS AND PRINT THE PRODUCT
OF NUMBERS
a=readline(prompt="enter number a=")
a=as.integer(a)
b=readline(prompt="enter number b=")
b=as.integer(b)
d=a*b
cat("product is",d)
enter number a=3
enter number b=9
product is 27
SIMPLE IF
#check a given number is greater than
a=readline(prompt="enter a number=")
a=as.integer(a)
if(a>5)
{
cat(a,"is greater than 5")
}
enter a number=8
8 is greater than 5
IF ELSE
#CHECK GIVEN NUMBER IS POSITIVE OR NEGATIVE
a=readline(prompt="enter a number=")
a=as.integer(a)
if(a>0)
{
cat(a,"is positive")
}else
{
cat(a,"is negative")
}
enter a number=6
6 is positive
enter a number=-1
-1 is negative
#find the greatest of 2 numbers
a=readline(prompt="enter first number=")
a=as.integer(a)
b=readline(prompt="enter second number=")
b=as.integer(b)
if(a>b)
{
cat(a,"is largest")
}else
{
cat(b,"is largest")
}
enter first number=3
enter second number=6
6 is largest
Write an R script to find sum and average of first 10 numbers
a=1:10
sum=0
for(i in 1:10)
{
sum=sum+i
}
avg=sum/10
cat("Sum of the first 10 numbers:", sum,"\n")
cat("Average of the first 10 numbers:",avg)
Sum of the first 10 numbers: 55
Average of the first 10 numbers: 5.5
Write an r script to print numbers from 5-0 using while loop
i=5
while(i>=0)
{
cat(i,"")
i=i-1
}
543210
Break
Terminate the execution
Next
Skip the current iteration
i=5
while(i>=0)
{
cat(i,"")
if(i==2)
break
i=i-1
}
5432
Next
i=5
while(i>=0)
{
cat(i,"")
if(i==2)
next
i=i-1 #skips this statement
}
Accessing individual elements of a vector is done using square brackets([])
The first element of x is retrieved by typing x[1]
The first 2 elements by x[1:2]
Non consecutive elements by x[c(1,4)]
> y=5:10
>y
[1] 5 6 7 8 9 10
> y[2]
[1] 6
> y[c(6,9)]
[1] 10 NA
This works for all type of vectors
DATA STRUCTURES
Vectors
Lists
Data frames
Matrices
Arrays
Strings
Factors
DATA FRAMES
The data represented inthe form of rows and columns
x=1:5
y=5:1
z=c("c","c++","java","python","r")
df=data.frame(x,y,z)
print(df)
xy z
115 c
2 2 4 c++
3 3 3 java
4 4 2 python
551 r
TO FIND THE SHAPE
x=1:5
y=5:1
z=c("c","c++","java","python","r")
df=data.frame(x,y,z)
print(df)
cat("number of rows",nrow(df))
cat("\nnumber of columns",ncol(df))
cat("\ndimension",dim(df))
number of rows 5
number of columns 3
dimension 5 3
We can rename any columns
df=data.frame(first=x,second=y,course=z)
first second course
1 1 5 c
2 2 4 c++
3 3 3 java
4 4 2 python
5 5 1 r
WRITE AN R SCRIPT TO CREATE DATA FRAME WITH COLUMNS AS ROLL
NUMBER,NAME,SGPA,CGPA FOR 5 STUDENTS
rollno=1:5
name=c("nithya","harika","monica","arun","vinay")
sgpa=c(8.9,7.6,6.5,8,9)
cgpa=c(8.8,9,7.9,8.8,7.5)
df=data.frame(rollno,name,sgpa,cgpa)
print(df)
cat("number of rows",nrow(df))
cat("\nnumber of columns",ncol(df))
cat("\ndimension",dim(df))
print(head(df)) #prints the first 5 rows columns
rollno name sgpa cgpa
1 1 nithya 8.9 8.8
2 2 harika 7.6 9.0
3 3 monica 6.5 7.9
4 4 arun 8.0 8.8
5 5 vinay 9.0 7.5
number of rows 5
number of columns 4
dimension 5 4
> class(rollno)
[1] "integer"
print(head(df,n=3))
rollno name sgpa cgpa
1 1 nithya 8.9 8.8
2 2 harika 7.6 9.0
3 3 monica 6.5 7.9
print(tail(df,n=3))
rollno name sgpa cgpa
3 3 monica 6.5 7.9
4 4 arun 8.0 8.8
5 5 vinay 9.0 7.5
rownames(df)=c("1st","2nd","3rd","4th","5th")
print(df)
rollno name sgpa cgpa
1st 1 nithya 8.9 8.8
2nd 2 harika 7.6 9.0
3rd 3 monica 6.5 7.9
4th 4 arun 8.0 8.8
5th 5 vinay 9.0 7.5
print(df$cgpa) #to access specific column
8.8 9.0 7.9 8.8 7.5
ARRAY
Can have more than 2 dimensions
array(vector,dim=c(nrow,ncol,nmat))
nrow=number of rows
ncol=number of columns
nmat=number of matrics
a=array(c(1:6),dim=c(2,3,2))
print(a)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
,,2
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
a=array(c(1:12),dim=c(3,2,2))
print(a)
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
,,2
[,1] [,2]
[1,] 7 10
[2,] 8 11
[3,] 9 12
a=array(c(1:12),dim=c(2,2,3))
print(a)
[,1] [,2]
[1,] 1 3
[2,] 2 4
,,2
[,1] [,2]
[1,] 5 7
[2,] 6 8
,,3
[,1] [,2]
[1,] 9 11
[2,] 10 12
b=array(c(1,0,0,2,0,1,0,1,1,1),dim=c(2,3,3))
print(b)
,,1
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 2 1
,,2
[,1] [,2] [,3]
[1,] 0 1 1
[2,] 1 1 0
,,3
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 2 1 1
cat ("\n 1st row 2nd col mat1",b[1,2,1])
cat ("\n 1st row 3rd col mat3",b[1,3,3])
cat ("\n 1st row of matrix 2",b[c(1), ,2])
1st row 2nd col mat1 0
1st row 3rd col mat3 0
1st row of matrix 2 0 1 1
cat ("\n 2nd row of matrix 3:",b[ ,c(2),3])
2nd row of matrix 3: 0 1
cat("\n 2 is in mat 2:",2%in%b)
cat("\n 2 is in mat 2:",6%in%b)
2 is in mat 2: TRUE
2 is in mat 2: FALSE
STRING
Any value within the pair of single quote or double quote in r is treated as string.
s1="hello"
print(s1)
s2="welcome to data visualization"
print(s2)
[1] "hello"
[1] "welcome to data visualization"
Length of the string
cat("\n length of the string1:",nchar(s1))
length of the string1: 5
Concatenation
print(paste("\n joining two strings (concatenation):",s1,s2))
"\n joining two strings (concatenation): hello welcome to data visualization"
Equal and upper
cat("\n s1 and s2 equal:",s1==s2)
cat("\n s1 in lower case:",toupper(s1))
s1 and s2 equal: FALSE
s1 in lower case: HELLO
PLOT
Plot function(), used to plot points in graph.
par(mar=c(1,1,1,1)) for the margins
plot(3,4)
x=c(1,3,5,7)
y=c(2,4,6,8)
plot(x,y,col="red")
LINE CHART USING PLOT FUNCTION
l=line plot
b=both line and plots
s=step plot
n=no plot
h=histogram like plot
x=c(2,3,4,6,8)
plot(x,type="l",col="black")
x=c(2,3,4,6,8)
plot(x,type="b",col="black")
x=c(2,3,4,6,8)
plot(x,type="h",col="red")
x=c(2,3,4,6,8)
plot(x,type="s",col="blue",main="line chart",xlab="x",ylab="y",lwd=3,lty=2)
x=c(2,3,4,6,8)
plot(x,type="l",col="red",main="line chart")
Main denotes the title of the chart.
x=c(2,3,4,6,8)
plot(x,type="l",col="red",main="line chart",xlab="x",ylab="y")
1 col = is the colour parameter
2 Line width=lwd (changes the width or size of the line)
3 Line style=lty (between 0-6)
FOR MULTIPLE LINES
line1=c(1,3,5,7,9)
line2=c(2,4,6,8,10)
plot(line1,type="l",col="blue",lwd=3,lty=3)
lines(line2,col="red",lwd=3,lty=3)
line1=c(1,3,5,7,9)
line2=c(2,4,6,8,10)
line3=c(2,3,4,5,6)
plot(line1,type="l",col="blue",lwd=3,lty=3)
lines(line2,col="red",lwd=3,lty=3)
lines(line3,col="green",lwd=3,lty=3)
PIE PLOT
x=c(50,60,70,80,90)
pie(x)
x=c(50,60,70,80,90)
lab=c("c","c++","java","r","python")
pie(x,label=lab)
x=c(50,60,70,80,90)
lab=c("c","c++","java","r","python")
colors=c("green","blue","purple","yellow","red")
pie(x,label=lab,col=colors)
pie(x,label=x,col=colors,main="MARKS")
BAR PLOT
y=c(4,8,9,12,15)
barplot(y,main="bar plot")
HORIZONTAL
y=c(4,8,9,12,15)
#barplot(y,main="bar plot")
barplot(y,main="bar plot",horiz=TRUE)
BOX PLOT
A box graph is a chart that is used to display information in the form of distribution by
drawing boxplots for each of them
Based on 5 sets
Min
First Quartile
Median
Third quartile
Max
boxplot(x,data,notch,varwidth,names,main)
MAPS
leaflet() package
It is the most popular open source java script libraries for mobile friendly interactive maps
Can add popups,map tiles
To load any package
library(package_name)
library(leaflet)
addTiles()--by default if no arg is passes it creates an openstreetmap map function on the top
of the map
Pipe operator
Output of one function is taken as input to another function.
map=leaflet()%>%addTiles()%>%addMarkers(long,lat,popup)
library(leaflet)
map=leaflet()%>%addTiles()%>%addMarkers(lng=77.5946,lat=12.9716,popup='bengaluru')
par(mar=c(1,1,1,1))
print(map)
Create a dataframe with name as city and columns as latitude longitude city name
cityname latitude longitude
1 bengaluru 12.9716 77.5946
2 hyderabad 17.4065 78.4772
3 mysuru 12.2958 76.6394
4 chennai 13.0843 80.2705
5 kochi 9.9312 76.2673
Multiple city
library(leaflet)
city=data.frame(
name=c("bengaluru","hyderabad","mysuru","chennai","kochi"),
lat=c(12.9716,17.4065,12.2958,13.0843,9.9312),
lng=c(77.5946,78.4772,76.6394,80.2705,76.2673))
city_map=leaflet()%>%addTiles()
city_map <- city_map %>%addCircleMarkers(
data = city,
lat = ~lat,
lng = ~lng,
col = "red",
popup = ~paste("City:", name)
)
print(city_map)
CREATE A DF WITH COL NAMES AS ROLLNO , SEM, SGPA, CGPA FOR 5
STUDENTS
SCATTER PLOT
rollno=c(314,310,322,335,334)
sem=c(5,6,4,5,3)
sgpa=c(8.9,7.6,6.5,8,9)
cgpa=c(8.8,9,7.9,8.8,7.5)
df=data.frame(rollno,sem,sgpa,cgpa)
print(df)
par(mar=c(1,1,1,1))
print(plot(df[['sgpa']],df[['cgpa']]))
x=c(1,2,3,4)
y=2*x
plot(x,y)
rbind is used to add rows to the data frame
cbind is used to add column to the data frame
Rbind
data=data.frame(fruit=c("apple","orange","kiwi"),
color=c("red","orange",'brown'))
print(data)
new_df=rbind(data,c("grapes","green"))
#print(new_df)
fruit color
1 apple red
2 orange orange
3 kiwi brown
data=data.frame(fruit=c("apple","orange","kiwi"),
color=c("red","orange",'brown'))
#print(data)
new_df=rbind(data,c("grapes","green"))
print(new_df)
fruit color
1 apple red
2 orange orange
3 kiwi brown
4 grapes green
cbind
data=data.frame(fruit=c("apple","orange","kiwi"),
color=c("red","orange",'brown'))
#print(data)
new_df=rbind(data,c("grapes","green"))
#print(new_df)
new_col=cbind(data,price=c(250,300,150))
print(new_col)
fruit color price
1 apple red 250
2 orange orange 300
3 kiwi brown 150
QUESTIONS
2m
1. List out the basic arithmetic operator
2. List out the data types in r programming
3. Describe how to convert integer to character data type
4. Describe the difference between the date and posixct
5. How to you identify datatype of a variable
6. Define a vector
7. List out any 3 vector operations
8. What is readline()
9. Differentiate between vector and list.
10.List out any 3 functions of list data structure
11.Create a matrix with 2 rows 3 columns using r statement
12.Diff btw matrix and arrays
13.List out any 3 functions of string data type
14.List out any 3 functions of df
15.Describe the functions to create scatter plot and box plot
16.What is a leaflet package
5m
1. Create a df with col names as state,city,population.(any 5 states)
2. Write an r program to print even and odd numbers frm 1 to 20
3. Write an r program to find reverse of a given number
4. Implement with markers for any 5 cities in india
5. Consider 2 matrix a with 3 rows 3 cols,b with 3 rows 3 cols perform matrix addition
and multiplication
6. Consider a sample dataframe of your choice and implement the following graphs
using r program (line,bar,scatter,box)
7. Replace the given string with new string
str_replace===stringr() package
library(stringr)
str_sub==to extract substring od a given string
library(stringr)
print (str_c("data", "science"))
print (str_sub("data science", 4,8))
print (str_replace ("data scince", "data", "political"))