0% found this document useful (0 votes)
116 views

DS Lab

Here is the R code to read different file types and write to a specific disk location: # Read CSV file from disk fruits <- read.csv("Fruit.csv") # Write CSV file to a new location write.csv(fruits, "fruits_data.csv") # Read text file from web data <- read.table("http://www.example.com/data.txt", header=TRUE) # Write text file to disk write.table(data, "data_from_web.txt") # Read XML file library(XML) doc <- xmlTreeParse("data.xml") # Write XML file

Uploaded by

018 Neelima
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views

DS Lab

Here is the R code to read different file types and write to a specific disk location: # Read CSV file from disk fruits <- read.csv("Fruit.csv") # Write CSV file to a new location write.csv(fruits, "fruits_data.csv") # Read text file from web data <- read.table("http://www.example.com/data.txt", header=TRUE) # Write text file to disk write.table(data, "data_from_web.txt") # Read XML file library(XML) doc <- xmlTreeParse("data.xml") # Write XML file

Uploaded by

018 Neelima
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

St.

Ann's College for Women


(Autonomous), Affiliated to Osmania University
NAAC Reaccredited with 'A+' Grade, College with
Potential forExcellence by UGC Mehdipatnam,
Hyderabad -500028

Certificate of Completion
This is to certify that………….………………….………………………
of MCA bearing Hall Ticket Number….…….….….…………….
has successfully completed the necessary practical
record workin the subject: Data Science. Her work has
been duly corrected and certified.

Signature of Internal Examiner Signature of External Examiner

Head of the Department

1
1
INDEX
R AS CALCULATOR APPLICATION a. Using with and without
R objects on console b. Using mathematical functions on console c.
Write an R script, to create R 4
objects for calculator application and save in a specified location in
disk.
2 DESCRIPTIVE STATISTICS IN R
a. Write an R script to find basic descriptive statistics using
summary, str, quartile function on mtcars& cars datasets. 5
b. Write an R script to find subset of dataset by using
subset (), aggregate () functions on iris dataset.
3 READING AND WRITING DIFFERENT TYPES OF
DATASETS
a. Reading different types of data sets (.txt, .csv) from 9
Web and disk and writing in file in specific disk location.
b. Reading Excel data sheet in R.
c. Reading XML dataset in R.
4 VISUALIZATIONS a. Find the data distributions using box and
scatter plot.
b. Find the outliers using plot. 12
c. Plot the histogram, bar chart and pie chart on sample data.
5 CORRELATION AND COVARIANCE
a. Find the correlation matrix.
b. Plot the correlation plot on dataset and visualize giving an 15
overview of relationships
among data on iris data.
c. Analysis of covariance: variance (ANOVA), if data have
categorical variables on iris data.
6 REGRESSION MODEL Import a data from web storage. Name the 19
dataset and now do Logistic Regression to find out relation between
variables that are affecting the admission of a student in a institute
based on his or her GRE score, GPA obtained and rank of the
student. Also check the model is fit or not. Require (foreign), require
(MASS).
7 MULTIPLE REGRESSION MODEL
Apply multiple regressions, if data have a continuous Independent
variable. Apply on above dataset. 22

2
8 REGRESSION MODEL FOR PREDICTION Apply regression 25
Model techniques to predict the data on above dataset.
9 CLASSIFICATION MODEL 27
a. Install relevant package for classification.
b. Choose classifier for classification problem.
c. Evaluate the performance of classifier.
10 CLUSTERING MODEL 29
a. Clustering algorithms for unsupervised classification.
b. Plot the cluster data using R visualizations.

3
1. R AS CALCULATOR APPLICATION
a. Using with and without R objects on console
b. Using mathematical functions on console
c. Write an R script, to create R objects for calculator application and save in a
specified location in disk.

R AS CALCULATOR APPLICATION:
1.#Write an R script to create R objects for calculator application
add <- function(x, y) {
return(x + y)
}
subtract <- function(x, y) {
return(x - y)
}
multiply <- function(x, y) {
return(x * y)
}
divide <- function(x, y) {
return(x / y)
}
# take input from the user
print("Select operation.")
print("1.Add")
print("2.Subtract")
print("3.Multiply")
print("4.Divide")
choice = as.integer(readline(prompt="Enter choice[1/2/3/4]: "))
num1 = as.integer(readline(prompt="Enter first number: "))
num2 = as.integer(readline(prompt="Enter second number: "))
operator <- switch(choice,"+","-","*","/")
result <- switch(choice, add(num1, num2), subtract(num1, num2), multiply(num1, num2),
divide(num1, num2))
print(paste(num1,operator, num2, "=", result))
Output:

[1] "Select operation."

[1] "1.Add"

[1] "2.Subtract"

[1] "3.Multiply"

4
[1] "4.Divide"

Enter choice[1/2/3/4]: 4

Enter first number: 20

Enter second number: 4

[1] "20 / 4 = 5"

5
2.DESCRIPTIVE STATISTICS IN R
a. Write an R script to find basic descriptive statistics using summary, str, quartile function
on mtcars & cars datasets.
Compute the minimum,1st quartile, median, mean,3rd quartile and the maximum for all numeric
variables of a dataset at once using SUMMARY():

step1: summarydata=summary(mtcars)
write.csv(summarydata,”x.csv”, row.names= FALSE)
read.csv(“x.csv”)
OUTPUT:
Min. Min. Min. Min. Min. Min.
:10.40 :4.000 Min. : 71.1 Min. : 52.0 :2.760 :1.513 :14.50 :0.0000
1st 1st 1st 1st Qu.: 1st 1st 1st 1st
Qu.:15.43 Qu.:4.000 Qu.:120.8 96.5 Qu.:3.080 Qu.:2.581 Qu.:16.89 Qu.:0.0000
Median Median Median Median Median Median Median Median
:19.20 :6.000 :196.3 :123.0 :3.695 :3.325 :17.71 :0.0000
Mean Mean Mean Mean Mean Mean Mean Mean
:20.09 :6.188 :230.7 :146.7 :3.597 :3.217 :17.85 :0.4375
3rd 3rd 3rd 3rd 3rd 3rd 3rd 3rd
Qu.:22.80 Qu.:8.000 Qu.:326.0 Qu.:180.0 Qu.:3.920 Qu.:3.610 Qu.:18.90 Qu.:1.0000
Max. Max. Max. Max. Max. Max. Max. Max.
:33.90 :8.000 :472.0 :335.0 :4.930 :5.424 :22.90 :1.0000

summary(cars)
step1: summarydata=summary(mtcars)
step2: write.csv(summarydata,”x.csv”, row.names= FALSE)
read.csv(“x.csv”)
output:
Speed Dist
Min. : 4.0 Min. : 2.00
1st
Qu.:12.0 1st Qu.: 26.00
Median
:15.0 Median : 36.00
Mean
:15.4 Mean : 42.98
3rd
Qu.:19.0 3rd Qu.: 56.00
Max. Max. :120.00

6
:25.0

>Str(mtcars)

str() function in R Language is used for compactly displaying the internal structure of a R
object. It can display even the internal structure of large lists which are nested. It provides one
liner output for the basic R objects letting the user know about the object and its constituents.

RANGE:The range can then be easily computed, as you have guessed, by subtracting the
minimum from the maximum

Range(mtcars$mpg)

Output: 10.4 33.9

quantile() function in R Language is used to create sample quantiles within a data set
with probability[0, 1].

Such as first quantile is at 0.25[25%], second is at 0.50[50%], and third is at 0.75[75%].

Step1: quantile(mtcars$cyl)

"x"
"0%" 4
"25%" 4
"50%" 6
"75%" 8
"100%" 8

b. Write an R script to find subset of dataset using subset() aggregate() functions on


iris datset
Subsetting in R is a useful indexing feature for accessing object elements. It can be used to
select and filter variables and observations.
Aggregate() Function in R Splits the data into subsets, computes summary statistics for each
subsets and returns the result in a group by form.

Aggregate function in R is similar to group by in SQL. Aggregate() function is useful in


performing all the aggregate operations like sum,count,mean, minimum and Maximum.

7
#load dataset iris into r
>r=iris
>s=subset(r,r$Species==”virginica”)
>write.csv(s,file=”su.csv”)
>read.csv(“su.csv”)
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
101 6.3 3.3 6 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
103 7.1 3 5.9 2.1 virginica
104 6.3 2.9 5.6 1.8 virginica
105 6.5 3 5.8 2.2 virginica
106 7.6 3 6.6 2.1 virginica
107 4.9 2.5 4.5 1.7 virginica
108 7.3 2.9 6.3 1.8 virginica
109 6.7 2.5 5.8 1.8 virginica
110 7.2 3.6 6.1 2.5 virginica
111 6.5 3.2 5.1 2 virginica
112 6.4 2.7 5.3 1.9 virginica
113 6.8 3 5.5 2.1 virginica

8
3A. Reading different types of data sets (.txt, .csv) from Web and disk and writing in file in
specific disk location

Create Fruit.csv file using .CSV file

Fruit Name Fruit Color Fruit Price


Apple Red 100
Banana Yellow 60
Watermelon, Green 120
Pineapple Yellow 80
Grapes Green 130
Bananna Green 50
Apple Green 150

Reading CSV file using read.csv() function

read.csv("Fruit.csv")

Reading CSV file using read.table() function

read.table ("Fruit.csv", header=TRUE, sep=",")

OUTPUT:

Fruit.Name Fruit.Color Fruit.Price


1 Apple Red 100
2 Banana Yellow 60
3 Watermelon Green 120
4 Pineapple Yellow 80
5 Grapes Green 130
6 Banana Green 50
7 Apple Green 150

3b Reading Excel data sheet in R.

Softdrinks.xlsx

Soft drinks Price


Pepsi 40
ThumpsUP 60
Maaza 70
Limca 80
Sprite 90

9
convert an Excel worksheet to a text file by using the Save As command.
1.Go to File > Save As.
1. Click Browse.
2. In the Save As dialog box, under Save as type box, choose the text file format for the
worksheet; for example, click Text (Tab delimited) or CSV (Comma delimited)

Reading spreadsheet[.xlsx] after converting into[ .csv file]

read.csv (“softdrinks.csv”)

OUTPUT:

softdrinks price
1 Pepsi 40
2 Thumpsup 60
3 maaza 70
4 Limca 80
5 sprite 90

3c. Reading XML dataset in R

<RECORDS>
<EMPLOYEE>
<ID>5</ID>
<NAME>Gary</NAME>
<SALARY>843.25</SALARY>
<STARTDATE>3/27/2015</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>

<EMPLOYEE>
<ID>6</ID>
<NAME>Nina</NAME>
<SALARY>578</SALARY>
<STARTDATE>5/21/2013</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>

<EMPLOYEE>
<ID>7</ID>
<NAME>Simon</NAME>
<SALARY>632.8</SALARY>
<STARTDATE>7/30/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>

10
<EMPLOYEE>
<ID>8</ID>
<NAME>Guru</NAME>
<SALARY>722.5</SALARY>
<STARTDATE>6/17/2014</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>

</RECORDS>

Load the PAckag requires to read XML file


Library(“XML”)

Load the required Package


Library(“methods”)
Give the input file name to the function
xmldtaframe<- xmlToDataFrame(“employee.xml”)
Print the output
Print(xmldataframe)

Output:
ID NAME SALARY STARTDATE DEPT
1 5 Gary 843.25 3/27/2015 Finance
2 6 Nina 578 5/21/2013 IT
3 7 Simon 632.8 7/30/2013 Operations
4 8 Guru 722.5 6/17/2014 Finance

11
VISUALIZATIONS
4a. Find the data distributions using box and scatter plot.

In R, boxplot (and whisker plot) is created using the boxplot() function.


The boxplot() function takes in any number of numeric vectors, drawing a boxplot for each
vector.
You can also pass in a list (or data frame) with numeric vectors as its components. Let us use the
built-in dataset airquality which has “Daily air quality measurements in New York, May to
September”
>str (airquality)

OUTPUT:

'data.frame': 153 obs. of 6 variables:

$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...

$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...

$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...

$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...

$ Month : int 5 5 5 5 5 5 5 5 5 5 ...

$ Day : int 1 2 3 4 5 6 7 8 9 10 ...

>boxplot (airquality)
>boxplot(airquality$Ozone)
OUTPUT:

>boxplot(airquality$Ozone,main = "Mean ozone in parts per billion at Roosevelt


Island",xlab = "Parts Per Billion",ylab = "Ozone",col = "orange",border =
"brown",horizontal = TRUE,notch = TRUE)
OUTPUT:

12
Scatterplot:

Plot a Scatter Plot. The function to plot a scatter plot is ‘plot’. This function uses two vectors, i.e.
one for the x axis and another for the y axis. The objective is to understand the relationship
between numbers and their sines. We will use two vectors. Vector, x which will have a sequence
of values between 1 and 25 at an interval of 0.1 and vector, y which stores the sines of all values
held in vector, x.

> x <-seq(1, 25, 0.1)


> y <-sin(x)

The plot function takes the values in the vector, x and plots it on the horizontal axis.

> plot(x, y)
OUTPUT:

Scatter Diagram

13
4b. Find the outliers using plot.

Removed all the existing objects


>rm(list = ls())

#Setting the working directory


>setwd("D:/Ediwsor_Project - Bike_Rental_Count/")
>getwd()

#Load the dataset


>bike_data = read.csv("day.csv",header=TRUE)

### Missing Value Analysis ###


>sum(is.na(bike_data))
>summary(is.na(bike_data))

#From the above result, it is clear that the dataset contains NO Missing Values.

4c. Plot the histogram, bar chart and pie chart on sample data

# Create data for the histogram

h<- c (8,13,30,5,28)

#Create histogram for H

hist(h)
Output:

14
CORRELATION AND COVARIANCE
5a. Find the correlation matrix.
A correlation matrix is a table of correlation coefficients for a set of variables used to
determine if a relationship exists between the variables. The coefficient indicates both the
strength of the relationship as well as the direction (positive vs. negative correlations).

install.packages("corrplot")

source ("http://www.sthda.com/upload/rquery_cormat.r")
mydata <- mtcars[, c(1,3,4,5,6,7)]
head(mydata)
mpg disp hp drat wt qsec
Mazda RX4 21.0 160 110 3.90 2.620 16.46
Mazda RX4 Wag 21.0 160 110 3.90 2.875 17.02
Datsun 710 22.8 108 93 3.85 2.320 18.61
Hornet 4 Drive 21.4 258 110 3.08 3.215 19.44
Hornet Sportabout 18.7 360 175 3.15 3.440 17.02
Valiant 18.1 225 105 2.76 3.460 20.22
>rquery.cormat(mydata)
$r
hp disp wt qsec mpg drat
hp 1
disp 0.79 1
wt 0.66 0.89 1
qsec -0.71 -0.43 -0.17 1
mpg -0.78 -0.85 -0.87 0.42 1
drat -0.45 -0.71 -0.71 0.091 0.68 1
$p
hp disp wt qsec mpg drat
hp 0
disp 7.1e-08 0
wt 4.1e-05 1.2e-11 0
qsec 5.8e-06 0.013 0.34 0
mpg 1.8e-07 9.4e-10 1.3e-10 0.017 0
drat 0.01 5.3e-06 4.8e-06 0.62 1.8e-05 0
$sym
hp disp wt qsec mpg drat
hp 1
disp , 1
wt , + 1
qsec , . 1
mpg , + + . 1
drat . , , , 1
attr(,"legend")
[1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

15
mydata <- iris[, c(1,2,3,4)]
head(mydata)
>rquery.cormat(mydata)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4

5b. Plot the correlation plot on dataset and visualize giving an overview of relationships
among data on iris data.
Step 1- Load the relevant libraries

>library(ggplot2)
>library(tidyr)
>library(datasets)
>data("iris")
>summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300

16
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50

Step 2 - Create a correlation matrix of the Iris dataset using the DataExplorer correlation
function we used in class in lab 3. Include only continuous variables in your correlation
plot to avoid confusion as factor variables don’t make sense in a correlation plot

>library(corrplot)
>title="matrix_iris"
>plot_correlation(iris)

Step 3 - Create three separate correlation matrices for each species of iris flower
str(iris)
data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

>m<-levels(iris$Species)
>title0<-"Setosa"
>setosaCorr=cor(iris[iris$Species==m[1],1:4])
>corrplot(setosaCorr,method="number",title=title,mar=c(0,0,1,0))
Output:

5c. Analysis of covariance: variance (ANOVA), if data have categorical variables on iris
data.
>input <- mtcars[,c("am","mpg","hp")]
>print(head(input))

am mpg hp
Mazda RX4 1 21.0 110
Mazda RX4 Wag 1 21.0 110
Datsun 710 1 22.8 93
Hornet 4 Drive 0 21.4 110

17
Hornet Sportabout 0 18.7 175
Valiant 0 18.1 105
Model with interaction between categorical variable and predictor variable

>input <- mtcars


>result <- aov(mpg~hp*am,data = input)
>print(summary(result))

OUTPUT:

Df Sum Sq Mean Sq F value Pr(>F)


hp 1 678.4 678.4 77.391 1.50e-09 ***
am 1 202.2 202.2 23.072 4.75e-05 ***
hp:am 1 0.0 0.0 0.001 0.981
Residuals 28 245.4 8.8
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Model without interaction between categorical variable and predictor variable


>input <- mtcars
>result <- aov(mpg~hp+am,data = input)
>print(summary(result))

OUTPUT:

Df Sum Sq Mean Sq F value Pr(>F)


hp 1 678.4 678.4 80.15 7.63e-10 ***
am 1 202.2 202.2 23.89 3.46e-05 ***
Residuals 29 245.4 8.5
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’

Comparing Two Models


>input <- mtcars
>result1 <- aov(mpg~hp*am,data = input)
>result2 <- aov(mpg~hp+am,data = input)
>print(anova(result1,result2))

OUTPUT:

Analysis of Varience table


Model 1: mpg ~ hp * am
Model 2: mpg ~ hp + am
Res.Df RSS Df Sum of Sq F Pr(>F)
1 28 245.43
2 29 245.44 -1 -0.0052515 6e-04 0.9806

18
REGRESSION MODEL
6.Import a data from web storage. Name the dataset and now do Logistic Regression to
find out relation between variables that are affecting the admission of a student in a
institute based on his or her GRE score, GPA obtained and rank of the student. Also check
the model is fit or not. Require (foreign), require (MASS).

>library (rio)
>data<-import(“binary.sas7bdat”)

Data Cleaning:
Looking at the structure of data set

>str(data)

## 'data.frame': 400 obs. of 4 variables:


## $ ADMIT: num 0 1 1 1 0 1 1 0 1 0 ...
## $ GRE : num 380 660 800 640 520 760 560 400 540 700 ...
## $ GPA : num 3.61 3.67 4 3.19 2.93 ...
## $ RANK : num 3 3 1 4 4 2 1 2 3 2 ...
## - attr(*, "label")= chr "LOGIT"

Variables ADMIT and RANK are of type numeric but they should be factor variables since
were are not going to perform any mathematical operations on them.

>data$ADMIT<-as.factor(data$ADMIT)
>data$RANK<- as.factor(data$RANK)
>str (data)
# 'data.frame': 400 obs. of 4 variables:
## $ ADMIT: Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 1 2 1 ...
## $ GRE : num 380 660 800 640 520 760 560 400 540 700 ...
## $ GPA : num 3.61 3.67 4 3.19 2.93 ...
## $ RANK : Factor w/ 4 levels "1","2","3","4": 3 3 1 4 4 2 1 2 3 2 ...
## - attr(*, "label")= chr "LOGIT"
Looking at the summary of the dataset
>summary (data)
## ADMIT GRE GPA RANK
## 0:273 Min. :220.0 Min. :2.260 1: 61
## 1:127 1st Qu.:520.0 1st Qu.:3.130 2:151
## Median :580.0 Median :3.395 3:121

19
## Mean :587.7 Mean :3.390 4: 67
## 3rd Qu.:660.0 3rd Qu.:3.670
## Max. :800.0 Max. :4.000

From the summary statistics we observe

• Most of students did not get admitted


• There are no missing data values(NAs).

Checking for multi collineality


>Plot(data$GPA,data$GRE,col=”red”)
> cor (data$GRE, data$GPA)
Exploratoty Data Analysis.
We will explore the relationship between dependent and independent variables by way of
visualization.
GRE
Since GRE is numeric variable and dependent variable is factor variable, we plot a box plot

Library(ggplot2)
Ggplot(data,aes(ADMIT,GRE,fill=ADMIT))+
library(ggplot2) #For plotting
ggplot(data,aes(ADMIT,GRE,fill=ADMIT))+
geom_boxplot()+
theme_bw()+
xlab("Admit")+
ylab("GRE")+
ggtitle("ADMIT BY GRE")

The two box plots are differents in terms of displacement, and hence GRE is significant
variable.
GPA
ggplot(data,aes(ADMIT,GPA,fill=ADMIT))+
geom_boxplot()+
theme_bw()+
xlab("Admit")+

20
ylab("GPA")+
ggtitle("ADMIT BY GPA")
There is clear difference in displacement between the two box plots, hence GPA is an important
predictor.
RANK
RANK is a factor variable and since the dependent variable is a factor variable we plot a bar plot.
ggplot(data,aes(RANK,ADMIT,fill=ADMIT))+
geom_col()+
xlab("RANK")+
ylab("COUNT-ADMIT")+
ggtitle("ADMIT BY RANK")

Modelling
Data Splitting
Before we fit a model, we need to split the dataset into training and test dataset to be able to
assess the performance of the model with the unseen test dataset.
library(caret) #For data spliting
set.seed(125) #For reproducibiity
ind <- createDataPartition(data$ADMIT,p=0.80,list = FALSE)
training <- data[ind,] #training data set
testing <- data[-ind,] #Testing data set

21
MULTIPLE REGRESSION MODEL
7.Apply multiple regressions, if data have a continuous Independent variable. Apply on
above dataset.
tidyverse for data manipulation and visualization

>library(tidverse)

We’ll use the marketing data set [datarium package], which contains the impact of the amount of
money spent on three advertising medias (youtube, facebook and newspaper) on sales.
First install the datarium package using devtools::install_github("kassmbara/datarium"), then load
and inspect the marketing data as follow:
>data("marketing", package = "datarium")
> head(marketing, 4)

## youtube facebook newspaper sales


## 1 276.1 45.4 83.0 26.5
## 2 53.4 47.2 54.1 12.5
## 3 20.6 55.1 83.2 11.2
## 4 181.8 49.6 70.2 22.2

Building model
We want to build a model for estimating sales based on the advertising budget invested in youtube,
facebook and newspaper, as follow:
sales = b0 + b1*youtube + b2*facebook + b3*newspaper
You can compute the model coefficients in R as follow:
>model <- lm(sales ~ youtube + facebook + newspaper, data = marketing)
>summary(model)
##
## Call:
## lm(formula = sales ~ youtube + facebook + newspaper, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.59 -1.07 0.29 1.43 3.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.52667 0.37429 9.42 <2e-16 ***
## youtube 0.04576 0.00139 32.81 <2e-16 ***
## facebook 0.18853 0.00861 21.89 <2e-16 ***
## newspaper -0.00104 0.00587 -0.18 0.86
## ---

22
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.02 on 196 degrees of freedom
## Multiple R-squared: 0.897, Adjusted R-squared: 0.896
## F-statistic: 570 on 3 and 196 DF, p-value: <2e-16

Interpretation
The first step in interpreting the multiple regression analysis is to examine the F-statistic and the
associated p-value, at the bottom of model summary.
In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly significant.
This means that, at least, one of the predictor variables is significantly related to the outcome
variable.
To see which predictor variables are significant, you can examine the coefficients table, which
shows the estimate of regression beta coefficients and the associated t-statitic p-values:
>summary(model)$coefficient

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 3.52667 0.37429 9.422 1.27e-17
## youtube 0.04576 0.00139 32.809 1.51e-81
## facebook 0.18853 0.00861 21.893 1.51e-54
## newspaper -0.00104 0.00587 -0.177 8.60e-01

As the newspaper variable is not significant, it is possible to remove it from the model:
>model <- lm(sales ~ youtube + facebook, data = marketing)
>summary(model)
##

## Call:
## lm(formula = sales ~ youtube + facebook, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.557 -1.050 0.291 1.405 3.399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.50532 0.35339 9.92 <2e-16 ***
## youtube 0.04575 0.00139 32.91 <2e-16 ***
## facebook 0.18799 0.00804 23.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.02 on 197 degrees of freedom

23
## Multiple R-squared: 0.897, Adjusted R-squared: 0.896
## F-statistic: 860 on 2 and 197 DF, p-value: <2e-16
Finally, our model equation can be written as follow: sales = 3.5 + 0.045*youtube +
0.187*facebook.
The confidence interval of the model coefficient can be extracted as follow:
>confint(model)

## 2.5 % 97.5 %
## (Intercept) 2.808 4.2022
## youtube 0.043 0.0485
## facebook 0.172 0.2038

Model accuracy assessment


Residual Standard Error (RSE), or sigma:
The RSE estimate gives a measure of error of prediction. The lower the RSE, the more accurate
the model (on the data in hand).
The error rate can be estimated by dividing the RSE by the mean outcome variable:
>sigma(model)/mean(marketing$sales)

## [1] 0.12

24
REGRESSION MODEL FOR PREDICTION

8. Apply regression Model techniques to predict the data on above dataset.

Predicting Blood pressure using Age by Regression in R


Now we are taking a dataset of Blood pressure and Age and with the help of the data train a
linear regression model in R which will be able to predict blood pressure at ages that are not
present in our dataset.

Equation of the regression line in our dataset

BP = 98.7147 + 0.9709 Age

Importing dataset
Importing a dataset of Age vs Blood Pressure which is a CSV file using function read.csv( ) in R
and storing this dataset into a data frame bp.

>bp <- read.csv(“bp.csv”)

Creating data frame for predicting values

Creating a data frame which will store Age 53. And this data frame will be used to predict blood
pressure at Age 53 after creating a linear regression model.

>p <- as.data.frame(53)


>colnames(p) <- "Age"
Calculating the correlation between Age and Blood pressure

We can also verify our above analysis that there is a correlation between Blood pressure and Age
by taking the help of cor( ) function in R which is used to calculate the correlation between two
variables.

>cor(bp$BP,bp$Age)

[1] 0.6575673
Creating a Linear regression model
Now with the help of lm( ) function, we are going to make a linear model. lm( ) function has two
attributes first is a formula where we will use “BP ~ Age” because Age is an independent
variable and Blood pressure is a dependent variable and the second is data, where we will give
the name of the data frame containing data which is in this case, is data frame bp.

model <- lm(BP ~ Age, data = bp)

25
Summary of our linear regression model
summary(model)
Output:

##
## Call:
## lm(formula = BP ~ Age, data = bp)
# Residuals:
## Min 1Q Median 3Q Max
## -21.724 -6.994 -0.520 2.931 75.654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.7147 10.0005 9.871 1.28e-10 ***
## Age 0.9709 0.2102 4.618 7.87e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.31 on 28 degrees of freedom
## Multiple R-squared: 0.4324, Adjusted R-squared: 0.4121
## F-statistic: 21.33 on 1 and 28 DF, p-value: 7.867e-05
Interpretation of the model

## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.7147 10.0005 9.871 1.28e-10 ***
## Age 0.9709 0.2102 4.618 7.87e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
B0 = 98.7147 (Y- intercept)
B1 = 0.9709 (Age coefficient)
BP = 98.7147 + 0.9709 Age

26
CLASSIFICATION MODEL
9a. Install relevant package for classification.
9b. Choose classifier for classification problem
9c. Evaluate the performance of classifier
The R package "party" is used to create decision trees.
Install R Package
Use the below command in R console to install the package. You also have to install the dependent
packages if any.
>install.packages("party")
The package "party" has the function ctree() which is used to create and analyze decison tree.
Syntax
The basic syntax for creating a decision tree in R is −
>ctree(formula, data)
INPUT DATA:
>library(party)
>print(head(reading skills))
When we execute the above code, it produces the following result and chart −
nativeSpeaker age shoeSize score
1 yes 5 24.83189 32.29385
2 yes 6 25.95238 36.63105
3 no 11 30.42170 49.60593
4 yes 7 28.66450 40.28456
5 yes 11 31.88207 55.46085
6 yes 10 30.07843 52.83124
Loading required package: methods
Loading required package: grid
...............................

Example
We will use the ctree() function to create the decision tree and see its graph.
# Load the party package. It will automatically load other
# dependent packages.
library(party)

# Create the input data frame.


input.data <- readingSkills[c(1:105),]

# Give the chart file a name.


png(file = "decision_tree.png")

27
# Create the tree.
output.tree <- ctree(
nativeSpeaker ~ age + shoeSize + score,
data = input.dat)

# Plot the tree.


plot(output.tree)

# Save the file.


dev.off()
When we execute the above code, it produces the following result −
null device
1
Loading required package: methods
Loading required package: grid
Loading required package: mvtnorm
Loading required package: modeltools
Loading required package: stats4
Loading required package: strucchange
Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

as.Date, as.Date.numeric

Loading required package: sandwich

28
CLUSTERING MODEL
10a. Clustering algorithms for unsupervised classification
10b. Plot the cluster data using R visualizations

k-means clustering

str(x)
## num [1:300, 1:2] 3.37 1.44 2.36 2.63 2.4 ...
head(x)
## [,1] [,2]
## [1,] 3.370958 1.995379
## [2,] 1.435302 2.760242
## [3,] 2.363128 2.038991
## [4,] 2.632863 2.735072
## [5,] 2.404268 1.853527
## [6,] 1.893875 1.942113
# Create the k-means model: km.out
km.out <- kmeans(x, centers = 3, nstart = 20)

# Inspect the result


summary(km.out)
## Length Class Mode
## cluster 300 -none- numeric
## centers 6 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric

– Results of kmeans()

# Print the cluster membership component of the model


km.out$cluster
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

29
## [36] 2 3 3 3 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [246] 1 1 1 1 1 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3 2
## [281] 3 3 3 3 3 3 2 3 3 3 3 3 3 2 3 3 3 2 3 3
# Print the km.out object
km.out
## K-means clustering with 3 clusters of sizes 150, 98, 52
##
## Cluster means:
## [,1] [,2]
## 1 -5.0556758 1.96991743
## 2 2.2171113 2.05110690
## 3 0.6642455 -0.09132968
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 3 3 3 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [246] 1 1 1 1 1 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3 2
## [281] 3 3 3 3 3 3 2 3 3 3 3 3 3 2 3 3 3 2 3 3
##
## Within cluster sum of squares by cluster:
## [1] 295.16925 148.64781 95.50625
## (between_SS / total_SS = 87.2 %)
##
## Available components:
##

30
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"

– Visualizing and interpreting results of kmeans()

# Scatter plot of x
plot(x,
col = km.out$cluster,
main = "k-means with 3 clusters",
xlab = "",
ylab = "")

31

You might also like