0% found this document useful (0 votes)
15 views

BDA LAB Experiments

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

BDA LAB Experiments

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Big Data Analytics lab

1. SETTING UP AND INSTALLING HADOOP

HADOOP
Hadoop is an open-source framework designed for processing and storing
large datasets across distributed computing clusters. It enables the handling of big
data by providing a scalable and fault-tolerant solution. The core components of
Hadoop are the Hadoop Distributed File System (HDFS) and the MapReduce
programming model. HDFS divides data into blocks and replicates them across
multiple nodes for reliability. MapReduce allows for parallel processing of data by
splitting it into smaller tasks that can be executed in parallel across the cluster.
Hadoop is widely used in various industries for tasks like data analysis, machine
learning, and log processing, providing a cost-effective and efficient solution for big
data processing.

Step 1
sudo apt install openjdk-11-jdk

Step 2
#Open bashrc file and modify contents
sudo nano bashrc
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export PATH=$PATH:/usr/lib/jvm/java-11-openjdk-amd64/bin
export HADOOP_HOME=~/hadoop-3.3.4/
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export
HADOOP_STREAMING=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-
streaming-3.3.4.jar
export HADOOP_LOG_DIR=$HADOOP_HOME/logs
export PDSH_RCMD_TYPE=ssh
Save: Control+O
Exit: Control+X

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

Step 3:
( ssh — secure shell — protocol used to securely connect to remote server/system —
transfers data in encrypted form)

sudo apt-get install ssh

Step 4:
#now open hadoop-env.h
sudo nano hadoop-env.h
#Save and Exit

Step 5:
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

step 6:
ssh localhost

step 7:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

step 8:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

step 9:
chmod 0600 ~/.ssh/authorized_keys

step 10:
hadoop-3.3.4/bin/hdfs namenode –format

step 11:
export PDSH_RCMD_TYPE=ssh

step 12:
start-all.sh

*****Once installed how to start hadoop****


stop-all.sh
hadoop namenode -format

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

start-all.sh

2. SHELL COMMANDS IN HADOOP

Shell commands in Hadoop provide a convenient and efficient way to interact


with the Hadoop ecosystem through the command line interface. These commands
allow users to perform various tasks related to managing and manipulating data in
Hadoop. For example, the "hadoop fs" command is used to interact with the Hadoop
Distributed File System (HDFS), allowing users to create, delete, copy, and move
files and directories. The "hadoop jar" command is used to submit MapReduce jobs to
the Hadoop cluster, enabling the processing of large datasets in a distributed manner.
Additionally, there are commands for monitoring cluster status, managing
permissions, and configuring Hadoop settings. Shell commands in Hadoop streamline
the management and execution of tasks, providing a flexible and powerful toolset for
working with big data.

1. mkdir: To make a directory


Syntax: hadoop fs -mkdir /<directory name>

2. nano: to create a file locally


Syntax: nano <filename>

3. rmdir: to remove a directory


Syntax: hadoop fs -rmdir /<directory name>

4. version: to get the current version of hadoop


Syntax: hadoop version

5. To list out the files


Syntax: hadoop fs -ls

3. FILE MANAGEMENT TASKS IN HADOOP

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

1. put: To put a file from local into directory in hadoop


Syntax: hadoop fs -put <filename>/<directory_name>

2. get: To get a file from the hadoop directory to a local directory


Syntax: hadoop fs -get /<directory_name>/<file_name> <directory_name>

3. rm: To remove a file from hadoop directory


Syntax: hadoop fs -rm/<directory_name>/<file_name>

4. cp: to copy file from one directory to another


Syntax: hadoop fs -cp/<directory_name>/<file_name> /<directory_name>

4. PROGRAM TO FIND FACTORIAL AND CHECK PALINDROME

AIM
Write an R program to find factorial and check palindrome.

ALGORITHM

1. Start
2. Import library stringi
3. Algorithms for palindrome (x)
3.1 using stri_reverse function; if stri_reverse(x) is equal to x then
3.1.1 print that the number is palindrome
3.2 else.
3.2.1 print that the number is not palindrome
4.Algorithm for factorial (y)
4.1 factt=1
4.2 if y<0 then
4.2.1 print that y is negative and factorial is not possible
4.3 if y=0 then
4.3.1 print that the factorial of 0 is 1
4.3.2 Call function palindrome(1)
4.4 else
4.4.1 for i from y to 1 do.
4.4.1.1 factt=factt*i
4.4.2 print factt
4.4.3 Call function palindrome(factt)

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

5. Read a number from user,k


6. call function factorial (k)
7. Stop.

PROGRAM
library(stringi)
palin<-function(x){
if(stri_reverse(x)==x){
print(paste(x," is a palindrome"))
}else{
print(paste(x," is not a palindrome"))
}
}

fact=function(y){
factt=1
if(y<0){
print(paste(y, "is a negative number"))
}else if(y==0) {
print("The factorial of 0 is 1")
palin(y)
}else{
for (i in 1:y){
factt=factt*i
}
print(paste("The factorial of ", y," is ",factt ))
palin(factt)
}

}
k=as.integer(readline("Enter a number: "))
fact(k)

OUTPUT

Enter a number: 1

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

[1] "The factorial of 1 is 1"


[1] "1 is a palindrome"

Enter a number: 11
[1] "The factorial of 11 is 39916800"
[1] "39916800 is not a palindrome"

Enter a number: 0
[1] "The factorial of 0 is 1"
[1] "0 is a palindrome"

Enter a number: -9
[1] "-9 is a negative number"

5. PROGRAM TO CHECK IF A NUMBER IS PRIME

AIM
Write an R program to check if a number is prime

ALGORITHM
1. Start.
2. Algorithm for prime (a)
2.1 flag=0
2.2 for i from 2 to x-1
2.2.1 if i mod x is equal to x then
2.2.1.1. flag=1
2.3 if x= 2 then
2.3.1 flag=0
2.4 if flag=0 then
2.4.1 Print x is a prime number.
2.5 else
2.5.1 print x is not a prime number
3. Stop
Read a number, k
14. call fusction prime (k)
5 Stop

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

PROGRAM
prim=function(x){
flag=0

for(i in 2:(x-1)){
if((i%%x)==0){
flag=1
}
}
if(x==2){
flag=0
}
if(flag==0){
print(paste(x," is a prime number"))
}else{
print(paste(x," is not a prime number"))
}
}

k=as.integer(readline("Enter a number: "))


prim(k)

OUTPUT

Enter a number: 6
[1] "6 is not a prime number"
> source("~/Desktop/07-05/primmm.R")
Enter a number: 5
[1] "5 is a prime number"
> source("~/Desktop/07-05/primmm.R")
Enter a number: 2
[1] "2 is a prime number"

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

3. PROGRAM TO PRINT A PATTERN

AIM
Write an R program to print a pattern

ALGORITHM

1. Start
2. for i from 1 to 5
2.1 v=c()
2.2 for j from i to 1
2.2.1 v=c(v,c(“*”))
2.3 print v
3. for i from 1 to 5
3.1 v=()
3.2 for j from i to 5
3.2.1 v=c(v,c(“*”))
3.3 print v
4. Stop

PROGRAM
print("pattern")
for(i in 1:5){
v=c()
for (j in i:1){
v=c(v,c("*"))
}
print(v)
}
for(i in 1:5){
v=c()
for (j in i:5){
v=c(v,c("*"))
}
print(v)
}

OUTPUT

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

[1] "pattern"
[1] "*"
[1] "*" "*"
[1] "*" "*" "*"
[1] "*" "*" "*" "*"
[1] "*" "*" "*" "*" "*"
[1] "*" "*" "*" "*" "*"
[1] "*" "*" "*" "*"
[1] "*" "*" "*"
[1] "*" "*"
[1] "*"

4. PROGRAM TO IMPLEMENT A SIMPLE CALCULATOR

AIM
Write an R program to implement a simple calculator

ALGORITHM

1. Start
2. Algorithm for add(a,b)
a. return (a+b)
3. Algorithm for subtract(a,b)
a. Return a-b
4. Algorithm for multiple(a,b)
a. Return a*b
5. Algorithm for divide (a,b)
a. If b=0 then return ‘Not possible’
b. Else return a/b
6. Read a choice ‘ch’; 1 for addition, 2 for subtraction, 3 for multiplication and 4 for
division
7. Using switch() choose the operator
8. Read 2 values a and b
9. Using switch() call function based on the corresponding function is to be performed
and store the return value to a variable r
10. Print r
11. Stop

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

PROGRAM

add=function(a,b){
print(paste("The sum is: ",a+b)
}
sub=function(a,b){
print(paste("The Difference is: ",a-b)
}
mul=function(a,b){
print(paste("The Multiplicated value is: ",a*b)
}
div=function(a,b){
print(paste("The Divided value is: ",(a%/%b)
}
print("Enter the choice: ")
print("1. Addition")
print("2. Subtraction")
print("3. Multiplication")
print("4. Division")
ch=as.integer(readline("Enter the choice: "))
a=as.integer(readline("Enter the first number: "))
b=as.integer(readline("Enter the second number: "))
op=switch(ch,"+","-","*","/")
r=switch(ch,add(a,b),sub(a,b),mul(a,b),div(a,b))
print(paste(a," ",op, " = ",r ))

OUTPUT

[1] "Enter the choice: "


[1] "1. Addition"
[1] "2. Subtraction"
[1] "3. Multiplication"
[1] "4. Division"
Enter the choice: 4
Enter the first number: 5
Enter the second number: 0

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

[1] "5 / 0 = Not Possible"

5. PROGRAM TO PRINT FIBONACCI SERIES

AIM
Write an R program to print fibonacci series of a number
ALGORITHM
1. Start
2. Algorithm for fib(x)
a. Assign x1=0 and x2=1
b. l=empty vector
c. l=vector(l,x1)
d. l=vector(l,x2)
e. For i from 3 to x do
i. xn=x1+x2
ii. l=vector(l,xn)
iii. x1=x2
iv. x2=xn
f. Print
3. Read an integer n
4. Call function fib(n)
5. Stop

PROGRAM

fib=function(x)#Function which prints the fibonacci series

{
x1=0
x2=1
l=c()
print(paste(&quot;The fibonacci series for &quot;,x,&quot; numbers is: \n&quot;))
l=c(l,x1)
l=c(l,x2)#Inserting the first two into the vector
for (i in 3:x)
{
xn=x1+x2
l=c(l,xn)

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

x1=x2
x2=xn
}
print(l)#printing the series as a vector
}
n=as.integer(readline(prompt = &quot;Enter a number:&quot;))
fib(n)

OUTPUT
Enter a number:8
[1] "The fibonacci series for 8 numbers is: \n"
[1] 0 1 1 2 3 5 8 13

6. PROGRAM TO FIND GCD OF 2 NUMBERS

AIM
Write an R program to print the GCD of 2 numbers

ALGORITHM

1. Start
2. Algorithm gcd(a,b)
a. Assign gcdd=1
b. If a<b then l=a else l=b
c. for i from 1 to l do
i. if a mod i and b mod i = 0 then gcdd=i
d. return gcdd
3. Read 2 integers a and b
4. Call function gcd(a,b) and store the result to a variable k
5. print k
6. Stop

PROGRAM
#greatest common divisor
gcd=function(a,b){
gcdd=1
if(a<b)
{

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

l=a
}
else
{
l=b
}
for (i in 1:l)
{
if((a%%i==0)&&(b%%i==0))
{
gcdd=i
}
}
return(gcdd)
}
a=as.integer(readline("Enter a number:"))
b=as.integer(readline("Enter another number:"))
k=gcd(a,b)
print(paste("The GCD is: ",k))

OUTPUT
Enter a number:6
Enter another number:5
[1] "The GCD is: 1"
> source("~/Desktop/14-03/gcd.R")
Enter a number:6
Enter another number:4
[1] "The GCD is: 2"

7. PROGRAM TO FIND LCM OF 2 NUMBERS


AIM
Write an R program to find LCM of 2 numbers
ALGORITHM
1. Start
2. Algorithm lcm(a,b)
a. Assign lcmm=1
b. If a<b then g=b else g=a

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

c. While true do
i. if g mod a and g mod b = 0 then lcmm=g and break
ii. g=g+1
d. return lcmm
3. Read 2 integers a and b
4. Call function lcm(a,b) and store the result to a variable k
5. print k
6. Stop

PROGRAM

#lcm
lcm=function(a,b){
lcmm=1
if(a<b)
{
g=b
}
else
{
g=a
}
while(TRUE)
{
if((g%%a==0)&&(g%%b==0))
{
lcmm=g
break
}
g=g+1
}
return(lcmm)
}
a=as.integer(readline("Enter a number:"))
b=as.integer(readline("Enter another number:"))
k=lcm(a,b)
print(paste("The LCM is: ",k))

OUTPUT

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

Enter a number:6
Enter another number:5
[1] "The LCM is: 30"

8. PROGRAM TO PRINT THE SUM OF N NATURAL NUMBERS

AIM
Write an R program to print the sum of N natural numbers
ALGORITHM
1. Start
2. Read an integer n
3. Assign sum=0
4. For i from 1 to n do
a. Sum =sum+i
5. Print sum
6. stop

PROGRAM
#sum of n natural numbers
n=as.integer(readline("Enter a natural number:"))
sum=0
for (i in 1:n)
{
sum=sum+i
}
cat("The sum is: ",sum)

OUTPUT

Enter a natural number:5


The sum is: 15

9. PROGRAM TO PRINT OCCURENCES OF N RANDOM NUMBERS


AIM

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

Write an R program to create a list of random numbers in normal distribution and count
occurrences of each values
ALGORITHM

1. Start
2. Using rnorm() function, get 50 random values
3. Using table() function get the occurrences of each value
4. Print the table
5. Stop

PROGRAM

x=floor(rnorm(n=50))
t = table(x)
print("Occurrences of each value:")
print(t)

OUTPUT
[1] "Occurrences of each value:"
x
-3 -2 -1 0 1 2
1 10 19 15 4 1

10. PROGRAM TO PLOT A BARPLOT


AIM
Write an R program to plot a barplot
ALGORITHM

1. Start
2. Read the subjects
3. Read the corresponding marks
4. Using barplot() function, plot the graph
5. Stop

PROGRAM
subject=c("BDA","PR","RIS","IEFT","AAD")
mark=c(47,49,34,46,48)
barplot(mark,names.arg=subject,main="Bar
Plot",xlab="Subject",ylab="Mark",col="royalblue2")

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

OUTPUT

11. PROGRAM TO PRINT SUM, MEAN AND PRODUCT OF A VECTOR


AIM
Write an R program to compute sum, mean and product of a given vector
ALGORITHM
1. Start
2. Read a vector, v
3. Sum of that vector can be found using sum() function
4. Mean can be found using mean() function
5. To a variable ‘size’ store the length of the vector, v
6. For i from 1 to size do
a. prod=prod*v[i]
7. Print prod
8. Stop
PROGRAM
#sum mean and product
v=c(23,12,34,54,1,2,7,8)
cat("The Vector is: ",v,"\n")
cat("The sum of the vector is: ",sum(v),"\n")
cat("The mean of the vector is: ",mean(v),"\n")

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

size = length(v)
prod = 1
for(i in 1:size)
{
prod = v[i]*prod
}
cat("The product of the vector is: ",prod,"\n")

OUTPUT

The Vector is: 23 12 34 54 1 2 7 8


The sum of the vector is: 141
The mean of the vector is: 17.625
The product of the vector is: 56754432

12. PROGRAM TO PRINT A DATAFRAME


AIM
Write an R program to create a dataframe which contains the details of 5 employees and
display the details
ALGORITHM
1. Start
2. Read names to a vector
3. Read gender to a vector
4. Read age to a vector
5. Read designation to a vector
6. Using data.frame() function, make the vectors into a single dataframe and print it
7. Stop

PROGRAM
Name=c("John","Christy","Ivy","Maggie","Zayn")
Gender=c("M","M","F","F","M")
Age=c(21,24,25,27,36)
Designation=c("Clerk","Manager","Executive","CEO","Assistant")
Employees = data.frame(Name,Gender,Age,Designation);
print("EMPLOYEE DETAILS:")
print(Employees)

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

OUTPUT
[1] "EMPLOYEE DETAILS:"
Name Gender Age Designation
1 John M 21 Clerk
2 Christy M 24 Manager
3 Ivy F 25 Executive
4 Maggie F 27 CEO
5 Zayn M 36 Assistant

13. PROGRAM TO IMPLEMENT LINEAR REGRESSION


AIM
Write an R program to implement simple linear regression

ALGORITHM

1. Start
2. Read height values
3. Read weight values
4. Make the height and weight into a single data frame
5. Print the data frame
6. Write the data frame to a csv file
7. Create a linear regression model with height as independent variable and weight as
dependent variable using lm() function
8. Print the Slope and Y intercept of the line
9. Predict the weight of a given height using predict() function
10. Plot the linear regression graph using abline
11. Stop
PROGRAM
x=c(150,152,153,160,157,158,166,170,156,154)
y=c(45,50,49,59,53,55,55,67,54,52)
df=data.frame(x,y)
print(df)
write.csv(df,"file.csv")
v=lm(y~x)
print("The Formula, Slope and Y intercept of the line is: ")
print(v$coefficients)
pred=data.frame(x=156)
xpred=predict(v,pred)

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

print(paste("Predicted Weight of height 156 is:",xpred))


plot(x,y,col='blue',main='Height and Weight
Regression',abline(v,col='red'),xlab="Height",ylab="Weight")

OUTPUT

x y
1 150 45
2 152 50
3 153 49
4 160 59
5 157 53
6 158 55
7 166 55
8 170 67
9 156 54
10 154 52
[1] "The Formula, Slope and Y intercept of the line is: "
(Intercept) x
-80.3518519 0.8518519
[1] "Predicted Weight of height 156 is: 52.537037037037"

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

14. PROGRAM TO IMPLEMENT LOGISTIC REGRESSION

AIM
Write an R program to implement Logistic Regression
ALGORITHM
1. Start
2. Import libraries caTools and ROCR
3. Load a csv file
4. Split the data into testing and training data with split ratio 80%
5. Create a logistic regression model with the training data using the function glm()
6. Print the summary of the model using the function summary()
7. Predict using the model the test values using the function predict()
8. Print the confusion matrix matrix with the actual values and predicted values using
table() function
9. Predict the Outcome of the training data also, using the function predict(), and store to
predict1 variable
10. Create an ROC curve using the predict1 values using the prediction() function
11. Find the Area under the curve, AUC using the function performance()
12. Plot the ROC graph using plot() function
13. Stop

PROGRAM

#logistic regression
library(caTools)
library(ROCR)
df=read.csv("diabetes.csv")
spl=sample.split(df,SplitRatio=0.8)
train_reg <- subset(df, spl == "TRUE")
test_reg <- subset(df, spl == "FALSE")
model=glm(Outcome~.,data=train_reg,family="binomial")
print("The model is: ")
print(model)
summ=summary(model)
print("The summary of the model is")
print(summ)

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

predict<- predict(model,test_reg, type = "response")


p1=data.frame(predict)
print("The predicted values of Test Data is: ")

print(head(p1,10))
x<-table(test_reg$Outcome, predict>0.4)
print("The Confusion Matrix is: ")
print(x)
acc <-(x[[1,1]]+x[[2,2]])/sum(x)
print(paste("The Accuracy is:",acc))
predict1<- predict(model,train_reg, type = "response")

print(paste("The predicted values of Train Data is: "))


print(head(p1,10))
ROC1 <- prediction(predict1, train_reg$Outcome)
ROC2 <- performance(ROC1, measure = "tpr", x.measure = "fpr")
auc <- performance(ROC1, measure = "auc")
auc <- [email protected][[1]]
cat("The Area under the curve is: ",auc)
auc <- round(auc, 4)
plot(ROC2, colorize=TRUE,print.cutoffs.at = seq(0.1, by = 0.1), main = "ROC CURVE")
abline(a = 0, b = 1)

OUTPUT
1] "The model is: "

Call: glm(formula = Outcome ~ ., family = "binomial", data = train_reg)

Coefficients:
(Intercept) Pregnancies Glucose
-8.639132 0.143143 0.037756
BloodPressure SkinThickness Insulin
-0.016310 0.001434 -0.001336
BMI DiabetesPedigreeFunction Age
0.096936 0.633853 0.011697

Degrees of Freedom: 596 Total (i.e. Null); 588 Residual

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

Null Deviance: 775.6


Residual Deviance: 550.2 AIC: 568.2
[1] "The summary of the model is"

Call:
glm(formula = Outcome ~ ., family = "binomial", data = train_reg)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.4706 -0.7023 -0.3893 0.7345 2.6794

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.639132 0.824683 -10.476 < 2e-16 ***
Pregnancies 0.143143 0.037795 3.787 0.000152 ***
Glucose 0.037756 0.004350 8.680 < 2e-16 ***
BloodPressure -0.016310 0.006011 -2.714 0.006656 **
SkinThickness 0.001434 0.007827 0.183 0.854605
Insulin -0.001336 0.001014 -1.318 0.187473
BMI 0.096936 0.017400 5.571 2.53e-08 ***
DiabetesPedigreeFunction 0.633853 0.331976 1.909 0.056219 .
Age 0.011697 0.010617 1.102 0.270576
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 775.56 on 596 degrees of freedom


Residual deviance: 550.22 on 588 degrees of freedom
AIC: 568.22

Number of Fisher Scoring iterations: 5

[1] "The predicted values of Test Data is: "


predict
3 0.8069695
8 0.7271615
12 0.9129396
17 0.3447163

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

21 0.3609391
26 0.4695308
30 0.2606044
35 0.4248690
39 0.1589572
44 0.9251604
[1] "The Confusion Matrix is: "

FALSE TRUE
0 99 15
1 29 28
[1] "The Accuracy is: 0.742690058479532"
[1] "The predicted values of Train Data is: "
predict
3 0.8069695
8 0.7271615
12 0.9129396
17 0.3447163
21 0.3609391
26 0.4695308
30 0.2606044
35 0.4248690
39 0.1589572
44 0.9251604
The Area under the curve is: 0.8481448
> source("~/.active-rstudio-document")
[1] "The model is: "

Call: glm(formula = Outcome ~ ., family = "binomial", data = train_reg)

Coefficients:
(Intercept) Pregnancies Glucose
-8.231923 0.133019 0.034235
BloodPressure SkinThickness Insulin
-0.012922 0.002832 -0.001240
BMI DiabetesPedigreeFunction Age
0.087621 0.968626 0.013071

Degrees of Freedom: 596 Total (i.e. Null); 588 Residual

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

Null Deviance: 781.4


Residual Deviance: 571.8 AIC: 589.8
[1] "The summary of the model is"

Call:
glm(formula = Outcome ~ ., family = "binomial", data = train_reg)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.5474 -0.7373 -0.4114 0.7502 2.8689

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.2319229 0.7966276 -10.333 < 2e-16 ***
Pregnancies 0.1330191 0.0372710 3.569 0.000358 ***
Glucose 0.0342347 0.0040348 8.485 < 2e-16 ***
BloodPressure -0.0129215 0.0058050 -2.226 0.026018 *
SkinThickness 0.0028324 0.0078585 0.360 0.718529
Insulin -0.0012398 0.0009722 -1.275 0.202237
BMI 0.0876209 0.0165019 5.310 1.1e-07 ***
DiabetesPedigreeFunction 0.9686256 0.3299302 2.936 0.003326 **
Age 0.0130708 0.0103317 1.265 0.205831
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 781.42 on 596 degrees of freedom


Residual deviance: 571.84 on 588 degrees of freedom
AIC: 589.84

Number of Fisher Scoring iterations: 5

[1] "The predicted values of Test Data is: "


predict
2 0.05296349
6 0.15162920
11 0.22250460
15 0.62678551

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

20 0.24266690
24 0.32726113
29 0.56138003
33 0.05283638
38 0.42753236
42 0.69997193
[1] "The Confusion Matrix is: "

FALSE TRUE
0 100 19
1 16 36
[1] "The Accuracy is: 0.795321637426901"
[1] "The predicted values of Train Data is: "
predict
2 0.05296349
6 0.15162920
11 0.22250460
15 0.62678551
20 0.24266690
24 0.32726113
29 0.56138003
33 0.05283638
38 0.42753236
42 0.69997193
The Area under the curve is: 0.8351925

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

15. PROGRAM TO IMPLEMENT STATISTICAL ANALYSIS


AIM
Write an R program to perform statistical Analysis

ALGORITHM
1. Start
2. Load datasets mtcars and iris
3. Perform one sample t test, 2 sample t test and paired t test using t.test() function
4. Perform anova test using aov() function
5. Perform Shapiro Normality test using shapiro.test() function
6. Perform Kolmogorav-Smirnov test using ks.test() function
7. Perform Kruskal test using kruskal.test() function
8. Perform Wilcoxon test using wilcox.test() function
9. Perform Flinger test using flinger.test() function
10. Perform Ansari test using ansari.test() function

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

11. Read to vectors smokers and patients and using this data perform Proposition test
using prop.test() function
12. Perform Binomial test using binomial.test() function
13. Stop

PROGRAM
#statistical analysis
data(mtcars)
data(iris)
print(t.test(mtcars$mpg, y=NULL)) # One sample
print(t.test(mpg ~ cyl, data = mtcars, subset = cyl %in% c(4, 6))) # Two sample
print(t.test(mtcars$mpg, mtcars$am, data = mtcars, paired = T)) # Paired t-test
print(aov(mpg ~ cyl, data = mtcars)) # ANOVA Test
print(shapiro.test(mtcars$wt)) # Shapiro Normality Test
print(ks.test(mtcars$wt, mtcars$disp)) # Kolmogorov-Smirnov test
print(kruskal.test(mpg ~ am, data = mtcars)) # Kruskal Test
print(wilcox.test(iris$Sepal.Length)) # Wilcoxon Test
print(fligner.test(mtcars$mpg, mtcars$am)) # Flinger Test
print(ansari.test(rnorm(20), rnorm(10, 0, 5), conf.int = T)) #Ansari Test
smokers <- c(83, 90, 129, 70)
patients <- c(86, 93, 136, 82)
print(prop.test(smokers, patients)) # Proposition Test
print(binom.test(64, 100, 0.5)) # Binomial Test

OUTPUT
One Sample t-test

data: mtcars$mpg
t = 18.857, df = 31, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
17.91768 22.26357
sample estimates:
mean of x
20.09062

Welch Two Sample t-test

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

data: mpg by cyl


t = 4.7191, df = 12.956, p-value = 0.0004048
alternative hypothesis: true difference in means between group 4 and group 6 is not equal to
0
95 percent confidence interval:
3.751376 10.090182
sample estimates:
mean in group 4 mean in group 6
26.66364 19.74286

Paired t-test

data: mtcars$mpg and mtcars$am


t = 19.394, df = 31, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
17.61433 21.75442
sample estimates:
mean difference
19.68437

Call:
aov(formula = mpg ~ cyl, data = mtcars)

Terms:
cyl Residuals
Sum of Squares 817.7130 308.3342
Deg. of Freedom 1 30

Residual standard error: 3.205902


Estimated effects may be unbalanced

Shapiro-Wilk normality test

data: mtcars$wt
W = 0.94326, p-value = 0.09265

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

Exact two-sample Kolmogorov-Smirnov test

data: mtcars$wt and mtcars$disp


D = 1, p-value < 2.2e-16
alternative hypothesis: two-sided

Kruskal-Wallis rank sum test

data: mpg by am
Kruskal-Wallis chi-squared = 9.7914, df = 1, p-value = 0.001753

Wilcoxon signed rank test with continuity correction

data: iris$Sepal.Length
V = 11325, p-value < 2.2e-16
alternative hypothesis: true location is not equal to 0

Fligner-Killeen test of homogeneity of variances

data: mtcars$mpg and mtcars$am


Fligner-Killeen:med chi-squared = 4.4929, df = 1, p-value = 0.03404

Ansari-Bradley test

data: rnorm(20) and rnorm(10, 0, 5)


AB = 193, p-value = 0.002937
alternative hypothesis: true ratio of scales is not equal to 1
95 percent confidence interval:
0.05364466 0.73505807
sample estimates:
ratio of scales
0.1959577

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

4-sample test for equality of proportions without continuity correction

data: smokers out of patients


X-squared = 12.6, df = 3, p-value = 0.005585
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.9651163 0.9677419 0.9485294 0.8536585

Exact binomial test

data: 64 and 100


number of successes = 64, number of trials = 100, p-value = 0.006637
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.5378781 0.7335916
sample estimates:
probability of success
0.64

16. PROGRAM TO PRINT VARIANCE, COVARIANCE AND CORRELATION


OF A DATA

AIM
Write an R program to print variance, correlation and covariance of a data

ALGORITHM
1. Start
2. Load the first 4 columns of iris data and store that to a variable ‘data’
3. Find variance using apply(data,margin,function), where function =var
4. Find the covariance matrix using cov() function
5. Find the correlation matrix using cor() function
6. Stop
PROGRAM

#covariance & correlation

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

data=iris[1:4]
print("Variance:")
var = apply(data,2,var)
print(var)
print("Covariance:")
print(cov(data))
print("Correlation:")
print(cor(data),method='pearson')

OUTPUT

[1] "Variance:"
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.6856935 0.1899794 3.1162779 0.5810063
[1] "Covariance:"
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707
Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394
Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094
Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063
[1] "Correlation:"
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000

17. PROGRAM TO IMPLEMENT SVM CLASSIFIER


AIM
Write an R program to implement SVM classifier

ALGORITHM
1. Start
2. Import libraries e1071 and caTools
3. Read a csv file to a variable ‘data’
4. Set the seed value to 123
5. Split the data into testing and training data with split ratio 80%
6. Create an SVM classifier model using the data, and svm() function

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

7. Predict the values of test data using the model


8. Create a confusion matrix with actual values and predicted values of test data
9. Find the accuracy
10. Stop
PROGRAM
library(e1071)
library(caTools)
file='/home/student/Desktop/27-04/social1.csv' #Social network ads
data=read.csv(file)
set.seed(123)
spl=sample.split(data,SplitRatio=0.75)
train_reg <- subset(data, spl == "TRUE")
test_reg <- subset(data, spl == "FALSE")
model=svm(Purchased~.,train_reg,type='C-classification',kernel='linear')
print(model)
pred=predict(model,test_reg)
cm=table(test_reg$Purchased,pred)
print(cm)
accu=(sum(diag(cm)))/sum(cm)
cat("Accuracy:",accu)

OUTPUT

Call:
svm(formula = Purchased ~ ., data = train_reg, type = "C-classification",
kernel = "linear")

Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 1

Number of Support Vectors: 120

pred
0 1
0 54 1

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

1 18 27
Accuracy: 0.81

18. PROGRAM TO IMPLEMENT DECISION TREE ALGORITHM


AIM
Write an R program to implement decision tree algorithm

ALGORITHM

1. Start
2. Import libraries rpart, rpart.plot and caTools
3. Load iris dataset
4. Set seed value to 123
5. Split the data into training and testing data with split ratio 80%
6. Create a decision tree model using train data using the function rpart()
7. Predict the values of test data using model
8. Create a confusion matrix using actual values and predicted values of test data
9. Find the accuracy of the model
10. Stop
PROGRAM
library(rpart)
library(rpart.plot)
library(caTools)
data=iris
set.seed(123)
spl=sample.split(data,SplitRatio=0.8)
train_reg <- subset(data, spl == "TRUE")
test_reg <- subset(data, spl == "FALSE")
model<-rpart(Species~.,train_reg,method="class")
rpart.plot(model,type=4,extra=101)
pred=predict(model,test_reg,type='class')
cm=table(test_reg$Species,pred)
print(cm)
accu=(cm[1]+cm[5]+cm[9])/sum(cm)
cat("The Accuracy is:",accu)

OUTPUT
pred

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

setosa versicolor virginica


setosa 10 0 0
versicolor 0 10 0
virginica 0 3 7
The Accuracy is: 0.9

19. PROGRAM TO IMPLEMENT K-MEANS CLUSTERING

AIM
Write an R program to implement k-means clustering
ALGORITHM

1. Start
2. Import library cluster
3. Load iris data from columns 1 to 4
4. Create a k-means clustering model with number of clusters 3 using the function km()
5. Using autoplot() plot the graph
6. Print the confusion matrix with species and the clusters of the model
7. Print the cluster centers
8. Stop
PROGRAM

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

library (ggplot2)
library(cluster)
data=iris
df=data[1:4] #remove the label bc unsupervised
print(head(df))
km=kmeans(df,centers = 3)
print("The Model is: ")
print(km)
plot(autoplot(km,df,frame=TRUE)) # to visually display that clusters are distinct
print("The Cluster Centers are:")
print(km$centers)
cm=table(data$Species,km$cluster)
print("The Confusion Matrix is:")
print(cm)

OUTPUT
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
[1] "The Model is: "
K-means clustering with 3 clusters of sizes 50, 38, 62

Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.006000 3.428000 1.462000 0.246000
2 6.850000 3.073684 5.742105 2.071053
3 5.901613 2.748387 4.393548 1.433871

Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[37] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[73] 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2
[109] 2 2 2 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2
[145] 2 2 3 2 2 3

Department of Artificial Intelligence and DataScience


Big Data Analytics lab

Within cluster sum of squares by cluster:


[1] 15.15100 23.87947 39.82097
(between_SS / total_SS = 88.4 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss"


[6] "betweenss" "size" "iter" "ifault"
[1] "The Cluster Centers are:"
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.006000 3.428000 1.462000 0.246000
2 6.850000 3.073684 5.742105 2.071053
3 5.901613 2.748387 4.393548 1.433871
[1] "The Confusion Matrix is:"

1 2 3
setosa 50 0 0
versicolor 0 2 48
virginica 0 36 14

Department of Artificial Intelligence and DataScience

You might also like