BDA LAB Experiments
BDA LAB Experiments
HADOOP
Hadoop is an open-source framework designed for processing and storing
large datasets across distributed computing clusters. It enables the handling of big
data by providing a scalable and fault-tolerant solution. The core components of
Hadoop are the Hadoop Distributed File System (HDFS) and the MapReduce
programming model. HDFS divides data into blocks and replicates them across
multiple nodes for reliability. MapReduce allows for parallel processing of data by
splitting it into smaller tasks that can be executed in parallel across the cluster.
Hadoop is widely used in various industries for tasks like data analysis, machine
learning, and log processing, providing a cost-effective and efficient solution for big
data processing.
Step 1
sudo apt install openjdk-11-jdk
Step 2
#Open bashrc file and modify contents
sudo nano bashrc
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export PATH=$PATH:/usr/lib/jvm/java-11-openjdk-amd64/bin
export HADOOP_HOME=~/hadoop-3.3.4/
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export
HADOOP_STREAMING=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-
streaming-3.3.4.jar
export HADOOP_LOG_DIR=$HADOOP_HOME/logs
export PDSH_RCMD_TYPE=ssh
Save: Control+O
Exit: Control+X
Step 3:
( ssh — secure shell — protocol used to securely connect to remote server/system —
transfers data in encrypted form)
Step 4:
#now open hadoop-env.h
sudo nano hadoop-env.h
#Save and Exit
Step 5:
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
step 6:
ssh localhost
step 7:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
step 8:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
step 9:
chmod 0600 ~/.ssh/authorized_keys
step 10:
hadoop-3.3.4/bin/hdfs namenode –format
step 11:
export PDSH_RCMD_TYPE=ssh
step 12:
start-all.sh
start-all.sh
AIM
Write an R program to find factorial and check palindrome.
ALGORITHM
1. Start
2. Import library stringi
3. Algorithms for palindrome (x)
3.1 using stri_reverse function; if stri_reverse(x) is equal to x then
3.1.1 print that the number is palindrome
3.2 else.
3.2.1 print that the number is not palindrome
4.Algorithm for factorial (y)
4.1 factt=1
4.2 if y<0 then
4.2.1 print that y is negative and factorial is not possible
4.3 if y=0 then
4.3.1 print that the factorial of 0 is 1
4.3.2 Call function palindrome(1)
4.4 else
4.4.1 for i from y to 1 do.
4.4.1.1 factt=factt*i
4.4.2 print factt
4.4.3 Call function palindrome(factt)
PROGRAM
library(stringi)
palin<-function(x){
if(stri_reverse(x)==x){
print(paste(x," is a palindrome"))
}else{
print(paste(x," is not a palindrome"))
}
}
fact=function(y){
factt=1
if(y<0){
print(paste(y, "is a negative number"))
}else if(y==0) {
print("The factorial of 0 is 1")
palin(y)
}else{
for (i in 1:y){
factt=factt*i
}
print(paste("The factorial of ", y," is ",factt ))
palin(factt)
}
}
k=as.integer(readline("Enter a number: "))
fact(k)
OUTPUT
Enter a number: 1
Enter a number: 11
[1] "The factorial of 11 is 39916800"
[1] "39916800 is not a palindrome"
Enter a number: 0
[1] "The factorial of 0 is 1"
[1] "0 is a palindrome"
Enter a number: -9
[1] "-9 is a negative number"
AIM
Write an R program to check if a number is prime
ALGORITHM
1. Start.
2. Algorithm for prime (a)
2.1 flag=0
2.2 for i from 2 to x-1
2.2.1 if i mod x is equal to x then
2.2.1.1. flag=1
2.3 if x= 2 then
2.3.1 flag=0
2.4 if flag=0 then
2.4.1 Print x is a prime number.
2.5 else
2.5.1 print x is not a prime number
3. Stop
Read a number, k
14. call fusction prime (k)
5 Stop
PROGRAM
prim=function(x){
flag=0
for(i in 2:(x-1)){
if((i%%x)==0){
flag=1
}
}
if(x==2){
flag=0
}
if(flag==0){
print(paste(x," is a prime number"))
}else{
print(paste(x," is not a prime number"))
}
}
OUTPUT
Enter a number: 6
[1] "6 is not a prime number"
> source("~/Desktop/07-05/primmm.R")
Enter a number: 5
[1] "5 is a prime number"
> source("~/Desktop/07-05/primmm.R")
Enter a number: 2
[1] "2 is a prime number"
AIM
Write an R program to print a pattern
ALGORITHM
1. Start
2. for i from 1 to 5
2.1 v=c()
2.2 for j from i to 1
2.2.1 v=c(v,c(“*”))
2.3 print v
3. for i from 1 to 5
3.1 v=()
3.2 for j from i to 5
3.2.1 v=c(v,c(“*”))
3.3 print v
4. Stop
PROGRAM
print("pattern")
for(i in 1:5){
v=c()
for (j in i:1){
v=c(v,c("*"))
}
print(v)
}
for(i in 1:5){
v=c()
for (j in i:5){
v=c(v,c("*"))
}
print(v)
}
OUTPUT
[1] "pattern"
[1] "*"
[1] "*" "*"
[1] "*" "*" "*"
[1] "*" "*" "*" "*"
[1] "*" "*" "*" "*" "*"
[1] "*" "*" "*" "*" "*"
[1] "*" "*" "*" "*"
[1] "*" "*" "*"
[1] "*" "*"
[1] "*"
AIM
Write an R program to implement a simple calculator
ALGORITHM
1. Start
2. Algorithm for add(a,b)
a. return (a+b)
3. Algorithm for subtract(a,b)
a. Return a-b
4. Algorithm for multiple(a,b)
a. Return a*b
5. Algorithm for divide (a,b)
a. If b=0 then return ‘Not possible’
b. Else return a/b
6. Read a choice ‘ch’; 1 for addition, 2 for subtraction, 3 for multiplication and 4 for
division
7. Using switch() choose the operator
8. Read 2 values a and b
9. Using switch() call function based on the corresponding function is to be performed
and store the return value to a variable r
10. Print r
11. Stop
PROGRAM
add=function(a,b){
print(paste("The sum is: ",a+b)
}
sub=function(a,b){
print(paste("The Difference is: ",a-b)
}
mul=function(a,b){
print(paste("The Multiplicated value is: ",a*b)
}
div=function(a,b){
print(paste("The Divided value is: ",(a%/%b)
}
print("Enter the choice: ")
print("1. Addition")
print("2. Subtraction")
print("3. Multiplication")
print("4. Division")
ch=as.integer(readline("Enter the choice: "))
a=as.integer(readline("Enter the first number: "))
b=as.integer(readline("Enter the second number: "))
op=switch(ch,"+","-","*","/")
r=switch(ch,add(a,b),sub(a,b),mul(a,b),div(a,b))
print(paste(a," ",op, " = ",r ))
OUTPUT
AIM
Write an R program to print fibonacci series of a number
ALGORITHM
1. Start
2. Algorithm for fib(x)
a. Assign x1=0 and x2=1
b. l=empty vector
c. l=vector(l,x1)
d. l=vector(l,x2)
e. For i from 3 to x do
i. xn=x1+x2
ii. l=vector(l,xn)
iii. x1=x2
iv. x2=xn
f. Print
3. Read an integer n
4. Call function fib(n)
5. Stop
PROGRAM
{
x1=0
x2=1
l=c()
print(paste("The fibonacci series for ",x," numbers is: \n"))
l=c(l,x1)
l=c(l,x2)#Inserting the first two into the vector
for (i in 3:x)
{
xn=x1+x2
l=c(l,xn)
x1=x2
x2=xn
}
print(l)#printing the series as a vector
}
n=as.integer(readline(prompt = "Enter a number:"))
fib(n)
OUTPUT
Enter a number:8
[1] "The fibonacci series for 8 numbers is: \n"
[1] 0 1 1 2 3 5 8 13
AIM
Write an R program to print the GCD of 2 numbers
ALGORITHM
1. Start
2. Algorithm gcd(a,b)
a. Assign gcdd=1
b. If a<b then l=a else l=b
c. for i from 1 to l do
i. if a mod i and b mod i = 0 then gcdd=i
d. return gcdd
3. Read 2 integers a and b
4. Call function gcd(a,b) and store the result to a variable k
5. print k
6. Stop
PROGRAM
#greatest common divisor
gcd=function(a,b){
gcdd=1
if(a<b)
{
l=a
}
else
{
l=b
}
for (i in 1:l)
{
if((a%%i==0)&&(b%%i==0))
{
gcdd=i
}
}
return(gcdd)
}
a=as.integer(readline("Enter a number:"))
b=as.integer(readline("Enter another number:"))
k=gcd(a,b)
print(paste("The GCD is: ",k))
OUTPUT
Enter a number:6
Enter another number:5
[1] "The GCD is: 1"
> source("~/Desktop/14-03/gcd.R")
Enter a number:6
Enter another number:4
[1] "The GCD is: 2"
c. While true do
i. if g mod a and g mod b = 0 then lcmm=g and break
ii. g=g+1
d. return lcmm
3. Read 2 integers a and b
4. Call function lcm(a,b) and store the result to a variable k
5. print k
6. Stop
PROGRAM
#lcm
lcm=function(a,b){
lcmm=1
if(a<b)
{
g=b
}
else
{
g=a
}
while(TRUE)
{
if((g%%a==0)&&(g%%b==0))
{
lcmm=g
break
}
g=g+1
}
return(lcmm)
}
a=as.integer(readline("Enter a number:"))
b=as.integer(readline("Enter another number:"))
k=lcm(a,b)
print(paste("The LCM is: ",k))
OUTPUT
Enter a number:6
Enter another number:5
[1] "The LCM is: 30"
AIM
Write an R program to print the sum of N natural numbers
ALGORITHM
1. Start
2. Read an integer n
3. Assign sum=0
4. For i from 1 to n do
a. Sum =sum+i
5. Print sum
6. stop
PROGRAM
#sum of n natural numbers
n=as.integer(readline("Enter a natural number:"))
sum=0
for (i in 1:n)
{
sum=sum+i
}
cat("The sum is: ",sum)
OUTPUT
Write an R program to create a list of random numbers in normal distribution and count
occurrences of each values
ALGORITHM
1. Start
2. Using rnorm() function, get 50 random values
3. Using table() function get the occurrences of each value
4. Print the table
5. Stop
PROGRAM
x=floor(rnorm(n=50))
t = table(x)
print("Occurrences of each value:")
print(t)
OUTPUT
[1] "Occurrences of each value:"
x
-3 -2 -1 0 1 2
1 10 19 15 4 1
1. Start
2. Read the subjects
3. Read the corresponding marks
4. Using barplot() function, plot the graph
5. Stop
PROGRAM
subject=c("BDA","PR","RIS","IEFT","AAD")
mark=c(47,49,34,46,48)
barplot(mark,names.arg=subject,main="Bar
Plot",xlab="Subject",ylab="Mark",col="royalblue2")
OUTPUT
size = length(v)
prod = 1
for(i in 1:size)
{
prod = v[i]*prod
}
cat("The product of the vector is: ",prod,"\n")
OUTPUT
PROGRAM
Name=c("John","Christy","Ivy","Maggie","Zayn")
Gender=c("M","M","F","F","M")
Age=c(21,24,25,27,36)
Designation=c("Clerk","Manager","Executive","CEO","Assistant")
Employees = data.frame(Name,Gender,Age,Designation);
print("EMPLOYEE DETAILS:")
print(Employees)
OUTPUT
[1] "EMPLOYEE DETAILS:"
Name Gender Age Designation
1 John M 21 Clerk
2 Christy M 24 Manager
3 Ivy F 25 Executive
4 Maggie F 27 CEO
5 Zayn M 36 Assistant
ALGORITHM
1. Start
2. Read height values
3. Read weight values
4. Make the height and weight into a single data frame
5. Print the data frame
6. Write the data frame to a csv file
7. Create a linear regression model with height as independent variable and weight as
dependent variable using lm() function
8. Print the Slope and Y intercept of the line
9. Predict the weight of a given height using predict() function
10. Plot the linear regression graph using abline
11. Stop
PROGRAM
x=c(150,152,153,160,157,158,166,170,156,154)
y=c(45,50,49,59,53,55,55,67,54,52)
df=data.frame(x,y)
print(df)
write.csv(df,"file.csv")
v=lm(y~x)
print("The Formula, Slope and Y intercept of the line is: ")
print(v$coefficients)
pred=data.frame(x=156)
xpred=predict(v,pred)
OUTPUT
x y
1 150 45
2 152 50
3 153 49
4 160 59
5 157 53
6 158 55
7 166 55
8 170 67
9 156 54
10 154 52
[1] "The Formula, Slope and Y intercept of the line is: "
(Intercept) x
-80.3518519 0.8518519
[1] "Predicted Weight of height 156 is: 52.537037037037"
AIM
Write an R program to implement Logistic Regression
ALGORITHM
1. Start
2. Import libraries caTools and ROCR
3. Load a csv file
4. Split the data into testing and training data with split ratio 80%
5. Create a logistic regression model with the training data using the function glm()
6. Print the summary of the model using the function summary()
7. Predict using the model the test values using the function predict()
8. Print the confusion matrix matrix with the actual values and predicted values using
table() function
9. Predict the Outcome of the training data also, using the function predict(), and store to
predict1 variable
10. Create an ROC curve using the predict1 values using the prediction() function
11. Find the Area under the curve, AUC using the function performance()
12. Plot the ROC graph using plot() function
13. Stop
PROGRAM
#logistic regression
library(caTools)
library(ROCR)
df=read.csv("diabetes.csv")
spl=sample.split(df,SplitRatio=0.8)
train_reg <- subset(df, spl == "TRUE")
test_reg <- subset(df, spl == "FALSE")
model=glm(Outcome~.,data=train_reg,family="binomial")
print("The model is: ")
print(model)
summ=summary(model)
print("The summary of the model is")
print(summ)
print(head(p1,10))
x<-table(test_reg$Outcome, predict>0.4)
print("The Confusion Matrix is: ")
print(x)
acc <-(x[[1,1]]+x[[2,2]])/sum(x)
print(paste("The Accuracy is:",acc))
predict1<- predict(model,train_reg, type = "response")
OUTPUT
1] "The model is: "
Coefficients:
(Intercept) Pregnancies Glucose
-8.639132 0.143143 0.037756
BloodPressure SkinThickness Insulin
-0.016310 0.001434 -0.001336
BMI DiabetesPedigreeFunction Age
0.096936 0.633853 0.011697
Call:
glm(formula = Outcome ~ ., family = "binomial", data = train_reg)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4706 -0.7023 -0.3893 0.7345 2.6794
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.639132 0.824683 -10.476 < 2e-16 ***
Pregnancies 0.143143 0.037795 3.787 0.000152 ***
Glucose 0.037756 0.004350 8.680 < 2e-16 ***
BloodPressure -0.016310 0.006011 -2.714 0.006656 **
SkinThickness 0.001434 0.007827 0.183 0.854605
Insulin -0.001336 0.001014 -1.318 0.187473
BMI 0.096936 0.017400 5.571 2.53e-08 ***
DiabetesPedigreeFunction 0.633853 0.331976 1.909 0.056219 .
Age 0.011697 0.010617 1.102 0.270576
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
21 0.3609391
26 0.4695308
30 0.2606044
35 0.4248690
39 0.1589572
44 0.9251604
[1] "The Confusion Matrix is: "
FALSE TRUE
0 99 15
1 29 28
[1] "The Accuracy is: 0.742690058479532"
[1] "The predicted values of Train Data is: "
predict
3 0.8069695
8 0.7271615
12 0.9129396
17 0.3447163
21 0.3609391
26 0.4695308
30 0.2606044
35 0.4248690
39 0.1589572
44 0.9251604
The Area under the curve is: 0.8481448
> source("~/.active-rstudio-document")
[1] "The model is: "
Coefficients:
(Intercept) Pregnancies Glucose
-8.231923 0.133019 0.034235
BloodPressure SkinThickness Insulin
-0.012922 0.002832 -0.001240
BMI DiabetesPedigreeFunction Age
0.087621 0.968626 0.013071
Call:
glm(formula = Outcome ~ ., family = "binomial", data = train_reg)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5474 -0.7373 -0.4114 0.7502 2.8689
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.2319229 0.7966276 -10.333 < 2e-16 ***
Pregnancies 0.1330191 0.0372710 3.569 0.000358 ***
Glucose 0.0342347 0.0040348 8.485 < 2e-16 ***
BloodPressure -0.0129215 0.0058050 -2.226 0.026018 *
SkinThickness 0.0028324 0.0078585 0.360 0.718529
Insulin -0.0012398 0.0009722 -1.275 0.202237
BMI 0.0876209 0.0165019 5.310 1.1e-07 ***
DiabetesPedigreeFunction 0.9686256 0.3299302 2.936 0.003326 **
Age 0.0130708 0.0103317 1.265 0.205831
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
20 0.24266690
24 0.32726113
29 0.56138003
33 0.05283638
38 0.42753236
42 0.69997193
[1] "The Confusion Matrix is: "
FALSE TRUE
0 100 19
1 16 36
[1] "The Accuracy is: 0.795321637426901"
[1] "The predicted values of Train Data is: "
predict
2 0.05296349
6 0.15162920
11 0.22250460
15 0.62678551
20 0.24266690
24 0.32726113
29 0.56138003
33 0.05283638
38 0.42753236
42 0.69997193
The Area under the curve is: 0.8351925
ALGORITHM
1. Start
2. Load datasets mtcars and iris
3. Perform one sample t test, 2 sample t test and paired t test using t.test() function
4. Perform anova test using aov() function
5. Perform Shapiro Normality test using shapiro.test() function
6. Perform Kolmogorav-Smirnov test using ks.test() function
7. Perform Kruskal test using kruskal.test() function
8. Perform Wilcoxon test using wilcox.test() function
9. Perform Flinger test using flinger.test() function
10. Perform Ansari test using ansari.test() function
11. Read to vectors smokers and patients and using this data perform Proposition test
using prop.test() function
12. Perform Binomial test using binomial.test() function
13. Stop
PROGRAM
#statistical analysis
data(mtcars)
data(iris)
print(t.test(mtcars$mpg, y=NULL)) # One sample
print(t.test(mpg ~ cyl, data = mtcars, subset = cyl %in% c(4, 6))) # Two sample
print(t.test(mtcars$mpg, mtcars$am, data = mtcars, paired = T)) # Paired t-test
print(aov(mpg ~ cyl, data = mtcars)) # ANOVA Test
print(shapiro.test(mtcars$wt)) # Shapiro Normality Test
print(ks.test(mtcars$wt, mtcars$disp)) # Kolmogorov-Smirnov test
print(kruskal.test(mpg ~ am, data = mtcars)) # Kruskal Test
print(wilcox.test(iris$Sepal.Length)) # Wilcoxon Test
print(fligner.test(mtcars$mpg, mtcars$am)) # Flinger Test
print(ansari.test(rnorm(20), rnorm(10, 0, 5), conf.int = T)) #Ansari Test
smokers <- c(83, 90, 129, 70)
patients <- c(86, 93, 136, 82)
print(prop.test(smokers, patients)) # Proposition Test
print(binom.test(64, 100, 0.5)) # Binomial Test
OUTPUT
One Sample t-test
data: mtcars$mpg
t = 18.857, df = 31, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
17.91768 22.26357
sample estimates:
mean of x
20.09062
Paired t-test
Call:
aov(formula = mpg ~ cyl, data = mtcars)
Terms:
cyl Residuals
Sum of Squares 817.7130 308.3342
Deg. of Freedom 1 30
data: mtcars$wt
W = 0.94326, p-value = 0.09265
data: mpg by am
Kruskal-Wallis chi-squared = 9.7914, df = 1, p-value = 0.001753
data: iris$Sepal.Length
V = 11325, p-value < 2.2e-16
alternative hypothesis: true location is not equal to 0
Ansari-Bradley test
AIM
Write an R program to print variance, correlation and covariance of a data
ALGORITHM
1. Start
2. Load the first 4 columns of iris data and store that to a variable ‘data’
3. Find variance using apply(data,margin,function), where function =var
4. Find the covariance matrix using cov() function
5. Find the correlation matrix using cor() function
6. Stop
PROGRAM
data=iris[1:4]
print("Variance:")
var = apply(data,2,var)
print(var)
print("Covariance:")
print(cov(data))
print("Correlation:")
print(cor(data),method='pearson')
OUTPUT
[1] "Variance:"
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.6856935 0.1899794 3.1162779 0.5810063
[1] "Covariance:"
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707
Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394
Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094
Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063
[1] "Correlation:"
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
ALGORITHM
1. Start
2. Import libraries e1071 and caTools
3. Read a csv file to a variable ‘data’
4. Set the seed value to 123
5. Split the data into testing and training data with split ratio 80%
6. Create an SVM classifier model using the data, and svm() function
OUTPUT
Call:
svm(formula = Purchased ~ ., data = train_reg, type = "C-classification",
kernel = "linear")
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 1
pred
0 1
0 54 1
1 18 27
Accuracy: 0.81
ALGORITHM
1. Start
2. Import libraries rpart, rpart.plot and caTools
3. Load iris dataset
4. Set seed value to 123
5. Split the data into training and testing data with split ratio 80%
6. Create a decision tree model using train data using the function rpart()
7. Predict the values of test data using model
8. Create a confusion matrix using actual values and predicted values of test data
9. Find the accuracy of the model
10. Stop
PROGRAM
library(rpart)
library(rpart.plot)
library(caTools)
data=iris
set.seed(123)
spl=sample.split(data,SplitRatio=0.8)
train_reg <- subset(data, spl == "TRUE")
test_reg <- subset(data, spl == "FALSE")
model<-rpart(Species~.,train_reg,method="class")
rpart.plot(model,type=4,extra=101)
pred=predict(model,test_reg,type='class')
cm=table(test_reg$Species,pred)
print(cm)
accu=(cm[1]+cm[5]+cm[9])/sum(cm)
cat("The Accuracy is:",accu)
OUTPUT
pred
AIM
Write an R program to implement k-means clustering
ALGORITHM
1. Start
2. Import library cluster
3. Load iris data from columns 1 to 4
4. Create a k-means clustering model with number of clusters 3 using the function km()
5. Using autoplot() plot the graph
6. Print the confusion matrix with species and the clusters of the model
7. Print the cluster centers
8. Stop
PROGRAM
library (ggplot2)
library(cluster)
data=iris
df=data[1:4] #remove the label bc unsupervised
print(head(df))
km=kmeans(df,centers = 3)
print("The Model is: ")
print(km)
plot(autoplot(km,df,frame=TRUE)) # to visually display that clusters are distinct
print("The Cluster Centers are:")
print(km$centers)
cm=table(data$Species,km$cluster)
print("The Confusion Matrix is:")
print(cm)
OUTPUT
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
[1] "The Model is: "
K-means clustering with 3 clusters of sizes 50, 38, 62
Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.006000 3.428000 1.462000 0.246000
2 6.850000 3.073684 5.742105 2.071053
3 5.901613 2.748387 4.393548 1.433871
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[37] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[73] 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2
[109] 2 2 2 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2
[145] 2 2 3 2 2 3
Available components:
1 2 3
setosa 50 0 0
versicolor 0 2 48
virginica 0 36 14