ML Disha
ML Disha
• WEKA is a popular open-source machine learning and data mining software that provides
a collection of machine learning algorithms for data mining tasks.
• The name "WEKA" stands for "Waikato Environment for Knowledge Analysis," as it was
developed at the University of Waikato in New Zealand.
• User Interface: WEKA provides a graphical user interface (GUI) that allows users to
interact with machine learning algorithms, build models, and evaluate results visually.
• Data Preprocessing Tools: WEKA offers various tools for data preprocessing, including
options for handling missing values, transforming data, and selecting relevant features.
• Extensibility: WEKA is extensible, allowing users to implement and integrate their own
algorithms.
• Data Visualization: It provides visualization tools to help users understand the data and
model results.
Advantages of WEKA:
[1]
mining. This diversity makes it suitable for various data mining and machine learning
applications.
• User-Friendly Interface: WEKA features a graphical user interface (GUI) that makes it
accessible to users with varying levels of technical expertise. The GUI facilitates the
exploration of algorithms, building and evaluating models, and experimenting with
different approaches.
• Data Preprocessing Tools: WEKA offers tools for data preprocessing, including handling
missing values, transforming data, and feature selection. These capabilities contribute to
the overall data preparation process.
Disadvantages of WEKA:
• Limited Scalability: While WEKA is suitable for small to medium-sized datasets, it may
face challenges with very large datasets due to limitations in scalability.
• Advanced Features: Advanced machine learning practitioners may find WEKA lacking
some of the more recent and sophisticated algorithms and features available in other
specialized tools and libraries.
• Steep Learning Curve for Advanced Features: While the basic functionalities are
userfriendly, mastering advanced features and customizing algorithms may require a
steeper learning curve for users who are new to the software.
• Limited Support for Deep Learning: WEKA historically has been more focused on
traditional machine learning algorithms, and as of my last knowledge update in January
2022, it may not provide extensive support for deep learning techniques.
[2]
Minimum Hardware Requirement
Here are general guidelines for the minimum hardware requirements for running WEKA:
• Processor (CPU): A modern multi-core processor is recommended. The more cores, the
better, as certain machine learning tasks can benefit from parallel processing.
• Storage:
Disk Space: A few gigabytes of free disk space should be sufficient for installing WEKA and
storing datasets.
Solid State Drive (SSD): While not strictly necessary, using an SSD can enhance the speed of data
access and program loading.
• Java Runtime Environment (JRE): WEKA requires Java to be installed on your system.
Ensure that you have a compatible version of Java installed.
Installation steps
Windows Installation:
• Download WEKA: Visit the official WEKA website: WEKA Download Page
• Choose the version of WEKA you want to download (e.g., stable version).
• Java Installation: WEKA requires Java. If Java is not already installed on your system,
the installer may prompt you to download and install Java.
• Launch WEKA: After installation, you can launch WEKA from the Start menu or desktop
shortcut.
[3]
Experiment no- 2
4. Click Next.
6. It is recommended that you install for Just Me, which will install Anaconda Distribution to
just the current user account. Only select an install for All Users if you need to install for
all users’ accounts on the computer (which requires Windows Administrator privileges).
7. Click Next.
8. Select a destination folder to install Anaconda and click Next. Install Anaconda to a
directory path that does not contain spaces or unicode characters. For more information on
destination folders
9. Choose whether to add Anaconda to your PATH environment variable or register Anaconda
as your default Python. We don’t recommend adding Anaconda to your PATH
environment variable, since this can interfere with other software. Unless you plan on
installing and running multiple versions of Anaconda or multiple versions of Python, accept
[4]
the default and leave this box checked. Instead, use Anaconda software by opening
Anaconda Navigator or the Anaconda Prompt from the Start Menu.
10. Click Install. If you want to watch the packages Anaconda is installing, click Show
Details.
11. Click Next.
12. After a successful installation you will see the “Thanks for installing Anaconda” dialog
box:
[5]
13. If you wish to read more about Anaconda.org and how to get started with Anaconda, check
the boxes “Anaconda Distribution Tutorial” and “Learn more about Anaconda”. Click the
Finish button.
(b) Get familiarized with arff file format. Create an arff file on your system and save in the WEKA
installed drive of your system.
Structure of file.
1. Header Section
2. Data Section
1. Header Section
This section contains various information related to the dataset like the name of the relation,
columns, and type of columns. The header section contains 2 parts
Table/relation and attribute part. @relation: used to give the table name @attribute: used to
give a column name datatypes:
[6]
nominal: represented inside curly brackets (Like constants)
string: data type which accepts only string value numeric:
@relation tablename
example:
@relation "employee"
2. Data section
Data section is used to represents the data or entries for available columns. (according to the
order in header section data would be inserted).
data section starts with @data, and this section must be added after Header section. only single
record can be written in single line.
Syntax:
@data
<record1>
<record2>
<record N>
[7]
all the Records must be in the same format as their attributes are defined in Header section Like
example:
1,naman,N,1234556678,IT,02-08-2000,rjt
2,yash,M,1234556679,HR,04-05-2001,amd
3,kishan,G,1214556678,MANAGEMENT,02-11-2001,pbr
4,?,?,5234556678,IT,03-05-2000,amd
file:
@relation "employee"
@attribute id numeric
@data
1,naman,N,1234556678,IT,02-08-2000,rjt
2,yash,M,1234556679,HR,04-05-2001,amd
3,kishan,G,1214556678,MANAGEMENT,02-11-2001,pbr
4,?,?,5234556678,IT,03-05-2000,amd
We separate values by comma(,) and to represent the empty or missing value for a particular
column we use the (?)sign. How to Create and open arff file you need to have weka tool install
on your machine.
Step 1: Open any text editor and paste the above code.
[8]
Step 3: Open weka tool
Step 6: file is now Loaded now click on Edit from Preprocess Tab
[9]
Step 7: dataset would be shown like this.
[10]
Experiment no- 3
(a) Execute the Linear Regression algorithm on WEKA with the help
of suitable data set. When you select your data set try to do the splitting
of data set for training and testing as: i) Training 80 % and Testing 20%
ii) Training 60 % and testing 40 % (b) Implement linear regression
using python.
(a) Execute the Linear Regression algorithm on WEKA with the help of suitable data set. When
you select your data set try to do the splitting of data set for training and testing as: i) Training 80
% and Testing 20% ii) Training 60 % and testing 40 %
When you start WEKA, the GUI chooser pops up and lets you choose four ways to work with
WEKA and your data. For all the examples in this article series, we will choose only the Explorer
option.
Figure 2. WEKA Explorer
[11]
Now that you're familiar with how to install and start up WEKA, let's get into our first datamining
technique: regression.
@RELATION house
@DATA
3529,9191,6,0,0,205000
3247,10061,5,1,1,224900
4032,10150,5,0,1,197900
2397,14156,4,1,0,189900
2200,9600,4,0,1,195000
3536,19994,6,1,1,325000
2983,9365,5,0,1,230000
Now that the data file has been created, it's time to create our regression model. Start WEKA, then
choose the Explorer. You'll be taken to the Explorer screen, with the Preprocess tab selected.
Select the Open File button and select the ARFF file you created in the section above. After
selecting the file, your WEKA Explorer should look similar to the screenshot in Figure 3.
[12]
Figure 3. WEKA with house data loaded
To create the model, click on the Classify tab. The first step is to select the model we want to build,
so WEKA knows how to work with the data, and how to create the appropriate model:
This tells WEKA that we want to build a regression model. As you can see from the other choices,
though, there are lots of possible models to build. Lots! This should give you a good indication of
how we are only touching the surface of this subject. Also of note: There is another choice called
SimpleLinearRegression in the same branch. Do not choose this because simple regression only
looks at one variable, and we have six.
[13]
Figure 4. Linear regression model in WEKA
Now that the desired model has been chosen, we have to tell WEKA where the data is that it should
use to build the model. Though it may be obvious to us that we want to use the data we supplied
in the ARFF file, there are actually different options, some more advanced than what we'll be using.
The other three choices are Supplied test set, where you can supply a different set of data to build
the model; Cross-validation, which lets WEKA build a model based on subsets of the supplied
data and then average them out to create a final model; and Percentage split, where WEKA takes
a percentile subset of the supplied data to build a final model.
Now we are ready to create our model. Click Start. Figure 5 shows what the output should look
like.
[14]
Figure 5. House price regression model in WEKA
[15]
Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable) X= Independent Variable
(predictor Variable) a0= intercept of the line (Gives an additional
degree of freedom) a1 = Linear regression coefficient (scale factor
to each input value). ε = random error
The values for x and y variables are training datasets for Linear Regression model representation.
Program: -
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt import
seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.core.common import random_state
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Get dataset
df_sal = pd.read_csv('Salary_Data.csv')
df_sal.head()
[16]
# Describe data
df_sal.describe()
# Data distribution
plt.title('Salary Distribution Plot')
sns.distplot(df_sal['Salary'])
plt.show()
[17]
# Relationship between Salary and Experience
plt.scatter(df_sal['YearsExperience'], df_sal['Salary'], color = 'lightcoral')
plt.title('Salary vs Experience') plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.box(False)
plt.show()
# Splitting variables
X = df_sal.iloc[:, :1] # independent y
= df_sal.iloc[:, 1:] # dependent
[18]
# Prediction on training set
plt.scatter(X_train, y_train, color = 'lightcoral')
plt.plot(X_train, y_pred_train, color = 'firebrick')
plt.title('Salary vs Experience (Training Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend(['X_train/Pred(y_test)', 'X_train/y_train'], title = 'Sal/Exp', loc='best', facecolor='white')
plt.box(False) plt.show()
[19]
# Regressor coefficients and intercept
print(f'Coefficient: {regressor.coef_}')
print(f'Intercept: {regressor.intercept_}')
[20]
Experiment no- 4
(a) Execute the Logistic Regression with the help of properly identified data set. Analyse the result
and identify how well the model performed on test set. Brief the steps that you have followed
for analyse the data set.
(b) Implement Logistic Regression using python.
Where, y is a dependent variable and x1, x2 ... and Xn are explanatory variables.
Sigmoid Function:
[21]
Logistic Function – Sigmoid Function
• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
• It maps any real value into another value within a range of 0 and 1. The value of the logistic
regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve
like the “S” form.
• The S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value
below the threshold values tends to 0.
Program: -
#importing libraries import numpy
as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns import matplotlib.pyplot as plt
[22]
df.info() #gives information about the columns
print(df["variety"].value_counts())
sns.countplot(df["variety"])
[23]
plt.figure(figsize=(8,4))
sns.heatmap(df.corr(),annot=True,fmt=".0%") #draws heatmap with input as the correlation matrix
calculted by(df.corr()) plt.show()
[24]
from sklearn.linear_model import LogisticRegression # for Logistic Regression algorithm from
sklearn.model_selection import train_test_split #to split the dataset for training and testing from
sklearn import metrics #for checking the model accuracy
X=df.iloc[:,0:4]
Y=df["variety"]
X.head()
[25]
log = LogisticRegression()
log.fit(X_train,Y_train)
prediction=log.predict(X_test)
print('The accuracy of the Logistic Regression is',metrics.accuracy_score(prediction,Y_test))
[26]
Experiment no- 5
Execute the Naïve Bayes algorithm with suitable data set and do proper
analysis on the result. Also implement Naïve Bayes algorithm using
python.
Execute the Naïve Bayes algorithm with suitable data set and do proper analysis on the result.
Also implement Naïve Bayes algorithm using python.
Naïve Bayes Classifier Algorithm
• Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
• It is mainly used in text classification that includes a high-dimensional training dataset.
• Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
• Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described
as:
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent
of the occurrence of other features. Such as if the fruit is identified on the bases of color, shape,
and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending on each other.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem: o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the conditional
probability. o The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Program: -
#importing libraries import numpy
as np # linear algebra
[27]
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns import matplotlib.pyplot as plt from
sklearn.naive_bayes import GaussianNB from sklearn import metrics
from sklearn.model_selection import train_test_split #to split the dataset for training and testing
[28]
df.shape #tells us about no. of rows and column [rows , columns]
(150, 5)
print(df["variety"].value_counts())
sns.countplot(df["variety"])
plt.figure(figsize=(8,4))
sns.heatmap(df.corr(),annot=True,fmt=".0%") #draws heatmap with input as the correlation matrix
calculted by(df.corr()) plt.show()
[29]
# We'll use seaborn's FacetGrid to color the scatterplot by species
sns.FacetGrid(df, hue="variety", height=5).map(plt.scatter, "sepal.length",
"sepal.width").add_legend()
[30]
gnb.fit(X_train, Y_train)
GaussianNB(priors=None, var_smoothing=1e-09)
[31]
Experiment no- 6
Program: -
# Load libraries import pandas as pd import seaborn as sns from sklearn.tree import
DecisionTreeClassifier # Import Decision Tree Classifier from sklearn.model_selection
import train_test_split # Import train_test_split function from sklearn import metrics
#Import scikit-learn metrics module for accuracy calculation # load dataset pima =
pd.read_csv("diabetes.csv") pima.head()
[32]
pima.describe()
pima.info()
print(pima["Outcome"].value_counts())
sns.countplot(pima["Outcome"])
[33]
feature_cols = ['Pregnancies', 'Glucose', 'BloodPressure',
'SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']
X = pima[feature_cols] # Features y
= pima.Outcome # Target variable
[34]
Experiment no- 7
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category
in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider
the below diagram in which there are two different categories that are classified using a decision
boundary or hyperplane:
Program: -
# Load libraries
import pandas as pd
import seaborn as sns
from sklearn import svm # Import SVM Classifier
[35]
from sklearn.model_selection import train_test_split # Import train_test_split function from
sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
# load dataset
pima = pd.read_csv("diabetes.csv") pima.head()
pima.describe()
pima.info()
print(pima["Outcome"].value_counts())
sns.countplot(pima["Outcome"])
[36]
#split dataset in features and target variable
feature_cols = ['Pregnancies', 'Glucose', 'BloodPressure',
'SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']
X = pima[feature_cols] # Features y
= pima.Outcome # Target variable
Accuracy: 0.7922077922077922
Precision 0.7936507936507936
[37]
Experiment no- 8
Hence each cluster has data points with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
[38]
• Step-2: Select random K points or centroids. (It can be other from the input dataset).
• Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
• Step-4: Calculate the variance and place a new centroid of each cluster.
• Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
• Step-7: The model is ready.
•
How to choose the value of "K number of clusters" in K-means Clustering?
• The performance of the K-means clustering algorithm depends upon highly efficient
clusters that it forms.
• But choosing the optimal number of clusters is a big task.
• There are some different ways to find the optimal number of clusters, but here we are
discussing the most appropriate method to find the number of clusters or value of K.
Elbow Method
• The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster.
• The formula to calculate the value of WCSS (for 3 clusters) is given below:
• Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:
Program: -
# importing libraries
import numpy as nm
[39]
import matplotlib.pyplot as mtp
import pandas as pd from
sklearn.cluster import KMeans
#Using for loop for iterations from 1 to 10. for i in range(1, 11): kmeans
= KMeans(n_clusters=i, init='k-means++', random_state= 42)
kmeans.fit(x)
wcss_list.append(kmeans.inertia_)
mtp.plot(range(1, 11), wcss_list)
mtp.title('The Elobw Method Graph')
mtp.xlabel('Number of clusters(k)')
mtp.ylabel('wcss_list') mtp.show()
[40]
mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for
first cluster
mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2')
#for second cluster
mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for
third cluster
mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for
fourth cluster
mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
#for fifth cluster
mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label
= 'Centroid') mtp.title('Clusters of customers') mtp.xlabel('Annual Income (k$)')
mtp.ylabel('Spending Score (1-100)') mtp.legend(loc='lower center') mtp.show()
Experiment no- 9
[41]
What is Activation Functions in Neural Network?
Activation functions in Neural Network are used in a neural network to compute the weighted sum
of inputs and biases, which is in turn used to decide whether a neuron can be activated or not. It
manipulates the presented data and produces an output for the neural network that contains the
parameters in the data. The activation functions are also referred to as transfer functions in some
literature. These can either be linear or nonlinear depending on the function it represents and is
used to control the output of neural networks across different domains.
[42]
# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
[43]
[44]
Experiment no- 10
[45]
# Normalizing the dataset
x_train = x_train/255
x_test = x_test/255
model.evaluate(x_test_flatten, y_test)
[46]