0% found this document useful (0 votes)
23 views

Final

Uploaded by

chaitu4064
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Final

Uploaded by

chaitu4064
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Exp No: Page No:

Date:

Aim: Write a program to implement k-Nearest Neighbor algorithm to classify the iris data
set. Print both correct and wrong predictions
Description:
The K-nearest neighbors (KNN) algorithm is a supervised machine learning algorithm that
classifies data points based on the class of their closest neighbors:

How it works:\

The KNN algorithm classifies new data points by looking at the labels of the closest
neighbors in the training dataset's feature space. The algorithm is based on the principle of
"information gain," which means it finds the most suitable way to predict an unknown value.

When to use it:

KNN is useful when labeled data is expensive or hard to get, and it can be used for a wide
variety of prediction problems. It's also used in many areas, including image recognition,
handwriting detection, and video recognition.

Dataset:
{'data': array([[5.1, 3.5, 1.4, 0.2],

[4.9, 3. , 1.4, 0.2],

[4.7, 3.2, 1.3, 0.2],

[4.6, 3.1, 1.5, 0.2],

[5. , 3.6, 1.4, 0.2],

[5.4, 3.9, 1.7, 0.4],

………………………………………………..

[6.3, 2.5, 5. , 1.9],

[6.5, 3. , 5.2, 2. ],

[6.2, 3.4, 5.4, 2.3],

[5.9, 3. , 5.1, 1.8]]),

'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

…………………………………………………….,

2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61


Exp No: Page No:
Date:

Program:
from sklearn.datasets import load_iris

from sklearn.model_selection import

train_test_split from sklearn.neighbors

import KNeighborsClassifier from

sklearn.metrics import accuracy_score iris

= load_iris() iris

X =

iris.data y

iris.target

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size =

0.4,random_state = 25) knn = KNeighborsClassifier(n_neighbors= 3)

knn.fit(X_train,y_train) y_pred = knn.predict(X_test) accuracy =

accuracy_score(y_test,y_pred) print("Accuracy",accuracy)

Output:
KNeighborsClassifier

KNeighborsClassifier(n_neighbors=3)

Accuracy 0.9666666666666667

Viva Questions:

1. What is K-Nearest Neighbors (KNN)?


Ans: KNN is a supervised learning algorithm used for both classification and regression. It
works by finding the 'k' closest data points (neighbors) to a query point and predicting the
class or value based on the majority label or average of these neighbors.

2. How does KNN work?


Ans: KNN identifies the 'k' nearest data points to a given input based on a distance metric
(like Euclidean distance).

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61


Exp No: Page No:
Date:

EXPERIMENT-5
Aim: To solve the real-world problems using Logistic Regression
Description:
Logistic regression is a machine learning algorithm that uses a statistical method to
predict the probability of a binary outcome based on independent variables:
Purpose
Logistic regression is used to classify data into categories and understand the
relationship between variables.
How it works
Logistic regression uses a sigmoid function to model the relationship between
variables and output a probability between 0 and 1.
When to use it
Logistic regression is used when the outcome variable is binary or categorical, such as
yes or no.
Applications
Logistic regression is used in many fields, including medical research and insurance.
For example, researchers can use logistic regression to calculate the risk of cancer by
considering patient habits and genetic predispositions.

Dataset:
Unnamed: 0 Age Sex ChestPain RestBP Chol Fbs RestECG
MaxHR ExAng Oldpeak Slope Ca Thal AHD
0 1 63 1 typical145 233 1 2 150 0 2.3 3
0.0 fixed No
1 2 67 1 asymptomatic 160 286 0 2 108 1
1.5 2 3.0 normal Yes
2 3 67 1 asymptomatic 120 229 0 2 129 1
2.6 2 2.0 reversable Yes
3 4 37 1 nonanginal 130 250 0 0 187 0 3.5
3 0.0 normal No
4 5 41 0 nontypical 130 204 0 2 172 0 1.4
1 0.0 normal No

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61


Exp No: Page No:
Date:

... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ...
298 299 45 1 typical110 264 0 0 132 0 1.2 2
0.0 reversable Yes
299 300 68 1 asymptomatic 144 193 1 0 141 0
3.4 2 2.0 reversable Yes
300 301 57 1 asymptomatic 130 131 0 0 115 1
1.2 2 1.0 reversable Yes
301 302 57 0 nontypical 130 236 0 2 174 0 0.0
2 1.0 normal Yes
302 303 38 1 nonanginal 138 175 0 0 173 0 0.0
1 NaN normal No
303 rows × 15 columns
Program:
import pandas as pd
df = pd.read_csv("Heart.csv")
df.info()
df = df.drop(columns = "Unnamed: 0")
df
df['ChestPain'] = df['ChestPain'].astype('category')
df['ChestPain'] = df['ChestPain'].cat.codes
df
df['Thal'] = df['Thal'].astype('category')
df['Thal'] = df['Thal'].cat.codes
df['AHD'] = df['AHD'].astype('category')
df['AHD'] = df['AHD'].cat.codes
df
df.isnull().sum()
df = df.dropna()
df

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61


Exp No: Page No:
Date:

X = df.drop(columns = 'AHD')
X
y = df['AHD']
y
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size =0.3, random_state =
23)
X_train
X_test

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train_scaled
X_test_scaled

from sklearn.linear_model import LogisticRegression


log_reg = LogisticRegression(random_state = 0).fit(X_train_scaled,y_train)
log_reg.predict(X_train_scaled)
log_reg.score(X_train_scaled,y_train)
log_reg.score(X_test_scaled,y_test)
from sklearn.linear_model import Lasso
Lasso_reg = Lasso(alpha = 50 , max_iter = 100,tol = 0.1)
Lasso_reg.fit(X_train_scaled,y_train)
Lasso_reg.score(X_test_scaled,y_test)
Lasso_reg.score(X_test_scaled,y_test)

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61


Exp No: Page No:
Date:
X_train_scaled,y_train
Output:
array([1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0,
…………………………………………………………
0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0,
0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1], dtype=int8)
0.8755980861244019
0.8111111111111111
Lasso
Lasso(alpha=50, max_iter=100, tol=0.1)
-0.0002953787464659019
-0.0002953787464659019

(array([[ 0.77904095, 0.73264228, 0.17796069, ..., 0.57260251,


2.42972109, 1.13611108],
[-0.08933003, 0.73264228, 0.17796069, ..., -1.02304981,
.......................................................................................,
[ 1.43031918, 0.73264228, 0.17796069, ..., 0.57260251,
-0.69848245, 1.13611108]]),
92 0
67 0
..
..
83 1
Name: AHD, Length: 209, dtype: int8)

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61


Exp No: Page No:
Date:

DECISION TREE
Aim: Write a program to demonstrate the working of the decision tree based ID3 algorithm.
Use an appropriate data set for building the decision tree and apply this knowledge to classify a
new sample.
Description: The ID3 (Iterative Dichotomiser 3) algorithm is a supervised learning
algorithm used to create decision trees for classification tasks. It works by recursively
partitioning the dataset based on the feature that provides the highest information gain, which is
a measure of how well a feature separates the data into distinct classes.
• Entropy is a measure of disorder or uncertainty in a dataset. It helps determine the
impurity in a set of examples.
• Information Gain (IG) is a measure of the effectiveness of a feature in classifying the
training data. It quantifies the reduction in entropy after the dataset is split based on a
specific feature.
Dataset:

Advantages of ID3:
• Simple and easy to understand.
• Requires little training data.
• Can work well with data with discrete and continuous attributes.
Disadvantages of ID3:
• Can lead to overfitting.
• May not be effective with data with many attributes.
Applications of ID3:
1.Fraud detection
2.Medical diagnosis
3.Customer segmentation
4.Risk assessment
5.Recommendation systems
Formulas:-
1.Entropy:
A measure of disorder or uncertainty in a set of data is called Entropy.

ADITYA UNIVERSITY Roll No: 22A91A61


Exp No: Page No:
Date:

1. Entropy= -P/P+N log2(P/P+N)-N/P+N log2(N/P+N)

Information Gain:
A measure of how well a certain quality reduces uncertainty is called Information Gain. ID3
splits the data at each stage, choosing the property that maximizes Information Gain.

Gain(S,A)=Entropy(S)- |Sv| /|S| Entropy(Sv)


rDA
Program:
import pandas as pd
import math
#calculate entropy of the whole dataset
def entropy(data):
total_samples = len(data)
if total_samples == 0:
return 0
positive_samples = sum(data['PlayTennis'] == 'Yes')
negative_samples = sum(data['PlayTennis'] == 'No')
p_positive = positive_samples / total_samples
p_negative = negative_samples / total_samples
if p_positive == 0 or p_negative == 0:
return 0
return -p_positive * math.log2(p_positive) - p_negative * math.log2(p_negative)
def information_gain(data, attribute):
total_entropy = entropy(data)
values = data[attribute].unique()
weighted_entropy = 0
for value in values:
subset = data[data[attribute] == value]
subset_entropy = entropy(subset)
subset_weight = len(subset) / len(data)
weighted_entropy += subset_weight * subset_entropy
return total_entropy - weighted_entropy
def id3(data, target_attribute, attributes):
if len(data[target_attribute].unique()) == 1:
return data[target_attribute].iloc[0]
if len(attributes) == 0:
return data[target_attribute].mode()[0]
best_attribute = max(attributes, key=lambda a: information_gain(data, a))
tree = {best_attribute: {}}
remaining_attributes = [a for a in attributes if a != best_attribute]
for value in data[best_attribute].unique():
subset = data[data[best_attribute] == value]
subtree = id3(subset, target_attribute, remaining_attributes)

ADITYA UNIVERSITY Roll No: 22A91A61


Exp No: Page No:
Date:
tree[best_attribute][value] = subtree
return tree
def classify(tree, sample):
if not isinstance(tree, dict):
return tree
attribute = next(iter(tree))
subtree = tree[attribute]
sample_value = sample[attribute]
if sample_value not in subtree:
return None
return classify(subtree[sample_value], sample)
if __name__ == "__main__":
# Load the dataset
data = pd.DataFrame({
'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rainy', 'Rainy', 'Rainy', 'Overcast', 'Sunny', 'Sunny',
'Rainy', 'Sunny', 'Overcast', 'Overcast', 'Rainy'],
'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild',
'Mild', 'Hot', 'Mild'],
'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal',
'Normal', 'Normal', 'High', 'Normal', 'High'],
'Windy': [False, True, False, False, False, True, True, False, False, False, True, True, False,
True],
'PlayTennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
})
target_attribute = 'PlayTennis'
attributes = ['Outlook', 'Temperature', 'Humidity', 'Windy']
# Build the decision tree
decision_tree = id3(data, target_attribute, attributes)
print("Decision Tree:")
print(decision_tree)
Output:

ADITYA UNIVERSITY Roll No: 22A91A61


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score
from google.colab import files

# Upload and read the dataset


uploaded = files.upload()
msg = pd.read_csv('document.csv', names=['message', 'label'])
print("Total Instances of Dataset: ", msg.shape[0])

# Map labels to numeric values


msg['labelnum'] = msg.label.map({'pos': 1, 'neg': 0})

# Define features and labels


X = msg.message
y = msg.labelnum

# Split the data into training and testing sets


Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=42)

# Convert text data to feature vectors


count_v = CountVectorizer()
Xtrain_dm = count_v.fit_transform(Xtrain)
Xtest_dm = count_v.transform(Xtest)

# Convert training data to a DataFrame for inspection (optional)


df = pd.DataFrame(Xtrain_dm.toarray(), columns=count_v.get_feature_names_out())
print(df.head()) # Display the first 5 rows

# Train the Naive Bayes classifier


clf = MultinomialNB()
clf.fit(Xtrain_dm, ytrain)

# Make predictions on the test data


pred = clf.predict(Xtest_dm)

# Display predictions for each document


print("\nPredictions:")
for doc, p in zip(Xtest, pred):
label = 'pos' if p == 1 else 'neg'
print(f"{doc} -> {label}")

# Evaluate the model


print("\nAccuracy Metrics:\n")
print("Accuracy: ", accuracy_score(ytest, pred))
print("Recall: ", recall_score(ytest, pred))
print("Precision: ", precision_score(ytest, pred))
print("Confusion Matrix: \n", confusion_matrix(ytest, pred))
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris

# Load the iris dataset


iris = load_iris()

# Convert the dataset into a DataFrame


df = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns=iris['feature_names'] + ['target'])
df['target'] = df['target'].astype(int)
print(df.head())
print(df.info())

# Remove duplicates
df.drop_duplicates(inplace=True)

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(df[iris['feature_names']], df['target'], test_size=0.4, random_state=0

# Fit a Random Forest Classifier model


model = RandomForestClassifier()
model.fit(X_train, y_train)

# Calculate Bias and Variance


train_error = 1 - model.score(X_train, y_train)
test_error = 1 - model.score(X_test, y_test)
bias = np.mean([train_error, test_error])
variance = test_error

print("Bias: {:.2f}".format(bias))
print("Variance: {:.2f}".format(variance))

# Cross-validation
cv_scores = cross_val_score(model, df[iris['feature_names']], df['target'], cv=5)
print("\nCross-validation scores:", cv_scores)
print("Mean CV Accuracy: {:.2f}".format(np.mean(cv_scores)))

# Confusion matrix
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 3))
sns.heatmap(cm, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# Accuracy, Precision, and Recall


accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')

print("\nAccuracy: {:.2f}".format(accuracy))
print("Precision: {:.2f}".format(precision))
print("Recall: {:.2f}".format(recall))
Exp No: Page No:
Date:

Week – 11

Aim: Write a program to implement One-hot Encoding

Description: One-hot encoding is a technique used in machine learning to convert


categorical variables into a numerical format that can be used by algorithms. Each unique
category value is converted into a new binary column, where a '1' indicates the presence of
that category and a '0' indicates its absence. This is particularly useful for models that require
numerical input but cannot handle categorical variables directly.

Dataset:

Program:

import numpy as np

import pandas as pd

data = pd.read_csv("C:\Program Files\employee_data.csv")

print(data.head())

print(data['Gender'].unique())

print(data['Remarks'].unique())

data['Gender'].value_counts()

data['Remarks'].value_counts()

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61


Exp No: Page No:
Date:
one_hot_encoded_data = pd.get_dummies(data, columns = ['Remarks', 'Gender'])

print(one_hot_encoded_data)

from sklearn.preprocessing import OneHotEncoder

data = {'Employee id': [10, 20, 15, 25, 30],

'Gender': ['M', 'F', 'F', 'M', 'F'],

'Remarks': ['Good', 'Nice', 'Good', 'Great', 'Nice'],

df = pd.DataFrame(data)

print(f"Employee data : \n{df}")

categorical_columns = df.select_dtypes(include=['object']).columns.tolist()

encoder = OneHotEncoder(sparse_output=False)

one_hot_encoded = encoder.fit_transform(df[categorical_columns])

one_hot_df = pd.DataFrame(one_hot_encoded,

columns=encoder.get_feature_names_out(categorical_columns))

df_encoded = pd.concat([df, one_hot_df], axis=1)

df_encoded = df_encoded.drop(categorical_columns, axis=1)

print(f"Encoded Employee data : \n{df_encoded}")

Output:

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

You might also like