Final
Final
Date:
Aim: Write a program to implement k-Nearest Neighbor algorithm to classify the iris data
set. Print both correct and wrong predictions
Description:
The K-nearest neighbors (KNN) algorithm is a supervised machine learning algorithm that
classifies data points based on the class of their closest neighbors:
How it works:\
The KNN algorithm classifies new data points by looking at the labels of the closest
neighbors in the training dataset's feature space. The algorithm is based on the principle of
"information gain," which means it finds the most suitable way to predict an unknown value.
KNN is useful when labeled data is expensive or hard to get, and it can be used for a wide
variety of prediction problems. It's also used in many areas, including image recognition,
handwriting detection, and video recognition.
Dataset:
{'data': array([[5.1, 3.5, 1.4, 0.2],
………………………………………………..
[6.5, 3. , 5.2, 2. ],
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
…………………………………………………….,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
Program:
from sklearn.datasets import load_iris
= load_iris() iris
X =
iris.data y
iris.target
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size =
accuracy_score(y_test,y_pred) print("Accuracy",accuracy)
Output:
KNeighborsClassifier
KNeighborsClassifier(n_neighbors=3)
Accuracy 0.9666666666666667
Viva Questions:
EXPERIMENT-5
Aim: To solve the real-world problems using Logistic Regression
Description:
Logistic regression is a machine learning algorithm that uses a statistical method to
predict the probability of a binary outcome based on independent variables:
Purpose
Logistic regression is used to classify data into categories and understand the
relationship between variables.
How it works
Logistic regression uses a sigmoid function to model the relationship between
variables and output a probability between 0 and 1.
When to use it
Logistic regression is used when the outcome variable is binary or categorical, such as
yes or no.
Applications
Logistic regression is used in many fields, including medical research and insurance.
For example, researchers can use logistic regression to calculate the risk of cancer by
considering patient habits and genetic predispositions.
Dataset:
Unnamed: 0 Age Sex ChestPain RestBP Chol Fbs RestECG
MaxHR ExAng Oldpeak Slope Ca Thal AHD
0 1 63 1 typical145 233 1 2 150 0 2.3 3
0.0 fixed No
1 2 67 1 asymptomatic 160 286 0 2 108 1
1.5 2 3.0 normal Yes
2 3 67 1 asymptomatic 120 229 0 2 129 1
2.6 2 2.0 reversable Yes
3 4 37 1 nonanginal 130 250 0 0 187 0 3.5
3 0.0 normal No
4 5 41 0 nontypical 130 204 0 2 172 0 1.4
1 0.0 normal No
... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ...
298 299 45 1 typical110 264 0 0 132 0 1.2 2
0.0 reversable Yes
299 300 68 1 asymptomatic 144 193 1 0 141 0
3.4 2 2.0 reversable Yes
300 301 57 1 asymptomatic 130 131 0 0 115 1
1.2 2 1.0 reversable Yes
301 302 57 0 nontypical 130 236 0 2 174 0 0.0
2 1.0 normal Yes
302 303 38 1 nonanginal 138 175 0 0 173 0 0.0
1 NaN normal No
303 rows × 15 columns
Program:
import pandas as pd
df = pd.read_csv("Heart.csv")
df.info()
df = df.drop(columns = "Unnamed: 0")
df
df['ChestPain'] = df['ChestPain'].astype('category')
df['ChestPain'] = df['ChestPain'].cat.codes
df
df['Thal'] = df['Thal'].astype('category')
df['Thal'] = df['Thal'].cat.codes
df['AHD'] = df['AHD'].astype('category')
df['AHD'] = df['AHD'].cat.codes
df
df.isnull().sum()
df = df.dropna()
df
X = df.drop(columns = 'AHD')
X
y = df['AHD']
y
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size =0.3, random_state =
23)
X_train
X_test
DECISION TREE
Aim: Write a program to demonstrate the working of the decision tree based ID3 algorithm.
Use an appropriate data set for building the decision tree and apply this knowledge to classify a
new sample.
Description: The ID3 (Iterative Dichotomiser 3) algorithm is a supervised learning
algorithm used to create decision trees for classification tasks. It works by recursively
partitioning the dataset based on the feature that provides the highest information gain, which is
a measure of how well a feature separates the data into distinct classes.
• Entropy is a measure of disorder or uncertainty in a dataset. It helps determine the
impurity in a set of examples.
• Information Gain (IG) is a measure of the effectiveness of a feature in classifying the
training data. It quantifies the reduction in entropy after the dataset is split based on a
specific feature.
Dataset:
Advantages of ID3:
• Simple and easy to understand.
• Requires little training data.
• Can work well with data with discrete and continuous attributes.
Disadvantages of ID3:
• Can lead to overfitting.
• May not be effective with data with many attributes.
Applications of ID3:
1.Fraud detection
2.Medical diagnosis
3.Customer segmentation
4.Risk assessment
5.Recommendation systems
Formulas:-
1.Entropy:
A measure of disorder or uncertainty in a set of data is called Entropy.
Information Gain:
A measure of how well a certain quality reduces uncertainty is called Information Gain. ID3
splits the data at each stage, choosing the property that maximizes Information Gain.
# Remove duplicates
df.drop_duplicates(inplace=True)
print("Bias: {:.2f}".format(bias))
print("Variance: {:.2f}".format(variance))
# Cross-validation
cv_scores = cross_val_score(model, df[iris['feature_names']], df['target'], cv=5)
print("\nCross-validation scores:", cv_scores)
print("Mean CV Accuracy: {:.2f}".format(np.mean(cv_scores)))
# Confusion matrix
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 3))
sns.heatmap(cm, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
print("\nAccuracy: {:.2f}".format(accuracy))
print("Precision: {:.2f}".format(precision))
print("Recall: {:.2f}".format(recall))
Exp No: Page No:
Date:
Week – 11
Dataset:
Program:
import numpy as np
import pandas as pd
print(data.head())
print(data['Gender'].unique())
print(data['Remarks'].unique())
data['Gender'].value_counts()
data['Remarks'].value_counts()
print(one_hot_encoded_data)
df = pd.DataFrame(data)
categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
encoder = OneHotEncoder(sparse_output=False)
one_hot_encoded = encoder.fit_transform(df[categorical_columns])
one_hot_df = pd.DataFrame(one_hot_encoded,
columns=encoder.get_feature_names_out(categorical_columns))
Output: