0% found this document useful (0 votes)
3 views

ML Lab Manual

The document is a lab manual for the Machine Learning Lab at CMR Institute of Technology, detailing course objectives, outcomes, and a series of programming experiments. It includes tasks such as creating histograms, implementing various machine learning algorithms like k-Nearest Neighbors and decision trees, and performing data analysis using datasets like California Housing and Iris. The manual outlines the structure of the lab sessions, evaluation criteria, and suggested pedagogical approaches.

Uploaded by

Ayushi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

ML Lab Manual

The document is a lab manual for the Machine Learning Lab at CMR Institute of Technology, detailing course objectives, outcomes, and a series of programming experiments. It includes tasks such as creating histograms, implementing various machine learning algorithms like k-Nearest Neighbors and decision trees, and performing data analysis using datasets like California Housing and Iris. The manual outlines the structure of the lab sessions, evaluation criteria, and suggested pedagogical approaches.

Uploaded by

Ayushi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

CMR Institute of Technology, Bengaluru

Department of Computer Science and Technology


Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

Lab Manual
Machine Learning Lab

Semester VI

Course Code : BCSL606

Teaching Hours/Week - 2

CIE Marks 50

SEE Marks 50

Credits 04

Exam Hours:

(Applicable for 6th Sem CSE)

1 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

Course Objectives (Defined by the university):

CLO 1: To become familiar with data and visualize univariate, bivariate, and multivariate data using
statistical techniques and dimensionality reduction.
CLO 2: To understand various machine learning algorithms such as similarity-based learning,
regression, decision trees, and clustering.
CLO 3: To familiarize with learning theories, probability-based models and developing the skills
required for decision-making in dynamic environments.

Note: A two-hour tutorial is suggested for each laboratory session.

Pedagogy: For the above experiments the following pedagogy can be considered. Problem-based
learning, Active learning, MOOC, Chalk & Talk
PART A – List of problems for which students should develop programs and execute in the
Laboratory.
Course outcomes (Course Skill Set):
At the end of the course, the student will be able to:
CO1: Illustrate the principles of multivariate data and apply dimensionality reduction techniques.
CO2: Demonstrate similarity-based learning methods and perform regression analysis.
CO3: Develop decision trees for classification and regression problems, and Bayesian models for
probabilistic learning.
CO4: Implement the clustering algorithms to share computing resources.

2 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

List of Problems/Experiments

Experiments

List of problems for which students should develop the program and execute it in the laboratory

1 Develop a program to create histograms for all numerical features and analyze the distribution of
each feature. Generate box plots for all numerical features and identify any outliers. Use California
Housing dataset.

2 Develop a program to Compute the correlation matrix to understand the relationships between pairs
offeatures. Visualize the correlation matrix using a heatmap to know which variables have
strongpositive/negative correlations. Create a pair plot to visualize pairwise relationships between
features. Use California Housing dataset.

3 Develop a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.

4 For a given set of training data examples stored in a .CSV file, implement and demonstrate the Find-
S algorithm to output a description of the set of all hypotheses consistent with the training examples.

5 Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly generated
100
values of x in the range of [0,1]. Perform the following based on dataset generated.

a. Label the first 50 points {x1,......,x50} as follows: if (xi 0.5), then xi Class1, else xi Class1
b. Classify the remaining points, x51,......,x100 using KNN. Perform this for k=1,2,3,4,5,20,30

6 Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points.
Select appropriate data set for your experiment and draw graphs

7 Develop a program to demonstrate the working of Linear Regression and Polynomial Regression.
Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel
efficiency prediction) for Polynomial Regression.

8 Develop a program to demonstrate the working of the decision tree algorithm. Use Breast Cancer
Data set for building the decision tree and apply this knowledge to classify a new sample.

9 Develop a program to implement the Naive Bayesian classifier considering Olivetti Face Data set for

3 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab
training. Compute the accuracy of the classifier, considering a few test data sets.

10 Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set and
visualize the clustering result.

4 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

Experiment 1: Develop a program to create histograms for all numerical features and analyze the
distribution of each feature. Generate box plots for all numerical features and identify any
outliers. Use California Housing dataset.
Code:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.datasets import fetch_california_housing

# Load the California Housing dataset

data = fetch_california_housing(as_frame=True)

df = data.frame

# Display the first few rows of the dataset

print("Dataset Sample:")

print(df.head())

# Plot histograms for all numerical features

def plot_histograms(df):

df.hist(figsize=(12, 10), bins=30, color='skyblue', edgecolor='black')

plt.suptitle('Histograms of Numerical Features', fontsize=16)

plt.tight_layout(rect=[0, 0, 1, 0.97])

plt.show()

# Plot box plots for all numerical features

def plot_boxplots(df):

plt.figure(figsize=(14, 10))

for i, column in enumerate(df.columns, 1):

plt.subplot(3, 3, i)

5 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab
sns.boxplot(y=df[column], color='skyblue')

plt.title(f"Box Plot of {column}", fontsize=12)

plt.tight_layout()

plt.show()

# Analyze distribution and detect outliers

def analyze_features(df):

print("\nFeature Analysis:")

for column in df.columns:

print(f"\nFeature: {column}")

print(f"Mean: {df[column].mean():.2f}, Median:


{df[column].median():.2f}, Std Dev: {df[column].std():.2f}")

q1 = df[column].quantile(0.25)

q3 = df[column].quantile(0.75)

iqr = q3 - q1

lower_bound = q1 - 1.5 * iqr

upper_bound = q3 + 1.5 * iqr

outliers = df[(df[column] < lower_bound) | (df[column] >


upper_bound)]

print(f"Number of Outliers: {len(outliers)}")

# Plot histograms and boxplots, and analyze features

plot_histograms(df)

plot_boxplots(df)

analyze_features(df)

Output:

6 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

7 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

Experiment 2: Develop a program to Compute the correlation matrix to understand the


relationships between pairs offeatures. Visualize the correlation matrix using a heatmap to know
which variables have strongpositive/negative correlations. Create a pair plot to visualize pairwise
relationships between features. Use California Housing dataset.
Code:
import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.datasets import fetch_california_housing

# Load the California Housing dataset

data = fetch_california_housing(as_frame=True)

df = data.frame

# Display the first few rows of the dataset

print("Dataset Sample:")

print(df.head())

# Compute the correlation matrix

correlation_matrix = df.corr()

# Print the correlation matrix

print("\nCorrelation Matrix:")

print(correlation_matrix)

# Visualize the correlation matrix using a heatmap

def plot_heatmap(corr_matrix):

plt.figure(figsize=(10, 8))

sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm",


cbar=True, square=True, linewidths=0.5)

plt.title("Correlation Matrix Heatmap", fontsize=16)

plt.show()

8 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

# Create a pair plot to visualize pairwise relationships

def plot_pairplot(df):

sns.pairplot(df, diag_kind="kde", corner=True, plot_kws={'alpha': 0.5},


diag_kws={'fill': True})

plt.suptitle("Pair Plot of Numerical Features", y=1.02, fontsize=16)

plt.show()

# Plot the heatmap and pair plot

plot_heatmap(correlation_matrix)

plot_pairplot(df)

Output

9 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

Experiment 3: Develop a program to implement Principal Component Analysis (PCA) for


reducing the dimensionality of the Iris dataset from 4 features to 2.
Code:
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

# Load the Iris dataset

iris = load_iris()

X = iris.data

y = iris.target

target_names = iris.target_names

# Standardize the features (mean = 0, variance = 1)

scaler = StandardScaler()

X_standardized = scaler.fit_transform(X)

# Apply PCA to reduce dimensions from 4 to 2

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_standardized)

# Create a DataFrame for the PCA results

pca_df = pd.DataFrame(data=X_pca, columns=['Principal Component 1',


'Principal Component 2'])

pca_df['Target'] = y

# Plot the PCA results

plt.figure(figsize=(8, 6))

for target, color, label in zip(range(len(target_names)), ['r', 'g', 'b'],


target_names):

10 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab
plt.scatter(

pca_df.loc[pca_df['Target'] == target, 'Principal Component 1'],

pca_df.loc[pca_df['Target'] == target, 'Principal Component 2'],

color=color,

alpha=0.6,

label=label

plt.title('PCA of Iris Dataset (2 Components)', fontsize=16)

plt.xlabel('Principal Component 1', fontsize=12)

plt.ylabel('Principal Component 2', fontsize=12)

plt.legend(title='Target', loc='best')

plt.grid(alpha=0.3)

plt.show()

Output:

11 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

12 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab
Experiment 4: For a given set of training data examples stored in a .CSV
file, implement and demonstrate the Find-S algorithm to output a description of the set of all
hypotheses consistent with the training examples.
Code:
import pandas as pd

# Define the data as a dictionary

data = {

"Weather": ["Sunny", "Sunny", "Rainy", "Sunny"],

"Temperature": ["Warm", "Warm", "Cold", "Warm"],

"Humidity": ["Normal", "High", "High", "High"],

"Wind": ["Strong", "Strong", "Strong", "Weak"],

"PlayTennis": ["Yes", "Yes", "No", "Yes"]

# Convert the data into a DataFrame

df = pd.DataFrame(data)

# Save the DataFrame to a CSV file

file_path = "training_data.csv"

df.to_csv(file_path, index=False)

print(f"CSV file 'training_data.csv' has been created successfully!")

def find_s_algorithm(data):

# Extract features and labels

features = data.iloc[:, :-1].values # All columns except the last


(attributes)

labels = data.iloc[:, -1].values # The last column (class/label)

# Initialize the hypothesis as the first positive example

hypothesis = None

for i, label in enumerate(labels):

13 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab
if label == "Yes": # Look for a positive example

hypothesis = features[i].copy()

break

# If no positive examples, return empty hypothesis

if hypothesis is None:

return "No positive examples found."

# Generalize hypothesis for each positive example

for i, label in enumerate(labels):

if label == "Yes": # Process only positive examples

for j in range(len(hypothesis)):

if hypothesis[j] != features[i][j]: # If attributes


differ, make it '?'

hypothesis[j] = '?'

return hypothesis

# Load training data from a CSV file

file_path = "training_data.csv" # Replace with your CSV file path

data = pd.read_csv(file_path)

# Display the training data

print("Training Data:")

print(data)

# Run the Find-S algorithm

14 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab
final_hypothesis = find_s_algorithm(data)

# Output the final hypothesis

print("\nFinal Hypothesis:")

print(final_hypothesis)

Output

15 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab
Experiment 5: Develop a program to implement k-Nearest Neighbour
algorithm to classify the randomly generated 100 values of x in the range of [0,1]. Perform the
following based on dataset generated.

a. Label the first 50 points {x1,......,x50} as follows: if (xi 0.5), then xi Class1, else xi Class1
b. Classify the remaining points, x51,......,x100 using KNN. Perform this for k=1,2,3,4,5,20,30
Employee(E_id, E_name, Age, Salary)
Code:
import numpy as np

from sklearn.neighbors import KNeighborsClassifier

# Step 1: Generate random dataset

np.random.seed(42) # For reproducibility

x = np.random.rand(100) # 100 random values in the range [0, 1]

# Step 2: Label the first 50 points

labels = np.where(x[:50] <= 0.5, 1, 2) # Class1 = 1, Class2 = 2

# Combine the labeled data into a training set

x_train = x[:50].reshape(-1, 1)

y_train = labels

# The remaining 50 points to classify

x_test = x[50:].reshape(-1, 1)

# Step 3: k-NN classification

k_values = [1, 2, 3, 4, 5, 20, 30]

print("k-NN Classification Results:")

for k in k_values:

knn = KNeighborsClassifier(n_neighbors=k)

knn.fit(x_train, y_train) # Train the k-NN model

y_pred = knn.predict(x_test) # Predict classes for the test points

# Output the predictions for this k value

print(f"\nk = {k}:")

16 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab
print(f"Predicted Classes: {y_pred}")

Output

17 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

Experiment 6: Implement the non-parametric Locally Weighted Regression algorithm in order to


fit data points. Select appropriate data set for your experiment and draw graphs.
Code:
import numpy as np

import matplotlib.pyplot as plt

# Generate synthetic data

np.random.seed(42)

X = np.linspace(0, 10, 100)

y = np.sin(X) + np.random.normal(scale=0.2, size=X.shape) # Add some


noise

# Reshape X for matrix operations

X = X[:, np.newaxis]

# Locally Weighted Regression function

def locally_weighted_regression(X_train, y_train, query_point, tau):

# Compute weights using Gaussian kernel

weights = np.exp(-np.sum((X_train - query_point)**2, axis=1) / (2 *


tau**2))

# Create a diagonal weight matrix

W = np.diag(weights)

# Compute the weighted normal equation

X_bias = np.hstack([np.ones_like(X_train), X_train]) # Add bias term

theta = np.linalg.inv(X_bias.T @ W @ X_bias) @ X_bias.T @ W @ y_train

# Predict for the query point

query_point_bias = np.array([1, query_point]) # Use scalar query_point

prediction = query_point_bias @ theta

return prediction

# Predict for multiple points using LWR

18 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab
def predict_lwr(X_train, y_train, X_test, tau):

predictions = np.array([locally_weighted_regression(X_train, y_train,


x[0], tau) for x in X_test])

return predictions

# Hyperparameter: Bandwidth (tau)

tau = 0.5

# Predict values for the test set

y_pred = predict_lwr(X, y, X, tau)

# Plot the results

plt.figure(figsize=(10, 6))

plt.scatter(X, y, label="Data Points", color="blue", s=10)

plt.plot(X, y_pred, label=f"LWR Prediction (tau={tau})", color="red",


linewidth=2)

plt.xlabel("X")

plt.ylabel("y")

plt.title("Locally Weighted Regression (LWR)")

plt.legend()

plt.grid(True)

plt.show()

Output

19 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

20 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

Experiment 7:. Develop a program to demonstrate the working of Linear Regression and
Polynomial Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG
Dataset (for vehicle fuel efficiency prediction) for Polynomial Regression.
Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# --- Linear Regression with Boston Housing Dataset ---


def linear_regression_boston():
# Load Boston Housing Dataset
boston = fetch_openml(name="boston", version=1, as_frame=True)
X = boston.data.to_numpy() # Convert DataFrame to NumPy array
y = boston.target.to_numpy() # Convert Series to NumPy array

# Split into train and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

# Train a linear regression model


lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Predictions
y_pred = lin_reg.predict(X_test)

# Performance metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.6)

21 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab
plt.plot([min(y_test), max(y_test)], [min(y_test),
max(y_test)], color="red", linewidth=2)
plt.xlabel("True Values (Boston Housing)")
plt.ylabel("Predicted Values")
plt.title("Linear Regression on Boston Housing Dataset")
plt.grid(True)
plt.show()

print(f"Linear Regression Results:")


print(f"Mean Squared Error: {mse:.2f}")
print(f"R^2 Score: {r2:.2f}")
# --- Polynomial Regression with Auto MPG Dataset ---
def polynomial_regression_auto_mpg():
# Load Auto MPG Dataset
auto_mpg = fetch_openml(name="autoMpg", version=1, as_frame=True)
data = auto_mpg.data
target = auto_mpg.target.astype(float) # Convert target (MPG) to float

# Remove rows with missing 'horsepower' values from both data and
target
data = data.dropna(subset=["horsepower"])
target = target.loc[data.index] # Ensure target is aligned with
filtered data

# Select feature (Horsepower) and target (MPG)


X_hp = data[["horsepower"]].astype(float) # Convert horsepower to
float
y_mpg = target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_hp, y_mpg,
test_size=0.2, random_state=42)

# Polynomial regression (degree=3)


poly_features = PolynomialFeatures(degree=3)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

22 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab
# Train a linear regression model on transformed
polynomial features
lr_poly = LinearRegression()
lr_poly.fit(X_train_poly, y_train)

# Predictions
y_pred_poly = lr_poly.predict(X_test_poly)

# Performance metrics
mse_poly = mean_squared_error(y_test, y_pred_poly)
r2_poly = r2_score(y_test, y_pred_poly)

# Sort for plotting


X_test_sorted, y_test_sorted = zip(*sorted(zip(X_test.values.flatten(),
y_test)))
y_pred_sorted =
lr_poly.predict(poly_features.transform(np.array(X_test_sorted).reshape(-
1, 1)))

# Plot
plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color="blue", label="True values",
alpha=0.6)
plt.plot(X_test_sorted, y_pred_sorted, color="red", label="Polynomial
fit (degree=3)", linewidth=2)
plt.xlabel("Horsepower")
plt.ylabel("Miles Per Gallon (MPG)")
plt.title("Polynomial Regression on Auto MPG Dataset")
plt.legend()
plt.grid(True)
plt.show()

print(f"Polynomial Regression Results (Degree=3):")


print(f"Mean Squared Error: {mse_poly:.2f}")
print(f"R^2 Score: {r2_poly:.2f}")

# Run both demonstrations


def run_models():

23 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab
linear_regression_boston()
polynomial_regression_auto_mpg()

# Execute the functions


run_models()

Output

24 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

25 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

Experiment 8:. Develop a program to demonstrate the working of the decision tree algorithm.
Use Breast Cancer Data set for building the decision tree and apply this knowledge to classify a
new sample.
Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report

# Load the Breast Cancer dataset


def load_and_preprocess_data():
cancer_data = load_breast_cancer()
X = cancer_data.data
y = cancer_data.target
feature_names = cancer_data.feature_names
target_names = cancer_data.target_names
return X, y, feature_names, target_names

# Train a Decision Tree classifier


def train_decision_tree(X_train, y_train):
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
return clf

# Plot the Decision Tree


def plot_decision_tree(clf, feature_names):
plt.figure(figsize=(12, 8))
plot_tree(clf, filled=True, feature_names=feature_names,
class_names=["Malignant", "Benign"], rounded=True, proportion=True)
plt.title("Decision Tree for Breast Cancer Classification")
plt.show()

# Classify a new sample


def classify_new_sample(clf, sample):

26 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab
sample = np.array(sample).reshape(1, -1)
prediction = clf.predict(sample)
return prediction

# Main function to demonstrate Decision Tree Algorithm


def main():
# Load and preprocess data
X, y, feature_names, target_names = load_and_preprocess_data()

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

# Train the Decision Tree model


clf = train_decision_tree(X_train, y_train)

# Predict on the test set


y_pred = clf.predict(X_test)

# Print performance metrics


print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))

# Plot the decision tree


plot_decision_tree(clf, feature_names)

# Classify a new sample (e.g., random sample from the test set)
sample = X_test[0] # You can replace this with any sample
prediction = classify_new_sample(clf, sample)
print(f"\nClassified sample: {'Benign' if prediction == 1 else
'Malignant'}")

# Run the main function


if __name__ == "__main__":
main()

27 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

Output

28 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

Experiment 9:. Develop a program to implement the Naive Bayesian classifier considering
Olivetti Face Data set for training. Compute the accuracy of the classifier, considering a few test
data sets.
Code:

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Function to load and preprocess the Olivetti Face Dataset


def load_and_preprocess_data():
faces_data = datasets.fetch_olivetti_faces(shuffle=True,
random_state=42)
X = faces_data.data # 4096 features for each face (64x64 pixel images
flattened)
y = faces_data.target # Labels for each person (40 classes)
return X, y

# Function to split the data into training and testing sets


def split_data(X, y, test_size=0.2):
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=test_size, random_state=42)
return X_train, X_test, y_train, y_test

# Function to train a Naive Bayes classifier


def train_naive_bayes(X_train, y_train):
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)
return nb_classifier

# Function to predict and calculate accuracy


def predict_and_evaluate(nb_classifier, X_test, y_test):
y_pred = nb_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

29 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab
return y_pred, accuracy

# Function to visualize a few test samples with predicted labels


def visualize_predictions(X_test, y_pred, n_samples=5):
plt.figure(figsize=(10, 5))
for i in range(n_samples):
plt.subplot(1, n_samples, i + 1)
plt.imshow(X_test[i].reshape(64, 64), cmap='gray')
plt.title(f"Pred: {y_pred[i]}")
plt.axis('off')
plt.show()

# Main function to run the program


def main():
# Load and preprocess data
X, y = load_and_preprocess_data()

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = split_data(X, y)

# Train the Naive Bayes classifier


nb_classifier = train_naive_bayes(X_train, y_train)

# Predict and evaluate the model


y_pred, accuracy = predict_and_evaluate(nb_classifier, X_test, y_test)
print(f"Accuracy of the Naive Bayes Classifier on the test set:
{accuracy * 100:.2f}%")

# Visualize predictions on the test set


visualize_predictions(X_test, y_pred)

# Run the program


if __name__ == "__main__":
main()

30 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

Output

31 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

Experiment 10:. Develop a program to implement k-means clustering using Wisconsin Breast
Cancer data set and visualize the clustering result.
Code:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Function to load and preprocess the dataset


def load_and_preprocess_data():
# Load the Wisconsin Breast Cancer dataset
data = load_breast_cancer()

# Extract features (X) and labels (y)


X = data.data
y = data.target

# Standardize the features


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

return X_scaled, y

# Function to apply K-Means clustering


def apply_kmeans(X, n_clusters=2):
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
kmeans.fit(X)

# Return the cluster centers and labels


return kmeans.cluster_centers_, kmeans.labels_

# Function to visualize the clustering result


def visualize_clusters(X, labels, centers):
plt.figure(figsize=(8, 6))

32 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab
# Plot the data points, colored by cluster label
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)

# Plot the cluster centers


plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='x', s=200,
label='Centroids')

plt.title('K-Means Clustering on Wisconsin Breast Cancer Dataset')


plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

# Main function to execute the program


def main():
# Load and preprocess the data
X, y = load_and_preprocess_data()

# Apply K-Means clustering (using 2 clusters for simplicity)


centers, labels = apply_kmeans(X, n_clusters=2)

# Visualize the clustering results


visualize_clusters(X, labels, centers)

# Run the program


if __name__ == "__main__":
main()

Output

33 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

34 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

Extra Program 1: Develop a program to implement the Support Vector Machine (SVM) algorithm
for binary classification. Use the Pima Indians Diabetes dataset for training and evaluation.
Experiment with different kernel functions (linear, RBF, polynomial) and regularization parameters
to find the best model.
Code:
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

# Load the Pima Indians Diabetes dataset


url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-
indians-diabetes.data.csv"
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
data = pd.read_csv(url, names=columns)

# Split the data into features (X) and target (y)


X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

# Normalize the features


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42)

# Define the SVM models with different kernel functions


kernels = ['linear', 'rbf', 'poly']
C_values = [0.1, 1, 10]

# Store results

35 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab
results = []

# Train and evaluate the models


for kernel in kernels:
for C in C_values:
print(f"Training SVM with {kernel} kernel and C={C}...")

# Create and train the SVM model


model = svm.SVC(kernel=kernel, C=C)
model.fit(X_train, y_train)

# Predict on the test set


y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
results.append((kernel, C, accuracy))

# Find the best model


best_model = max(results, key=lambda x: x[2])

# Print results
print("\nModel Evaluation Results:")
for result in results:
print(f"Kernel: {result[0]}, C: {result[1]}, Accuracy:
{result[2]:.4f}")

print("\nBest Model:")
print(f"Kernel: {best_model[0]}, C: {best_model[1]}, Accuracy:
{best_model[2]:.4f}")

Output

36 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

37 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab

Extra Program 2: Develop a program to implement the AdaBoost algorithm for classification. Use
the Pima Indians Diabetes dataset for training and evaluation. Analyze the impact of the number of
weak learners on the model's performance and compare the performance of AdaBoost with a single
decision tree classifier.
Code:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Load the dataset (Pima Indians Diabetes dataset)


url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-
indians-diabetes.data.csv"
column_names = ['Pregnancies', 'Glucose', 'BloodPressure',
'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age',
'Outcome']
data = pd.read_csv(url, names=column_names)

# Split into features and target


X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Normalize the data (important for algorithms like AdaBoost)


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42)

# Initialize the base classifier (Decision Tree)


base_estimator = DecisionTreeClassifier(max_depth=1)

38 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab
# AdaBoost with different numbers of weak learners
(n_estimators)
n_estimators_list = [10, 50, 100, 200]
adaboost_scores = []

for n_estimators in n_estimators_list:


# Train AdaBoost with the corrected parameter name 'estimator'
adaboost = AdaBoostClassifier(estimator=base_estimator,
n_estimators=n_estimators, random_state=42)
adaboost.fit(X_train, y_train)

# Evaluate the performance


y_pred_adaboost = adaboost.predict(X_test)
report = classification_report(y_test, y_pred_adaboost,
output_dict=True)
adaboost_scores.append(report['accuracy']) # Collecting accuracy for
comparison

# Evaluate performance of a single Decision Tree


single_tree = DecisionTreeClassifier(max_depth=1, random_state=42)
single_tree.fit(X_train, y_train)
y_pred_tree = single_tree.predict(X_test)
single_tree_report = classification_report(y_test, y_pred_tree,
output_dict=True)

# Display performance comparison between AdaBoost and Single Decision Tree


print(f"Single Decision Tree Performance:\n{classification_report(y_test,
y_pred_tree)}")
print("\nAdaBoost Performance with different numbers of weak learners:")
for i, n_estimators in enumerate(n_estimators_list):
print(f"\nAdaBoost with {n_estimators} weak learners:")
print(f"Accuracy: {adaboost_scores[i]}")

# Plot the impact of number of weak learners on AdaBoost's performance


plt.plot(n_estimators_list, adaboost_scores, marker='o', label='AdaBoost')
plt.axhline(y=single_tree_report['accuracy'], color='r', linestyle='--',
label='Single Decision Tree')
plt.xlabel('Number of Weak Learners')
plt.ylabel('Accuracy')

39 | Page Department of CSE


CMR Institute of Technology, Bengaluru
Department of Computer Science and Technology
Lab Manual, IV Semester, 2022 Scheme
BCS403 - Database Management System Lab
plt.title('Impact of Number of Weak Learners on AdaBoost
Performance')
plt.legend()
plt.show()

Output

40 | Page Department of CSE

You might also like