ML Lab Manual
ML Lab Manual
Lab Manual
Machine Learning Lab
Semester VI
Teaching Hours/Week - 2
CIE Marks 50
SEE Marks 50
Credits 04
Exam Hours:
CLO 1: To become familiar with data and visualize univariate, bivariate, and multivariate data using
statistical techniques and dimensionality reduction.
CLO 2: To understand various machine learning algorithms such as similarity-based learning,
regression, decision trees, and clustering.
CLO 3: To familiarize with learning theories, probability-based models and developing the skills
required for decision-making in dynamic environments.
Pedagogy: For the above experiments the following pedagogy can be considered. Problem-based
learning, Active learning, MOOC, Chalk & Talk
PART A – List of problems for which students should develop programs and execute in the
Laboratory.
Course outcomes (Course Skill Set):
At the end of the course, the student will be able to:
CO1: Illustrate the principles of multivariate data and apply dimensionality reduction techniques.
CO2: Demonstrate similarity-based learning methods and perform regression analysis.
CO3: Develop decision trees for classification and regression problems, and Bayesian models for
probabilistic learning.
CO4: Implement the clustering algorithms to share computing resources.
List of Problems/Experiments
Experiments
List of problems for which students should develop the program and execute it in the laboratory
1 Develop a program to create histograms for all numerical features and analyze the distribution of
each feature. Generate box plots for all numerical features and identify any outliers. Use California
Housing dataset.
2 Develop a program to Compute the correlation matrix to understand the relationships between pairs
offeatures. Visualize the correlation matrix using a heatmap to know which variables have
strongpositive/negative correlations. Create a pair plot to visualize pairwise relationships between
features. Use California Housing dataset.
3 Develop a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.
4 For a given set of training data examples stored in a .CSV file, implement and demonstrate the Find-
S algorithm to output a description of the set of all hypotheses consistent with the training examples.
5 Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly generated
100
values of x in the range of [0,1]. Perform the following based on dataset generated.
a. Label the first 50 points {x1,......,x50} as follows: if (xi 0.5), then xi Class1, else xi Class1
b. Classify the remaining points, x51,......,x100 using KNN. Perform this for k=1,2,3,4,5,20,30
6 Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points.
Select appropriate data set for your experiment and draw graphs
7 Develop a program to demonstrate the working of Linear Regression and Polynomial Regression.
Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel
efficiency prediction) for Polynomial Regression.
8 Develop a program to demonstrate the working of the decision tree algorithm. Use Breast Cancer
Data set for building the decision tree and apply this knowledge to classify a new sample.
9 Develop a program to implement the Naive Bayesian classifier considering Olivetti Face Data set for
10 Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set and
visualize the clustering result.
Experiment 1: Develop a program to create histograms for all numerical features and analyze the
distribution of each feature. Generate box plots for all numerical features and identify any
outliers. Use California Housing dataset.
Code:
import pandas as pd
import numpy as np
data = fetch_california_housing(as_frame=True)
df = data.frame
print("Dataset Sample:")
print(df.head())
def plot_histograms(df):
plt.tight_layout(rect=[0, 0, 1, 0.97])
plt.show()
def plot_boxplots(df):
plt.figure(figsize=(14, 10))
plt.subplot(3, 3, i)
plt.tight_layout()
plt.show()
def analyze_features(df):
print("\nFeature Analysis:")
print(f"\nFeature: {column}")
q1 = df[column].quantile(0.25)
q3 = df[column].quantile(0.75)
iqr = q3 - q1
plot_histograms(df)
plot_boxplots(df)
analyze_features(df)
Output:
data = fetch_california_housing(as_frame=True)
df = data.frame
print("Dataset Sample:")
print(df.head())
correlation_matrix = df.corr()
print("\nCorrelation Matrix:")
print(correlation_matrix)
def plot_heatmap(corr_matrix):
plt.figure(figsize=(10, 8))
plt.show()
def plot_pairplot(df):
plt.show()
plot_heatmap(correlation_matrix)
plot_pairplot(df)
Output
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_standardized)
pca_df['Target'] = y
plt.figure(figsize=(8, 6))
color=color,
alpha=0.6,
label=label
plt.legend(title='Target', loc='best')
plt.grid(alpha=0.3)
plt.show()
Output:
data = {
df = pd.DataFrame(data)
file_path = "training_data.csv"
df.to_csv(file_path, index=False)
def find_s_algorithm(data):
hypothesis = None
hypothesis = features[i].copy()
break
if hypothesis is None:
for j in range(len(hypothesis)):
hypothesis[j] = '?'
return hypothesis
data = pd.read_csv(file_path)
print("Training Data:")
print(data)
print("\nFinal Hypothesis:")
print(final_hypothesis)
Output
a. Label the first 50 points {x1,......,x50} as follows: if (xi 0.5), then xi Class1, else xi Class1
b. Classify the remaining points, x51,......,x100 using KNN. Perform this for k=1,2,3,4,5,20,30
Employee(E_id, E_name, Age, Salary)
Code:
import numpy as np
x_train = x[:50].reshape(-1, 1)
y_train = labels
x_test = x[50:].reshape(-1, 1)
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
print(f"\nk = {k}:")
Output
np.random.seed(42)
X = X[:, np.newaxis]
W = np.diag(weights)
return prediction
return predictions
tau = 0.5
plt.figure(figsize=(10, 6))
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.grid(True)
plt.show()
Output
Experiment 7:. Develop a program to demonstrate the working of Linear Regression and
Polynomial Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG
Dataset (for vehicle fuel efficiency prediction) for Polynomial Regression.
Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Predictions
y_pred = lin_reg.predict(X_test)
# Performance metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.6)
# Remove rows with missing 'horsepower' values from both data and
target
data = data.dropna(subset=["horsepower"])
target = target.loc[data.index] # Ensure target is aligned with
filtered data
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_hp, y_mpg,
test_size=0.2, random_state=42)
# Predictions
y_pred_poly = lr_poly.predict(X_test_poly)
# Performance metrics
mse_poly = mean_squared_error(y_test, y_pred_poly)
r2_poly = r2_score(y_test, y_pred_poly)
# Plot
plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color="blue", label="True values",
alpha=0.6)
plt.plot(X_test_sorted, y_pred_sorted, color="red", label="Polynomial
fit (degree=3)", linewidth=2)
plt.xlabel("Horsepower")
plt.ylabel("Miles Per Gallon (MPG)")
plt.title("Polynomial Regression on Auto MPG Dataset")
plt.legend()
plt.grid(True)
plt.show()
Output
Experiment 8:. Develop a program to demonstrate the working of the decision tree algorithm.
Use Breast Cancer Data set for building the decision tree and apply this knowledge to classify a
new sample.
Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report
# Classify a new sample (e.g., random sample from the test set)
sample = X_test[0] # You can replace this with any sample
prediction = classify_new_sample(clf, sample)
print(f"\nClassified sample: {'Benign' if prediction == 1 else
'Malignant'}")
Output
Experiment 9:. Develop a program to implement the Naive Bayesian classifier considering
Olivetti Face Data set for training. Compute the accuracy of the classifier, considering a few test
data sets.
Code:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
Output
Experiment 10:. Develop a program to implement k-means clustering using Wisconsin Breast
Cancer data set and visualize the clustering result.
Code:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
return X_scaled, y
Output
Extra Program 1: Develop a program to implement the Support Vector Machine (SVM) algorithm
for binary classification. Use the Pima Indians Diabetes dataset for training and evaluation.
Experiment with different kernel functions (linear, RBF, polynomial) and regularization parameters
to find the best model.
Code:
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42)
# Store results
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
results.append((kernel, C, accuracy))
# Print results
print("\nModel Evaluation Results:")
for result in results:
print(f"Kernel: {result[0]}, C: {result[1]}, Accuracy:
{result[2]:.4f}")
print("\nBest Model:")
print(f"Kernel: {best_model[0]}, C: {best_model[1]}, Accuracy:
{best_model[2]:.4f}")
Output
Extra Program 2: Develop a program to implement the AdaBoost algorithm for classification. Use
the Pima Indians Diabetes dataset for training and evaluation. Analyze the impact of the number of
weak learners on the model's performance and compare the performance of AdaBoost with a single
decision tree classifier.
Code:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42)
Output