0% found this document useful (0 votes)
1 views

machine learning lab

The document provides an overview of machine learning concepts and techniques using the Scikit-learn library in Python, covering classification, regression, clustering, dimensionality reduction, preprocessing, and model selection. It includes practical code examples for K-Nearest Neighbors, Linear Regression, K-Means Clustering, and Logistic Regression. Additionally, it discusses the advantages and limitations of Scikit-learn, along with detailed explanations of key functions and methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

machine learning lab

The document provides an overview of machine learning concepts and techniques using the Scikit-learn library in Python, covering classification, regression, clustering, dimensionality reduction, preprocessing, and model selection. It includes practical code examples for K-Nearest Neighbors, Linear Regression, K-Means Clustering, and Logistic Regression. Additionally, it discusses the advantages and limitations of Scikit-learn, along with detailed explanations of key functions and methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Machine learning Lab (ugcs 213)

Experiment: 01
1. Introduction to Scikit-learn
Scikit-learn is a popular machine learning library in Python. It provides a wide range of tools
for building machine learning models and data preprocessing. It is built on top of NumPy,
SciPy, and matplotlib, making it efficient and easy to integrate with other Python data tools.

6Key Features of Scikit-learn

 Classification – Identifying which category an object belongs to.


Example: Email spam detection.
 Regression – Predicting a continuous-valued attribute.
Example: Predicting house prices.
 Clustering – Grouping similar items.
Example: Customer segmentation.
 Dimensionality Reduction – Reducing number of features.
Example: PCA (Principal Component Analysis).
 Model Selection – Comparing, validating, and selecting models.
 Preprocessing – Feature extraction and normalization.

Installation:

pip install scikit-learn

2. Classification
Classification is a supervised learning technique where the output is a label or category.

Example: K-Nearest Neighbors (KNN)

Theory:
KNN classifies a data point based on how its neighbors are classified. It stores all available
cases and classifies new ones based on a similarity measure (e.g., distance functions).

Functions Used:

 KNeighborsClassifier() - Initialize the model


 fit() - Train the model
 predict() - Predict the classes
 score() or accuracy_score() - Evaluate the model

Code:

from sklearn.datasets import load_iris


from sklearn.model_selection import train_test_split
Machine learning Lab (ugcs 213)

from sklearn.neighbors import KNeighborsClassifier


from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Model
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

3. Regression
Regression is used when the target variable is continuous.

Example: Linear Regression

Theory:
Linear regression fits a line (y = mx + b) to predict a continuous output based on input
features.

Functions Used:

 LinearRegression() - Initialize the model


 fit() - Train model
 predict() - Predict values
 mean_squared_error() - Evaluate performance

Code:

from sklearn.linear_model import LinearRegression


from sklearn.metrics import mean_squared_error
import numpy as np

# Sample data
X = np.array([[1], [2], [3], [4]])
y = np.array([3, 6, 9, 12])

# Model
model = LinearRegression()
model.fit(X, y)

# Predict
Machine learning Lab (ugcs 213)

y_pred = model.predict([[5]])
print("Predicted value:", y_pred)

4. Clustering
Clustering is an unsupervised learning method that groups data based on similarity.

Example: K-Means Clustering

Theory:
K-Means groups data into k clusters by minimizing the variance within each cluster.

Functions Used:

 KMeans() - Initialize with number of clusters


 fit() - Compute clusters
 predict() or labels_ - Access cluster assignments
 cluster_centers_ - Get coordinates of cluster centers

Code:

from sklearn.cluster import KMeans


import numpy as np

# Sample data
X = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])

# Model
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(X)

# Outputs
print("Cluster centers:", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)

5. Dimensionality Reduction
Used to reduce the number of input variables in a dataset.

Example: PCA (Principal Component Analysis)

Theory:
PCA transforms the data to a new coordinate system, keeping only the components (axes)
that contribute most to variance.

Functions Used:

 PCA(n_components=2) - Specify number of reduced features


Machine learning Lab (ugcs 213)

 fit_transform() - Fit PCA and apply transformation

Code:

from sklearn.decomposition import PCA


from sklearn.datasets import load_iris

# Load data
X, _ = load_iris(return_X_y=True)

# PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print("Reduced shape:", X_reduced.shape)

6. Preprocessing
Example: Standardization using StandardScaler

Theory:
StandardScaler standardizes features by removing the mean and scaling to unit variance.

Functions Used:

 StandardScaler() - Initialize
 fit_transform() - Compute and apply standardization

Code:

from sklearn.preprocessing import StandardScaler

# Data
data = [[1, 2], [3, 4], [5, 6]]

# Standardize
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print("Scaled Data:", scaled_data)

7. Model Selection and Evaluation


Example: Cross-validation

Theory:
Cross-validation is used to evaluate the model’s ability to generalize. It splits data into
training and validation sets multiple times.

Functions Used:

 cross_val_score() - Performs k-fold cross-validation


Machine learning Lab (ugcs 213)

Code:

from sklearn.model_selection import cross_val_score


from sklearn.svm import SVC
from sklearn.datasets import load_digits

# Load data
X, y = load_digits(return_X_y=True)

# Model
svc = SVC(kernel='linear')

# Cross-validation
scores = cross_val_score(svc, X, y, cv=5)
print("CV Scores:", scores)

print("Average Accuracy:", scores.mean())

Advantages of Scikit-learn

 Simple API for beginners


 Integrates well with NumPy and pandas
 Strong community support
 Wide variety of algorithms
 Cross-validation and hyperparameter tuning built-in

Limitations

 Not suitable for deep learning (use TensorFlow or PyTorch instead)


 Not ideal for very large datasets
 Limited support for GPU acceleration
Machine learning Lab (ugcs 213)

Experiment: 02
K-Means clustering with code and examples.
1. Introduction to K-Means Clustering
K-Means is an unsupervised machine learning algorithm used to partition data into K
clusters, where each data point belongs to the cluster with the nearest mean (centroid). It is
widely used for tasks like customer segmentation, image compression, and anomaly
detection.

How K-Means Works


1. Choose the number of clusters KKK.
2. Randomly initialize KKK centroids.
3. Assign each point to the nearest centroid (cluster assignment).
4. Compute new centroids (mean of the points in each cluster).
5. Repeat steps 3–4 until centroids do not change or reach max iterations.

Important Parameters of KMeans()


Parameter Description
n_clusters Number of clusters (K) to form.
init
Method for initialization ('k-means++' by default, improves
performance).
n_init Number of times the algorithm will be run with different centroid seeds.
max_iter Maximum number of iterations for a single run.
random_state Ensures reproducibility of results.
tol Relative tolerance with regards to inertia to declare convergence.
algorithm Algorithm to use: 'auto', 'full', or 'elkan'.

Common Methods in KMeans


Method Description
.fit(X) Fits KMeans model to the data X.
.predict(X) Assigns each sample in X to a cluster.
.fit_predict(X) Combines fit and predict in one step.
.transform(X) Returns the distance of each point to each centroid.
.fit_transform(X) Fits and returns distance to centroids.
Machine learning Lab (ugcs 213)

Attribute Description
.cluster_centers_ Coordinates of cluster centers.
.labels_ Labels of each point.
.inertia_ Sum of squared distances of samples to their closest cluster center.
.n_iter_ Number of iterations run.

Implementation of code :
# Import libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

# Load the dataset

dataset = pd.read_csv("Mall_Customers.csv")

# Select features: Annual Income and Spending Score

x = dataset.iloc[:, [3, 4]].values # or dataset[['Annual Income (k$)', 'Spending Score (1-


100)']].values

# Elbow Method to find optimal number of clusters

wcss_list = [] # Within-cluster sum of squares

for i in range(1, 11):

kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)

kmeans.fit(x)

wcss_list.append(kmeans.inertia_) # inertia_ = WCSS


Machine learning Lab (ugcs 213)

# Plot the Elbow graph

plt.figure(figsize=(8, 5))

plt.plot(range(1, 11), wcss_list, marker='o')

plt.title('The Elbow Method Graph')

plt.xlabel('Number of clusters (k)')

plt.ylabel('WCSS')

plt.grid(True)

plt.show()

# Apply KMeans with the optimal number of clusters (k = 5)

kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)

y_kmeans = kmeans.fit_predict(x)

# Visualize the clusters

plt.figure(figsize=(8, 6))

# Cluster 1

plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1],

s=100, c='red', label='Cluster 1')

# Cluster 2

plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1],

s=100, c='blue', label='Cluster 2')


Machine learning Lab (ugcs 213)

# Cluster 3

plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1],

s=100, c='green', label='Cluster 3')

# Cluster 4

plt.scatter(x[y_kmeans == 3, 0], x[y_kmeans == 3, 1],

s=100, c='cyan', label='Cluster 4')

# Cluster 5

plt.scatter(x[y_kmeans == 4, 0], x[y_kmeans == 4, 1],

s=100, c='magenta', label='Cluster 5')

# Plot centroids

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],

s=300, c='yellow', label='Centroids', edgecolor='black')

# Labels

plt.title('Customer Segments (KMeans Clustering)')

plt.xlabel('Annual Income (k$)')

plt.ylabel('Spending Score (1-100)')

plt.legend()

plt.grid(True)

plt.show()
Machine learning Lab (ugcs 213)
Machine learning Lab (ugcs 213)

Experiment : 03:

What is Linear Regression?


Linear Regression is a supervised learning algorithm used for predicting a continuous
dependent variable based on one or more independent variables.

🔸 Simple Linear Regression:

Model: y=mx+c

Where:

 y = predicted value (dependent variable)

 x = independent variable

 m = slope (coefficient)

 c = intercept

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# 1. Load the dataset


# dataset = pd.read_csv('Salary_Data.csv')
url = "https://raw.githubusercontent.com/AdiPersonalWorks/Random/master/
Salary_Data.csv"
dataset = pd.read_csv(url)

print(dataset.head())

X = dataset[['YearsExperience']] # Independent variable


y = dataset['Salary'] # Dependent variable

# 2. Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Create the Linear Regression model


model = LinearRegression()
Machine learning Lab (ugcs 213)

# 4. Train the model


model.fit(X_train, y_train)

# 5. Predict the test set results


y_pred = model.predict(X_test)

# 6. Evaluate the model


print("Coefficient (slope):", model.coef_[0])
print("Intercept:", model.intercept_)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R^2 Score (accuracy):", r2_score(y_test, y_pred))

# 7. Plotting the training data


plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.plot(X_train, model.predict(X_train), color='red', label='Regression line')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend()
plt.show()

#8. Plotting the test data


plt.scatter(X_test, y_test, color='green', label='Test data')
plt.plot(X_train, model.predict(X_train), color='red', label='Regression line')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend()
plt.show()
Machine learning Lab (ugcs 213)

Functions and Methods Used:

Function/Method Description
LinearRegression() Creates a linear regression model.
fit(X, y) Trains the model using features X and target y.
predict(X) Predicts the output for new input values.
mean_squared_error(y_true, Evaluates average squared difference between
y_pred) actual and predicted.
r2_score(y_true, y_pred) Gives model accuracy score (closer to 1 is better).
Machine learning Lab (ugcs 213)

Experiment :04
Binary classification with logistics regression
What is Binary Classification?

Binary classification is used when the output has two classes (e.g., Yes/No, 0/1, Spam/Not
Spam).

Logistic Regression Overview

 It's a classification algorithm, not regression, despite the name.


 It predicts the probability that a given input belongs to a certain class.
 Output is between 0 and 1, using the sigmoid function.

Logistic Regression Key Functions Explained:

Function Description
LogisticRegression() Initializes the model.
fit(X_train, y_train) Trains the model.
predict(X_test) Predicts 0 or 1.
predict_proba(X) Gives probabilities for class 0 and 1.
accuracy_score(y_test, y_pred) Calculates accuracy.
confusion_matrix() Displays TP, TN, FP, FN.
classification_report() Shows precision, recall, f1-score.

Implementation of code:

# Import required libraries


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset


data = pd.read_csv("Titanic-Dataset.csv")

# Select relevant features


features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']
data = data[features + ['Survived']]

# Handle missing values (fix: assign the filled values back)


data['Age'].fillna(data['Age'].median(), inplace=True)
Machine learning Lab (ugcs 213)

# Encode categorical feature 'Sex'


le = LabelEncoder()
data['Sex'] = le.fit_transform(data['Sex']) # male=1, female=0

# Define input and target


X = data[features]
y = data['Survived']

# Split into training and testing datasets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train logistic regression model


model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Accuracy and report


print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix Visualization


cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Did not Survive', 'Survived'],
yticklabels=['Did not Survive', 'Survived'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
Machine learning Lab (ugcs 213)
Machine learning Lab (ugcs 213)

Experiment : 05
Explain Decision Tree and implement with code.

What is a Decision Tree?


A Decision Tree is a supervised learning algorithm used for classification and regression
problems. It splits the dataset into branches based on feature values, helping to make
predictions by learning simple decision rules inferred from the data.

How It Works:
 The algorithm selects the best feature using criteria like Gini impurity or
Information Gain (Entropy).
 It splits the dataset recursively into subsets.
 It continues until it meets a stopping criterion (like max depth, pure nodes, or few
samples).

Function/ Method used in this algo:

Function/Method Purpose
DecisionTreeClassifier() Create Decision Tree model
.fit() Train the model on dataset
.predict() Predict using trained model
train_test_split() Split data into train and test sets
accuracy_score() Calculate prediction accuracy
classification_report() Show precision, recall, F1-score
confusion_matrix() Matrix showing TP, TN, FP, FN
sns.heatmap() Visualize confusion matrix
plot_tree() Visual representation of decision rules
LabelEncoder() Convert categorical to numeric (e.g., male → 1)

Implementation of code:

# import library

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns


Machine learning Lab (ugcs 213)

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder

from sklearn.tree import DecisionTreeClassifier, plot_tree

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load dataset

data = pd.read_csv("Titanic-Dataset.csv")

data.head()

# Select features

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']

data = data[features + ['Survived']]

data

# Handle missing values

data['Age'].fillna(data['Age'].median(), inplace=True)

# Encode categorical variables

le = LabelEncoder()

data['Sex'] = le.fit_transform(data['Sex']) # male = 1, female = 0

# Split data into input and output

X = data[features]
Machine learning Lab (ugcs 213)

y = data['Survived']

# Train-Test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Decision Tree Model

model = DecisionTreeClassifier(criterion='entropy', random_state=42)

model.fit(X_train, y_train)

# Predict

y_pred = model.predict(X_test)

# Evaluate

print("Accuracy:", accuracy_score(y_test, y_pred))

print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix

cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(6, 5))

sns.heatmap(cm, annot=True, fmt='d', cmap='Greens',

xticklabels=['Did not Survive', 'Survived'],

yticklabels=['Did not Survive', 'Survived'])

plt.xlabel('Predicted')

plt.ylabel('Actual')
Machine learning Lab (ugcs 213)

plt.title('Confusion Matrix - Decision Tree')

plt.show()

# Plot the Decision Tree

plt.figure(figsize=(20, 10))

plot_tree(model, feature_names=features, class_names=['Not Survived', 'Survived'],

filled=True, rounded=True, fontsize=12)

plt.title('Decision Tree Visualization')

plt.show()

You might also like