machine learning lab
machine learning lab
Experiment: 01
1. Introduction to Scikit-learn
Scikit-learn is a popular machine learning library in Python. It provides a wide range of tools
for building machine learning models and data preprocessing. It is built on top of NumPy,
SciPy, and matplotlib, making it efficient and easy to integrate with other Python data tools.
Installation:
2. Classification
Classification is a supervised learning technique where the output is a label or category.
Theory:
KNN classifies a data point based on how its neighbors are classified. It stores all available
cases and classifies new ones based on a similarity measure (e.g., distance functions).
Functions Used:
Code:
# Load data
X, y = load_iris(return_X_y=True)
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Model
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))
3. Regression
Regression is used when the target variable is continuous.
Theory:
Linear regression fits a line (y = mx + b) to predict a continuous output based on input
features.
Functions Used:
Code:
# Sample data
X = np.array([[1], [2], [3], [4]])
y = np.array([3, 6, 9, 12])
# Model
model = LinearRegression()
model.fit(X, y)
# Predict
Machine learning Lab (ugcs 213)
y_pred = model.predict([[5]])
print("Predicted value:", y_pred)
4. Clustering
Clustering is an unsupervised learning method that groups data based on similarity.
Theory:
K-Means groups data into k clusters by minimizing the variance within each cluster.
Functions Used:
Code:
# Sample data
X = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])
# Model
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(X)
# Outputs
print("Cluster centers:", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)
5. Dimensionality Reduction
Used to reduce the number of input variables in a dataset.
Theory:
PCA transforms the data to a new coordinate system, keeping only the components (axes)
that contribute most to variance.
Functions Used:
Code:
# Load data
X, _ = load_iris(return_X_y=True)
# PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print("Reduced shape:", X_reduced.shape)
6. Preprocessing
Example: Standardization using StandardScaler
Theory:
StandardScaler standardizes features by removing the mean and scaling to unit variance.
Functions Used:
StandardScaler() - Initialize
fit_transform() - Compute and apply standardization
Code:
# Data
data = [[1, 2], [3, 4], [5, 6]]
# Standardize
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print("Scaled Data:", scaled_data)
Theory:
Cross-validation is used to evaluate the model’s ability to generalize. It splits data into
training and validation sets multiple times.
Functions Used:
Code:
# Load data
X, y = load_digits(return_X_y=True)
# Model
svc = SVC(kernel='linear')
# Cross-validation
scores = cross_val_score(svc, X, y, cv=5)
print("CV Scores:", scores)
Advantages of Scikit-learn
Limitations
Experiment: 02
K-Means clustering with code and examples.
1. Introduction to K-Means Clustering
K-Means is an unsupervised machine learning algorithm used to partition data into K
clusters, where each data point belongs to the cluster with the nearest mean (centroid). It is
widely used for tasks like customer segmentation, image compression, and anomaly
detection.
Attribute Description
.cluster_centers_ Coordinates of cluster centers.
.labels_ Labels of each point.
.inertia_ Sum of squared distances of samples to their closest cluster center.
.n_iter_ Number of iterations run.
Implementation of code :
# Import libraries
import numpy as np
import pandas as pd
dataset = pd.read_csv("Mall_Customers.csv")
kmeans.fit(x)
plt.figure(figsize=(8, 5))
plt.ylabel('WCSS')
plt.grid(True)
plt.show()
y_kmeans = kmeans.fit_predict(x)
plt.figure(figsize=(8, 6))
# Cluster 1
# Cluster 2
# Cluster 3
# Cluster 4
# Cluster 5
# Plot centroids
# Labels
plt.legend()
plt.grid(True)
plt.show()
Machine learning Lab (ugcs 213)
Machine learning Lab (ugcs 213)
Experiment : 03:
Model: y=mx+c
Where:
x = independent variable
m = slope (coefficient)
c = intercept
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
print(dataset.head())
Function/Method Description
LinearRegression() Creates a linear regression model.
fit(X, y) Trains the model using features X and target y.
predict(X) Predicts the output for new input values.
mean_squared_error(y_true, Evaluates average squared difference between
y_pred) actual and predicted.
r2_score(y_true, y_pred) Gives model accuracy score (closer to 1 is better).
Machine learning Lab (ugcs 213)
Experiment :04
Binary classification with logistics regression
What is Binary Classification?
Binary classification is used when the output has two classes (e.g., Yes/No, 0/1, Spam/Not
Spam).
Function Description
LogisticRegression() Initializes the model.
fit(X_train, y_train) Trains the model.
predict(X_test) Predicts 0 or 1.
predict_proba(X) Gives probabilities for class 0 and 1.
accuracy_score(y_test, y_pred) Calculates accuracy.
confusion_matrix() Displays TP, TN, FP, FN.
classification_report() Shows precision, recall, f1-score.
Implementation of code:
# Make predictions
y_pred = model.predict(X_test)
Experiment : 05
Explain Decision Tree and implement with code.
How It Works:
The algorithm selects the best feature using criteria like Gini impurity or
Information Gain (Entropy).
It splits the dataset recursively into subsets.
It continues until it meets a stopping criterion (like max depth, pure nodes, or few
samples).
Function/Method Purpose
DecisionTreeClassifier() Create Decision Tree model
.fit() Train the model on dataset
.predict() Predict using trained model
train_test_split() Split data into train and test sets
accuracy_score() Calculate prediction accuracy
classification_report() Show precision, recall, F1-score
confusion_matrix() Matrix showing TP, TN, FP, FN
sns.heatmap() Visualize confusion matrix
plot_tree() Visual representation of decision rules
LabelEncoder() Convert categorical to numeric (e.g., male → 1)
Implementation of code:
# import library
import pandas as pd
# Load dataset
data = pd.read_csv("Titanic-Dataset.csv")
data.head()
# Select features
data
data['Age'].fillna(data['Age'].median(), inplace=True)
le = LabelEncoder()
X = data[features]
Machine learning Lab (ugcs 213)
y = data['Survived']
# Train-Test split
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
plt.xlabel('Predicted')
plt.ylabel('Actual')
Machine learning Lab (ugcs 213)
plt.show()
plt.figure(figsize=(20, 10))
plt.show()