0% found this document useful (0 votes)
9 views

ML Report

Uploaded by

kanishk862005
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

ML Report

Uploaded by

kanishk862005
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Model Evaluation Report for

UCI SECOM Dataset


1. Introduction
This report details the data
preprocessing, model training, and
evaluation using a Voting Classifier on
the UCI SECOM dataset. The objective
was to classify the quality of
manufacturing processes based on
sensor data, predicting pass or fail
outcomes.

2. Data Preprocessing
Dataset Overview
The UCI SECOM dataset contains
measurements from various sensors
used in a manufacturing process, along
with a binary target variable indicating
whether a product passed or failed
quality control.
1.import necessary libraries:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import
StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import
train_test_split
from sklearn.metrics import accuracy_score,
classification_report, confusion_matrix
from sklearn.ensemble import
RandomForestClassifier, VotingClassifier
from sklearn.linear_model import
LogisticRegression
from sklearn.svm import SVC
from sklearn.decomposition import PCA
import warnings
2. Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')
This step is specific to Google Colab and
allows you to access files stored on Google
Drive. The dataset is stored in the user's
Google Drive, and this code mounts it for use.

3. Loading the Dataset


df = pd.read_csv('/content/drive/MyDrive/uci-
secom.csv')
x = df.drop(columns=['Pass/Fail'])
y = df['Pass/Fail'].replace({-1: 0})
 The dataset is read from the file stored on
Google Drive.
 The target variable is Pass/Fail, and the
rest of the columns are used as features.
The value -1 in the target variable is
replaced with 0 to denote the negative
class.
4. Train-Test Split
x_train, x_test, y_train, y_test =
train_test_split(x, y, test_size=0.2,
random_state=42, stratify=y)
The dataset is split into training and testing
sets. 20% of the data is reserved for testing.
The stratify=y ensures the same class
distribution in both training and test sets.

5. Handling Missing Values


threshold = len(x_train) * 0.2
train_cleaned = x_train.dropna(axis=1,
thresh=threshold)
 Columns with more than 20% missing
values are dropped from the training set.
train_cleaned =
train_cleaned.select_dtypes(include=[np.num
ber])
train_filled =
train_cleaned.fillna(train_cleaned.mean())
 Non-numeric columns (like timestamps)
are removed, and missing values in the
remaining numeric columns are filled with
the column means.

6. Handling Class Imbalance using


SMOTE
smote =
SMOTE(sampling_strategy='minority',
random_state=42)
x_resampled, y_resampled =
smote.fit_resample(train_filled, y_train)
 SMOTE (Synthetic Minority Over-
sampling Technique) is used to balance
the classes by generating synthetic
examples for the minority class.

7. Feature Standardization
scaler = StandardScaler()
x_resampled_scaled =
scaler.fit_transform(x_resampled)
 The features are standardized (i.e., scaled
to have a mean of 0 and standard
deviation of 1), which is essential for
models like SVM to perform well.

8. Dimensionality Reduction with PCA


pca = PCA(n_components=0.90)
x_resampled_pca =
pca.fit_transform(x_resampled_scaled)
 PCA (Principal Component Analysis) is
used to reduce the dimensionality of the
dataset while retaining 90% of the
variance. This step helps remove noise
and redundancy in the data.

9. Second Train-Test Split


x_train_pca, x_val_pca, y_train_pca, y_val_pca
= train_test_split(x_resampled_pca,
y_resampled, test_size=0.2,
random_state=45, stratify=y_resampled)
 After resampling and PCA, the data is split
again into training and validation sets to
evaluate the model's performance on
unseen data.

10. Building the Voting Classifier


log_clf = LogisticRegression(max_iter=1000)
rf_clf = RandomForestClassifier()
svc_clf = SVC(probability=True)
 Logistic Regression, Random Forest,
and Support Vector Classifier are
defined. The probability=True parameter
in SVC allows it to return probability
estimates for class predictions, which is
required for soft voting.
voting_clf = VotingClassifier(
estimators=[('lr', log_clf), ('rf', rf_clf), ('svc',
svc_clf)],
voting='soft'
)
 A Voting Classifier is created that
combines these models. Soft voting is
used, meaning the classifier predicts
based on the average predicted
probabilities of each class from the
models.

11. Model Training and Evaluation


voting_clf.fit(x_train_pca, y_train_pca)
val_predictions = voting_clf.predict(x_val_pca)
 The model is trained on the PCA-
transformed data and then used to predict
the validation set.
print("Accuracy on Validation Set:",
accuracy_score(y_val_pca, val_predictions))
print("Classification Report on Validation
Set:")
print(classification_report(y_val_pca,
val_predictions))
 The accuracy and classification report
(precision, recall, F1-score) are printed to
evaluate model performance.

12. Confusion Matrix


cm = confusion_matrix(y_val_pca,
val_predictions)
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d',
cmap='Blues', xticklabels=['Class 0', 'Class
1'], yticklabels=['Class 0', 'Class 1'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()
 A confusion matrix is plotted to visually
analyze the model’s performance in terms
of true positives, true negatives, false
positives, and false negatives.
OUTPUT:

Drive already mounted at /content/drive; to


attempt to forcibly remount, call drive.
mount("/content/drive", force_remount=True).
Accuracy on Validation Set:
0.9978632478632479
Classification Report on Validation Set:
precision recall f1-score support
Class 0 1.00 0.996 0.998
234
Class 1 0.997 1.00 0.998
234

accuracy 0.998 468


macro avg 0.998 0.998 0.998
468
weighted avg 0.998 0.998 0.998
468

CONFUSION MATRIX
References:
US Accident Dataset: Kaggle

Conclusion
The code takes the SECOM dataset, handles
missing data and class imbalance, and applies
PCA for dimensionality reduction.
Three classifiers (Logistic Regression, Random
Forest, and SVC) are combined using a Voting
Classifier to make final predictions.
The model is evaluated using various metrics
such as accuracy, confusion matrix, and
classification report.

You might also like