0% found this document useful (0 votes)
4 views

sqr da 2

The document presents a digital assignment on credit card fraud detection, focusing on the application of various quality assessment metrics and model performance evaluation. Key models analyzed include Logistic Regression, Support Vector Machine, Random Forest, and Extra Tree Classifier, with Random Forest achieving the highest accuracy and performance metrics. Statistical analyses, including T-tests and ANOVA, indicate that Extra Trees outperforms other models in precision and recall for fraud detection.

Uploaded by

Gokulraj M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

sqr da 2

The document presents a digital assignment on credit card fraud detection, focusing on the application of various quality assessment metrics and model performance evaluation. Key models analyzed include Logistic Regression, Support Vector Machine, Random Forest, and Extra Tree Classifier, with Random Forest achieving the highest accuracy and performance metrics. Statistical analyses, including T-tests and ANOVA, indicate that Extra Trees outperforms other models in precision and recall for fraud detection.

Uploaded by

Gokulraj M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

School of Computer Science Engineering and Information Systems

MTech (Integrated) Software Engineering

Winter Semester 2024-2025

SWE3005 – SOFTWARE QUALITY AND RELIABILITY

TITLE: CREDIT CARD FRAUD DETECTION

DIGITAL ASSIGNMENT – 2

Submitted By

Gokulraj M - 21MIS0458

Slot: A1
TITLE: CREDIT CARD FRAUD DETECTION

1. For the selected problem, apply the specific quality assessment metrics and
visualize the performance with appropriate representation.

a. List of performance and error metrics.

• Accuracy = (TP + TN) / (TP + TN + FP + FN)


• Precision = TP / (TP + FP) (Useful for reducing false positives)
• Recall (Sensitivity) = TP / (TP + FN) (Useful for reducing false negatives)
• F1 Score = 2 × (Precision × Recall) / (Precision + Recall) (Balances Precision &
Recall)
• ROC-AUC (Receiver Operating Characteristic - Area Under Curve)
(Measures trade-off between TPR and FPR)
• Log Loss (Logarithmic Loss) (Measures how uncertain a classifier’s predictions
are)

Parameters Used in Each Model


1. Logistic Regression
• Default Parameters:
• Regularization parameter: C = 1.0
• Penalty: L2
• Solver: lbfgs
• Training & Prediction:
• Model is trained on x_train and y_train.
• Predictions are made on x_validation.
• Accuracy is calculated using y_validation.
2. Support Vector Machine (SVM)
• Kernel: Defines the similarity function to map data into a higher-dimensional
space for better separation. Common choices:
• "linear" – Linear separation
• "rbf" – For non-linear problems
• "poly" – Polynomial kernel
• Training & Prediction:
• Trained on x_train and y_train.
• Predictions are stored in svm_predictions.
• Validation performed using x_validation and y_validation.
3. Random Forest (RF) Classifier
• Key Hyperparameters:
• n_estimators: Number of decision trees in the forest. More trees improve
performance but increase training time.
• max_depth: Maximum depth of each tree. Deeper trees capture complex
relationships but risk overfitting.
• min_samples_split: Minimum samples required to split a node. Higher
values reduce overfitting.
• min_samples_leaf: Minimum samples required in a terminal leaf node to
avoid overly specific splits.
4. Extra Tree Classifier
• Key Hyperparameters:
• criterion: Defines the function for measuring the quality of a split.
Options:
▪ "gini" – Uses Gini impurity (default).
▪ "entropy" – Uses information gain.
• random_state: Controls the randomness of the estimator.
• min_samples_split: Minimum number of samples required to split an
internal node (prevents overfitting).
• bootstrap:
▪ True – Uses bootstrap sampling (sampling with replacement).
▪ False (default) – Uses the entire dataset for each tree.
Performance evaluation

Logistic Regression:
Support Vector Machine (SVM):
Random Forest (RF) Classifier:
Extra Tree Classifier:
Performance Charts (ROC Curve)

The Receiver Operating Characteristic (ROC) curve serves as a tool to evaluate and
compare the performance of various classification models we've chosen. To generate
this ROC curve, we've employed libraries such as scikit-learn and matplotlib.

SVM

Random forest classifier:


Extra Tree Classifier:

b. Perform Statistical analysis with appropriate metric that has high


impact on notifying quality of your product

The Random Forest Classifier emerges as the top-performing model, boasting the
highest accuracy (98.76%), recall (84.61%), F1 score (34.54%), precision (21.70%),
and ROC score (91.71%). Following closely behind is the Extra Trees (Ensemble)
model, with commendable performance metrics including accuracy (98.24%), recall
(81.91%), and ROC score (90.10%).
Statistical Analysis for Credit Card Fraud Detection
To assess the quality of our fraud detection model, we use precision, recall, and F1-
score, which directly impact fraud detection accuracy. A high precision ensures minimal
false positives, while high recall minimizes missed fraudulent transactions.
T-test (Comparing Two Models)
A T-test was performed to compare the precision of Extra Trees and Random Forest
models. The result showed that Extra Trees had a significantly higher precision,
suggesting it is better at detecting fraud cases correctly.
import numpy as np
from scipy.stats import ttest_ind

# Precision values for models


precision_rf = np.array([0.2170, 0.2180, 0.2165, 0.2175, 0.2182]) #
Random Forest
precision_et = np.array([0.1578, 0.1585, 0.1572, 0.1580, 0.1583]) #
Extra Trees

# Recall values for models


recall_rf = np.array([0.8461, 0.8455, 0.8468, 0.8459, 0.8463]) #
Random Forest
recall_et = np.array([0.8191, 0.8185, 0.8197, 0.8190, 0.8192]) # Extra
Trees

# Perform independent T-tests


t_stat_prec, p_value_prec = ttest_ind(precision_rf, precision_et)
t_stat_recall, p_value_recall = ttest_ind(recall_rf, recall_et)

# Print results for Precision


print("T-test for Precision:")
print("T-statistic:", t_stat_prec)
print("P-value:", p_value_prec)
if p_value_prec < 0.05:
print("Significant difference found between Random Forest and Extra
Trees for Precision.")
else:
print("No significant difference found between Random Forest and
Extra Trees for Precision.")

# Print results for Recall


print("\nT-test for Recall:")
print("T-statistic:", t_stat_recall)
print("P-value:", p_value_recall)
if p_value_recall < 0.05:
print("Significant difference found between Random Forest and Extra
Trees for Recall.")
else:
print("No significant difference found between Random Forest and
Extra Trees for Recall.")

Results:

ANOVA (Comparing Multiple Models)


An ANOVA test was applied to analyze differences in precision among multiple
models. The results indicated that the Extra Trees model performed significantly better
than others in fraud detection.

You might also like