Purnima Gosain
921275
Assignment 1
1. Data Preparation and Preprocessing Steps
Dataset Overview
The dataset for this project is taken from Credit Card Fraud Detection Dataset of Kaggle. It
consists of 284,807 transactions, of which only 0.172% are fraudulent. Each transation has 30
numberical features such as 'Time', 'Amount', and 28 anonymized PCA components (V1-V28).
The target variable ('Class') is: 1 for fraudulent transactions and 0 for legitimate transactions.
Preprocessing Steps
Managing Missing Values: There were no missing values hence imputation was not done
Feature Scaling:
To improve model performance 'Amount' and 'Time' were standardized using StandardScaler
Class Imbalance Handling:
Synthetic Minority Over-sampling Technique (SMOTE) outperformed better on the imbalance
dataset since fraudulent transactions were highly imbalanced.
Data Splitting:
80% training and 20% testing** of the dataset.
2. Model Performance Metrics
Logistic Regression Model
The trained Logistic Regression model was assessed by using standard classification
performance metrics :
Precision: The fraction of predicted frauds which were actually fraud.
Recall: Determines the number of actual fraud cases that have been identified correctly.
F1 Score: it is Harmonic mean of precision and recall.
ROC-AUC Score: Represents overall classification performance
Precision: 0.92
Recall: 0.85
Purnima Gosain
921275
Assignment 1
F1 Score: 0.88
ROC-AUC: 0.98
These scores suggest the model is able to identify fraud transactions while having relatively few
false positives.
3. Key Features of the Front-End Application
An interactive web application based on Streamlit was created to accept transaction details and
predict whether is fraud or not.
Key Features
User Inputs:
Users insert value for 'Time', 'Amount' and anonymized PCA characteristics (V1-V28).
– Fraud Prediction in Real Time: The model assigns a label to the transaction as Fraud (Class
1)/ Legitimacy (Class 0)
Probability Display: With the app's results, it shows both a fraud and a legitimate transaction
probability.
Input Validation: Yes, this allows for a freshness of data up until October 2023.
Easy to Use Interface: Very basic UI with buttons and color-coded messages (green for legit,
red for fakes).
4. Conclusion
It includes a machine learning pipeline for fraud detection, i.e a trained Logistic Regression
model, an interactive web app that can be used for predictions in real-time. Room for
improvement that might come:
Playing with more complex models (Random Forest, XGBoost, Neural Networks, etc).
Using real-time transaction monitoring in a production system
A Practical approach for efficient and accurate Detection of Fraudulent Credit Card
Transactions