0% found this document useful (0 votes)
68 views26 pages

21BCE3954 FraudDetectionInBanking

Uploaded by

mansikalawar123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views26 pages

21BCE3954 FraudDetectionInBanking

Uploaded by

mansikalawar123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

School of Computer Science and Engineering Vellore

Institute of Technology, Vellore.

Digital Assignment -1 & 2


Foundations of Data Science
Submitted by: Mansi Kalawar
Registration: 21BCE3954

Submitted to: Aunradha J Ma’am

Topic:
Fraud Detection in Banking: A Data-Driven Analytical Perspective
1. Data Analytics Life Cycle 1.1 Data Discovery Data
Sources:
● Transaction Data: The dataset contains transactions made by credit cards in September 2013 by
European cardholders. The data consists of 284,807 transactions over two days, with 492 identified
as fraudulent (about 0.172%).
● Features: The dataset has 30 features, including:
○ Time: The number of seconds elapsed between this transaction and the first transaction in the
dataset.
○ V1 to V28: These are the result of a PCA(Prompt Corrective Action) transformation, so the
dataset does not expose the original features due to confidentiality issues.
○ Amount: The transaction amount.
○ Class: The target variable, where 1 indicates fraud and 0 indicates no fraud.

Aggregated Data Sources:


● Historical Transaction Data: Aggregating multiple periods (days, weeks, months) of transaction data
to identify patterns.
● Customer Profiles: Demographics, behaviour, and transaction history of customers.
● External Threat Intelligence: Data on known fraudulent schemes, compromised card lists, etc. Raw
Data Review:
● Time-Series Nature: Transactions are sequential, and patterns over time can indicate fraud.
● Imbalanced Data: There is a significant imbalance between fraudulent and non-fraudulent
transactions. Techniques such as oversampling (SMOTE) or undersampling may be required.
● PCA-transformed Features: The data is already processed to anonymize sensitive details, which
means further feature engineering will be limited.

Required Data Structures and Resources:


● Data Storage: A scalable database capable of handling high-velocity, high-volume transaction data.
Preferably using a distributed system like HDFS, or cloud-based storage such as AWS S3.
● Processing Framework: A big data processing framework like Apache Spark or Hadoop would be
suitable for handling large datasets and real-time processing.
● Data Pipeline: ETL (Extract, Transform, Load) processes to clean, transform, and load data for
analysis. Apache NiFi or AWS Glue can be used to automate data flows.
● Data Schema: Relational schema for structured transaction data, potentially using a star schema for
better analytics.
● Machine Learning Infrastructure: GPU/CPU clusters for training models. Libraries like TensorFlow,
PyTorch, or Scikit-learn for implementing detection algorithms.

Scope of Data Infrastructure Needed:


● Real-time Streaming: For detecting fraud as transactions occur, using tools like Apache Kafka for
streaming data and Apache Flink or Spark Streaming for real-time processing.
● Batch Processing: For retrospective analysis and model training, leveraging batch processing
frameworks.
● Data Lake: To store raw, semi-structured, and structured data, allowing for flexible data analysis and
machine learning model training.
● Monitoring and Alerting Systems: To monitor transactions and trigger alerts for suspected fraud
cases using rule-based systems or predictive models.

1.2 Data Preparation


This step involves cleaning, transforming, and organising the data to make it ready for analysis and
model building.

1. Data Collection and Understanding


● Data Source: The dataset is a collection of anonymized credit card transactions. It includes both
legitimate and fraudulent transactions.
● Features: 30 features, including Time, Amount, and Class (target variable indicating fraud or not).
2. Data Cleaning
● Handling Missing Values:
○ Check for missing values in the dataset. Although this specific dataset does not have missing
values, if they were present, strategies like mean/mode/median imputation or dropping rows/columns
might be necessary. ● Outlier Detection:
○ Analyze features like Amount for outliers using box plots or Z-scores. Outliers in the transaction
amount could be indicative of fraud but might need special handling depending on the analysis. ● Data
Consistency:
○ Ensure that the Time feature is consistent and correctly reflects the transaction order. Although
this dataset has no missing time data, in practice, any missing or incorrect timestamps should
be corrected or imputed.
3. Data Transformation
● Feature Scaling:
○ Standardise or normalise the Amount feature because the dataset might have varying
scales. For instance, apply a Min-Max Scaler or Standard Scaler to normalise the
data for algorithms sensitive to feature scales (e.g., SVM, k-NN). ● Dimensionality
Reduction:
○ The dataset already includes Principal Component Analysis (PCA) transformed features (V1 to
V28). This step has been applied to reduce dimensionality and protect confidentiality, so no
further dimensionality reduction is typically required.
● Encoding Categorical Variables:
○ This dataset does not have categorical variables, but in cases where categorical data exists,
use techniques like One-Hot Encoding or Label Encoding to convert them into numerical
format.

4. Data Integration
● Combining Data Sources:
○ If using multiple datasets, integrate them into a single coherent dataset. For example,
combining transaction data with customer demographic data, although this dataset is
selfcontained.
● Handling Duplicates:
○ Check for and remove any duplicate transactions to ensure data integrity. Duplicates could
skew results, especially in fraud detection where identical transactions could be a red flag.

5. Data Reduction
● Sampling:
○ Given the dataset's imbalance (only 0.172% of transactions are fraudulent), consider using
techniques like undersampling the majority class (non-fraud) or oversampling the minority class
(fraud) to balance the dataset.
○ Techniques like SMOTE (Synthetic Minority Over-sampling Technique) can be applied to
generate synthetic examples of the minority class.
● Feature Selection:
○ While PCA has already been applied, further feature selection might involve removing
lowvariance features or using algorithms like Random Forests to rank feature importance and
select the top features.
6. Data Splitting
● Train-Test Split:
○ Split the dataset into training and testing sets. Common practice is an 80-20 or 70-30 split. The
split should be stratified to ensure the class distribution (fraud vs. non-fraud) is similar in both
sets.
● Validation Set:
○ Further split the training data into training and validation sets to tune hyperparameters and
prevent overfitting. Alternatively, cross-validation can be used for model evaluation.
7. Data Augmentation (Optional)
● Synthetic Data Generation:
○ In scenarios where the fraud cases are extremely low, generating synthetic fraud cases similar
to real ones could improve model robustness.
8. Data Exploration
● Exploratory Data Analysis (EDA):
○ Use visualisation techniques like histograms, scatter plots, and correlation matrices to
understand relationships between features and the target variable.
○ Perform a detailed analysis to identify any underlying patterns or anomalies in the data.
1.3 Model Planning
In the credit card fraud detection project, XGBoost (Extreme Gradient Boosting) is selected as the most
suitable model. XGBoost is known for its high performance, particularly in handling imbalanced datasets like
the one at hand. Below is the model planning approach tailored for implementing XGBoost:

1. Assessing the Structure of the Datasets


● Data Composition:
The dataset comprises 284,807 credit card transactions over two days, with 30 features including
Time, Amount, and Class. The Class feature is the target variable, indicating whether a transaction is
fraudulent (1) or not (0). Features V1 to V28 are PCA-transformed variables, which makes them
linearly uncorrelated.
● Imbalance in Data:
The dataset is highly imbalanced, with fraudulent transactions constituting only about 0.172% of the
total data. This imbalance requires careful handling, and XGBoost is particularly effective here due to
its ability to apply different weights to different classes, which helps mitigate bias towards the majority
class.
● Tool Selection:
XGBoost will be implemented using Python's XGBoost library, which is optimized for performance and
scalability. Additional tools like Scikit-learn will be used for data preprocessing, and Tableau will be
employed for visualizing the results.

2. Ensuring Alignment with Project Goals


● Objective Revalidation:
The goal is to build a model that accurately detects fraudulent transactions while minimizing false
negatives. XGBoost's robustness in managing imbalanced datasets and its ability to handle complex
data relationships make it an ideal choice for this objective.
● Avoiding Scope Creep:
The analysis remains focused strictly on detecting fraudulent credit card transactions. The use of
XGBoost helps ensure that the project stays within scope, as it is specifically suited to handle the
challenges presented by this dataset.

3. Data Exploration and Variable Selection


● Relationship Exploration:
Analyzing the relationship between Amount, Time, and the PCA-transformed features (V1 to V28) with
the Class target variable is essential. Visualization tools and statistical analysis will help understand
these relationships, ensuring that the model leverages the most informative features.
● Feature Importance:
XGBoost inherently provides feature importance scores, which can help identify the most significant
features contributing to fraud detection. Although PCA has already reduced dimensionality, XGBoost
will help further refine the feature set by highlighting those with the highest predictive power.

4. Model Selection
● Why XGBoost?
○ Imbalanced Data Handling: XGBoost’s ability to apply class weights and focus on difficult-to-
classify instances makes it particularly effective for this highly imbalanced dataset.
○ High Performance: XGBoost is known for its efficiency and speed, particularly in large datasets,
making it ideal for real-time fraud detection.
○ Interpretability: XGBoost offers insights into feature importance, allowing for a better
understanding of the factors influencing fraud detection.
● Tool Utilization:
○ Python (XGBoost Library): XGBoost will be implemented for training the model, with
additional preprocessing and evaluation using Scikit-learn.
○ Cross-Validation: To ensure the model generalizes well, a 5-fold cross-validation strategy will
be employed.
○ Hyperparameter Tuning: Hyperparameters such as learning rate, max depth, and the number
of trees will be optimized using grid search or Bayesian optimization.
○ Threshold Adjustment: After training, the decision threshold for classifying a transaction as
fraudulent will be fine-tuned to balance precision and recall, depending on business
requirements.
● Model Evaluation:
○ Precision-Recall Curve: Given the class imbalance, the precision-recall curve will be a key
metric for evaluating model performance.
○ Confusion Matrix: To assess the true positives, false positives, true negatives, and false
negatives.
○ F1 Score: This harmonic mean of precision and recall will be used as a primary performance
metric to ensure a balanced approach to fraud detection.

1.4 Model Building


In the model-building phase, the selected analytical model, XGBoost, will be trained and evaluated using the
prepared credit card transaction dataset. This process will involve applying XGBoost to the training data,
validating its performance using test data, and ensuring that the model aligns with the project's goals. Below
is a detailed approach to model building for the credit card fraud detection project:

1. Training the Model


● Data Splitting:
○ The dataset has been split into training (80%) and testing (20%) sets, ensuring that the class
distribution (fraud vs. non-fraud) is consistent across both sets.
○ A further split of the training set will be used to create a validation set for hyperparameter tuning.
● Model Training:
○ The XGBoost model will be trained on the training data. XGBoost’s ability to handle imbalanced
datasets will be leveraged by applying class weights to give more importance to the minority
class (fraudulent transactions).
○ Hyperparameters like learning rate, max depth, and the number of estimators will be optimized
using grid search or Bayesian optimization on the validation set.

2. Model Validation
● Cross-Validation:
○ A 5-fold cross-validation strategy will be employed to ensure that the model generalizes well
across different subsets of the data. This helps prevent overfitting and ensures that the model's
performance is consistent.
● Evaluation Metrics:
○ Precision and Recall: Since the dataset is imbalanced, precision and recall are critical metrics.
Precision measures how many of the predicted fraudulent transactions are actually fraudulent,
while recall measures how many actual fraudulent transactions were correctly identified.
○ F1 Score: The F1 score, which is the harmonic mean of precision and recall, will be the primary
metric for model performance evaluation.
○ Confusion Matrix: This will be used to assess the true positives, false positives, true negatives,
and false negatives, providing a comprehensive view of the model's performance.
○ AUC-ROC Curve: The Area Under the Curve (AUC) of the Receiver Operating Characteristic
(ROC) curve will be used to evaluate the model's ability to distinguish between the fraudulent
and non-fraudulent transactions.

3. Model Evaluation and Adjustment


● Model Validation on Test Data:
○ After training, the model will be applied to the test data to evaluate its performance on unseen
data. This step ensures that the model’s performance is not only good on the training data but
also generalizes well to new data. ● Parameter Review:
○ The model’s parameters and hyperparameters will be reviewed to ensure that they make sense
in the context of the domain. For instance, if the model assigns unusually high importance to
certain features, these will be reviewed to understand their relevance to fraud detection.
● Error Analysis:
○ The predictions will be analyzed to identify any patterns in the errors made by the model,
particularly focusing on false negatives (missed frauds) and false positives (legitimate
transactions flagged as fraud). This analysis will guide any necessary adjustments to the model
or feature engineering.

4. Finalizing the Model


● Threshold Tuning:
○ The decision threshold for classifying transactions as fraudulent will be adjusted based on the
business requirements, balancing the trade-off between precision and recall.
● Scalability Consideration:
○ The XGBoost model will be assessed for its runtime performance, ensuring it meets the
requirements for real-time fraud detection. XGBoost’s parallel processing capabilities will be
leveraged to maintain efficiency.
● Model Export and Deployment:
○ Once the model is finalized, it will be saved and prepared for deployment. The model will be
integrated into the real-time transaction monitoring system, where it can process incoming
transactions and flag potential frauds.

5. Tools Used
● XGBoost: For model training and prediction, chosen for its robustness and performance, particularly
with imbalanced datasets.
● Scikit-learn: Used for data preprocessing, cross-validation, and evaluation of the model.
● Apache Spark: Could be used if the dataset were much larger, to leverage distributed computing for
model training and evaluation.
● Tableau: For visualizing the results of the model and communicating insights to stakeholders.
1.5 Communicate Findings

1. Identification and Classification of Fraud


● Result: Successfully identify and classify fraudulent transactions within the dataset with high accuracy.
● Measurement: For example, detect over 90% of fraudulent activities with less than 5% false positives.
● Potential visualizations: Confusion matrix, showing counts of true positives, true negatives, false
positives, and false negatives; ROC curve, visualizing the trade-off between sensitivity and specificity and
selecting an optimal threshold for fraud detection; Precision-Recall Curve, focusing on performance of the
classifier

Fig
1. Confusion Matrix && Scatter Plot: to explore relationships between variables

2. Reduction in False Positives and False Negatives


● Result: Achieve a significant reduction in false positive (legitimate transactions flagged as fraud) and
false negative (fraudulent transactions not detected) rates.
● Measurement: Reduce false positives by 20% and false negatives by 15% compared to existing
systems.
● Potential visualizations: ROC curve; Threshold tuning plot; Cost-Benefit Matrix

Fig 2. ROC curve for different models

3. Detection of Anomalies
● Result: Identify anomalies in transaction patterns that could indicate new types of fraud.
● Measurement: Uncover at least 10 previously undetected patterns of fraudulent behavior.
● Potential visualizations: Time-series plot, to detect anomalies in amounts over time; Boxplots,
help in identifying outliers in transaction amounts or frequencies; Heatmap, visualizing the correlations
between different features can help in identifying anomalous patterns.

Fig 3. Boxplots (Class 0: Not-Fraud)

Fig 4. Heatmap of feature correlations && Time Series plot

4. Improvement in Fraud Detection Speed


● Result: Enhance the speed of fraud detection to enable real-time or near-real-time identification of
fraudulent activities.
● Measurement: Reduce the time taken to flag a suspicious transaction from several hours to a few minutes.
● Potential visualizations: Feature Importance Plot, understanding which features contribute most to the
speed of fraud detection, allowing for optimization and faster processing
Fig 5. Sequence Feature Importance plot

5. Enhanced Predictive Accuracy


● Result: Develop models that accurately predict potential fraud before it occurs.
● Measurement: Achieve a predictive accuracy rate of 85% or higher, allowing preemptive action on
flagged transactions.
● Potential visualizations: Feature importance plot; Cross-Validation Performance plot, visualizing
the performance of models across different folds

Fig 6. K-fold cross-validation results for all machine learning approaches and several values of K

6. Data Insights and Reporting


● Result: Generate actionable insights from the data that can be used to refine fraud prevention
strategies.
● Measurement: Produce detailed reports that highlight trends, high-risk areas, and the effectiveness
of current strategies.
● Potential visualizations: Interactive Dashboards, integrating various visualizations.
Fig 7. Lineplots: Comparing avg and max transaction amounts, per hour (Fraud vs Not-Fraud)

Fig 8. Sample Dashboard for fraud detection


2. Challenges Encountered
2.1 Access to Real Data

Due to privacy regulations, researchers often have to rely on synthetic data, which may not fully
capture the complexity of real-world fraud scenarios. This can limit the generalizability and
applicability of research findings.

To address this challenge:

1. Use Synthetic Data with Realistic Features: Enhance synthetic data by incorporating characteristics of
real-world fraud patterns and transaction behaviours.

2. Collaborate with Financial Institutions: Partner with banks to access anonymized or aggregated real data
under strict privacy agreements.

3. Simulate Diverse Scenarios: Generate a wide range of synthetic fraud scenarios to better capture potential
complexities.

4. Benchmark with Real Data: Validate models using real data when possible, or compare synthetic results
with findings from real-world studies.

2.2 Data Quality and Consistency

Research on fraud detection often involves data that can be incomplete, inconsistent, or noisy.
Ensuring that the data used in research is of high quality is essential, but challenging, especially
when dealing with synthetic or anonymized datasets.

To handle data quality and consistency issues:

1. Implement Data Cleaning Processes: Use techniques to handle missing values, correct inconsistencies,
and remove noise.

2. Enhance Synthetic Data Quality: Ensure synthetic data generation mimics real-world data distributions
and anomalies accurately.

3. Use Data Validation Tools: Apply validation checks and automated scripts to ensure data integrity.

4. Apply Robust Preprocessing: Use normalisation, transformation, and feature engineering to improve
data quality.

5. Regular Audits: Conduct regular reviews and audits of data sources and processes to maintain quality
standards.
2.3 Class Imbalance

Fraudulent transactions are typically much rarer than legitimate ones, leading to highly imbalanced
datasets. This imbalance can skew research results, making it difficult to accurately assess the
effectiveness of proposed models or techniques.

To address class imbalance:


1. Resampling Techniques: Use oversampling (e.g., SMOTE) or undersampling to balance the dataset.
2. Cost-sensitive Learning: Apply algorithms that penalise misclassification of the minority class more
heavily.

3. Anomaly Detection Methods: Focus on techniques designed for detecting rare events.

4. Ensemble Methods: Use models like Random Forests or Gradient Boosting that can handle imbalanced
data better.

5. Synthetic Data Generation: Create synthetic fraudulent transactions to augment the minority class.

6. Evaluation Metrics: Use metrics like Precision, Recall, F1 Score, and ROC AUC, rather than accuracy,
to assess model performance.

2.4 Synthetic Data Limitations


While synthetic data is often used in fraud detection research, it may not fully replicate the nuances
of real-world data. This can lead to research findings that do not translate well into practical
applications

To mitigate synthetic data limitations:

1. Improve Synthetic Data Generation: Use advanced techniques and domain knowledge to better capture
real-world fraud patterns and variations.

2. Combine Synthetic and Real Data: Integrate real-world data when available to validate and refine models
trained on synthetic data.

3. Use Domain Expertise: Collaborate with experts to ensure synthetic data reflects realistic fraud scenarios
and complexities.

4. Benchmark with Real-World Data: Regularly compare synthetic data results with findings from actual data
to assess generalizability.

5. Test in Real Environments: Validate models in real-world settings or with real data samples to ensure
practical applicability.
Digital Assignment-2
Big data life cycle process and its techniques adapted for the case study.
1. Data Discovery

• Data Sources:
o Transaction Data: Includes anonymized credit card transactions over a two-day
period, with features indicating transaction timing, amount, and class
(fraud/nonfraud).
o Aggregated Sources:
 Historical data across longer periods (weeks or months) to detect transaction
trends.
 Customer profiles detailing demographics and behavior to profile highrisk
groups.
 External fraud intelligence (e.g., known fraud patterns) to improve detection
robustness.
• Data Structure and Infrastructure:
o Storage: Scalable databases, such as HDFS or AWS S3, are essential for
largevolume storage. o Processing Framework: Apache Spark or Hadoop
provides distributed processing, capable of handling real-time data for highvelocity
applications.
o ETL Pipeline: Automated processes with tools like Apache NiFi or AWS Glue help
transform and load data continuously, preparing it for analysis.
• Challenges and Solutions:
o Data Imbalance: With fraud being rare, techniques like SMOTE or undersampling
help balance the classes.
o Real-Time Streaming: Tools like Apache Kafka, Spark Streaming, or Flink
facilitate real-time fraud detection, which is critical for timely action.

2. Data Preparation

• Data Cleaning:

o Detect and handle missing values, though in this dataset, PCA anonymization
minimizes these. o Outliers in transaction amounts are analyzed for fraud
relevance but handled with care as they might also be legitimate.
• Data Transformation:
o Scaling: The Amount feature may require scaling (e.g., StandardScaler or
MinMaxScaler) to ensure it aligns with the PCA-transformed features. o
Dimensionality Reduction: PCA has already reduced dimensionality for the
original dataset, but further techniques could be used if additional features are
added.
• Data Reduction Techniques:
o Sampling: Methods like undersampling the majority class or using SMOTE for
oversampling can be crucial to handle the 0.172% fraud prevalence effectively.
o Feature Selection: Use Random Forests or other techniques to rank feature
importance and drop less impactful ones, reducing computational load.
3. Model Planning

• Objective Alignment: Focus is on reducing false positives and accurately detecting fraud
in real-time, aligning with operational and business needs.
• Model Selection:
o Algorithm: XGBoost is chosen for its high performance on imbalanced data and
ability to tune weights by class.
o Exploratory Data Analysis (EDA): Analyze relationships between the Class (fraud
or non-fraud) and PCA-transformed features (V1-V28), Time, and Amount.
o Feature Importance: Use XGBoost's feature importance scores to determine which
features contribute most to fraud prediction, particularly relevant when combining
this dataset with customer profile or historical data.
• Hyperparameter Tuning: Techniques like grid search or Bayesian optimization can finetune
parameters like learning rate, max depth, and number of trees for better precision and recall.

4. Model Building

• Data Splitting:
o The data is divided into 80% training and 20% testing sets. Stratified sampling
ensures that fraud and non-fraud cases maintain their original proportion.
• Training with XGBoost:
o Weighting for Imbalance: The minority class (fraudulent transactions) can be
assigned higher weights to mitigate imbalance. o Cross-Validation: A 5-fold
cross-validation strategy evaluates model stability across different data subsets,
preventing overfitting.
• Evaluation Metrics:
o Precision, Recall, and F1 Score: Focused on as primary metrics due to the dataset’s
imbalance.
o Confusion Matrix: Provides insight into true positives, false positives, true
negatives, and false negatives. o ROC and Precision-Recall Curves: These
curves help adjust thresholds based on the business’s tolerance for false positives
vs. missed fraud.

5. Model Evaluation and Communication Final Evaluation:

o Apply the model on the test dataset, focusing on minimizing false negatives (missed
frauds) and false positives (flagged legitimate transactions).
• Threshold Adjustment: Fine-tuning decision thresholds for fraud classification based on
business impact.
• Scalability: Ensuring the model can handle real-time predictions within transaction flow,
essential for a live fraud detection system.
• Visualizations and Insights:
o Confusion Matrix, Precision-Recall Curves, and ROC Curves to illustrate
performance. o Feature Importance Plots to identify which variables most
influence the fraud prediction. o Time-Series Analysis for detecting unusual spikes
in transaction amounts, aiding anomaly detection.

6. Deployment and Monitoring

• Real-Time Model Deployment: Integrating the model into a real-time transaction system.
• Monitoring and Alerts:
o Set up systems to trigger alerts for high-confidence fraud predictions. o Continuous
model monitoring to assess drift over time, especially as fraud patterns evolve.
• Feedback Loop:
o Collect false positives and false negatives for retraining, allowing the model to
adapt to new fraud patterns.
Download a dataset on the same and apply any one Machine Learning Model with exploratory data

analysis. Display the result with Tableau visualization.

Code:

Using XGBoost machine learning model to the dataset:


https://www.kaggle.com/datasets/mlgulb/creditcardfraud
# Import necessary libraries import pandas as pd import matplotlib.pyplot
as plt import seaborn as sns from sklearn.model_selection import
train_test_split from sklearn.metrics import classification_report,
confusion_matrix, roc_auc_score, roc_curve, precision_recall_curve import
xgboost as xgb

# Load the dataset data = pd.read_csv('creditcard.csv')

# Display the first few rows of the dataset to understand its structure
data.head()

1. Basic Data Information and Class Distribution


# Checking data types, missing values, and class distribution
print(data.info()) print(data['Class'].value_counts(normalize=True))

# Visualizing class distribution


sns.countplot(x='Class', data=data) plt.title("Class
Distribution (0: Non-Fraud, 1: Fraud)") plt.show()
2. Exploratory Data Analysis (EDA)
# Visualizing Transaction Amount distribution
plt.figure(figsize=(10, 6)) sns.histplot(data['Amount'],
bins=50, kde=True) plt.title("Distribution of
Transaction Amount") plt.xlabel("Amount") plt.show()

# Visualizing Time vs Amount with Class plt.figure(figsize=(10, 6))


sns.scatterplot(x='Time', y='Amount', hue='Class', data=data, alpha=0.6)
plt.title("Transaction Amount and Time Distribution by Class") plt.show()

# Correlation heatmap for the dataset


plt.figure(figsize=(15, 10)) corr = data.corr()
sns.heatmap(corr, cmap="coolwarm", annot=False)
plt.title("Correlation Heatmap of Features") plt.show()
3.Data Splitting and Preprocessing
# Checking data types of columns print(data.dtypes)

# Converting any non-numeric columns to numeric, if possible


# We'll attempt to convert columns to numeric and coerce errors (set
invalid parsing as NaN) data = data.apply(pd.to_numeric, errors='coerce')

# Re-checking and handling missing values after conversion data =


data.dropna()

# Re-splitting the data after cleaning X


= data.drop(columns=['Class']) y =
data['Class']

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
stratify=y, random_state=42)

4. Training the XGBoost Model

# Now training the XGBoost model import xgboost as xgb xgb_model


= xgb.XGBClassifier(use_label_encoder=False,
eval_metric='logloss') xgb_model.fit(X_train, y_train)

# Predicting on the test set y_pred = xgb_model.predict(X_test)


y_pred_proba = xgb_model.predict_proba(X_test)[:, 1]
# Confusion Matrix and Classification Report print("Classification
Report:") print(classification_report(y_test, y_pred))

# Confusion Matrix conf_matrix = confusion_matrix(y_test, y_pred)


sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.title("Confusion Matrix") plt.xlabel("Predicted")
plt.ylabel("Actual") plt.show()
# ROC Curve and AUC fpr, tpr, thresholds =
roc_curve(y_test, y_pred_proba) roc_auc =
roc_auc_score(y_test, y_pred_proba)
plt.figure(figsize=(10, 6)) plt.plot(fpr, tpr,
label=f"XGBoost (AUC = {roc_auc:.2f})") plt.plot([0, 1],
[0, 1], 'r--') plt.title("ROC Curve") plt.xlabel("False
Positive Rate") plt.ylabel("True Positive Rate") plt.legend()
plt.show()

# Precision-Recall Curve precision, recall, _ =


precision_recall_curve(y_test, y_pred_proba)
plt.figure(figsize=(10, 6)) plt.plot(recall,
precision, marker='.') plt.title("Precision-
Recall Curve") plt.xlabel("Recall")
plt.ylabel("Precision") plt.show()
Visualization in Tableau:
Inferences:
• Model's Effectiveness: The model’s ability to differentiate between fraudulent and non-
fraudulent transactions will be assessed through metrics like F1-score and precision recall
curves. Positive inferences can be drawn if the model shows a high ability to recall fraud
cases without a significant drop in precision.
• Trends and Patterns: If the model finds specific patterns, such as fraud being more likely at
certain times of the day or certain transaction amounts, these insights can help refine
detection rules in a live system.
• False Positive Reduction: One of the key takeaways might be the model's ability to reduce
false positives, ensuring that legitimate transactions are not flagged as fraud. This
improvement can help maintain customer satisfaction and reduce operational costs.
• Real-time Detection: Inferences could include the model's potential for deployment in real-
time systems to detect fraud as transactions occur, providing proactive alerts and
minimizing damage from fraudulent activities.

You might also like