0% found this document useful (0 votes)

68 views26 pages

21BCE3954 FraudDetectionInBanking

Uploaded by

mansikalawar123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views26 pages

21BCE3954 FraudDetectionInBanking

Uploaded by

mansikalawar123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

School of Computer Science and Engineering Vellore

Institute of Technology, Vellore.

Digital Assignment -1 & 2

Foundations of Data Science
Submitted by: Mansi Kalawar
Registration: 21BCE3954

Submitted to: Aunradha J Ma’am

Topic:
Fraud Detection in Banking: A Data-Driven Analytical Perspective
1. Data Analytics Life Cycle 1.1 Data Discovery Data
Sources:
● Transaction Data: The dataset contains transactions made by credit cards in September 2013 by
European cardholders. The data consists of 284,807 transactions over two days, with 492 identified
as fraudulent (about 0.172%).
● Features: The dataset has 30 features, including:
○ Time: The number of seconds elapsed between this transaction and the first transaction in the
dataset.
○ V1 to V28: These are the result of a PCA(Prompt Corrective Action) transformation, so the
dataset does not expose the original features due to confidentiality issues.
○ Amount: The transaction amount.
○ Class: The target variable, where 1 indicates fraud and 0 indicates no fraud.

Aggregated Data Sources:

● Historical Transaction Data: Aggregating multiple periods (days, weeks, months) of transaction data
to identify patterns.
● Customer Profiles: Demographics, behaviour, and transaction history of customers.
● External Threat Intelligence: Data on known fraudulent schemes, compromised card lists, etc. Raw
Data Review:
● Time-Series Nature: Transactions are sequential, and patterns over time can indicate fraud.
● Imbalanced Data: There is a significant imbalance between fraudulent and non-fraudulent
transactions. Techniques such as oversampling (SMOTE) or undersampling may be required.
● PCA-transformed Features: The data is already processed to anonymize sensitive details, which
means further feature engineering will be limited.

Required Data Structures and Resources:

● Data Storage: A scalable database capable of handling high-velocity, high-volume transaction data.
Preferably using a distributed system like HDFS, or cloud-based storage such as AWS S3.
● Processing Framework: A big data processing framework like Apache Spark or Hadoop would be
suitable for handling large datasets and real-time processing.
● Data Pipeline: ETL (Extract, Transform, Load) processes to clean, transform, and load data for
analysis. Apache NiFi or AWS Glue can be used to automate data flows.
● Data Schema: Relational schema for structured transaction data, potentially using a star schema for
better analytics.
● Machine Learning Infrastructure: GPU/CPU clusters for training models. Libraries like TensorFlow,
PyTorch, or Scikit-learn for implementing detection algorithms.

Scope of Data Infrastructure Needed:

● Real-time Streaming: For detecting fraud as transactions occur, using tools like Apache Kafka for
streaming data and Apache Flink or Spark Streaming for real-time processing.
● Batch Processing: For retrospective analysis and model training, leveraging batch processing
frameworks.
● Data Lake: To store raw, semi-structured, and structured data, allowing for flexible data analysis and
machine learning model training.
● Monitoring and Alerting Systems: To monitor transactions and trigger alerts for suspected fraud
cases using rule-based systems or predictive models.

1.2 Data Preparation

This step involves cleaning, transforming, and organising the data to make it ready for analysis and
model building.

1. Data Collection and Understanding

● Data Source: The dataset is a collection of anonymized credit card transactions. It includes both
legitimate and fraudulent transactions.
● Features: 30 features, including Time, Amount, and Class (target variable indicating fraud or not).
2. Data Cleaning
● Handling Missing Values:
○ Check for missing values in the dataset. Although this specific dataset does not have missing
values, if they were present, strategies like mean/mode/median imputation or dropping rows/columns
might be necessary. ● Outlier Detection:
○ Analyze features like Amount for outliers using box plots or Z-scores. Outliers in the transaction
amount could be indicative of fraud but might need special handling depending on the analysis. ● Data
Consistency:
○ Ensure that the Time feature is consistent and correctly reflects the transaction order. Although
this dataset has no missing time data, in practice, any missing or incorrect timestamps should
be corrected or imputed.
3. Data Transformation
● Feature Scaling:
○ Standardise or normalise the Amount feature because the dataset might have varying
scales. For instance, apply a Min-Max Scaler or Standard Scaler to normalise the
data for algorithms sensitive to feature scales (e.g., SVM, k-NN). ● Dimensionality
Reduction:
○ The dataset already includes Principal Component Analysis (PCA) transformed features (V1 to
V28). This step has been applied to reduce dimensionality and protect confidentiality, so no
further dimensionality reduction is typically required.
● Encoding Categorical Variables:
○ This dataset does not have categorical variables, but in cases where categorical data exists,
use techniques like One-Hot Encoding or Label Encoding to convert them into numerical
format.

4. Data Integration
● Combining Data Sources:
○ If using multiple datasets, integrate them into a single coherent dataset. For example,
combining transaction data with customer demographic data, although this dataset is
selfcontained.
● Handling Duplicates:
○ Check for and remove any duplicate transactions to ensure data integrity. Duplicates could
skew results, especially in fraud detection where identical transactions could be a red flag.

5. Data Reduction
● Sampling:
○ Given the dataset's imbalance (only 0.172% of transactions are fraudulent), consider using
techniques like undersampling the majority class (non-fraud) or oversampling the minority class
(fraud) to balance the dataset.
○ Techniques like SMOTE (Synthetic Minority Over-sampling Technique) can be applied to
generate synthetic examples of the minority class.
● Feature Selection:
○ While PCA has already been applied, further feature selection might involve removing
lowvariance features or using algorithms like Random Forests to rank feature importance and
select the top features.
6. Data Splitting
● Train-Test Split:
○ Split the dataset into training and testing sets. Common practice is an 80-20 or 70-30 split. The
split should be stratified to ensure the class distribution (fraud vs. non-fraud) is similar in both
sets.
● Validation Set:
○ Further split the training data into training and validation sets to tune hyperparameters and
prevent overfitting. Alternatively, cross-validation can be used for model evaluation.
7. Data Augmentation (Optional)
● Synthetic Data Generation:
○ In scenarios where the fraud cases are extremely low, generating synthetic fraud cases similar
to real ones could improve model robustness.
8. Data Exploration
● Exploratory Data Analysis (EDA):
○ Use visualisation techniques like histograms, scatter plots, and correlation matrices to
understand relationships between features and the target variable.
○ Perform a detailed analysis to identify any underlying patterns or anomalies in the data.
1.3 Model Planning
In the credit card fraud detection project, XGBoost (Extreme Gradient Boosting) is selected as the most
suitable model. XGBoost is known for its high performance, particularly in handling imbalanced datasets like
the one at hand. Below is the model planning approach tailored for implementing XGBoost:

1. Assessing the Structure of the Datasets

● Data Composition:
The dataset comprises 284,807 credit card transactions over two days, with 30 features including
Time, Amount, and Class. The Class feature is the target variable, indicating whether a transaction is
fraudulent (1) or not (0). Features V1 to V28 are PCA-transformed variables, which makes them
linearly uncorrelated.
● Imbalance in Data:
The dataset is highly imbalanced, with fraudulent transactions constituting only about 0.172% of the
total data. This imbalance requires careful handling, and XGBoost is particularly effective here due to
its ability to apply different weights to different classes, which helps mitigate bias towards the majority
class.
● Tool Selection:
XGBoost will be implemented using Python's XGBoost library, which is optimized for performance and
scalability. Additional tools like Scikit-learn will be used for data preprocessing, and Tableau will be
employed for visualizing the results.

2. Ensuring Alignment with Project Goals

● Objective Revalidation:
The goal is to build a model that accurately detects fraudulent transactions while minimizing false
negatives. XGBoost's robustness in managing imbalanced datasets and its ability to handle complex
data relationships make it an ideal choice for this objective.
● Avoiding Scope Creep:
The analysis remains focused strictly on detecting fraudulent credit card transactions. The use of
XGBoost helps ensure that the project stays within scope, as it is specifically suited to handle the
challenges presented by this dataset.

3. Data Exploration and Variable Selection

● Relationship Exploration:
Analyzing the relationship between Amount, Time, and the PCA-transformed features (V1 to V28) with
the Class target variable is essential. Visualization tools and statistical analysis will help understand
these relationships, ensuring that the model leverages the most informative features.
● Feature Importance:
XGBoost inherently provides feature importance scores, which can help identify the most significant
features contributing to fraud detection. Although PCA has already reduced dimensionality, XGBoost
will help further refine the feature set by highlighting those with the highest predictive power.

4. Model Selection
● Why XGBoost?
○ Imbalanced Data Handling: XGBoost’s ability to apply class weights and focus on difficult-to-
classify instances makes it particularly effective for this highly imbalanced dataset.
○ High Performance: XGBoost is known for its efficiency and speed, particularly in large datasets,
making it ideal for real-time fraud detection.
○ Interpretability: XGBoost offers insights into feature importance, allowing for a better
understanding of the factors influencing fraud detection.
● Tool Utilization:
○ Python (XGBoost Library): XGBoost will be implemented for training the model, with
additional preprocessing and evaluation using Scikit-learn.
○ Cross-Validation: To ensure the model generalizes well, a 5-fold cross-validation strategy will
be employed.
○ Hyperparameter Tuning: Hyperparameters such as learning rate, max depth, and the number
of trees will be optimized using grid search or Bayesian optimization.
○ Threshold Adjustment: After training, the decision threshold for classifying a transaction as
fraudulent will be fine-tuned to balance precision and recall, depending on business
requirements.
● Model Evaluation:
○ Precision-Recall Curve: Given the class imbalance, the precision-recall curve will be a key
metric for evaluating model performance.
○ Confusion Matrix: To assess the true positives, false positives, true negatives, and false
negatives.
○ F1 Score: This harmonic mean of precision and recall will be used as a primary performance
metric to ensure a balanced approach to fraud detection.

1.4 Model Building

In the model-building phase, the selected analytical model, XGBoost, will be trained and evaluated using the
prepared credit card transaction dataset. This process will involve applying XGBoost to the training data,
validating its performance using test data, and ensuring that the model aligns with the project's goals. Below
is a detailed approach to model building for the credit card fraud detection project:

1. Training the Model

● Data Splitting:
○ The dataset has been split into training (80%) and testing (20%) sets, ensuring that the class
distribution (fraud vs. non-fraud) is consistent across both sets.
○ A further split of the training set will be used to create a validation set for hyperparameter tuning.
● Model Training:
○ The XGBoost model will be trained on the training data. XGBoost’s ability to handle imbalanced
datasets will be leveraged by applying class weights to give more importance to the minority
class (fraudulent transactions).
○ Hyperparameters like learning rate, max depth, and the number of estimators will be optimized
using grid search or Bayesian optimization on the validation set.

2. Model Validation
● Cross-Validation:
○ A 5-fold cross-validation strategy will be employed to ensure that the model generalizes well
across different subsets of the data. This helps prevent overfitting and ensures that the model's
performance is consistent.
● Evaluation Metrics:
○ Precision and Recall: Since the dataset is imbalanced, precision and recall are critical metrics.
Precision measures how many of the predicted fraudulent transactions are actually fraudulent,
while recall measures how many actual fraudulent transactions were correctly identified.
○ F1 Score: The F1 score, which is the harmonic mean of precision and recall, will be the primary
metric for model performance evaluation.
○ Confusion Matrix: This will be used to assess the true positives, false positives, true negatives,
and false negatives, providing a comprehensive view of the model's performance.
○ AUC-ROC Curve: The Area Under the Curve (AUC) of the Receiver Operating Characteristic
(ROC) curve will be used to evaluate the model's ability to distinguish between the fraudulent
and non-fraudulent transactions.

3. Model Evaluation and Adjustment

● Model Validation on Test Data:
○ After training, the model will be applied to the test data to evaluate its performance on unseen
data. This step ensures that the model’s performance is not only good on the training data but
also generalizes well to new data. ● Parameter Review:
○ The model’s parameters and hyperparameters will be reviewed to ensure that they make sense
in the context of the domain. For instance, if the model assigns unusually high importance to
certain features, these will be reviewed to understand their relevance to fraud detection.
● Error Analysis:
○ The predictions will be analyzed to identify any patterns in the errors made by the model,
particularly focusing on false negatives (missed frauds) and false positives (legitimate
transactions flagged as fraud). This analysis will guide any necessary adjustments to the model
or feature engineering.

4. Finalizing the Model

● Threshold Tuning:
○ The decision threshold for classifying transactions as fraudulent will be adjusted based on the
business requirements, balancing the trade-off between precision and recall.
● Scalability Consideration:
○ The XGBoost model will be assessed for its runtime performance, ensuring it meets the
requirements for real-time fraud detection. XGBoost’s parallel processing capabilities will be
leveraged to maintain efficiency.
● Model Export and Deployment:
○ Once the model is finalized, it will be saved and prepared for deployment. The model will be
integrated into the real-time transaction monitoring system, where it can process incoming
transactions and flag potential frauds.

5. Tools Used
● XGBoost: For model training and prediction, chosen for its robustness and performance, particularly
with imbalanced datasets.
● Scikit-learn: Used for data preprocessing, cross-validation, and evaluation of the model.
● Apache Spark: Could be used if the dataset were much larger, to leverage distributed computing for
model training and evaluation.
● Tableau: For visualizing the results of the model and communicating insights to stakeholders.
1.5 Communicate Findings

1. Identification and Classification of Fraud

● Result: Successfully identify and classify fraudulent transactions within the dataset with high accuracy.
● Measurement: For example, detect over 90% of fraudulent activities with less than 5% false positives.
● Potential visualizations: Confusion matrix, showing counts of true positives, true negatives, false
positives, and false negatives; ROC curve, visualizing the trade-off between sensitivity and specificity and
selecting an optimal threshold for fraud detection; Precision-Recall Curve, focusing on performance of the
classifier

Fig
1. Confusion Matrix && Scatter Plot: to explore relationships between variables

2. Reduction in False Positives and False Negatives

● Result: Achieve a significant reduction in false positive (legitimate transactions flagged as fraud) and
false negative (fraudulent transactions not detected) rates.
● Measurement: Reduce false positives by 20% and false negatives by 15% compared to existing
systems.
● Potential visualizations: ROC curve; Threshold tuning plot; Cost-Benefit Matrix

Fig 2. ROC curve for different models

3. Detection of Anomalies
● Result: Identify anomalies in transaction patterns that could indicate new types of fraud.
● Measurement: Uncover at least 10 previously undetected patterns of fraudulent behavior.
● Potential visualizations: Time-series plot, to detect anomalies in amounts over time; Boxplots,
help in identifying outliers in transaction amounts or frequencies; Heatmap, visualizing the correlations
between different features can help in identifying anomalous patterns.

Fig 3. Boxplots (Class 0: Not-Fraud)

Fig 4. Heatmap of feature correlations && Time Series plot

4. Improvement in Fraud Detection Speed

● Result: Enhance the speed of fraud detection to enable real-time or near-real-time identification of
fraudulent activities.
● Measurement: Reduce the time taken to flag a suspicious transaction from several hours to a few minutes.
● Potential visualizations: Feature Importance Plot, understanding which features contribute most to the
speed of fraud detection, allowing for optimization and faster processing
Fig 5. Sequence Feature Importance plot

5. Enhanced Predictive Accuracy

● Result: Develop models that accurately predict potential fraud before it occurs.
● Measurement: Achieve a predictive accuracy rate of 85% or higher, allowing preemptive action on
flagged transactions.
● Potential visualizations: Feature importance plot; Cross-Validation Performance plot, visualizing
the performance of models across different folds

Fig 6. K-fold cross-validation results for all machine learning approaches and several values of K

6. Data Insights and Reporting

● Result: Generate actionable insights from the data that can be used to refine fraud prevention
strategies.
● Measurement: Produce detailed reports that highlight trends, high-risk areas, and the effectiveness
of current strategies.
● Potential visualizations: Interactive Dashboards, integrating various visualizations.
Fig 7. Lineplots: Comparing avg and max transaction amounts, per hour (Fraud vs Not-Fraud)

Fig 8. Sample Dashboard for fraud detection

2. Challenges Encountered
2.1 Access to Real Data

Due to privacy regulations, researchers often have to rely on synthetic data, which may not fully
capture the complexity of real-world fraud scenarios. This can limit the generalizability and
applicability of research findings.

To address this challenge:

1. Use Synthetic Data with Realistic Features: Enhance synthetic data by incorporating characteristics of
real-world fraud patterns and transaction behaviours.

2. Collaborate with Financial Institutions: Partner with banks to access anonymized or aggregated real data
under strict privacy agreements.

3. Simulate Diverse Scenarios: Generate a wide range of synthetic fraud scenarios to better capture potential
complexities.

4. Benchmark with Real Data: Validate models using real data when possible, or compare synthetic results
with findings from real-world studies.

2.2 Data Quality and Consistency

Research on fraud detection often involves data that can be incomplete, inconsistent, or noisy.
Ensuring that the data used in research is of high quality is essential, but challenging, especially
when dealing with synthetic or anonymized datasets.

To handle data quality and consistency issues:

1. Implement Data Cleaning Processes: Use techniques to handle missing values, correct inconsistencies,
and remove noise.

2. Enhance Synthetic Data Quality: Ensure synthetic data generation mimics real-world data distributions
and anomalies accurately.

3. Use Data Validation Tools: Apply validation checks and automated scripts to ensure data integrity.

4. Apply Robust Preprocessing: Use normalisation, transformation, and feature engineering to improve
data quality.

5. Regular Audits: Conduct regular reviews and audits of data sources and processes to maintain quality
standards.
2.3 Class Imbalance

Fraudulent transactions are typically much rarer than legitimate ones, leading to highly imbalanced
datasets. This imbalance can skew research results, making it difficult to accurately assess the
effectiveness of proposed models or techniques.

To address class imbalance:

1. Resampling Techniques: Use oversampling (e.g., SMOTE) or undersampling to balance the dataset.
2. Cost-sensitive Learning: Apply algorithms that penalise misclassification of the minority class more
heavily.

3. Anomaly Detection Methods: Focus on techniques designed for detecting rare events.

4. Ensemble Methods: Use models like Random Forests or Gradient Boosting that can handle imbalanced
data better.

5. Synthetic Data Generation: Create synthetic fraudulent transactions to augment the minority class.

6. Evaluation Metrics: Use metrics like Precision, Recall, F1 Score, and ROC AUC, rather than accuracy,
to assess model performance.

2.4 Synthetic Data Limitations

While synthetic data is often used in fraud detection research, it may not fully replicate the nuances
of real-world data. This can lead to research findings that do not translate well into practical
applications

To mitigate synthetic data limitations:

1. Improve Synthetic Data Generation: Use advanced techniques and domain knowledge to better capture
real-world fraud patterns and variations.

2. Combine Synthetic and Real Data: Integrate real-world data when available to validate and refine models
trained on synthetic data.

3. Use Domain Expertise: Collaborate with experts to ensure synthetic data reflects realistic fraud scenarios
and complexities.

4. Benchmark with Real-World Data: Regularly compare synthetic data results with findings from actual data
to assess generalizability.

5. Test in Real Environments: Validate models in real-world settings or with real data samples to ensure
practical applicability.
Digital Assignment-2
Big data life cycle process and its techniques adapted for the case study.
1. Data Discovery

• Data Sources:
o Transaction Data: Includes anonymized credit card transactions over a two-day
period, with features indicating transaction timing, amount, and class
(fraud/nonfraud).
o Aggregated Sources:
 Historical data across longer periods (weeks or months) to detect transaction
trends.
 Customer profiles detailing demographics and behavior to profile highrisk
groups.
 External fraud intelligence (e.g., known fraud patterns) to improve detection
robustness.
• Data Structure and Infrastructure:
o Storage: Scalable databases, such as HDFS or AWS S3, are essential for
largevolume storage. o Processing Framework: Apache Spark or Hadoop
provides distributed processing, capable of handling real-time data for highvelocity
applications.
o ETL Pipeline: Automated processes with tools like Apache NiFi or AWS Glue help
transform and load data continuously, preparing it for analysis.
• Challenges and Solutions:
o Data Imbalance: With fraud being rare, techniques like SMOTE or undersampling
help balance the classes.
o Real-Time Streaming: Tools like Apache Kafka, Spark Streaming, or Flink
facilitate real-time fraud detection, which is critical for timely action.

2. Data Preparation

• Data Cleaning:

o Detect and handle missing values, though in this dataset, PCA anonymization
minimizes these. o Outliers in transaction amounts are analyzed for fraud
relevance but handled with care as they might also be legitimate.
• Data Transformation:
o Scaling: The Amount feature may require scaling (e.g., StandardScaler or
MinMaxScaler) to ensure it aligns with the PCA-transformed features. o
Dimensionality Reduction: PCA has already reduced dimensionality for the
original dataset, but further techniques could be used if additional features are
added.
• Data Reduction Techniques:
o Sampling: Methods like undersampling the majority class or using SMOTE for
oversampling can be crucial to handle the 0.172% fraud prevalence effectively.
o Feature Selection: Use Random Forests or other techniques to rank feature
importance and drop less impactful ones, reducing computational load.
3. Model Planning

• Objective Alignment: Focus is on reducing false positives and accurately detecting fraud
in real-time, aligning with operational and business needs.
• Model Selection:
o Algorithm: XGBoost is chosen for its high performance on imbalanced data and
ability to tune weights by class.
o Exploratory Data Analysis (EDA): Analyze relationships between the Class (fraud
or non-fraud) and PCA-transformed features (V1-V28), Time, and Amount.
o Feature Importance: Use XGBoost's feature importance scores to determine which
features contribute most to fraud prediction, particularly relevant when combining
this dataset with customer profile or historical data.
• Hyperparameter Tuning: Techniques like grid search or Bayesian optimization can finetune
parameters like learning rate, max depth, and number of trees for better precision and recall.

4. Model Building

• Data Splitting:
o The data is divided into 80% training and 20% testing sets. Stratified sampling
ensures that fraud and non-fraud cases maintain their original proportion.
• Training with XGBoost:
o Weighting for Imbalance: The minority class (fraudulent transactions) can be
assigned higher weights to mitigate imbalance. o Cross-Validation: A 5-fold
cross-validation strategy evaluates model stability across different data subsets,
preventing overfitting.
• Evaluation Metrics:
o Precision, Recall, and F1 Score: Focused on as primary metrics due to the dataset’s
imbalance.
o Confusion Matrix: Provides insight into true positives, false positives, true
negatives, and false negatives. o ROC and Precision-Recall Curves: These
curves help adjust thresholds based on the business’s tolerance for false positives
vs. missed fraud.

5. Model Evaluation and Communication Final Evaluation:

o Apply the model on the test dataset, focusing on minimizing false negatives (missed
frauds) and false positives (flagged legitimate transactions).
• Threshold Adjustment: Fine-tuning decision thresholds for fraud classification based on
business impact.
• Scalability: Ensuring the model can handle real-time predictions within transaction flow,
essential for a live fraud detection system.
• Visualizations and Insights:
o Confusion Matrix, Precision-Recall Curves, and ROC Curves to illustrate
performance. o Feature Importance Plots to identify which variables most
influence the fraud prediction. o Time-Series Analysis for detecting unusual spikes
in transaction amounts, aiding anomaly detection.

6. Deployment and Monitoring

• Real-Time Model Deployment: Integrating the model into a real-time transaction system.
• Monitoring and Alerts:
o Set up systems to trigger alerts for high-confidence fraud predictions. o Continuous
model monitoring to assess drift over time, especially as fraud patterns evolve.
• Feedback Loop:
o Collect false positives and false negatives for retraining, allowing the model to
adapt to new fraud patterns.
Download a dataset on the same and apply any one Machine Learning Model with exploratory data

analysis. Display the result with Tableau visualization.

Code:

Using XGBoost machine learning model to the dataset:

https://www.kaggle.com/datasets/mlgulb/creditcardfraud
# Import necessary libraries import pandas as pd import matplotlib.pyplot
as plt import seaborn as sns from sklearn.model_selection import
train_test_split from sklearn.metrics import classification_report,
confusion_matrix, roc_auc_score, roc_curve, precision_recall_curve import
xgboost as xgb

# Load the dataset data = pd.read_csv('creditcard.csv')

# Display the first few rows of the dataset to understand its structure
data.head()

1. Basic Data Information and Class Distribution

# Checking data types, missing values, and class distribution
print(data.info()) print(data['Class'].value_counts(normalize=True))

# Visualizing class distribution

sns.countplot(x='Class', data=data) plt.title("Class
Distribution (0: Non-Fraud, 1: Fraud)") plt.show()
2. Exploratory Data Analysis (EDA)
# Visualizing Transaction Amount distribution
plt.figure(figsize=(10, 6)) sns.histplot(data['Amount'],
bins=50, kde=True) plt.title("Distribution of
Transaction Amount") plt.xlabel("Amount") plt.show()

# Visualizing Time vs Amount with Class plt.figure(figsize=(10, 6))

sns.scatterplot(x='Time', y='Amount', hue='Class', data=data, alpha=0.6)
plt.title("Transaction Amount and Time Distribution by Class") plt.show()

# Correlation heatmap for the dataset

plt.figure(figsize=(15, 10)) corr = data.corr()
sns.heatmap(corr, cmap="coolwarm", annot=False)
plt.title("Correlation Heatmap of Features") plt.show()
3.Data Splitting and Preprocessing
# Checking data types of columns print(data.dtypes)

# Converting any non-numeric columns to numeric, if possible

# We'll attempt to convert columns to numeric and coerce errors (set
invalid parsing as NaN) data = data.apply(pd.to_numeric, errors='coerce')

# Re-checking and handling missing values after conversion data =

data.dropna()

# Re-splitting the data after cleaning X

= data.drop(columns=['Class']) y =
data['Class']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
stratify=y, random_state=42)

4. Training the XGBoost Model

# Now training the XGBoost model import xgboost as xgb xgb_model

= xgb.XGBClassifier(use_label_encoder=False,
eval_metric='logloss') xgb_model.fit(X_train, y_train)

# Predicting on the test set y_pred = xgb_model.predict(X_test)

y_pred_proba = xgb_model.predict_proba(X_test)[:, 1]
# Confusion Matrix and Classification Report print("Classification
Report:") print(classification_report(y_test, y_pred))

# Confusion Matrix conf_matrix = confusion_matrix(y_test, y_pred)

sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.title("Confusion Matrix") plt.xlabel("Predicted")
plt.ylabel("Actual") plt.show()
# ROC Curve and AUC fpr, tpr, thresholds =
roc_curve(y_test, y_pred_proba) roc_auc =
roc_auc_score(y_test, y_pred_proba)
plt.figure(figsize=(10, 6)) plt.plot(fpr, tpr,
label=f"XGBoost (AUC = {roc_auc:.2f})") plt.plot([0, 1],
[0, 1], 'r--') plt.title("ROC Curve") plt.xlabel("False
Positive Rate") plt.ylabel("True Positive Rate") plt.legend()
plt.show()

# Precision-Recall Curve precision, recall, _ =

precision_recall_curve(y_test, y_pred_proba)
plt.figure(figsize=(10, 6)) plt.plot(recall,
precision, marker='.') plt.title("Precision-
Recall Curve") plt.xlabel("Recall")
plt.ylabel("Precision") plt.show()
Visualization in Tableau:
Inferences:
• Model's Effectiveness: The model’s ability to differentiate between fraudulent and non-
fraudulent transactions will be assessed through metrics like F1-score and precision recall
curves. Positive inferences can be drawn if the model shows a high ability to recall fraud
cases without a significant drop in precision.
• Trends and Patterns: If the model finds specific patterns, such as fraud being more likely at
certain times of the day or certain transaction amounts, these insights can help refine
detection rules in a live system.
• False Positive Reduction: One of the key takeaways might be the model's ability to reduce
false positives, ensuring that legitimate transactions are not flagged as fraud. This
improvement can help maintain customer satisfaction and reduce operational costs.
• Real-time Detection: Inferences could include the model's potential for deployment in real-
time systems to detect fraud as transactions occur, providing proactive alerts and
minimizing damage from fraudulent activities.

Salinan Copy of Writing Effective PRD Product Management
No ratings yet
Salinan Copy of Writing Effective PRD Product Management
51 pages
Fiorano So A User Guide
No ratings yet
Fiorano So A User Guide
892 pages
Synthetic Data Generator For Electric Vehicle Char
No ratings yet
Synthetic Data Generator For Electric Vehicle Char
18 pages
5 Capabilities For The Best Azure Backup and Recovery-Veeam - PG
No ratings yet
5 Capabilities For The Best Azure Backup and Recovery-Veeam - PG
12 pages
Data Center Design Criteria: Course Content
No ratings yet
Data Center Design Criteria: Course Content
29 pages
Trends in Dataops: Bringing Scale and Rigor To Data and Analytics
No ratings yet
Trends in Dataops: Bringing Scale and Rigor To Data and Analytics
22 pages
IT Infrastructure & Azure Guide
No ratings yet
IT Infrastructure & Azure Guide
7 pages
Cloudwise Synthetic Monitoring - Complete Data Sheet - 2024
No ratings yet
Cloudwise Synthetic Monitoring - Complete Data Sheet - 2024
6 pages
DSS Technial Architecture
No ratings yet
DSS Technial Architecture
70 pages
DevOps Training
No ratings yet
DevOps Training
22 pages
Data Ethics Framework 2
No ratings yet
Data Ethics Framework 2
23 pages
Aws Migration Checklist
No ratings yet
Aws Migration Checklist
4 pages
Cloud Computing Data Centers
No ratings yet
Cloud Computing Data Centers
12 pages
Generative AI: Synthetic Data Methods
No ratings yet
Generative AI: Synthetic Data Methods
8 pages
AI Builders: Synthetic Data Solutions
No ratings yet
AI Builders: Synthetic Data Solutions
15 pages
Structured Approachto Solution Architecture
100% (1)
Structured Approachto Solution Architecture
109 pages
WORK FioranoMQHandBook PDF
No ratings yet
WORK FioranoMQHandBook PDF
438 pages
Software Project Management-2022 (Autosaved)
No ratings yet
Software Project Management-2022 (Autosaved)
191 pages
CMDBuildReady2Use OverviewManual ENG V100
No ratings yet
CMDBuildReady2Use OverviewManual ENG V100
41 pages
Gigaom Radar For Network Observability
No ratings yet
Gigaom Radar For Network Observability
26 pages
SumoLogic - Professional Services - Security Analytics PDF
No ratings yet
SumoLogic - Professional Services - Security Analytics PDF
75 pages
The Art of Pre Sales The Art of Pre Sale
No ratings yet
The Art of Pre Sales The Art of Pre Sale
4 pages
Digital Asset Management Use Cases
No ratings yet
Digital Asset Management Use Cases
4 pages
Comprehensive API Guide & Resources
No ratings yet
Comprehensive API Guide & Resources
126 pages
GANs for Financial Data Augmentation
No ratings yet
GANs for Financial Data Augmentation
8 pages
Designing A Banking System
No ratings yet
Designing A Banking System
47 pages
ASEAN SMEs: Transforming For The Future
No ratings yet
ASEAN SMEs: Transforming For The Future
30 pages
ZEP Temporal Knowledge Graph Architecture For AI Agents
No ratings yet
ZEP Temporal Knowledge Graph Architecture For AI Agents
12 pages
White Paper - DataOps Is NOT DevOps For Data
No ratings yet
White Paper - DataOps Is NOT DevOps For Data
15 pages
Data Integration Specification V4
No ratings yet
Data Integration Specification V4
74 pages
Rangers ALM Assessment
No ratings yet
Rangers ALM Assessment
54 pages
Undp The Dpi Approach A Playbook
No ratings yet
Undp The Dpi Approach A Playbook
57 pages
Allot Cybersecurity Terms
No ratings yet
Allot Cybersecurity Terms
28 pages
Software Architect Bootcamp 2nd Ed Edition Mowbray Download
No ratings yet
Software Architect Bootcamp 2nd Ed Edition Mowbray Download
126 pages
Gartner Business Quarterly 2q22 PDF
100% (1)
Gartner Business Quarterly 2q22 PDF
65 pages
One World, One Touch: Standard Chartered Bank Provides An Innovative " Single-Touch" Custody Model
No ratings yet
One World, One Touch: Standard Chartered Bank Provides An Innovative " Single-Touch" Custody Model
24 pages
IDS Reference Architecture Model 3.0 2019
No ratings yet
IDS Reference Architecture Model 3.0 2019
118 pages
ONAP API Fabric (API GW) Proposal
No ratings yet
ONAP API Fabric (API GW) Proposal
27 pages
DW Reference Documetns
No ratings yet
DW Reference Documetns
9 pages
EKS Networking
No ratings yet
EKS Networking
9 pages
Detecon Study Next-Generation Telco Product Lifecycle Management: How To Overcome Complexity in Product Management by Implementing Best-Practice PLM
100% (2)
Detecon Study Next-Generation Telco Product Lifecycle Management: How To Overcome Complexity in Product Management by Implementing Best-Practice PLM
66 pages
Whitepaper Neo Core Banking Def EN
No ratings yet
Whitepaper Neo Core Banking Def EN
10 pages
CLC Africa Training Catalogue
No ratings yet
CLC Africa Training Catalogue
20 pages
Cognizant Analytics For Banking & Financial Services Firms
No ratings yet
Cognizant Analytics For Banking & Financial Services Firms
8 pages
(APM) and Observability Report From PeerSpot 2023-09!23!18f1
No ratings yet
(APM) and Observability Report From PeerSpot 2023-09!23!18f1
45 pages
How To Add A System Call in Linux Kernel
No ratings yet
How To Add A System Call in Linux Kernel
19 pages
AI Principal
No ratings yet
AI Principal
49 pages
Apache Kafka Best Practices Guide
No ratings yet
Apache Kafka Best Practices Guide
10 pages
Self Service BI
No ratings yet
Self Service BI
6 pages
Optimizing The ROI of Enterprise Architecture Using Real Options
No ratings yet
Optimizing The ROI of Enterprise Architecture Using Real Options
7 pages
Week 7 8 - The Architecture Tradeoff Analysis Method (ATAM)
No ratings yet
Week 7 8 - The Architecture Tradeoff Analysis Method (ATAM)
28 pages
8.1 GCP Cloud Datastore PDF
No ratings yet
8.1 GCP Cloud Datastore PDF
8 pages
Data Integration for IT Professionals
No ratings yet
Data Integration for IT Professionals
95 pages
The Ultimate Guide To Data Marketplaces in 2023
No ratings yet
The Ultimate Guide To Data Marketplaces in 2023
11 pages
Network Detection & Response Guide
No ratings yet
Network Detection & Response Guide
16 pages
AWS Competency Application Readiness Checklist: Topic Description
No ratings yet
AWS Competency Application Readiness Checklist: Topic Description
3 pages
Mano Phase 2
No ratings yet
Mano Phase 2
10 pages
Nityananda Vyawhare 2223216 Case Study 5
No ratings yet
Nityananda Vyawhare 2223216 Case Study 5
5 pages
Phase 5 Fraud Detection in Financial Transactions
No ratings yet
Phase 5 Fraud Detection in Financial Transactions
17 pages
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
No ratings yet
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
33 pages
Sensors 22 07162 v3
No ratings yet
Sensors 22 07162 v3
20 pages
Balancing Data
No ratings yet
Balancing Data
28 pages
Pre-T1 Assignment 1
No ratings yet
Pre-T1 Assignment 1
2 pages
Lab Report 3 - Colab
No ratings yet
Lab Report 3 - Colab
6 pages
ML Unit 1
No ratings yet
ML Unit 1
17 pages
Paper 71-Convolutional Neural Network Model
No ratings yet
Paper 71-Convolutional Neural Network Model
10 pages
Student Grade Prediction Model
No ratings yet
Student Grade Prediction Model
106 pages
Online Fraud Report
No ratings yet
Online Fraud Report
15 pages
FINANCE & RISK ANALYTICS – PROJECT - YARESH VIJAYASUNDARAM
No ratings yet
FINANCE & RISK ANALYTICS – PROJECT - YARESH VIJAYASUNDARAM
48 pages
Updated Survey PAPER
No ratings yet
Updated Survey PAPER
5 pages
DP-Designing and Implementing
No ratings yet
DP-Designing and Implementing
10 pages
Pereira 2021
No ratings yet
Pereira 2021
7 pages
ML Lab Report
No ratings yet
ML Lab Report
23 pages
Classification of Cervical Cancer Using Convolutional Neural Networks
No ratings yet
Classification of Cervical Cancer Using Convolutional Neural Networks
5 pages
dp-100 - 5 Microsoft Certified Associate Data Scientist
No ratings yet
dp-100 - 5 Microsoft Certified Associate Data Scientist
31 pages
2019 - Data Augmentation Using GANs - Fabio Henrique
No ratings yet
2019 - Data Augmentation Using GANs - Fabio Henrique
16 pages
Sampath Et Al. - 2021 - A Survey On Generative Adversarial Networks For Im
No ratings yet
Sampath Et Al. - 2021 - A Survey On Generative Adversarial Networks For Im
60 pages
International Journal of System of Systems Engine
No ratings yet
International Journal of System of Systems Engine
4 pages
Stroke Prediction Using Machine Learning
No ratings yet
Stroke Prediction Using Machine Learning
8 pages
Detecting Fraudulent Financial Statement Under Imbalanced Data Using Neural Network
No ratings yet
Detecting Fraudulent Financial Statement Under Imbalanced Data Using Neural Network
7 pages
Solve Class Imbalance in ML: 10 Techniques
No ratings yet
Solve Class Imbalance in ML: 10 Techniques
16 pages
Sampling Approaches For Imbalanced Data Classification Problem in Machine Learning
No ratings yet
Sampling Approaches For Imbalanced Data Classification Problem in Machine Learning
13 pages
Mini Project
No ratings yet
Mini Project
27 pages
An Assessment of Machine Learning Models and Algorithms For Early
No ratings yet
An Assessment of Machine Learning Models and Algorithms For Early
14 pages
Customer Churn Prediction in ECommerce Sector
No ratings yet
Customer Churn Prediction in ECommerce Sector
40 pages
Cross-Validation for Imbalanced Data
No ratings yet
Cross-Validation for Imbalanced Data
17 pages
Unit 2 Quantitative Techniques
No ratings yet
Unit 2 Quantitative Techniques
33 pages
Credit Card Fraud Detection - Machine Learning Methods: March 2019
No ratings yet
Credit Card Fraud Detection - Machine Learning Methods: March 2019
6 pages
8119-Article Text-8942-1-10-20230930
No ratings yet
8119-Article Text-8942-1-10-20230930
10 pages

21BCE3954 FraudDetectionInBanking

Uploaded by

21BCE3954 FraudDetectionInBanking

Uploaded by

School of Computer Science and Engineering Vellore

Institute of Technology, Vellore.

Digital Assignment -1 & 2

Submitted to: Aunradha J Ma’am

Aggregated Data Sources:

Required Data Structures and Resources:

Scope of Data Infrastructure Needed:

1.2 Data Preparation

1. Data Collection and Understanding

1. Assessing the Structure of the Datasets

2. Ensuring Alignment with Project Goals

3. Data Exploration and Variable Selection

1.4 Model Building

1. Training the Model

3. Model Evaluation and Adjustment

4. Finalizing the Model

1. Identification and Classification of Fraud

2. Reduction in False Positives and False Negatives

Fig 2. ROC curve for different models

Fig 3. Boxplots (Class 0: Not-Fraud)

Fig 4. Heatmap of feature correlations && Time Series plot

4. Improvement in Fraud Detection Speed

5. Enhanced Predictive Accuracy

6. Data Insights and Reporting

Fig 8. Sample Dashboard for fraud detection

To address this challenge:

2.2 Data Quality and Consistency

To handle data quality and consistency issues:

To address class imbalance:

2.4 Synthetic Data Limitations

To mitigate synthetic data limitations:

5. Model Evaluation and Communication Final Evaluation:

6. Deployment and Monitoring

analysis. Display the result with Tableau visualization.

Using XGBoost machine learning model to the dataset:

# Load the dataset data = pd.read_csv('creditcard.csv')

1. Basic Data Information and Class Distribution

# Visualizing class distribution

# Visualizing Time vs Amount with Class plt.figure(figsize=(10, 6))

# Correlation heatmap for the dataset

# Converting any non-numeric columns to numeric, if possible

# Re-checking and handling missing values after conversion data =

# Re-splitting the data after cleaning X

from sklearn.model_selection import train_test_split

4. Training the XGBoost Model

# Now training the XGBoost model import xgboost as xgb xgb_model

# Predicting on the test set y_pred = xgb_model.predict(X_test)

# Confusion Matrix conf_matrix = confusion_matrix(y_test, y_pred)

# Precision-Recall Curve precision, recall, _ =

You might also like