Fraud Detection on Banking
Data using Python with
Google Colab
ABSTRACT
Fraudulent activity is observed in a wide range of industries, including
banking, e-commerce, healthcare, and payments.
Half (49 percent) of the 7,200 companies surveyed in the 2018 PwC Global
Economic Crime Survey reported having dealt with fraud of some kind.
Although fraud may seem frightening to businesses, it can be identified with the
use of intelligent systems like machine learning or rules engines.
The majority of people on Kaggle are familiar with machine learning, but this is
a quick overview of rule engines. A software system that carries out one or more
business rules in a production runtime environment is called a rules engine.
Experts in the field typically write these rules in order to impart their knowledge
of the issue to the rules engine and production.
Two examples of rules for fraud detection are velocity rules, which set a limit
on the number of transactions allowed within a given time period, and denial of
transactions originating from previously identified fraudulent IP addresses
and/or domain names.
Rules are excellent at identifying some forms of fraud, but in certain
situations, they can produce a large number of false positives or false negatives
due to their preset threshold values. As an illustration, consider a rule that
prohibits a transaction for a particular user whose amount exceeds 100,000
rupees. If the user is a seasoned scammer, it's possible that they are aware that
the system has a threshold and can simply make a transaction that is just
marginally below it. ML provides assistance for these kinds of issues and lowers
the possibility of fraud and financial loss for businesses. Combining rules and
machine learning would enable a more confident and accurate fraud detection.
1. INTRODUCTION
1.1. OBJECTIVE
-Early Detection of Fraud:
Identify and detect fraudulent activities as early as possible to minimize financial
losses and protect customers.
-Accuracy and Precision:
Develop models with high accuracy and precision to reduce false positives and
negatives, ensuring that genuine transactions are not flagged as fraudulent and vice
versa.
-Adaptability and Scalability:
Design models that can adapt to evolving fraud patterns and are scalable to handle the
increasing volume of transactions as the bank grows.
-User-Friendly:
Ensure that fraud detection systems are user-friendly for both customers and internal
users, minimizing disruptions to legitimate transactions.
-Explainability:
Develop models that are interpretable and provide explanations for their decisions to
enhance transparency and trust among stakeholders.
(Times new roman-12 font size, 1.5 line spacing)
1.2 Motivation
-Relevance to the real world: Fraud detection in banking is a serious problem with
large financial ramifications. Working on a project that involves bank data exposes you
to an issue that affects financial institutions and their clients on a personal level.
-Practical experience Constructing a fraud detection model offers hands-on practice in
feature engineering, data preprocessing, model selection, and evaluation. In positions
involving data science and machine learning, these abilities are highly valued.
-Python's applicability Because of its many libraries, including those for Pandas,
NumPy, and Scikit-learn, Python is a flexible language that is frequently used in data
science and machine learning projects. You may make good use of these tools if you
run the project in Python.
-Benefits of Google Collab: TensorFlow and other well-known frameworks are
available for free, as are computing resources (CPU, GPU, and TPU).
- Accessibility: Google Collab eliminates the requirement for local setup and
configuration by enabling you to access your project from any location with an
internet connection. This accessibility facilitates iteration and collaboration with ease.
-Possibility for learning: Working on a fraud detection project gives you the chance to
experience a variety of machine learning approaches, including ensemble methods,
supervised learning (classification), and anomaly detection. Additionally, it presents
techniques for handling unbalanced datasets, a prevalent problem in fraud detection.
-Implications for ethics: Fraud detection initiatives bring up issues with justice,
privacy, and transparency. Participating in a project like this can help you have a better
grasp of these problems and look into responsible solutions.
1.3 Background
-Data Availability: Public datasets or simulated datasets developed for teaching and
research purposes are common ways that bank data is easily accessible for research.
Aspiring data scientists and researchers can now work on practical issues without
requiring access to private or sensitive data because to this availability.
-Relevance: Banks and other financial organizations around the world are very
concerned about detecting fraud. Identifying fraudulent activity reduces financial
losses for both individuals and organizations, protects consumers, and upholds public
confidence in financial systems. Working on a project that uses bank data for fraud
detection means you're taking on a challenge with big real-world ramifications.
-Value as an Educational Tool: Developing a fraud detection model offers practical
experience with feature engineering, exploratory data analysis, data preparation, model
selection, and evaluation—all crucial steps in a machine learning endeavor.
- Research & Innovation: The banking industry is a continuous source of research and
innovation, and working on a project involving bank data can help develop new
approaches, algorithms, and strategies for identifying and stopping fraudulent activity.
Your efforts could also lead to improvements in fraud detection systems, which would
ultimately help fight financial fraud more successfully.
2. LITERATURE SURVEY
S.NO Author & Paper Merits De-Merits
1. Fraud Detection Using the Fraud • Integration of • Absence of
Triangle Theory and Data Mining Theoretical Empirical
Techniques. Framework: Validation:
-by Marco Sánchez-Aguayo, Luis The paper may be The usefulness of the
Urquiza-Aguiar and José Estrada- deemed strong if it suggested fraud detection
Jiménez.2021 successfully strategy in the actual
incorporates the widely world may be questioned
recognized if the study lacks
criminological empirical support or a
framework known as the realistic implementation
Fraud Triangle Theory of the suggested approach.
into the context of fraud • Limited
detection through the generalizability:
use of data mining tools. The results of the research
• Innovative Data may not be as applicable
Mining to larger contexts if they
approaches: depend on a small dataset
A study adds value to the or have a limited scope.
body of literature if it
presents new or useful
data mining approaches
for fraud detection and
offers proof of their
efficacy.
2. Learning Fraud Detection from • Data-driven • Data Privacy
Big Data in Online Banking strategy: Issues:
Transactions. Using big data suggests Managing large amounts
-by Indrajani, Harjanto Prabowo a data-driven strategy, of data in the context of
and Meyliana.2016 which enables the online banking
model to pick up transactions could give
patterns and rise to data privacy issues.
abnormalities from a lot The paper can receive a
of transactions, possibly demerit if these issues are
increasing the accuracy not sufficiently addressed.
of fraud detection. • Unbalanced
• Real-time Datasets:
Detection: Fraud detection frequently
The paper may have a has to cope with datasets
substantial advantage if that have a notably lower
it focuses on real-time percentage of fraudulent
fraud detection. Early transactions than lawful
detection of fraudulent ones. The performance of
activity can aid in the the model may suffer if the
avoidance of financial paper does not adequately
losses. handle the problems
caused by unbalanced
data.
3. Intelligent Fraud Detection in • Innovation and • Data Bias:
Financial Statements Using Contribution: Results may be distorted
Machine Learning and Data By introducing novel and have limited
Mining. techniques or strategies application in a variety of
-by Matin N. Ashtiani and Bijan for fraud detection, the financial circumstances if
Raahemi.2021 research may advance the dataset used to train
our understanding of the the machine learning
subject. models is biased.
• Accuracy and • Absence of
Effectiveness: Benchmarking:
Good outcomes It can be detrimental if
showing the suggested there are no comparisons
machine learning and with commonly used
data mining approaches' standards or benchmark
accuracy and efficacy in techniques for fraud
identifying financial detection.
fraud can be regarded as
a major merit.
4. The application of data mining • Pattern • False Positives:
techniques in financial fraud Recognition: The creation of false
detection. Data mining facilitates positives, in which valid
-by E.W.T. Ngai, Yong Hu, Y.H. the identification of transactions are reported
Wong, Yijun Chen and Xin intricate linkages and as fraudulent, is a frequent
Sun.2010 patterns in financial problem. Customer
data, which makes it unhappiness and more
simpler to spot odd manual reviews may result
behaviour that may be a from this.
sign of fraud. • Problems with
• Automation: Data Quality:
By reducing the need for The Caliber of the input
human intervention, data has a significant
automated data mining impact on how effective
techniques increase data mining is. Incomplete
productivity and make it or inaccurate data can
possible to analyse large skew results and
amounts of data rapidly. jeopardize the detection
process.
5. Financial fraud detection applying • Performance • Data Bias and
data mining techniques. metrics: Imbalance:
-by Khaled Gubran Al-Hashedi It's critical to assess how Results may be distorted if
and Pritheega Magalingam.2021 well data mining the study ignores
methods work. A robust problems with data bias
evaluation process is and class imbalance in
demonstrated by the fraud detection datasets.
paper if it offers a To enable generalization
comprehensive study of of the model, the data must
performance indicators be represented objectively.
including area under the • Transparency and
ROC curve, F1 score, Reproducibility:
precision, and recall. The wider acceptability
• Data Quantity and application of the
and Quality: suggested procedures may
The dependability and be hampered if the
generalizability of the publication lacks
results are improved if transparency in describing
the paper makes use of the methodologies,
suitably large and high- making it difficult for
quality datasets. others to reproduce the
Sufficient data is results.
essential for efficient
model testing and
training.
6. Fraud detection system. • Scalability: • False
-by Aisha Abdallah, Mohd A system is positives/negatives:
Aizaini Maarof and Anazida advantageous for real- The system's practical
Zainal.2016 world applications if it is utility may be limited
scalable and able to if it produces a large
manage massive number of false
amounts of data or positives or false
transactions. negatives.
• Adaptability: • Computational
Since fraudsters complexity:
constantly change their The suggested system may
tactics, it is beneficial not be practical in some
for a system to be able to settings if it is
adjust to shifting fraud computationally
trends or strategies. demanding and needs a lot
of resources.
7. Detecting Financial Fraud • Relevance of • Assumption
Using Data Mining Techniques. Features: Validity:
The data mining model's The applicability of the
-by Mousa Albashrawi.2016 feature selection and suggested procedures may
relevance play a crucial be called into question if
role. It is a plus if the the paper is predicated on
paper clearly explains assumptions that are
why the features were unrealistic or not
chosen and shows how applicable to real-world
important they are for situations.
detecting fraud. • Absence of
• Comparative Validation:
Analysis: It casts doubt on the
The paper helps readers suggested methods'
comprehend the generalizability if the
suggested approach research isn't supported by
better if it presents a independent evaluation or
detailed analysis of the external datasets.
advantages and
disadvantages of the
suggested methodology
in comparison to other
approaches.
8. A Review of Machine Learning • Assessment of • Narrow Scope:
Algorithms for Fraud Detection in Multiple The usefulness and
Credit Card Transaction. Algorithms: generalizability of a study
-by Lim, Kha Shing ; Lee, Lam A quality article should may be limited if it solely
Hong and Sim, Yee-Wai.2021 assess and contrast the focuses on a few
effectiveness of algorithms or datasets
different machine without taking into
learning algorithms, account a wider context.
offering information on • Overemphasis on
the advantages and Particular
disadvantages of each in Algorithms:
the context of detecting The effectiveness of
credit card fraud. certain algorithms may be
• Robust approach: distorted if a paper places
The results are given too much emphasis on one
more trust if the study or a small number of them
uses a rigorous without taking into
experimental approach account any of their
that makes use of the shortcomings or
right datasets, downsides.
measurements, and
statistical analysis.
9. A systematic literature review on • Extensive • Limited Inclusion
frauds in banking sector. Coverage: of Diverse Views:
-by Deepa Mangala, Lalita A well-conducted The review may lack
Soni.2022 systematic literature diversity and fall short of
review seeks to give a offering a comprehensive
thorough and objective understanding of the
summary of the subject if it mostly covers
literature on banking papers from a certain
frauds that has already geographic area or ignores
been published, particular types of frauds.
providing insights into a • Inadequate
number of related areas. Methodology
• Methodological Details:
Rigor: The validity of the results
A thorough literature may be called into
review should adhere to question if the literature
a strict set of guidelines, review's methodology is
which should include not transparently stated.
transparent techniques
for data collecting and
synthesis, well-defined
inclusion and exclusion
criteria, and systematic
search tactics.
10. An Analysis of the Most Used • Performance • Lack of originality:
Machine Learning Algorithms Assessment: The work may face
for Online Fraud Detection. A detailed analysis of criticism for lacking
-by Elena-Adriana and the effectiveness of originality if it does not
Gabriela.2019 various machine present unique ideas or
learning algorithms, methods, particularly if it
stressing their merely restates well-
advantages and known conclusions
disadvantages in without making a
relation to online fraud significant contribution to
detection, may be the body of existing
included in the article. knowledge.
• Insufficient
Methodology:
The credibility of the
stated results may be
compromised if the
methodology used for
algorithm evaluation is not
sufficiently transparent,
robust, or repeatable,
which could lead to
criticism of the
publication.
3. PROJECT DESCRIPTION AND GOALS
Fraudulent activity is observed in a wide range of industries, including banking, e-
commerce, healthcare, and payment systems. According to the 2018 PwC Global
Economic Crime Survey, half (49 percent) of the 7,200 organizations polled had some
sort of fraud experience. Even while fraud seems frightening to businesses, it may be
found with the use of sophisticated systems like machine learning or rules engines.
The majority of individuals on Kaggle are acquainted with machine learning;
nonetheless, this brief information relates to rule engines. A software program that
carries out one or more business rules in a runtime production setting is called a rules
engine. Typically, subject matter experts create these rules in order to impart their
understanding of the issue to the rules engine and ultimately to production. Two
examples of rules for detecting fraud would be velocity rules, which set a limit on the
number of transactions allowed within a given time, and denial of transactions
originating from previously identified fraudulent IP addresses and/or domain names.
Rules are excellent at identifying various forms of fraud, but in certain situations,
they can produce a large number of false positives or false negatives due to their preset
threshold values. Consider a rule that, in the case of a particular user, blocks
transactions involving amounts greater than 100,000 rupees. If the user is a seasoned
scammer, it's possible that they are aware that the system has a threshold and can
simply perform a transaction that is just a little bit below it (i.e. 99999).
ML provides assistance for these kinds of issues and lowers the possibility of
fraud and financial loss for businesses. Combining rules and machine learning would
enable more confident and accurate fraud detection.
-Data Preprocessing: To produce a consistent and dependable dataset for ML model
training and testing, clean, normalize, and preprocess raw bank data.
-Feature Engineering: This technique involves extracting pertinent elements from the
data to aid in differentiating between authentic and fraudulent transactions. The
transaction amount, frequency, location, device information, etc., may be included in
this.
-Model Selection: Based on the type and intricacy of the fraud patterns in the bank
data, select suitable machine learning (ML) techniques (e.g., decision trees, random
forests, neural networks).
- Training and Testing Models:
The chosen model should be optimized for accuracy by fine-tuning it once it has been
trained on historical data and its performance verified on a different dataset.
-Integration with Banking Systems: To enable real-time monitoring and decision-
making during transactions, integrate the generated model into the bank's current
systems.
4. DESIGN APPROACH AND DETAILS
Flow diagram
We detect the fraudulent transactions from the Banksim dataset. This
synthetically generated dataset consists of payments from various customers made in
different time periods and with different amounts.
Here's what we'll do:
1.Exploratory Data Analysis (EDA)
✓ We will perform an EDA on the data and try to gain some insight from it.
2.Data Preprocessing
✓ We will Preprocess the data and prepare it for the training
3.Oversampling with SMOTE.
✓ Using SMOTE(Synthetic Minority Oversampling Technique) for balancing the
dataset. Resulted counts show that now we have exact number of class
instances (1 and 0)
4.K-Neighbours Classifier
✓ A non-parametric, supervised learning classifier called K-Neighbour’s
Classifier employs proximity to classify or anticipate how a single data point
will be grouped.
5.Random Forest Classifier
✓ It works by building a large number of decision trees during the training phase.
The class that the majority of the trees choose is the random forest's output for
classification problems.
6.XGBoost Classifier
✓ It is a machine learning algorithm under ensemble learning
7.Conclusion
5. IMPLEMENTATION & METHODOLOGY
✓ Exploratory Data Analysis (EDA): we will analyse the data using EDA in an
effort to extract some insights. Information The dataset comprises nine feature
columns and one target column, as can be seen in the first rows below. The
columns with features are: Customer , zipCodeOrigin , zipThe merchant, their
age, gender, amount, category, and fraud
✓ Preprocessing the Data: We will prepare the data for the training by
preprocessing it. Because there is just one distinct zip Code value, we will be
eliminating them. We will now convert the category features' numerical values.
It is typically preferred to convert these categories to dummies because they
don't have a size connection (customer 1 is not greater than customer 2).
However, because to their excessive number (above 500k merchants and
customers), the features will increase by a factor of 10^5, rendering training
unfeasible.
Let's define our variable, X, and our dependent/target variable, y.
✓ Oversampling with SMOTE: Utilizing SMOTE (Synthetic Minority
Oversampling Technique), balance the dataset. The resultant counts (1 and 0)
show that we now know the precise number of class instances. will measure
performance with a split train test. Cross validation should be preferred most of
the time, but I haven't done it because I don't want to wait around for training
because we have a lot of instances.
I'll devise a technique for displaying the ROC AUC curve, which offers a
useful graphical depiction of the categorization outcomes. As I've already
mentioned, fraudulent datasets will be uneven, and most occurrences won't be
fraudulent. Assume that we have access to the dataset and that our forecasts are
reliably correct.
✓ K-Neighbor Classifier
For the K-Nearest Neighbors Classification Report, we will utilize the K-
Neighbors Classifier. and a K-Nearest Neighbors Confusion Matrix.
✓ Classification via Random Forest
The Random Forest Classifier will be utilized for the Random Forest Classifier
Classification Report. and a Random Forest Classifier Confusion Matrix.
✓ XGBoost Classifier
For the XGBoost Classification Report, we will make use of the XGBoost
Classifier. with an XGBoost Confusion Matrix.
✓ Verdict
We have attempted to detect fraud using bank payment data, and our classifiers
have produced excellent results. We used an oversampling approach called
SMOTE to create new minority class cases since fraud datasets have an
imbalance class problem.
➢ Exploratory Data Analysis (EDA)
➢ Preprocessing the Data
➢ Oversampling with SMOTE
➢ K-Neighbour Classifier
➢ Classification via Random Forest
➢ XGBoost Classifier
6. RESULT & SUMMARY
The report describes how common fraudulent activity is in a number of industries,
including banking, e-commerce, healthcare, and payment systems. It makes reference
to the 2018 PwC Global Economic Crime Survey, which shows that almost half of the
organizations examined had some kind of fraud. The article highlights the benefits and
drawbacks of using intelligent systems for fraud detection, including rules engines and
machine learning.
With examples such as velocity rules and IP/domain checks, it gives a general
understanding of rules engines and how they are used in fraud detection. It does,
however, recognize that the use of rules alone could result in false positives or
negatives because of predetermined thresholds, which calls for the incorporation of
machine learning.
The story then shifts to the data analysis section, including data pretreatment
techniques include handling categorical variables and deleting redundant features as
well as exploratory data analysis (EDA). In order to balance the dataset and get it
ready for training, it introduces oversampling using the Synthetic Minority
Oversampling Technique (SMOTE).
The paper addresses model selection, including the K-Neighbours Classifier,
Random Forest Classifier, and XGBoost Classifier, after data preparation. It describes
each model's evaluation measures, including confusion matrices and classification
reports.
The effective implementation of classifier-based fraud detection approaches on
bank payment data is finally summarized in the conclusion. It discusses the creation of
new minority class examples to enhance model performance and emphasizes the
application of SMOTE to address the problem of class imbalance.
References
[1] Albashrawi, M. Detecting Financial Fraud Using Data Mining Techniques: A
Decade Review from 2004 to 2015.
[2] Ngai, E.W.T.; Hu, Y.; Wong, Y.H.; Chen, Y.; Sun, X. The application of data mining
& Machine Learning techniques in financial fraud detection: A classification
framework and an academic review of literature.
[3] Ryman-Tubb, N.F.; Krause, P.; Garn, W. How Artificial Intelligence and machine
learning research impacts payment card fraud detection: A survey and industry
benchmark.
[4] Abdallah, A.; Maarof, M.A.; Zainal, A. Fraud detection system: A survey.
[5] Chaquet-ulldemolins, J.; Moral-rubio, S.; Muñoz-romero, S. On the Black-Box
Challenge for Fraud Detection Using Machine Learning (II): Nonlinear Analysis
through Interpretable Autoencoders.
[6] Nassif, A.B.; Abu Talib, M.; Nasir, Q.; Dakalbab, F.M. Machine Learning for Fraud
Detection: A Systematic Review.
[7] Bhattacharyya, S.; Jha, S.; Tharakunnel, K.; Westland, J.C. ML & DM for credit
card fraud: A comparative study.
[8] Hilal, W.; Gadsden, S.A.; Yawney, J. Financial Fraud: A Review of Anomaly
Detection Techniques and Recent Advances.
[9] Ashtiani, M.N.; Raahemi, B. Intelligent Fraud Detection in Financial Statements
Using Machine Learning and Data Mining: A Systematic Literature Review.
[10] Al-Hashedi, K.G.; Magalingam, P. Financial fraud detection applying data mining
& Machine Learning techniques: A comprehensive review from 2009 to 2019.