Credit Card Fraud Detection Project Report
Credit Card Fraud Detection Project Report
A Project Report
Submitted in partial fulfilment of the requirement for the award of the
degree
OF
BACHELOR OF TECHNOLOGY
in
(Information Technology)
SUBMITTED BY
OF
BACHELOR OF TECHNOLOGY
in
(Information Technology)
SUBMITTED BY
August 2024
DECLARATION
I declare that this written submission represents my work and ideas in my own
words and where others' ideas or words have been included, I have adequately
cited and referenced the original sources I also declare that I have adhered to all
principles of academic honesty and integrity and have not misrepresented or
fabricated or falsified any idea/data/fact/source in my submission. I understand
that any violation of the above will be cause for disciplinary action by the
University and can also evoke penal action from the sources which have thus not
been properly cited or from whom proper permission has not been taken when
needed. This project represents our own work conducted under the guidance of
Dr. Vineet Kumar Singh (Assistance Professor, Department of Information
Technology).
CERTIFICATE
This is to certify that this project entitled, SMART HOME AUTOMATION submitted
by Siddhi vinayak singh , Rameshwar pratap singh the requirement for the award of the
Degree of Bachelor of Technology in Information Technology of Institute of
Engineering and Technology Dr Ram Manohar Lohia Avadh University Ayodhya is a
record of student on work carried under the supervision and guidance. The project
embodies result of original work and studies carried out by students and contents do not
form the basis for the award of any other degree to the candidate all to anybody else.
Signature of Supervisor
ACKNOWLEDGEMENT
Before we Present Our work, we would like to be gratefully acknowledge the contribution
of all those people who helped in the work described in this project report. I would like
to thanks all my team member for building this great project and their hard work for the
project.
We are gratefully acknowledging our HOD, Mr. Rajesh Kumar Singh, Head of
Department of Information Technology, Institute of Engineering and Technology,
Ayodhya. For this unconditionally support and encouragement to pursue of our field of
interest Information Technology.
We express our sincere and profound sense of gratitude to our respected supervisor faculty
Dr. Vineet Kumar Singh, Assistant professor of Information Technology Department.
Institute of Engineering and technology, Ayodhya for her expert guidance and constant
inspiration through this work that proved the way for successful completion of this
Endeavour.
vi
Approval Sheet
Supervisor
Head of Department
Date:
Place:
vii
ABSTRACT
The purpose of this project is to detect the fraudulent transactions made by credit cards
by the use of machine learning techniques, to stop fraudsters from the unauthorized usage
of customers’ accounts. The increase of credit card fraud is growing rapidly worldwide,
which is the reason actions should be taken to stop fraudsters. Putting a limit for those
actions would have a positive impact on the customers as their money would be recovered
and retrieved back into their accounts and they won’t be charged for items or services that
were not purchased by them which is the main goal of the project. Detection of the
fraudulent transactions will be made by using three machine learning techniques KNN,
SVM and Logistic Regression, those models will be used on a credit card transaction
dataset.
TABLE OF CONTENT
CHAPTER 1 INTRODUCTION 1
1.1 INTODUCTION 1
1.2 PROJECT GOALS 1
1.3 RESEARCH METHODOLOGY 2
1.3.1 CRISP-DM 2
CHAPTER 4 DATAANALYSIS 13
4.1 SYSTEM REQUIREMENT SPECIFICATION 13
4.2 HARDWARE SPECIFICATION 13
4.3 SOFTWARE SPECIFICATION 14
4.4 FUNCTIONAL REQUIREMENTS 14
4.5 NON-FINCTIONAL REQUIREMENTS 14
4.6 PERFORMANCE REQIREMENT 15
ix
CHAPTER 7 IMPLEMENTATION 39
CODE 40-44
CHAPTER 9 ALGORITHM 47
CHAPTER 10 CONCLUSION 50
REFRENCES 51-52
1
CHAPTER:01
INTRODUCTION
1.1 INTRODUCTION
With the increase of people using credit cards in their daily lives, credit card
companies should take special care in the security and safety of the customers.
According to (Credit card statistics 2021) the number of people using credit
cards around the world was 2.8 billion in 2019, in addition 70% of those users
own a single card at least.
Reports of Credit card fraud in the US rose by 44.7% from 271,927 in 2019 to
393,207 reports in 2020. There are two kinds of credit card fraud, the first one
is by having a credit card account opened under your name by an identity thief,
reports of this fraudulent behavior increased 48% from 2019 to 2020. The
second type is by an identity thief uses an existing account that you created,
and it’s usually done by stealing the information of the credit card, reports on
this type of fraud increased 9% from 2019 to 2020 (Daly, 2021). Those
statistics caught my attention as the numbers are increasing drastically and
rapidly throughout the years, which gave me the motive to try to resolve the
issue analytically by using different machine learning methods to detect the
credit card fraudulent transactions within numerous transactions.
The main aim of this project is the detection of credit card fraudulent
transactions, as it’s important to figure out the fraudulent transactions so that
customers don’t get charged for the purchase of products that they didn’t buy.
The detection of the credit card fraudulent transactions will be performed with
multiple ML techniques then a comparison will be made between the outcomes
and results of each technique to
2
Research question: What is the most suited machine learning model in the
detection of fraudulent credit card transactions?
I believe that taking the route of CRISP-DM will ease obtaining efficient and
elite results, as it takes the project into the whole journey, starting by
understanding the business and data, preparing the data then modeling it and
finally evaluate the model to make sure it’s performing well.
As stated, before credit card fraud is increasing drastically every year, many
people are facing the problem of having their credits breached by those
fraudulent people, which is impacting their daily lives, as payments using a
credit card is similar to taking a loan. If the problem is not solved many people
will have large amounts of loans that they cannot pay back which will make
them face a hard life, and they won’t be able to afford necessary products, in
the long run not being able to pay back the amount might lead to them going
to jail. Basically, the problem proposed is the detection of the credit card
fraudulent transactions made by fraudsters to stop those breaches and to ensure
customers security.
After choosing the most suited dataset the preparation phase begins, the
preparation of the dataset includes selecting the wanted attributes or variables,
cleaning it by excluding Null rows, deleting duplicated variables, treating
outlier if necessary, in addition to transforming data types to the wanted type,
data merging can be performed as well where two or more attributes get
merged. All those alterations lead to the wanted result which is to make the
data ready to be modeled.
The dataset chosen for this project didn’t need to go through all of the
alterations mentioned earlier, as there were no missing nor duplicated
variables, there was no merging needed as well. But there was some changing
in the types of the data to be able to create graphs, in addition to using the
application Sublime Text to be able to insert the data into Weka and perform
analysis, as it needed to be altered.
Phase 4: Modelling
Four machine learning models were created in the modelling phase, KNN,
SVM, Logistic Regression and Naïve Bayes. A comparison of the results will
be presented later in the paper to know which technique is most suited in the
4
The final phase will show evaluations of the models by presenting their
efficiency, the accuracies of the models will be presented in addition to any
comment observed, to find the best and most suited model for detecting the
fraud transactions made by credit card.
5
CHAPTER: 02
LITERATURE REVIEW
2.1 INTRODUCTION
It is essential for credit card companies to establish credit card transactions that
fraudulent from transactions that are non-fraudulent, so that their customers’
accounts won’t get affected and charged for products that the customers didn’t
buy (Maniraj et al., 2019). There are many financial Companies and
institutions that lose massive amounts of money because of fraud and
fraudsters that are seeking different approaches continuously to violate the
rules and commit illegal actions; therefore, systems of fraud detection are
essential for all banks that issue credit cards to decrease their losses (Zareapoor
et al., 2012). There are multiple methods used to detect fraudulent behaviors
such as Neural Network (NN), Decision Trees, K-Nearest Neighbor
algorithms, and Support Vector Machines (SVM). Those ML methods can
either be applied independently or can be used collectively with the addition
of ensemble or meta-learning techniques to develop classifiers (Zareapoor et
al., 2012).
Zarea poor and his research team used multiple techniques to determine the
best performing model in detecting fraudulent transactions, which was
established using the accuracy of the model, the speed in detecting and the
cost. The models used were Neural Network, Bayesian Network, SVM, KNN
and more. The comparison table provided in the research paper showed that
Bayesian Network was very fast in finding the transactions that are fraudulent,
with high accuracy. The NN performed well as well as the detection was fast,
with a medium accuracy. KNN’s speed was good with a medium accuracy, and
finally SVM scored one of the lower scores, as the speed was low, and the
6
accuracy was medium. As for the cost All models built were expansive
(Zareapoor et al., 2012).
The model used by Alenzi and Aljehane to detect fraud in credit cards was
Logistic Regression, their model scored 97.2% in accuracy, 97% sensitivity
and 2.8% Error Rate. A comparison was performed between their model and
two other classifier which are Voting Classifier and KNN. VC scored 90% in
accuracy, 88% sensitivity and 10% error rate, as for KNN where k = 1:10, the
accuracy of the model was 93%, the sensitivity 94% and 7% for the error rate
(Alenzi & Aljehane, 2020).
Manirams team built a model that can recognize if any new transaction is fraud
or nonfraud, their goal was to get 100% in the detection of fraudulent
transactions in addition to trying to minimize the incorrectly classified fraud
instances. Their model has performed well as they were able to get 99.7% of
the fraudulent transactions (Maniraj et al., 2019).
The classification approach used by Dheepa and Dhanapal was the behavior-
based classification approach, by using Support Vector Machine, where the
behavioral patterns of the customers were analyzed to distinguish credit card
fraud, such as the amount, date, time, place, and frequency of card usage. The
accuracy achieved by their approach was more than 80% (Dheepa & Dhanapal,
2012).
Mailini and Pushpa proposed using KNN and Outlier detection in identifying
credit card fraud, the authors found after performing their model over sampled
data, that the most suited method in detecting and determining target instance
anomaly is KNN which showed that its most suited in the detection of fraud
with the memory limitation. As for Outlier detection the computation and
memory required for the credit card fraud detection is much less in addition to
its working faster and better in online large datasets. But their work and results
showed that KNN was more accurate and efficient (Malini & Pushpa, 2017).
Maes and his team proposed using Bayesian and Neural Network in the credit
card fraud detection. Their results showed that Bayesian performance is 8%
more effective in detecting fraud than ANN, which means that in some cases
BBN detects 8% more of the fraudulent transactions. In addition to the
7
Learning times, ANN can go up to several hours whereas BBN takes only 20
minutes (Maes et al., 2002).
Accuracy of the 10:90 distribution is Naïve Bayes with 97.5%, then KNN with
97.1%, Logistic regression performed poorly as the accuracy is 36.4%.
Another distribution that was viewed is 34:66, KNN topped the chart with a
slight increase in the accuracy 97.9%, then Naïve Bayes with 97.6%, Logistic
Regression performed better in this distribution as the accuracy raised to
54.8% (Awoyemi et al., 2017).
Jain’s team used several ML techniques to distinguish credit card fraud, three
of them are SVM, ANN and KNN. Then to compare the outcome of each
model, they calculated the true positive (TP), false negative (FN), false
positive (FP), and true negative (TN) generated. ANN scored 99.71%
accuracy, 99.68% precision, and 0.12% false alarm rate. SVM accuracy is
94.65%, 85.45% for the precision, and 5.2% false alarm rate. and finally, the
accuracy of KNN is 97.15%, precision is 96.84% and the false alarm rate is
2.88% (Jain et al., 2019).
Adepoju and his team used all of the ML methods that are used in this paper,
Logistic Regression, (SVM) Support Vector Machine, Naive Bayes, and
(KNN) K-Nearest Neighbor, those methods were used on distorted credit card
fraud data. The accuracies scored by all the models were 99.07% for Logistic
Regression, Naïve Bayes scored 95.98%, 96.91% for K-nearest neighbor, and
8
the last model (SVM) Support Vector Machine scored 97.53% (Adepoju et al.,
2019).
Safa and Ganga investigated how well Logistic Regression, (KNN) K-nearest
neighbor, and Naïve Bayes work on exceptionally distorted credit card dataset,
they implanted their work on Python where the best method was selected using
evaluation. The accuracies result of their model for Naïve Bayes is 83%,
97.69% for Logistic regression and in last place K-nearest neighbor with
54.86% (Safa & Ganga, 2019).
The system to detect credit card fraud that was introduced by Sailusha and his
team to detect fraudulent activities. The algorithms used in their model is
adaboost and Random Forest, which scored the accuracy 93.99% and the
accuracy of adaboost is 99.90% which shows that it did better than Random
Forest in term of accuracy (Sailusha et al.).
The paper of Kiran and his team presents Naïve Bayes (NB) improved (KNN)
K-Nearest Neighbour method for Fraud Detection of Credit Card which is
(NBKNN) in short format. The outcome of the experiment illustrates the
difference in the process of each classifier on the same dataset. Naïve bayes
performed better than K-nearest neighbor as it scored an accuracy of 95%
while KNN scored 90% (Kiran et al., 2018).
The paper of Saheed and his group focuses on detection of Credit Card Fraud
with the use of (GA) Genetic Algorithm as a feature selection technique. In
feature selection the data is splitted in two parts first priority features and
second priority features, and the ML techniques that the group used are The
Naïve Bayes (NB), Random Forest (RF) and (SVM) Support Vector Machine.
Naïve bayes scored 94.3%, SVM scored 96.3%, and Random Forest scored
96.40% which is the highest accuracy (Saheed et al., 2020).
The work of Itoo and his group uses three different ML methods the first is
logistic regression, the second is Naïve bayes and the last one is K-nearest
neighbors. Itoo and his group recorded the work and comparative analysis,
their work is implemented on python. Logistic regression accuracy is 91.2%,
Naïve bayes accuracy is 85.4% and K-Nearest neighbour is last with an
accuracy of 66.9% (Itoo et al., 2020).
Dighe and his team used KNN, Naïve Bayes, Logistic Regression and Neural
Network, Multi-Layers Perceptron and Decision Tree in their work, then
evaluated the results in terms of numerous accuracy metrics. Out of all the
models created the best performing one is KNN which scored 99.13%, then in
second place Naïve Bayes which scored 96.98%, the third best performing
10
model 96.40% and in last place is logistic regression with 96.27% (Dighe et
al., 2018).
Sahin and Duman used four Support Vector Machine methods in detecting
credit card fraud. SVM) Support Vector Machine with RBF, Polynomial,
Sigmoid, and Linear Kernel, all models scored 99.87% in the training model
and 83.02% in the testing part of the model (Sahin & Duman, 2011).
Throughout the search I found that there were many models created by other
researchers which have proven that people have been trying to solve the credit
card fraud problem. I found that Najdat Team used an approach that is
established upon bidirectional long/short-term memory in building their
model, other researchers have tried different data splitting ratios to generate
different accuracies. The team of Sahin and Duman used different Support
Vector Machine methods which are (SVM) Support Vector Machine with RBF,
Polynomial, Sigmoid, and Linear Kernel.
The lowest accuracy of the four models that will be studied in this research, is
54.86% for KNN and 36.40% for logistic Regression which were scored by
Awoyemi and his team, as for Naïve Bayes the lowest accuracy was scored by
Gupta and his team which is
80.4% and finally, SVM the lowest score was 94.65% and it was scored by
Jain’s team. To determine the best model out of the four models that will be
studied through the research, the average of the best three accuracies of each
model will be calculated, the average of the accuracy of KNN is 98.72%, the
11
average of logistic regression is 98.11%, 98.85% for Naïve bayes and 96.16%
for Support Vector Machine. So, for the best performing credit card fraud
detecting model within the Literature review is the Logistic Regression model.
12
CHAPTER: 03
Project Description
3.1 Introduction
In order to accomplish the objective and goal of the project which is to find
the most suited model to detect credit card fraud several steps need to be taken.
Finding the most suited data and preparing/preprocessing are the first and
second steps, after making sure that the data is ready the modeling phase starts,
where 4 models are created, K-Nearest Neighbor (KNN) , Naïve Bayes, SVM
and the last one is Logistic Regression. In the KNN model two Ks were chosen
K=3 and K=7. All models were created in both R and Weka programs expect
SVM which was created in Weka only, in addition all visualizations are taken
from both applications.
CHAPTER: 04
SYSTEM REQUIREMENTS AND SPECIFICATION
➢ Jupyter Notebook
➢ Language: Python
CHAPTER: 05
DATAANALYSIS
The first figure bellow shows the structure of the dataset where all a
attributes are shown, with their type, in addition to glimpse of the
variables within each a attribute, as shown at the end of the figure the
Class type is integer which I needed to change to factor and identify the
0 as Not Fraud and the 1 as Fraud to ease the process of creating the
model and obtain visualizations.
17
The second figure shows the distribution of the class, the red bar which
contains 284,315 variables represents the non-fraudulent transactions, and the
blue bar with 492 variables represents the fraudulent transactions.
The correlations between all the of the attributes within the dataset are
presented in the figure below.
Figure 3 - Correlations
18
Figure 4 below shows attribute 18 the attribute with the most credit card
fraudulent transactions, the blue line represents the variable 1 which is the
fraudulent transactions.
Variable 18
The figure below shows the variable that have the lowest number of fraudulent
transactions, as mentioned earlier the blue line represents the fraudulent
instances within the data.
19
Figure Variable28
As there are no NAs nor duplicated variables, the preparation of the dataset
was simple the first alteration that was made to be able to open the dataset on
Weka program is changing the type of the class attribute from Numeric to Class
and identify the class as {1,0} using the program Sublime Text. Another
alteration was made on the type as well on the R program to be able to create
the model and the visualization.
After making sure that the data is ready to get modeled the four models were
created using both Weka and R. the model SVM was created using Weka only,
as for KNN, Logistic Regression and Naïve Bayes they were created using R
and Weka.
5.3.1 KNN
regression instances (Mahesh, 2020).To figure the best KNN model two Ks
where used K=3 and K=7, both are presented with figures from both Weka and
R.
• K=3
During the making of the KNN model, I decided to create two models where
K=3 and K=7. Figure 5 shows the model created in R, the model scored an
accuracy of 99.83% and managed to correctly identify 91,719 transactions and
missed 155. As for the Weka program the model scored 99.94% for the
accuracy and missclassified 52 transactions.
• K=7
21
There was a slight decrease in the accuracy in the model created in R (Figure
6) as it scored 99.82% when K is 7, and the model miss classified 166
fraudulent transactions as nonfraudulent. As for Weka (Figure 7) the accuracy
The last model created using both R and Weka is Logistic Regression, the
model managed to score and accuracy of 99.92% in R (figure 11) with 70
23
Regression
such a fashion that the space between the margin and the classes is maximum
which minimizes the error of the classification (Mahesh, 2020).
The last stage of the CRISP-DM model is the evaluation and deployment
stage, as presented in table 2 below all models are being compared
to each other to figure the best model in identifying fraudulent credit
card transactions.
Positive TP FN
Negative FP TN
The table above shows all the components to calculate an accuracy of a model
which is displayed in the below equation.
Accuracy =
Table 2 shows all of the accuracies of all the models that were created in the
project, all models performed well in detecting fraudulent transactions and
managed to score high accuracies. Out of all the models the model that scored
the best is Support Vector Machine as its accuracy is 99.94%, the second best
is Logistic Regression, then in third place is KNN as both Ks scored similar
26
accuracies, and the model that scored the lowest accuracy out of all models is
Naïve Bayes with a score of 97.76%.
27
CHAPTER: 06
SYSTEM DESIGN
• The Preprocessed data is split into training and tes ng datasets in the
80:20 ra o to avoidthe problems of over-fi ng and under-fi ng.
• A model is trained using the training dataset with the following
algorithms SVM, Random Forest Classifier and Decision Tree
• The trained models are trained with the tes ng data and results are
visualized using bar graphs, sca er plots.
• The accuracy rates of each algorithm are calculated using different
params like F1 score, Precision, Recall. The results are then displayed
using various data visualiza on tools for analysis purpose.
28
SYSTEM ARCHITECTURE
Our Project main purpose is to making Credit Card Fraud Detection awaring
to people from credit card online frauds. the main point of credit card fraud
detection system is necessaryto safe our transactions & security. With this
system, fraudsters don't have the chance to make multiple transactions on a
stolen or counterfeit card before the cardholder is aware of the fraudulent
activity. This model is then used to identify whether a new transaction is
fraudulent or not. Our aim here is to detect 100% of the fraudulent transactions
while minimizing the incorrect fraud classifications.
29
Activity diagram
Sequence Diagram
The sequence diagram represents the flow of messages in the system and is
also termed as an event diagram. It helps in envisioning several dynamic
scenarios. It portrays the communication between any two lifelines as a time-
ordered sequence of events, such that these lifelines took part at the run time.
In UML, the lifeline is represented by a vertical bar, whereas the message flow
is represented by a vertical dotted line that extends across the bottom of the
page. It incorporates the iterations as well as branching.
32
Fig 5.4
SEQUENCEDDIAGRAM
MODULES
• Data collection
• Data pre-processing
• Feature extraction
• Evaluation model
Data Collection
Data used in this paper is a set of product reviews collected from credit card
transactions records. This step is concerned with selecting the subset of all
available data that you will be working with. ML problems start with data
preferably, lots of data (examples or observations) for which you already know
34
the target answer. Data for which you already know the target answer is called
labelled data.
Data pre-processing
Data Exploration
STEP 1:
Fig. 4: Pre-Processing
36
STEP2:
STEP 3: Acquired trained and testing dataset from the large dataset
Data visualization
Feature extraction
Feature extraction is the process of studying the behavior and pattern of the
analyzed data and draw the features for further testing and training. Finally,
our models are trained using the Classifier algorithm. We use classify module
on Natural Language Toolkit library on Python. We use the labelled dataset
gathered. The rest of our labelled data will be used to evaluate the models.
Some machine learning algorithms were used to classify pre-processed data.
The chosen classifiers were Random forest. These algorithms are very popular
in text classification tasks.
Evaluation model
evaluation methods such as hold out and cross-validations are used to test to
evaluate model performance. The result will be in the visualized form.
Representation of classified data in the form of graphs. Accuracy is well-
defined as the proportion of precise predictions for the test data. It can be
calculated easily by mathematical calculation
i.e. dividing the number of correct predictions by the number of total
predictions.
39
CHAPTER: 07
IMPLEMENTATION
7.1 Algorithm
Step7: Choose the algorithm among 3 different algorithms and create the
model
Step11: Compare the algorithms for all the variables and find out the best
algorithm.
40
import numpy as np
# evaluation
accuracy_score,confusion_matrix,classification_report,
precision_score,recall_score, f1_score,roc_auc_score
import systemcheck
41
Data Acquisition
data = pd.read_csv('[Link]') data
Data Analysis
[Link]
[Link]()
[Link]()
[Link](x='Cl
ass', data=data)
print("Fraud:
",[Link]()/d
[Link]())
Fraud_class = [Link]({'Fraud': data['Class']})
Fraud_class. apply(pd.value_counts).
plot(kind='pie',subplots=True) fraud = data[data['Class'] == 1]
valid = data[data['Class'] == 0] [Link]()
[Link](figsize=(20,20)) [Link]('Correlation Matrix',
y=1.05, size=15)
[Link]([Link](float).corr(),linewidths=0.1,vmax=1.0,
square=True, linecolor='white', annot=True)
Data Normalization
rs = RobustScaler() data['Amount'] =
Y = data["Class"]
Data splitting
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2,
random_state = 1) X_train
X
_
t
e
s
t
Y
_
t
e
s
t
# Testing SVC
Y_pred_svm = [Link](X_test)
#
Random
forest
model
creation
rfc =
Random
ForestCl
assifier()
# training [Link](X_train, Y_train)
# Testing
Y_pred_rf = [Link](X_test)
#Ev
alua
tion
eval
uate
(Y_
pred
_rf,
Y_t
est)
44
# predictions
RandomForestClassifier(class_weight='balanced') [Link](X_train,
Y_train)
# predictions
CHAPTER: 08
TESTING
Testing is a process of executing a program with intent of finding an error.
Testing presents an interesting
anomaly for the software engineering. The goal of the software testing is to
convince system developer and customers that the software is good enough for
operational use. Testing is a process intended to build confidence in the
software. Testing is a set of activities that can be planned in advance and
conducted systematically. Software testing is often referred to as verification &
validation.
In this testing we test each module individually and integrate with the overall
system. Unit testing focuses verification efforts on the smallest unit of
software design in the module. This is also known as module testing. The
module of the system is tested separately. This testing is carried out during
programming stage itself. In this testing step each module is found to working
satisfactorily as regard to the expected output from the module. There are some
validation checks for fields also. It is very easy to find error debut in the
system.
User Acceptance Testing is a critical phase of any project and requires significant
participation bythe end user. It also ensures that the system meets the functional
requirements. Some of my friendswere who tested this module suggested that this
was really a user-friendly application and givinggood processing speed.
47
CHAPTER: 09
ALGORITHM
• The random forest algorithm is not biased and depends on multiple trees
where each tree is trained separately based on the data, therefore biasedness
is reduced overall.
• It’s a very stable algorithm. Even if a new data point is introduced in the
dataset it doesn’t affect the overall algorithm rather affect the only a single
tree.
• It works well when one has both categorical and numerical features.
• The random forest algorithm also works well when data possess missing
values, or when it’s not been scaled properly. Thus, using this Random forest
algorithm and decision trees algorithm we have extracted the accurate
percentage of detection of fraud from the given dataset by studying its
behavior. A confusion matrix is basically a summary of prediction results or
a table which is used to describe the performance of the classifier on a set of
test data where true values are known. It provides visualization of an
algorithm’s performance and allows easy identification of classes. Thus,
resulting in the computing of most performance measures by giving insights
not only the errors being made by the classification model but also tells the
48
type of errors being made. Trained Data and Testing Data is represented in a
confusion matrix which portrays:
• TP: True Positive which denotes the real data where customers are subjected
to fraud and are used for training and were accurately predicted.
• TN: True Negative denotes the data which was not predicted and doesn’t
match with the data which was subjected to the fraud.
• FP: False Positive is predicted but there is no possibility of the data to be
subjected to the fraud.
• FN: False Negative is not predicted but there is an actual possibility of the
data who is subjected to fraud.
Fig. 10: Accurate result extracted from the random forest classification and
regression model using decision tree.
CHAPTER: 10
CONCLUSION
5.1 Conclusion
In conclusion, the main objective of this project was to find the most suited
model in credit card fraud detection in terms of the machine learning techniques
chosen for the project, and it was met by building the four models and finding
the accuracies of them all, the best model in terms of accuracies is Support
Vector Machine which scored 99.94% with only 51 misclassified instances. I
believe that using the model will help in decreasing the amount of credit card
fraud and increase the customers satisfaction as it will provide them with better
experience in addition to feeling secure.
5.2 Recommendations
There are many ways to improve the model, such as using it on different datasets
with various sizes, different data types or by changing the data splitting ratio, in
addition to viewing it from different algorithm perspective. An example can be
merging telecom data to calculate the location of people to have better
knowledge of the location of the card owner while his/her credit card is being
used, this will ease the detection because if the card owner is in Dubai and a
transaction of his card was made in Abu Dhabi it will easily be detected as fraud.
51
REFERENCE
[4] K. Chaudhary, B. Mallick, "Credit Card Fraud: The study of its impact
and detection techniques", International Journal of Computer Science
and Network (IJCSN), vol. 1, no. 4, pp. 31-35, 2019, ISSN ISSN: 2277-
5420.
[10] Y. Sahin, E. Duman, "Detecting credit card fraud by ANN and logistic
regression",