Loan Approval Predictor Using Data Science and Machine Learning Project
Loan Approval Predictor Using Data Science and Machine Learning Project
By
I hereby declare that the work presented in this report entitled “Loan Approval Predictor” in partial
fulfillment of the requirements for the award of the degree of Bachelor of Technology in
Computer Science and Engineering submitted in the department of Computer Science &
Engineering, Bengal College of Engineering & Technology is an authentic record of my own work
carried out over a period from October 2021 to December 2021 under the supervision of Mr. Tapas
Pal
The matter embodied in the report has not been submitted for the award of any other degree
This is to certify that the above statement made by the candidate is true to the best of my
knowledge.
ACKNOWLEDGEMENT
First of all, I would like to express my deep gratitude to my project guide Mr. Tapas Pal (Assistant
Professor), Department of Computer Science & Engineering) for providing me an opportunity to
work under his supervision and guidance. His constant encouragement at every step was a precious
asset during the project work.
I am thankful to the faculty and staff of Department of Computer Science & Engineering, Bengal
College of Engineering & Technology for providing me with all the facilities required for
experimental work.
i
I would like to thank my family for their continuous support and motivation. Finally, I would like
to thank those who directly or indirectly helped me in completing this project.
TABLE OF CONTENT
CHAPTER -1 INTRODUCTION 1
1.1 General Introduction 1
1.2 Problem Statement 4
1.3 Objectives 5
1.4 Methodology 6
CHAPTER-5 CONCLUSION 54
References
LIST OF FIGURES
iv
Fig 17 Counting values in target variable
Fig 18 Normalizing
Fig 19 Generating plot of features of target variable
Fig 20 Distribution of age
Fig 21 Frequency of job
Fig 22 Plot of features of job
Fig 23 Plot of default
Fig 24 Job-subscribed crosstab
v
ABSTRACT
Lately people depend on bank loans to meet their wishes. The fee of loan packages will increase
with a very rapid speed in current years. Risk is constantly involved in approval of loans. The
banking officials are very acutely aware of the price of the mortgage quantity by its customers.
Even after taking lot of precautions and analyzing the mortgage applicant information, the
mortgage approval choices are not continually correct. There is need of automation of this system
so that loan approval is much less risky and incur less loss for banks.
Since it is a major activity for the banks,to identify whether a loan of the desired amount should
be approved to the applicant or not,the Computer Science is capable of making such a system
using Artificial Intelligence,which can make this tough decision accurately and quickly.
Using data science,which is responsible to deal with the large amount of data efficiently,and some
algorithms of Machine Learning,a prediction system is made,which,on the basis of some training
data set,is capable of identifying if the loan applicant is ideal for the loan approval or not.
vi
Machine Learning algorithms like Decision Tree,Logistic Regression,Random Forest,etc. are used
for the analysis.These are efficient algorithms that are followed for data analysis and prediction
making.
The system will look into some basic information of the applicant such as his/her
profession,age,gender,marital status,etc.,and after analyzing all this information,using
visualization and machine learning algorithms,it will come to a decision.
vii
CHAPTER 1:INTRODUCTION
In a world that is progressively turning into an advanced space, associations influence zetta bytes
and yotta bytes of organized and unstructured information daily . Advancing advances have
empowered cost reserve funds and more brilliant extra rooms to store basic information. As of
now, inside the business, there's a gigantic requirement for talented and approved Data Scientists.
They are among the most generously compensated experts in the IT business. As indicated by
Forbes, 'the best employment in America is of a Data Scientist with a normal yearly pay of
$110,000'. Just a few people can possibly process it and determine important bits of knowledge
out of it.
Organizations are overwhelmed with monster measures of information. Accordingly, it's essential
to comprehend what to attempt to with this detonating information and the best approach to use it.
It is here, the idea of Data Science comes into the image. Data science along with arithmetics and
organisational information allows the organisation to explore approaches to:
● Reduce costs as much as possible
1
● Explore the new market and get entry into it
Thus, paying little mind to the business vertical, Information Science is probably going to assume
a key job in your association's prosperity.
Take a gander at the beneath infographic, and you will have the option to see how Data Science is
making its impression:
In recent years, data science emerged as a new and important discipline.It can also be considered
a mixture or combination of some technologies such as data mining and statistics,databases,etc.
Existing approaches need to be combined to turn abundantly available data into value for
individuals, organizations, and society.
Science means the knowledge acquired by systematic study.Data science basically focused on the
data which even may be very large in size,and handling of this data using modern technology.A
very big amount of data is generated in the companies every second.This data definitely needs to
2
be handled properly and efficiently.If not handled properly,it could prove dangerous for the
company.
Hence every company has the required number of data scientists,who work constantly handling
the records which are essential to the company.These data scientists are very heavily paid.The data
scientists know all the statistics and have the knowledge of the data which makes itb easy for them
to handle it.
Traditional database techniques are not acceptable for knowledge discovery due to the fact they're
optimized for instant get right of entry to and summarization of statistics, provided what the user
wants to ask, or a question, now not discovery of patterns in big swaths of statistics while
customers lack a wellformulated question. Unlike database querying, which asks “What records
satisfies this sample (query)?” discovery asks “What patterns fulfill this statistics?”
Specifically, our concern is finding interesting and sturdy patterns that fulfill the facts, where
“interesting” is usually something sudden and “robust” is a pattern expected to arise within the
destiny.
Machine learning uses data science and makes it feasible to generate such models which are able
to make accurate predictions,classisfy things into categories,interpretation of images,etc.This
makes the machines and robots intelligent as they can learn on their own and there is no need to
worry about their accuracy.In today‟s world,where eveything is getting automated,there surely is
need of the mechanisms which can be eaily trusted for their accuracy.Machine learning along with
data science makes this possible.Machine learning is already helping several industries and is well
praised for easing the human effort.
Machine learning is a subset of artificial intelligence (AI) wherein algorithms research by way of
instance from historical statistics to expect results and uncover patterns which humans cannot spot
easily. For instance,ML can screen clients who are probably to churn, possibly fraudulent coverage
claims,etc. While ML has been round since the Fifties, latest breakthroughs in lowvalue compute
assets like cloud storage, less complicated collection of data, and the proliferation of information
science have made it very a good deal “the next huge thing” in commercial enterprise analytics.
The ML algorithms learns through instance, and then users practice the ones self-gaining
knowledge of algorithms to find insights, determine relationships, and make predictions about
future tendencies.ML has practical implications throughout industry sectors, together with
healthcare, insurance,advertising, manufacturing,etc. When implemented successfully,ML lets in
3
groups to find most appropriate solutions to realistic issues, which results in actual, tangible
commercial enterprise value.
Google is so far one of the greatest organization that is on an employing binge for prepared Data
Scientists. Since Google is for the most part determined by Data Science nowadays, it offers
probably the best datum Science compensations to its representatives.Amazon is a worldwide web
based business and distributed computing monster that is recruiting Data Scientists on an
exceptionally huge scope. They need Data Scientists to discover client mentality and improve the
topographical reach of both web based business and cloud spaces, among different businessdriven
objectives.
Loan approval is a completely important procedure for banking businesses. The system approve
or reject the mortgage applications. Recovery of loans is a first-rate contributing parameter in the
economic statements of a bank. It may be very hard to are expecting the possibility of fee of loan
through the purchaser. In current years many researchers worked on mortgage approval prediction
structures.ML techniques are very useful in predicting consequences for big quantity of
information. In this project some ML algorithms like Logistic Regression, Decision Tree,Random
4
Forest,etc are implemented to are expecting the loan approval for customers. The experimental
results conclude that the accuracy of Decision Tree ML algorithm is better in comparison to other
algorithms.
1.3 Objectives:
Lately people depend on bank loans to meet their wishes. The fee of loan packages will increase
with a very rapid speed in current years. Risk is constantly involved in approval of loans. The
banking officials are very acutely aware of the price of the mortgage quantity by its customers.
Even after taking lot of precautions and analyzing the mortgage applicant information, the
mortgage approval choices are not continually correct. There is need of automation of this system
so that loan approval is much less risky and incur less loss for banks.
Artificial Intelligence AI is a rising technology. The utility of AI solves many real world troubles.
Machine Learning is an AI method which could be very useful in prediction systems.A model is
created from a training data. While making the prediction the model that is evolved by way of
training algorithm(ML) is used. The ML algorithm trained the machine the usage of a fragment of
the statistics available and the remaining data is tested.
Distribution of the loans is the center commercial enterprise part of almost all banks. The main
portion the financial institution‟s property is immediately came from the profit earned from the
loans allotted via the banks. The high objective in banking environment is to invest their property
in secure hands.Today many banks/financial corporations approves loan after a regress procedure
of verification and validation but nonetheless there's no surety whether the selected applicant is
the deserving right applicant out of all applicants. Through this system we are able to are expecting
whether or not that specific applicant is safe or not and the whole process of validation of functions
is automatic via ML technique. The downside of this model is that it emphasize exclusive weights
to each element however in actual existence sometime loan can be accredited on the idea of single
strong component only, which is not feasible via this system. Loan Prediction could be very helpful
for worker of banks as well as for the applicant also. The goal of this project is to provide
brief,quick and easy way to pick out the deserving candidates. It can offer unique benefits to the
financial institution. The Loan Prediction System can can mechanically calculate the load of every
features participating in mortgage processing and on new test statistics same capabilities are
processed with appreciate to their associated weight .A time restriction can be set for the applicant
to test whether his/her mortgage may be sanctioned or not. Loan Prediction System allows leaping
to specific utility in order that it is able to be take a look at on the basis of priority.
5
The ML strategies can be implemented on a sample data first after which can be used in making
prediction associated selections. This project applied the ML procedures in solving loan approval
trouble of banking area.
1.4Methodology:
Since it is a major activity for the banks,to identify whether a loan of the desired amount should
be approved to the applicant or not,the Computer Science is capable of making such a system using
Artificial Intelligence,which can make this tough decision accurately and quickly.
Using data science,which is responsible to deal with the large amount of data efficiently,and some
algorithms of Machine Learning,a prediction system is made,which,on the basis of some training
data set,is capable of identifying if the loan applicant is ideal for the loan approval or not.
Machine Learning algorithms like Decision Tree,Logistic Regression,Random Forest,etc. are used
for the analysis.These are efficient algorithms that are followed for data analysis and prediction
making.
The system will look into some basic information of the applicant such as his/her
profession,age,gender,marital status,etc.,and after analyzing all this information,using
visualization and machine learning algorithms,it will come to a decision.
Data science is the sphere of study that mixes domain information, programming competencies,
and know-how of mathematics and facts to extract significant insights from records. Data scientists
6
follow ML algorithms to numbers, text, pics, video, audio, etc. to generate AI systems to carry out
tasks that generally require human intelligence. In turn, those structures generate insights which
analysts and users of related field can turn into tangible business value.
Big data is a blanket term for any series of records so large or complicated that it becomes difficult
to technique them using traditional records management strategies consisting of, as an instance,
the relational database management systems(RDBMS).The extensively adopted RDBMS has
lengthy been regarded as a one-length-suits-all solution, but the needs of coping with big records
have proven in any other case. Data science involves using techniques to investigate big quantities
of information and extract the know-how it carries. You can consider the relationship among big
information and data science as being just like the relationship between crude oil and an oil
refinery. Data technology and massive facts advanced from facts and traditional facts control but
are actually taken into consideration to be different disciplines.
There are data scientists who are professionally sound and are able to handle the big data easily
using data science.
Big data and data science are used almost everywhere in both business and noncommercial
settings. The variety of use instances is large.Commercial organizations in nearly each industry
use big data and data science to gain insights into their customers, tactics, group of workers,
completion,and products.Data Science is used by many businesses to offer clients a higher user
experience, as well as to cross-sell, up-sell, and customize their offerings.
An example of this is Google AdSense.It generates the advertisements for the users based on their
interest and past searches.This makes it easier for the users to get the required items of their
interests easily.
Human aid specialists analyse the employees by analysing their moods and behaviours.Relations
with the co-workers can also be analysed this way.
Data science is used by financial institutions in prediction of stock markets,decide the risk of
lending cash, and discover ways to attract new customers for institution's services.
7
Maximum trade is nowadays takes place with the help of mechanisms which highly operate on the
algorithms of machine learning.These are reliable machines and their outputs can be trustred
without any question.
The corporate sectors also uses data science.There are a number of governmental bodies where
data scientists have important position as they have to deal with confidential records which are
important to the organisation.These records can further be used to gain insights and also in
developing applications which are driven with records.
Not only governmental,but there are certain non-governmental organisations also which are
responsible to deal with big records.There are a thousands of NGO‟s in every country.All these
NGO‟s have to maintain a number of records,irrespective of their field of work.Hence data science
and data scientists are required by these non-governmental organisation also.
Universities use data science in their studies however also to elevate the observe experience of
their students. The upward thrust of massive open on line courses (MOOC) produces lots of
information, which allows universities to have a look at how this form of learning can complement
traditional training.
• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Streaming
8
Structured data is information that depends on a statistics model and resides in a hard and fast
discipline within a report. As such, it‟s easy to contain structured records in tables within databases
or Excel documents.
SQL (Structured Query Language), is the language which helps us deal with the records contained
in the tables within the databases. SQL makes the manipulation of data contained in databases
much easier which otherwise would have been a tough task.
Unstructured data comprises of data is varying data and hence is not suitable to be used within a
data model.Example of unstructured data is email,
There are several structured components in an email like sender name,body,etc. but it is difficult
to identify the mails containing the content having same reference sent by a number of people
because there are a number of methods to refer to someone.
Natural language is a unique form of unstructured statistics; it‟s processing is tough as it requires
know-how of particular data science strategies and linguistics. The natural language processing
community has had success in recognition of certain entity,recognition of subject matter,
summarization, text completion, and sentiment evaluation, however models trained in a single
domain do not generalize nicely to remaining domains.
The analysis of machine information is based on quite scalable gear, due to its high quantity and
generation speed. Examples of machine records are web server logs, call records, community event
logs, and telemetry.
9
Graph Data may be a confusing term due to the fact any information can be shown in a graph.
“Graph” in this case factors to mathematical graph theory. In graph principle, a graph is a
mathematical shape to model pair-wise relationships.Graph or network statistics is, in brief, facts
that makes a speciality of the relationship or adjacency of items. The graph structures use nodes,
edges, and functionalities to visualize and keep graphical facts. Graph-based data is a natural
manner to represent social networks, and its structure lets in you to calculate particular metrics
consisting of the affect of someone and the shortest direction between humans.
Audio,image, and video are sort of data that pose precise demanding situations to a data scientist.
Tasks which might be easy for humans, consisting of spotting objects in photographs, grow to be
challenging for computer systems. MLBAM (Major League Baseball Advanced Media)
announced in 2014 that they‟ll boom video capture to about 7 TB consistent with game for the
live, in-sport analytics. High-speed cameras at stadiums will capture actions of players and ball to
calculate in real time, for instance, the route taken by a defender relative to two baselines.
Recently an organisation called DeepMind succeeded at developing some set of rules that‟s able
to learning the way to play video video games. The video game screen acts as an input for the
algorithm and learns to interpret everything via a complex manner of deep learning. It‟s an
exceptional feat that triggered Google to buy the enterprise for their very own Artificial
Intelligence (AI) development plans. The learning algorithm takes in data because it‟s produced
via the pc game; it‟s streaming information.
While streaming records can take nearly any of the previous form, it has an additional
distinguishing property. The record flows into the machine whilst an event occurs rather than being
loaded right into a store in a batch. Although this isn‟t truely a one-of-a-kind form of data, we deal
with it right here as such due to the fact you want to adapt your process to cope with this form of
information.Examples are the “What‟s trending” on Twitter, live events like matches and concerts,
and the share market.
10
2. Retrieving data
3. Data preperation
4. Data exploration
5. Data modeling
Data science is in most cases implemented in the context of a corporation. When the business asks
you to carry out a project related to data science and analysis,a project charter will be preapared
by one.This charter incorporates statistics including what you‟re going to research, how the agency
advantages from that, what statistics and resources you want, a timetable, and deliverables.
The 2nd step is to accumulate statistics.As already stated inside the project charter which records
are needed and where to find that data. In this step it is ensured that you can use the statistics to
your program, which means checking the existence of, quality, and get access to to the data. Data
also can be added by means of third-party businesses and takes many forms starting from Excel
spreadsheets to different sorts of databases.
11
Collection of data is an errors-susceptible procedure; on this segment you enhance the record
quality and get the records ready for use in subsequent steps. This section consists of 3 subphases:
Data cleansing removes fake values from a information supply and inconsistencies across sources
of records, statistics integration enriches data sources by combining facts from a couple of
information sources, and data transformation guarantees that the information is in an appropriate
format to be used in the model.
Data exploration is concerned with building a deeper expertise of records. You try to understand
how variables engage with every other, the distribution of the records, and whether there are
outliers. To attain this particularly use descriptive statistics, visual techniques, and simple
modeling. This step is called Exploratory Data Analysis and regularly is going by the abbreviation
EDA.
In this section models are used, domain knowledge, and insights of the information discovered in
the preceding steps to reply the studies query.Pick a way from the fields of statistics,ML,
operations studies, and so on. Building a model is an iterative procedure that involves selecting
the variables for the model, executing the model, and diagnostics of model.
Finally, you present the outcomes on your business. These outcomes can take many forms, ranging
from presentations to analysis reports. Sometimes there would be need to automate the execution
of the system because the commercial enterprise will need to use the insights that were acquired
in every project or enable an operational technique to apply the final results out of your model.
ML is an utility of AI that gives systems the ability to learn mechanically and improve from
experience without being programmed explicitly. ML centers around the advancement of PC
programs that can get to information and use it learn for themselves.
To accomplish ML, specialists create broadly useful calculations that can be utilized on enormous
classes of learning issues. At the point when you need to explain a particular undertaking you just
need to take care of the calculation progressively explicit information. As it were, you're
12
customizing by model. By and large a PC will utilize information as its wellspring of data and
contrast its yield with an ideal yield and afterward right for it. The more information or
"experience" the PC gets, the better it becomes at its assigned activity, similar to a human does.
For instance, as a user writes more textual content messages on a mobile phone, the mobile learns
the vacabulary that is used quite often in the messages and might expect these words instantly and
with high accuracy. In the broader subject of technology,ML is a subfield of AI and is carefully
related to arithmetic and statistics.
The learning begins with observations or facts, together with examples, direct experience, or
practise, with a view to look for patterns in information and make higher decisions in future based
on the earlier instances given. The primary intention is to permit the computers learn robotically
with out human interference or help and regulate movements as a result.
• Supervised ML algorithms can follow what has been discovered within the beyond to new
statistics using categorised examples to predict events that might occur in future.To begin
with,consider the analysis of an acknowledged training dataset, the ML algorithm
produces an inferred feature to predict the output values. The system is capable of
providing goals for any new I/P after adequate training. These set of rules can also
evaluate its output which is comparable to the correct one,expected output and locate
errors so one can adjust the model according to the requirements.
• In evaluation, unsupervised ML algorithms are used when the statistics used for training
is neither categorised nor classified. Unsupervised learning researches how programs can
infer a characteristic to describe a hidden shape from unlabeled records. The system
doesn’t discern out the proper output, however it explores the facts and might draw
inferences from datasets to describe hidden structures from unlabeled information.
13
categorized statistics requires professional and relevant sources with a view to getting it
trained / learning from it.Else, unlabeled data typically doesn’t require extra sources.
Organizations that correctly put in force ML and different AI technology get great benefits over
others . According to a recent report by way of McKinsey & Company, AI will create $50 trillion
of fee through 2025.The organizations failing to do so may be not able to compete with folks who
include the new frontier.
Historically,ML has proved to be an effort taking method that calls for manual
programming,proscribing the potential of corporations to take full gain of the ML algorithms.
Without teams of hard-to-locate information scientists at their disposal, organizations are
restrained inside the range of models they're able to broaden and check – and regularly those
14
models consumes a great amount of time to get developed, they are old by the time they're
developed.
Two important concepts that are needed by data scientists are classification and regression.
Hence,in order to gain the advantages of these two concepts,data scientists use ML. Some of the
popular practical uses of regression and automatic classification are described as following:
• Finding unknown places such as oil fields, gold mines,archeological sites,etc based on
existing sites (classification and regression)
• Trying to identify the correct person with the help of pictures or even some voice notes
(classification)
• Predicting the number of volcanic eruptions under a certain time period (regression)
Time-to-time models are built by data scientists (an abstraction of reality) telling how actually
certain phenomenon work.Sometimes the intention of a model is iterpretation instead of
prediction.This scenario is called root cause analysis.
15
• Understanding certain processes of organizations and optimising the required
processes.For instance,identifying the products that adds value to the product line.
This list of machine learning applications can only be seen as an appetizer because it‟s everpresent
within data science.As we have discussed earlier that regression and classification are two
important techniques, but the repertoire and the applications don‟t end.Another important and
beneficial technique is clustering. ML techniques prove benficial throughout the process of data
science.
Majorly ML is related to the data science‟data modeling step.Still ML somehow can be used in
every step of data science.
It is necessary to have some qualitative raw data to get the data modeling phase started.Before this
step,ML can be used in the data preparation step.
For instance:Suppose list os strings needs to be cleansed.Comparing the strings to spot the spelling
errors can be done by placing the similar strings together,which can be done by ML algorithms.
For instance:ML algorithms are capable of finding out patterns in data and bring the required data
together,which would have been difficult errand otherwise,
It would not be correct to give the full credit to ML alone.Without certain Python libraries it would
not have been possible to do these activities.Python libraries allows all the manipulation and
processing of large sets of data.
Python provides a large number of packages,containing a several libraries that makes the ML
efficient and beneficial to use.
• The first type of package allows some basic tasks and fitting of data into the memory.
16
• The second type of package enables code optimization as sometimes after the protyping
is done there arises sever issues related to speed and memory.
• The third type of package enables the use of Python in big data technologies.
These packages provides a number of functionalities and that too which consumes a very less
numbers of lines.
• SciPy: A library that is responsible for the integration of some fundamental packages such
as NumPy, matplotlib, Pandas, and SymPy.
• NumPy :It gives access to functions related to array functions and linear algebra.
• Matplotlib: This library is often called as 2D plotting package with some 3D functionality.
• SymPy: It is a package which helps with computational algebra and symbolic mathematics.
17
• StatsModels: It is a package helpful for statistical methods.
• NLTK (Natural Language Toolkit) : This Python toolkit focuses on text analytics. It is a wise
choice to begin with Python using these libraries.The real performance comes into play
when a Python code is run at some regular intervals.
As we begin with the production phase of an application, the below listed libraries help to attain
the required speed. This even includes connecting to big data infrastructures like Hadoop and
Spark.
• PyCUDA—Library lets us write program which instead of the central processing unit will
be executed on the GPU.This makes it capable of applications that require a lot of
calculations. An example is examining the strength of predictions by calculating several
different outcomes based on a single start state.
• Blaze—Allows to work with massive sets of data by providing the appropriate data
structures which are very big(even bigger than the main memory) .
• Dispy and IPCluster—These packages permits the programmer tomake programs those
can be shared over a network of computers.
18
• PySpark—Python and another Big Data framework,Spark are linked together with this.
The starting three steps are iterated through until an appropriate model is not found.
In some cases,the final objective is explanation instead of prediction.In such cases the last step of
applying the trained model to unseen data is not present.For example,we want to find the
reasons of extinction of animal species but do not predict the name of species that might extinct
in some time.
Chaining of models can also be done by making a combination of more than one technique.In
chaining,output of one model is treated as an input to the other model.After combination of
multiple models training of those models is still done separately.However,their results are
combined together.This is called ensemble learning.
There are two major components in a model namely, feature and target variable.The major
objective of the model is the prediction of target variable,for instance,high temperature of the next
day.The features are the variables that help in making this prediction.Some examples of features
for the temperature example are Wind Speed,Today‟s Temperature,Movement of clouds.The
models that accurately predicts are considered to be best.For this feature engineering is most
helpful part of modeling.
Creation of appropriate predictors for the model is done with engineering features.Since in this
step the model recombines these features to attain the required predictions,hence this step is also
considered one of the most important step.
19
Some functions are the variables you get from a records set. In exercise there is a need to discover
the features on his/her self, that might be scattered among unique statistics units. In numerous tasks
we had to convey collectively more than 20 one of a kind facts resources before we had the raw
information we required. Many times there is a need to transform an input until it turns into an
accurate predictor or to combine a couple of inputs. An instance of combining more than one
inputs would be interaction variables: the impact of both single variable is low, but if both are
present their effect becomes high. This is specifically authentic in chemical and environments
related to medical science. For instance, even though vinegar and bleach are harmless
commonplace family merchandise through themselves, blending them outcomes in poisonous
chlorine fuel, a gasoline that killed hundreds all through World War I.
Many-a-times in order to get the features derived,modeling techniques are to be used.One model's
output is used in second model.This is commonly done as in text mining.In order to categorize the
content,documents first need to be annotated.Counting of number of places or people in the text
can also be done.But this process is not that much easy,how it sounds.Firstly,the model is made to
recognize some words like names of places or people.This information is then givem to another
model that is to be developed.Availability bias is considered a major mistake in process of model
making.If the model is having availablity bias then that model will fail while validation as it is
seen that it is not an accurate model.
Model training can be done having an idea of efficient modeling technique and using the correct
predictors at the correct place.In this step training data is given to the model so that it can learn
using this data.The modeling techniques that are famous have all-set implementations ready to be
used in nearly all coding languages.Thus,it makes it easier to train the model just by running a few
lines of program.Other techniques of data science requires complex calculations and needs to be
implemented using modern techniques.After the model gets trained successfully,it is to be checked
whether this model is capable to deal with real-word problems or not.
There are several data modeling techniques in data science.One just has to chose the correct and
efficient one.Basically,there are two distinguishing properties of a good model:
20
• It works efficiently and accurately with new data (test data)
In order to get these properties right,error measure needs to be defined which tells the extent to
which the model is inaccurate and also a strategy for its validation.
• the classification error rate for classification problems: It is the percentage of observations
in the testing data that the particular model mislabeled.The lesser this rate is,better the
model is considered to be,
• the mean squared error for regression problems: It measures the extent of how big the
average error of required prediction is. Results of squaring the average error are:wrong
prediction cannot be undone in one direction with some prediction having faults in other
direction. For instance, overestimating future turnover for next month by 5,000 is not
capable of cancelling out underestimating it by 5,000 for the following
month.Secondly,squaring makes big errors way more bigger.
Some of the validation techniques are:
• Partitioning the given data by taking out some percent of the data as training set,This is
very common technique.
• K-folds cross validation:According to this technique the given data is divided into k parts
and uses each part one time as a test data set the others as a training data set.The advantage
of using this technique is that all the data present in the data set is used.
• eave-1 out:This approach is the similar to k-folds having k=1. One observation is always
left out and training is done on the rest of the data. This technique is implemented only on
datasets that are not too big.
givem for using the extra variables during the making of the model.
• The variance between the coefficients of the predictors are kept as minimum as possible
using L2 regularization.It becomes difficult to see the actual impact of every predictor if
the variance between predictors overlap.If there is no such overlapping the variance is
interpreted more easily and clearly.
21
• Basically regularization prevents a model to use several features and this inturn prevents
over-fitting.
• Validation checks if the model is actually working properly with the real-world
applications also or not.This makes validation step important.
• Logistic Regression
• Decision Tree
• Random Forest
• KNN
• Naïve Bayes
• Goal is to find the best fitting model for independent and dependent variable relationship
• Deals with probability to measure the relation between dependent and independent variable
2.11.2 Decision Trees:
One of the most easy and famous classification algorithm is Decision Tree Algorithm.This
algorithm helps interpretating and understanding better.
22
The decision tree algorithm is capable of solving both, classification and regression
problems,which distinguishes it from rest of the supervised learning algorithms.
The main objective of using this algorithm is to predict the value/class of the target variable by
learning some decision rules.
For predictions using this algorithm,we have to begin with the root node.The value of record,s
attribute and root node are compared.This comparision tells what will be the next node that needs
to be followed.
• Categorical Variable Decision Tree:If the target variable is of categorical type,then the tree
is called Categorical Variable Decision Tree.
• Continous Variable Decision Tree:If the target variable is of continous type,then the tree is
called Continous Variable Decision Tree.
Terminology:
1. Root Node: The first and the top most node of the decision tree from where the other edges
and nodes emerge downwards.
3. Decision Node: Makes the splitting of a node into sub nodes after a decision.
4. Leaf / Terminal Node: The bottom most nodes of the tree which do not split any further.
5. Pruning: It is the process exactly opposite to splitting.In this process the sub nodes from a
node are removed.
7. Parent and Child Node: A node from which other nodes emerge is called a parent node,and
the nodes ,so emerged, are called child nodes.
23
Fig 6:Decision tree terms visualized
Every node acts as a test case for some other attribute,and every edge running down from a node
corresponds to a probable answer of that test case.
In the starting the entire train dataset is considered as the root of the decision tree.
Before building tyhis model the continous variables are converted into categorical.
Our focus will be on the ID3 algorithm used in the decision tree approach.This algorithm follows
a top down approach.It is a greedy method.
2. In each iteration an unused attribute of S is iterated and Entropy and Information Gain of
that attribute is calculated.
3. The attribute having either minimum entropy or maximum Information gain is selected.
5. Similarly this process gets applied on the subset produced by selecting only those attributes
that are not considered previously.
• Ensemble Learning is an approach which tells not to be reliant only on one model to make
prediction.Rather take into consideration a number of models, and on the basis of outputs
of all such models,come to a conclusion.
• The prediction made using this aaproach is far more accurate than it would have been
considering only one model for predicting.
• Random forest is kind of an ensemble classifier which is using decision tree algorithm in
a randomized fashion.
• Initially we are provided with a training data set i.e. original data set(OD).
• Using this OD, a new data set is generated i.e. bootstrap data set(BD).This data set is
generated by randomly sampling the records from the OD and putting it in BD.Duplicate
records are allowed in BD.However,it is better to have more unique data in the bootstrap
data set.
• Now,considering the bootstrap data set,a decision tree is to be built.To decide the root
node,a subset of the total number of variables present in the bootstrap data set is
considered.For instance,if there are total three variables except the target variable in the
bootstrap data set,then any two of the total three variables are considered randomly for the
root.Out of these variables,one will become the root node of the decision tree.
• Similarly,generate a new bootstrap data set for each decision tree to be built.Build a
number of decision trees in the same way using the subset of total variables present in the
bootstrap data set.
• After building a number of decision trees,a test data is given for which the value of target
variable is to be predicted as output.This data will be tested by each and every decision
tree.Keeping a track of all the outputs generated by these trees,a prediction will be made.
• The prediction will definitely be far more accurate than it would have been using only one
decision tree.
25
2.11.4 KNN:
KNN is a rule which learns by memorizing.This algorithm prerequisites containing of the training
data.The neighbors are found at the time of testing of the already stored training data.
The time complexity of implementation of this algorithm is Θ(d m). This makes the computation
expensive at the tie of testing. In case where the value of d is small, there are a number of data
structures projected by several computational geometry results which makes the time complexity
of this algorithm as (d O(1) log(m)). Anyhow, these data structures have the space complexity of
mO(d) , hence making this approach inappropriate for cases where the value of d is large.In order
to get rid of this problem,the approximate search should be allowed as this will result in improved
searching.
• In KNN algorithm,Euclidean distance is measured between the training and test data.
• Suppose we are provided with a data set having a number of values.Let us say we have are
given with a value whose category is to be tested.
• We are also provided with a value K,which tells us the number of neighbours which are
closest to the test value.
• In order to do so,Euclidean distance between all the similar values from the table are to be
calculated.
• K values having least distance with the test value are considered and their category is
checked.
• Checking the categories of K nearest values,the value of the test data is predicted.
This algorithm serves as a primitive manifestation about simpifying the learning process via
estimations of parameters and by taking some generative assumptions into consideration. Let us
assume having a scenario of predicting a label y {0, 1} on the basis of a vector of features x =
26
(x1, . . . , xd), where we assume that each xi is in {0, 1}. Recall that the Bayes optimal classifier
is
In order to illustrate the probability function P[Y = y|X = x] we need 2d parameters, each of which
corresponds to P[Y = 1|X = x] for a certain value of x {0, 1} d . This shows that there is an
exponential growth in the number of required instances with number of features.
In this approach a general assumption is taken into consideration that labels and features are
independent of each other. That is,
With this assumption and using Bayes‟ rule, the Bayes optimal classifier can be further simplified:
Now we can say that the total number of parameters needed to estimate is 2d + 1. In this the
previously made generative assumptions reduced significantly the number of parameters we need
to learn. When we also estimate the parameters using the maximum likelihood principle, the
resulting classifier is called the Naive Bayes classifier.
An algorithm of computer science which is totally comparable to real biological nervous system.A
network similar to the network of nerves in the body is created which helps in learning,memorizing
and in generating outputs to certain inputs.
Biological Neuron:
• The Axon endings have the synaptic juction.It is a sort of contact with other neurons and this
contact is electrochemical in nature.
Different NN Types:
• Single-layer NNs
• Temporal NNs
28
• Self-organizing NNs
• Classification
• Pattern matching
• Pattern completion
• Optimization
• Control
• Data mining
Feedforward networks:
Input layer:The total number of input values is qual to the number of nodes present in this
layer.Since the neurons of this layer do not participate in modification of signal,these nodes are
refered to as passive nodes.These neurons are responsible for transferring the signal to the hidden
layer.
Hidden layer:The hidden layer can have as many number layers as well as ,as many number of
neurons too.The nodes present in the hidden layer are also called active nodes as these nodes are
responsible for signal modification.
Output layer: The number of values generated by the network as output is equal to the number of
neurons present in this layer.These nodes are also called active nodes.
29
Fig 8:Feedforward Network
Feedback Networks:
The neurons which are used for recognizing pattern acts as a path for the output to be given as
input either directly or indirectly.
30
Lateral Networks:
• This can be treated like a compromise among forward and feedback network
It is made to learn by the system,that how it should generate correct outputs for each and every
input provided to that system.
After the learning procedure the whole network gets trained.Now,the newly updated weights are
provided as inputs and the trained neural network should now generate the outputs for these new
input values.The output must be expectedly accurate.
Learning methods
• Supervised learning
• Unsupervised learning
• Reinforcement learning
31
CHAPTER-3: SYSTEM DEVELOPMENT
The system requirements for the algorithms to run efficiently and for the implementation of the
whole idea are:
• Windows 10 (64-bit)
• 8 GBRAM
• ANACONDA
• Python
3.2.1 numpy:
• Some linear algebra operations. NumPy has in-built functions for linear algebra and
random number generation
Some packages such as SciPy and Matplotlib are also used quite frequently with NumPy.SciPy
refers to the Scientific Python and Matplotlib serves as a plotting library for
32
visualization.Whenever these two packages are combined together,they can serve the
functionalities similar to Matlab,hence proves to be a replacement for the Matlab.This option is
used more frequently over the Matlab.Hence this shows how much capable Python truly is.
The NumPy package allows the usage of mathematical operators in various datatypes,be it a list or
a dictionary.Numpy proves beneficial in manipulating and doing some calculative work over the
columns of the datasets.
3.2.2 Pandas:
One of the most famous library of Python programming language for study and interpretation of
data is Pandas.
It is already known very clearly that before the preperation of any model,it is necessary to set up
the data sets for the model.For this very cause,Pandas‟role come into play.Pandas library is
responsible for the preparation and extraction of adequate data for a data model.
Pandas provides us with useful functionalities which helps in importing files,manipulating those
files and enable to implement certain significant concepts.It also enables grouping and seperation
of datasets.
• Pandas proves to be the adequate tool for processing and cleaning of tabular data.For
instance,the data stored in some database or in spreadsheets can easily be processed using
Pandas library.
• To import the file into the system,there is a functionality whisch is used as, read_*.
• Pandas also provide methods for grouping,dividing and extracting the data from a file
which is imported previously.
• It is also possible to filter out some number of specific rows of records out of a large set of
records using Pandas.
33
• Pandas,using matplotlib enables to visualize the data in a desired way.There are many
graphs that can be generated be it bar graph or scattered graph.
• While making calculations,Pandas saves the effort of going throuhgh all the records of the
dataframe.Calculations can also be made columnwise.
• The basic structure of the table containing information can be changed using Pandas in
several ways.For instance,melt() can help in making a wide table tidy by making it long
and pivot() can change the table from long to wide.
• Pandas provides a number of tools for working with dates and times.
• Not only numeric data,but Pandas also proves helpful in doing operations on textual datam
contained in a table.
3.2.3 Matplotlib:
Matplotlib is a very popular library for visualization of data.This library allows plotting the data
in various formats and visualizing of this plot is an easy task.This visualization of the data make
the things more understandable.Data interpretation becomes easy by visualization.
Hence we can say that Matplotlib is valuable for a developer to picturize the information for better
understanding.
The pyplot module enables the developers to take a controlover the style of lines,properties of
styles of texts,etc.This module helps plotting various forms of grahs such as bars,histograms and
many more.
34
• Certain changes to a figure can be made using pyplot: for instance, making a figure,
specifying a particular plotting area in the figure, generating certain lines in the specified
plotting area, making the plot more readable by adding lables to the plot, etc.
3.2.4 Seaborn:
Another important library,Seaborn enables generating graphics that are statistical.This seaborn
library works properly in cooperation with data structures of Pandas and is built on top of
Matplotlib library.
• An API which works in accordance with a dataset for knowing the inter-realtion of the
variables
• Provides support for the variables which can be categorizes for visualizing statistrics
• Provides the choice for visualization of distributions be it either univariate or bivariate,and
allows their comparision with data subsets.
• Seaborn provides support to linear regression models by estimating and generating plots
for various types of features.
• Makes complicated datasets look understandable and readable.
• Enables easy generation of even complex visualizations by abstractions that are of good
standards.
• Helps styling figures generated by matplotlib using a number of default themes.
• Some tools for opting color palettes are also provides.The helps in making the plot
readable.
The major objective of seaborn library is to prove the importance of visualization in understanding
the data.Seaborn‟s functions are appliable on dataframes and data structures containing multiple
datasets.These functions implicitly do the mapping and aggregation to generate valuable plots.
Scikit Learn is a very important machine learning library.This library is used for traditional
machine learning calculations.Scikit Learn is based upon two major libraries of
Python,namely,NumPy and SciPy.
35
Scikit Learn can be used for data mining and processing of data.Hence this makes an essential tool
for the machine learning and development.
3.4 Anaconda:
Anaconda provides more than 300 libraries for data science.This makes Anaconda popular among
the developers working with data science.
Anaconda provides a space which is simple and easily manageable which makes deployement of
any project just a click away.
The console-based approach is extended to computing which is interactive in nature in a new path
by the notebook.The notebook provides us with an application which is mostly web based which
supports the computational process which includes developing,documenting and implementation
of code and also the generation of results.
The Jupyter Notebook is responsible for combining the following two components:
• A web application: a tool which is assisted by the browser in order to author the documents
in a better way,which inturn links texts,arithmetic operations and computations and their
output.
36
CHAPTER 4 PERFORMANCE ANALYSIS
In this project:
The first step is of this project development is to import the required libraries:
• pandas
• numpy
• seaborn
37
• matplotlib.pyplot
• warnings
All these libraries have been discussed in the literature survey in detail.
Objects of all these libraries are made.These objects will help in accessing the methods present in
these libraries.
The next step is to import the training data set in the object named train and the testing data set in
the object named test.
This import is done by pandas method of reading .csv files into the system.
Next we check the the features present in the loaded data sets.
Subscibed is present only in the training dataset and hence this makes it the target variable.
38
Fig 14:Shape of datasets
In the training data set there are 17 independent variables and 1 target variable(subscribed). Rest
of the features of both the datasets are identical.
The target variable subscribed,will be predicted using model trained with the train data.
Further we will see what categorical and numerical variables are there in our dataset.Datatypes of
the variables will be checked.
39
To display some of the records of a dataset for reference, .head() is used.If we specify a number in
the parameter of this function,then that many rows are displayed otherwise top five records are
displayed.
Univariate analysis:
Let us check the distributionof the target variable subscribed.Since this variable lies in the
categorical type,let us generate a frequency table,distribution and bar plot.
First of all let us see how many yes and how many no are there in the the target variable subscribed.
Fig 18:Normalizing
Now we will generate a plot of frequencies of the target variable subscribed using the object of
matplotlib.
40
Fig 19:Generating plot of features of target variable
It is concluded that there are 3715 yes,which means this many users have subscribed.It is 12%of
the total number of users.
Different variables will be explored so that the entire data set is better understood.
Univariate analysis of variables will help analysing the variables individually.
Bivariate analysis would enable us to see the relation of different variables with the target variable.
In order to see which variable influences the target variable the most,correlation plot will be used.
The distribution of age will show the number of people belonging to a similar age group.
Next,we should see what are the professions of people using the job variable.
41
Frequency table of job:
It is also seen that number of students is very less.It is because generally students do not apply for
a loan.
Now let us see how many users are having a default history.
42
Fig 23:Plot of default
Bivariate analysis:
In bivariate analysis,we will see the relationship of target variable subscribed with other
independent variables.
43
Fig 24:Job-Subscribed Crosstab
44
Fig 26: Default-Subscribed Crosstab
Now we will look into the correlation between the numerical variables of data set.
This can tell,which variables have more tendency to affect the target variable subscribed.For this
analysis,the target variable first needs to be conerted into numerical type.
45
Fig 28:Correlation between numeric variables
It is visible that target variable subscribed and duration are highly correlated.
This is evident as the client having a higher call duration might be showing more interest in the
loan scheme.This implies, that particular client might get ready to apply for the loan.
Model Building:
Now,the development of a model for predicting if the user will apply for a loan or not will start.
Dummies will be used for converting categorical variables into numerical variables because
sklearn models allows only numerical inputs.
ID, being the unique valued variable will be removed before applying dummies.
The target variable subscribed will also be removed.
46
Fig 29:the .get_dummies() method
The train data set will be divided into two parts,80% of the data will act as training data and the
remaining 20% data will be the validation data.
Logistic Regression:
We will first build a Logistic Regression model since logistic regression is used for classification
problems.
47
Fig 31: Fitting the model
Now the accuracy of the predictions will be checked. Calculating accuracy on the validation set.
Let's try decision tree algorithm now to check if we get better accuracy with that.
Now let us check what accuracy will be generated by Naïve Bayes algorithm.
48
Fig 35:Defining and fitting Naïve Bayes Model
KNN algorithm:
Now we will see the prediction results made by the KNN algorithm.
49
Fig 37:Setting up for KNN
Using the KNN algorithm,we got the accuracy of nearly 89.9 percent.
Now let us calculate the accuracy using another algorithm called Random Forest.
50
Fig 41:Setting up for Random Forest algorithm
There we will make two columns in this data frame,ID and Subscribed.
The value of IDs will be same as those of training data set and values of subscribed will be taken
as the output “test_prediction”.
This newly created data frame will be converted to csv file using .to_csv() method.
51
Fig 44:Saving the result
Similarly after checking the accuracy of every algorithm,the test data prediction will be made and
result will be stored in a new csv file.
CHAPTER 5:CONCLUSION
Lately people depend on bank loans to meet their wishes. The fee of loan packages will increase
with a very rapid speed in current years. Risk is constantly involved in approval of loans. The
banking officials are very acutely aware of the price of the mortgage quantity by its customers.
Even after taking lot of precautions and analyzing the mortgage applicant information, the
mortgage approval choices are not continually correct. There is need of automation of this system
so that loan approval is much less risky and incur less loss for banks.
Artificial Intelligence AI is a rising technology. The utility of AI solves many real world troubles.
Machine Learning is an AI method which could be very useful in prediction systems.A model is
created from a training data. While making the prediction the model that is evolved by way of
52
training algorithm(ML) is used. The ML algorithm trained the machine the usage of a fragment of
the statistics available and the remaining data is tested.
In this project some ML algorithms like Logistic Regression, Decision Tree,Random Forest,etc
are implemented to expect the loan approval for customers. The experimental results conclude that
the accuracy of Decision Tree ML algorithm is better in comparison to other algorithms.
As it is certainly a very important procedure for a bank to check whether an applicant,applying for
a term loan should be approved with the loan or not,this project focuses on making the tedious task
mechanized.
All the required algorithms were implemented successfully and accurate results were generated.
53
ORIGINALITY REPORT
14 % 10% 2% 5%
SIMILARITY INDEX INTERNET SOURCES PUBLICATIONS STUDENT PAPERS
PRIMARY SOURCES
b-ok.cc
1
Internet Source 5
%
manning.com
2
Internet Source 2
%
slidelegend.com
3
Internet Source 1%
Submitted to MVJ College of Engineering, Bangalore
4 Student Paper
1%
54
Submitted to Brunel University
5
Student Paper
link.springer.com
8
55