ST JOSEPH’S UNIVERSITY
BENGALURU-560027
LOAN ELIGIBILITY PREDICTION
MACHINE LEARNING PROJECT REPORT
Submitted by:
PRIYADARSHINI P
REG. NO: 23PGDDA03
Under the supervision of
AARAN LAWRENCE DLIMA
Department of Advanced Computing
St. Joseph’s University
36, Lalbagh Road
Bengaluru- 575003
Contents
Chapter 1: Introduction 1
1.1 About 1
1.2 Definition 1
1.3 Need for Study 1
1.4 Objectives 2
Chapter 2: Dataset 2
2.1 Information Regarding Dataset 2
2.2 Summary of the Dataset 3
2.3 Data Cleaning 5
Chapter 3: Exploratory Data Analysis 6
3.1 Univariate Analysis Observation 7
3.2 Independent variables (Categorical) 7
Independent variables (Ordinal)
3.3 For continuous variables, we have made 8
use of Log transformation
3.4 Heatmap 9
Chapter 4: Model 10
4.1 Logistic Regression for Prediction 11
4.1.1 Data Building and model training 11
4.2 Random Forest Classifier 12
4.3 Confusion Matrix 12
Chapter 5: Results and Analysis 14
Chapter 6: Conclusion and References 15
Chapter 7: References 15
LOAN ELIGIBILITY PREDICTION USING MACHINE LEARNING
Chapter 1: INTRODUCTION
ABOUT
Loan Prediction
Loans account for a large portion of bank profits. Despite the fact that many people are
looking for loans. Choosing a legitimate applicant who will return the money is difficult.
There may be several misconceptions when choosing the true applicant through a manual
approach. As a result, we are creating a machine learning-based loan prediction system that
will choose the qualified applicants on its own. This benefits the applicant as well as the bank
employees. There will be a significant reduction in the loan sanctioning timeframe. In this
work, we use a few machine learning techniques to anticipate the loan data.
Definition:
The main line of business for banks is loans. The majority of the bank's earnings is derived
directly from the interest received on loans. Even when a bank authorizes a loan following a
stringent verification and testimony procedure, there is still uncertainty as to whether the
selected candidate is the best candidate or not. It takes more time to complete this step
manually. We are able to predict whether or not that certain optimistic is safe, and machine
literacy style automates the entire testimony process. Loan Prognostic is really beneficial for
bank retainers.
Need For Study
The primary necessity of the modern world is loans. Banks receive a large portion of the
overall profit only from this. It is advantageous for people to purchase any type of luxury,
including homes, vehicles, and other items, as well as for students to manage their living and
educational costs.
But when it comes to assessing whether the applicant’s profile is suitable to be granted the
loan or not. Banks have a lot of responsibilities.
Therefore, in order to make their jobs easier and determine whether or not a candidate's
profile is relevant, we will be employing machine learning with Python to make use of
important features such applicant income, credit history, marital status, and education.
Almost all banks primary business is the distribution of loans. The majority of a bank's assets
are directly attributable to the profits made on the loans that the banks provide.
Placing assets in safe hands is the main goal in a banking system. Nowadays, a lot of banks
and financial institutions grant loans following a lengthy procedure of verification and
validation, but it's still unclear if the selected application is the worthiest candidate out of all
1
of the applicants. This approach allows us to forecast whether a certain application is safe or
not, and machine learning techniques automate the entire feature validation procedure.
Objectives
The main objective of this topic is to identify similarities in a common loan-approved dataset.
Based on these patterns, a model will be constructed to forecast the likelihood of loan
defaulters using classification data mining algorithms.
The analysis will be conducted using the consumers' historical data. Subsequently, an
analysis will be conducted to identify the most significant features, or the variables that have
the biggest impact on the prediction outcome.
CHAPTER 2: DATASET
Information Regarding Dataset:
The goal of the company is to automate the loan qualifying process in real time, using the
client information submitted on the online application form. These include information about
the borrower's gender, credit history, number of dependents, marriage, education, and
income. In order to target these clients specifically, they have created a challenge to identify
the customer segments that qualify for a certain loan amount in order to automate this
procedure.
This challenge involves supervised categorization as usual. A classification problem in which
we must forecast the likelihood of loan approval or denial. The dataset properties and
descriptions are shown below.
Preprocessing:
There could be missing values in the gathered data, which could cause inconsistent results.
Preprocessing data is necessary to improve outcomes and increase the algorithm's efficiency.
The variables need to be converted, and the outliers should be eliminated. To tackle these
problems, we employ the chart function.
Variable Description
Loan ID Unique Loan ID
Gender Male/ Female
2
Variable Description
Married Applicant married (Y/N)
Dependents Number of dependents
Education Applicant Education (Graduate/ Under Graduate)
Self-employed Self-employed (Y/N)
Applicant Income Applicant income
Coapplicant Income Coapplicant income
Loan Amount Loan amount in thousands
Loan Amount Term Term of loan in months
Credit History Credit history meets guidelines
Property Area Urban/ Semi Urban/ Rural
Loan Status Loan approved (Y/N)
Summary of the Dataset
We have 12 independent variables and 1 target variable, i.e. Loan Status in the training dataset.
Here in the above dataset, we have three formats of data types:
1. Object: Object format means variables are categorical. Categorical variables in our
dataset are Loan ID, Gender, Married, Dependents, Education, Self Employed,
Property Area, Loan Status.
2. int64: It represents the integer variables. Applicant Income is of this format.
3
3. float64: It represents the variable that has some decimal values involved. They are also
numerical. In the dataset let’s visualize different types of variables. Different types of
variables are Categorical, ordinal, and numerical.
Categorical features: These features have categories (Gender, Married, Self Employed,
Credit History, Loan Status)
Ordinal features: Variables in categorical features having some order involved (Dependents,
Education, Property Area)
Numerical features: These features have numerical values (Applicant Income, Co-applicant
Income, Loan Amount, Loan Amount Term)
DATA CLEANING
Machine Learning works on the idea of garbage in – garbage out. The machine learning
algorithm's results will be "junk," if we put in useless junk data to the machine learning
algorithm.
The reason behind cleaning the data before training ML is If we feed in bad data, then in
return we get bad results. Occasionally, you may even be fortunate enough to obtain accurate
findings at a certain snapshot using trash data; nevertheless, the true performance becomes
apparent when you use the model to forecast on more recent instances of data. Prior to
training the model, the data must be cleaned.
Let’s list out feature-wise count of missing values.
Loan ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self Employed 32
4
Applicant Income 0
Co-applicant Income 0
Loan Amount 22
Loan Amount Term 14
Credit History 50
Property Area 0
Loan Status 0
There are missing values in Gender, Married, Dependents, Self Employed, Loan Amount,
Loan Amount Term, and Credit History features.
Correlating attributes:
Based on the link between characteristics, it was shown that borrowers were more likely to
repay their debts. The attributes that are individual and significant can include Property area,
education, loan measure, and credit History, which is by insight considered as important.
In Python platform boxplot can be used to associate the correlation between attributes.
We need to fill null values with mean and median. As we can see here, there are too many
columns missing with small number of null values so we use mean mode to replace the
values.
For Credit history we fill NaN values in mode to convert Categorical variables with
numerical variables.
Boolean values are present in Loan Status. To fix this, we substitute 1 for Y values and 0 for
N values. This also applies to other Boolean types of columns.
Now, all the features arrive to be numerical values.
We run the following code to check if all the variables to Null values.
loan_train.isnull().sum()
Gender 0
Married 0
Dependents 0
Education 0
Self Employed 0
Applicant Income 0
Co-applicant Income 0
Loan Amount 0
Loan Amount Term 0
Credit History 0
Property Area 0
Loan Status 0
dtype: int64
5
Chapter 3: Exploratory Data Analysis
Exploratory Data Analysis known as EDA for short, this is the stage where you fully
comprehend the data.
We visualize the distributions, compute frequency counts, and gain an understanding of each
variable separately. Additionally, by making scatterplots, correlations, and other
visualizations, the links between the different combinations of the predictor and response
variables.
Every machine learning or predictive modelling project usually includes EDA, especially
when working with tabular datasets.
The main objective to use EDA process is:
1. Recognize the factors that can be important in forecasting the Y (response).
2. Provide insights that help us comprehend the performance and business environment
better.
Some standard examples on how to perform EDA on modelling dataset.
First, we check frequency count in the dataset.
For eg: Married people are 63.5% and the rest are 36.5 %
6
Univariate Analysis Observations:
1. More Loans are approved Vs Rejected.
2. Count of Male applicants is more than Female.
3. Count of Married applicant is more than non-married.
4. Count of graduate is more than non-Graduate.
5. Count of self-employed is less than that of non-self-employed.
6. Maximum properties are located in Semiurban areas.
7. Credit History is present for many applicants.
8. The count of applicants with several dependents=0 is maximum.
Independent variables(categorical):
We check value counts and plot bar graph for ‘Gender’, ‘Married’, ‘Self-employed’ and
‘Credit History’.
So, by working on above mentioned 4 bar plots, we can see the following observation:
80% of applicants in the dataset are male.
Around 65% of the applicants in the dataset are married.
Around 15% of applicants in the dataset are self-employed.
Around 85% of applicants have repaid their doubts.
Independent variables (Ordinal):
We check value counts and plot bar graph for Dependents, Education and Property area.
The following inferences can be made from the above bar plots:
• Most of the applicants don't have any dependents.
• Around 80% of the applicants are Graduate.
• Most of the applicants are from the Semiurban area.
7
• 3.3 For continuous variables, we have made use of Log
transformation
In this step, we will split the data for training and testing. After that, we will preprocess the
training data.
Then Creation of new attributes – Total Income
# total income
df['Total_Income'] = df['ApplicantIncome'] + df['CoapplicantIncome']
Next, we apply log transformation to the attributes.
Applicant Income Log
Co-applicant Income Log
Loan Amount Log
Loan Amount Term Log
8
Total Income Graph:
We can see it is shifted towards left, i.e., the distribution is right-skewed. So, let’s take
the log transformation to make the distribution normal.
Heatmap:
Let's now examine the correlation that exists between each numerical variable. To see the
correlation, we shall make use of the heat map. Heatmaps use colour variations to display
data. The factors that have a darker tint indicate a higher correlation.
9
The above heatmap is showing the correlation between Loan Amount and Applicant
Income. It also shows that Credit History has a high impact on Loan Status.
Data visualization is accomplished with the help of the Python package Seaborn, which is
based on matplotlib. It offers a way to display data in the form of a statistical graph as an
engaging and educational way to share certain information. One of the elements that Seaborn
supports is a heatmap, which uses a colour scheme to depict variations in linked data. The
main focus of it is correlation heatmaps and how to create them for data frames using
Seaborn, pandas, and matplotlib.
Chapter 4: Model
4.1 Logistic Regression for Prediction
In this step, we have lots of machine learning models from sklearn package, and we need to
decide on which model gives us the better performance. Then we use that model in the final
stage.
First, we use the sklearn.linear_model package's Logistic Regression. This is a brief overview
of logistic regression.
An algorithm for categorization is called logistic regression. Given a set of independent
variables, it is used to predict a binary outcome (1 / 0, Yes / No, and True / False). Dummy
variables are used to represent binary or categorical outcomes. When the dependent variable
is the log of chances and the outcome variable is categorical, logistic regression can also be
thought of as a specific case of linear regression
We split the data into the Training Set and Test Set & Applying K-Fold Cross Validation and then
calculate the Logistic Regression.
So, let’s see the Logistic Regression accuracy and average cross validation values.
Determining the number of features available for training and testing is a necessary step
before fitting the model.
10
4.1.1 Data Building and model training
Here we have determined the accuracy for SVC, Decision Tree Classifier, Random Forest
Classifier.
SVM – SUPPORT VECTOR MACHINE
A supervised machine learning approach called Support Vector Machine (SVM) is used for
regression as well as classification. The SVM algorithm's primary goal is to locate the best
hyperplane in an N-dimensional space.
Decision Tree Classifier
Decision trees are a supervised learning technique, they are primarily employed to solve
classification problems. However, they can also be used to solve regression problems. This
classifier is tree-structured, with internal nodes standing in for dataset attributes, branches for
decision rules, and leaf nodes for each outcome.
The decisions or the test are performed on the basis of features of the given dataset.
It is a graphical tool that shows all of the options for solving a problem or making a decision
given certain parameters.
11
Random Forest Classifier
A Random Forest classifier uses many decision trees on various subsets in the given dataset
and averages them to increase the dataset's predicted accuracy.
Rather than depending on a single decision tree, the random forest forecasts the outcome
based on the majority of projections from each tree.
Because the greater the number of trees in the forest, accuracy is higher and overfitting is
avoided.
The best model training among all the above is Logistic Regression.
Among all the algorithms logistic regression performs best on the validation data with an
accuracy score of 80.48%
4.3 Confusion Matrix
An overview of a machine learning model's performance on a set of test data is provided via a
confusion matrix. It is a way to show how many instances, depending on the model's
predictions, are accurate and inaccurate. It is frequently used to assess how well
categorization models seek to assign a categorical label to each instance of input.
The number of instances that the model generated on the test data is shown in the matrix.
12
True Positive (TP): When a positive outcome is accurately predicted by the model, the actual
result is also positive.
True Negative (TN): When a negative result is accurately predicted by the model, the real
result is also negative.
False Positive (FP): When a positive result is predicted by the model but the actual result is
negative. FP is also known as a Type I error.
False Negative (FN): When a positive result occurs instead of the expected negative one, the
model predicted the wrong thing. FN is also known as Type II error.
Confusion Matrix for y_test and y_pred
Confusion Matrix for training dataset
13
Metrics based on Confusion Matrix Data
1. Accuracy
2. Precision
3. Recall
4. F1-score
5. Support
Further, we predict Feature Engineering
We are able to generate new features that could have an impact on the target variable by
using the domain knowledge. The following three new features will be developed by us.
1. Total Income: We will aggregate the applicant's and co-applicant's income. If the total
income is high, the chances of loan approval might also be high.
2. EMI: The amount the borrower must pay each month to repay the loan is known as the
interest rate. The reason behind adding this variable is because borrowers with large EMIs
may find it challenging to repay their debt. By calculating the ratio of the loan amount in
relation to the loan period, we may determine the EMI.
3. Balance Income: This is the amount of money that remains after EMI is paid. The reason for
the variable's creation is that a high value indicates a high likelihood of loan repayment,
which raises the chance that the loan will be approved.
CHAPTER 5: RESULTS AND ANALYSIS
The loan applications are either accepted or denied by the system.
One important factor that influences a bank's financial statements is loan recovery. It is
exceedingly difficult to forecast whether a consumer will be able to repay a loan. For huge
data sets, machine learning (ML) approaches are highly helpful in predicting results. In our
study, the loan acceptance of consumers is predicted using three machine learning
algorithms: Random Forest (RF), Decision Tree (DT), and Logistic Regression (LR).
According to the experimental findings, the experimental results conclude that the accuracy
of Logistic Regression is better than compared to Random Forest machine algorithm and
decision tree machine learning approaches.
In this project, we learned how to create models to predict the target variable, i.e. if the
applicant will be able to repay the loan or not.
14
CHAPTER 6: CONCLUSION AND REFERENCES
For training, we have employed a variety of algorithms, including Random Forest, SVC,
Decision Tree, and Logistic Regression.
With an accuracy score of 80.48%, logistic regression outperforms all other algorithms on the
validation data.
My accuracy score following the test data's final submission was 80.18%.
Accuracy improved with the aid of feature engineering.
Amazingly, compared to all other ensemble models, Logistic Regression performed better.
CHAPTER 7: REFERENCE
[Link]
[Link]
[Link]
NING
[Link]
[Link]
[Link]
%20Classification/Loan%20Prediction%20Analysis%20-%[Link]
[Link]
15