0% found this document useful (0 votes)

16 views17 pages

Loan Prediction ML Project

Uploaded by

Karthik .k

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views17 pages

Loan Prediction ML Project

Uploaded by

Karthik .k

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

ST JOSEPH’S UNIVERSITY

BENGALURU-560027

LOAN ELIGIBILITY PREDICTION

MACHINE LEARNING PROJECT REPORT

Submitted by:
PRIYADARSHINI P
REG. NO: 23PGDDA03

Under the supervision of

AARAN LAWRENCE DLIMA
Department of Advanced Computing
St. Joseph’s University

36, Lalbagh Road

Bengaluru- 575003
Contents

Chapter 1: Introduction 1

1.1 About 1
1.2 Definition 1
1.3 Need for Study 1
1.4 Objectives 2

Chapter 2: Dataset 2

2.1 Information Regarding Dataset 2

2.2 Summary of the Dataset 3
2.3 Data Cleaning 5

Chapter 3: Exploratory Data Analysis 6

3.1 Univariate Analysis Observation 7

3.2 Independent variables (Categorical) 7
Independent variables (Ordinal)
3.3 For continuous variables, we have made 8
use of Log transformation
3.4 Heatmap 9

Chapter 4: Model 10

4.1 Logistic Regression for Prediction 11

4.1.1 Data Building and model training 11
4.2 Random Forest Classifier 12
4.3 Confusion Matrix 12

Chapter 5: Results and Analysis 14

Chapter 6: Conclusion and References 15

Chapter 7: References 15
LOAN ELIGIBILITY PREDICTION USING MACHINE LEARNING

Chapter 1: INTRODUCTION

ABOUT

Loan Prediction

Loans account for a large portion of bank profits. Despite the fact that many people are
looking for loans. Choosing a legitimate applicant who will return the money is difficult.
There may be several misconceptions when choosing the true applicant through a manual
approach. As a result, we are creating a machine learning-based loan prediction system that
will choose the qualified applicants on its own. This benefits the applicant as well as the bank
employees. There will be a significant reduction in the loan sanctioning timeframe. In this
work, we use a few machine learning techniques to anticipate the loan data.

Definition:

The main line of business for banks is loans. The majority of the bank's earnings is derived
directly from the interest received on loans. Even when a bank authorizes a loan following a
stringent verification and testimony procedure, there is still uncertainty as to whether the
selected candidate is the best candidate or not. It takes more time to complete this step
manually. We are able to predict whether or not that certain optimistic is safe, and machine
literacy style automates the entire testimony process. Loan Prognostic is really beneficial for
bank retainers.

Need For Study

The primary necessity of the modern world is loans. Banks receive a large portion of the
overall profit only from this. It is advantageous for people to purchase any type of luxury,
including homes, vehicles, and other items, as well as for students to manage their living and
educational costs.

But when it comes to assessing whether the applicant’s profile is suitable to be granted the
loan or not. Banks have a lot of responsibilities.

Therefore, in order to make their jobs easier and determine whether or not a candidate's
profile is relevant, we will be employing machine learning with Python to make use of
important features such applicant income, credit history, marital status, and education.

Almost all banks primary business is the distribution of loans. The majority of a bank's assets
are directly attributable to the profits made on the loans that the banks provide.

Placing assets in safe hands is the main goal in a banking system. Nowadays, a lot of banks
and financial institutions grant loans following a lengthy procedure of verification and
validation, but it's still unclear if the selected application is the worthiest candidate out of all

1
of the applicants. This approach allows us to forecast whether a certain application is safe or
not, and machine learning techniques automate the entire feature validation procedure.

Objectives

The main objective of this topic is to identify similarities in a common loan-approved dataset.
Based on these patterns, a model will be constructed to forecast the likelihood of loan
defaulters using classification data mining algorithms.

The analysis will be conducted using the consumers' historical data. Subsequently, an
analysis will be conducted to identify the most significant features, or the variables that have
the biggest impact on the prediction outcome.

CHAPTER 2: DATASET
Information Regarding Dataset:

The goal of the company is to automate the loan qualifying process in real time, using the
client information submitted on the online application form. These include information about
the borrower's gender, credit history, number of dependents, marriage, education, and
income. In order to target these clients specifically, they have created a challenge to identify
the customer segments that qualify for a certain loan amount in order to automate this
procedure.

This challenge involves supervised categorization as usual. A classification problem in which

we must forecast the likelihood of loan approval or denial. The dataset properties and
descriptions are shown below.

Preprocessing:

There could be missing values in the gathered data, which could cause inconsistent results.
Preprocessing data is necessary to improve outcomes and increase the algorithm's efficiency.
The variables need to be converted, and the outliers should be eliminated. To tackle these
problems, we employ the chart function.

Variable Description

Loan ID Unique Loan ID

Gender Male/ Female

2
Variable Description

Married Applicant married (Y/N)

Dependents Number of dependents

Education Applicant Education (Graduate/ Under Graduate)

Self-employed Self-employed (Y/N)

Applicant Income Applicant income

Coapplicant Income Coapplicant income

Loan Amount Loan amount in thousands

Loan Amount Term Term of loan in months

Credit History Credit history meets guidelines

Property Area Urban/ Semi Urban/ Rural

Loan Status Loan approved (Y/N)

Summary of the Dataset

We have 12 independent variables and 1 target variable, i.e. Loan Status in the training dataset.

Here in the above dataset, we have three formats of data types:

1. Object: Object format means variables are categorical. Categorical variables in our

dataset are Loan ID, Gender, Married, Dependents, Education, Self Employed,

Property Area, Loan Status.

2. int64: It represents the integer variables. Applicant Income is of this format.

3
3. float64: It represents the variable that has some decimal values involved. They are also

numerical. In the dataset let’s visualize different types of variables. Different types of

variables are Categorical, ordinal, and numerical.

Categorical features: These features have categories (Gender, Married, Self Employed,

Credit History, Loan Status)

Ordinal features: Variables in categorical features having some order involved (Dependents,

Education, Property Area)

Numerical features: These features have numerical values (Applicant Income, Co-applicant

Income, Loan Amount, Loan Amount Term)

DATA CLEANING

Machine Learning works on the idea of garbage in – garbage out. The machine learning
algorithm's results will be "junk," if we put in useless junk data to the machine learning
algorithm.

The reason behind cleaning the data before training ML is If we feed in bad data, then in
return we get bad results. Occasionally, you may even be fortunate enough to obtain accurate
findings at a certain snapshot using trash data; nevertheless, the true performance becomes
apparent when you use the model to forecast on more recent instances of data. Prior to
training the model, the data must be cleaned.

Let’s list out feature-wise count of missing values.

Loan ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self Employed 32

4
Applicant Income 0
Co-applicant Income 0
Loan Amount 22
Loan Amount Term 14
Credit History 50
Property Area 0
Loan Status 0

There are missing values in Gender, Married, Dependents, Self Employed, Loan Amount,
Loan Amount Term, and Credit History features.

Correlating attributes:

Based on the link between characteristics, it was shown that borrowers were more likely to
repay their debts. The attributes that are individual and significant can include Property area,
education, loan measure, and credit History, which is by insight considered as important.

In Python platform boxplot can be used to associate the correlation between attributes.

We need to fill null values with mean and median. As we can see here, there are too many
columns missing with small number of null values so we use mean mode to replace the
values.

For Credit history we fill NaN values in mode to convert Categorical variables with
numerical variables.

Boolean values are present in Loan Status. To fix this, we substitute 1 for Y values and 0 for
N values. This also applies to other Boolean types of columns.

Now, all the features arrive to be numerical values.

We run the following code to check if all the variables to Null values.
loan_train.isnull().sum()

Gender 0
Married 0
Dependents 0
Education 0
Self Employed 0
Applicant Income 0
Co-applicant Income 0
Loan Amount 0
Loan Amount Term 0
Credit History 0
Property Area 0
Loan Status 0
dtype: int64

5
Chapter 3: Exploratory Data Analysis
Exploratory Data Analysis known as EDA for short, this is the stage where you fully
comprehend the data.

We visualize the distributions, compute frequency counts, and gain an understanding of each
variable separately. Additionally, by making scatterplots, correlations, and other
visualizations, the links between the different combinations of the predictor and response
variables.
Every machine learning or predictive modelling project usually includes EDA, especially
when working with tabular datasets.

The main objective to use EDA process is:

1. Recognize the factors that can be important in forecasting the Y (response).

2. Provide insights that help us comprehend the performance and business environment
better.

Some standard examples on how to perform EDA on modelling dataset.

First, we check frequency count in the dataset.

For eg: Married people are 63.5% and the rest are 36.5 %

6
Univariate Analysis Observations:

1. More Loans are approved Vs Rejected.

2. Count of Male applicants is more than Female.

3. Count of Married applicant is more than non-married.

4. Count of graduate is more than non-Graduate.

5. Count of self-employed is less than that of non-self-employed.

6. Maximum properties are located in Semiurban areas.

7. Credit History is present for many applicants.

8. The count of applicants with several dependents=0 is maximum.

Independent variables(categorical):

We check value counts and plot bar graph for ‘Gender’, ‘Married’, ‘Self-employed’ and
‘Credit History’.

So, by working on above mentioned 4 bar plots, we can see the following observation:

80% of applicants in the dataset are male.

Around 65% of the applicants in the dataset are married.

Around 15% of applicants in the dataset are self-employed.

Around 85% of applicants have repaid their doubts.

Independent variables (Ordinal):

We check value counts and plot bar graph for Dependents, Education and Property area.

The following inferences can be made from the above bar plots:

• Most of the applicants don't have any dependents.

• Around 80% of the applicants are Graduate.

• Most of the applicants are from the Semiurban area.

7
• 3.3 For continuous variables, we have made use of Log
transformation

In this step, we will split the data for training and testing. After that, we will preprocess the

training data.

Then Creation of new attributes – Total Income

# total income
df['Total_Income'] = df['ApplicantIncome'] + df['CoapplicantIncome']

Next, we apply log transformation to the attributes.

Applicant Income Log

Co-applicant Income Log
Loan Amount Log
Loan Amount Term Log

8
Total Income Graph:

We can see it is shifted towards left, i.e., the distribution is right-skewed. So, let’s take
the log transformation to make the distribution normal.
Heatmap:
Let's now examine the correlation that exists between each numerical variable. To see the
correlation, we shall make use of the heat map. Heatmaps use colour variations to display
data. The factors that have a darker tint indicate a higher correlation.

9
The above heatmap is showing the correlation between Loan Amount and Applicant
Income. It also shows that Credit History has a high impact on Loan Status.

Data visualization is accomplished with the help of the Python package Seaborn, which is
based on matplotlib. It offers a way to display data in the form of a statistical graph as an
engaging and educational way to share certain information. One of the elements that Seaborn
supports is a heatmap, which uses a colour scheme to depict variations in linked data. The
main focus of it is correlation heatmaps and how to create them for data frames using
Seaborn, pandas, and matplotlib.

Chapter 4: Model

4.1 Logistic Regression for Prediction

In this step, we have lots of machine learning models from sklearn package, and we need to
decide on which model gives us the better performance. Then we use that model in the final
stage.
First, we use the sklearn.linear_model package's Logistic Regression. This is a brief overview
of logistic regression.

An algorithm for categorization is called logistic regression. Given a set of independent

variables, it is used to predict a binary outcome (1 / 0, Yes / No, and True / False). Dummy
variables are used to represent binary or categorical outcomes. When the dependent variable
is the log of chances and the outcome variable is categorical, logistic regression can also be
thought of as a specific case of linear regression

We split the data into the Training Set and Test Set & Applying K-Fold Cross Validation and then
calculate the Logistic Regression.

So, let’s see the Logistic Regression accuracy and average cross validation values.

Determining the number of features available for training and testing is a necessary step
before fitting the model.

10
4.1.1 Data Building and model training

Here we have determined the accuracy for SVC, Decision Tree Classifier, Random Forest
Classifier.

SVM – SUPPORT VECTOR MACHINE

A supervised machine learning approach called Support Vector Machine (SVM) is used for
regression as well as classification. The SVM algorithm's primary goal is to locate the best
hyperplane in an N-dimensional space.

Decision Tree Classifier

Decision trees are a supervised learning technique, they are primarily employed to solve
classification problems. However, they can also be used to solve regression problems. This
classifier is tree-structured, with internal nodes standing in for dataset attributes, branches for
decision rules, and leaf nodes for each outcome.

The decisions or the test are performed on the basis of features of the given dataset.

It is a graphical tool that shows all of the options for solving a problem or making a decision
given certain parameters.

11
Random Forest Classifier

A Random Forest classifier uses many decision trees on various subsets in the given dataset
and averages them to increase the dataset's predicted accuracy.
Rather than depending on a single decision tree, the random forest forecasts the outcome
based on the majority of projections from each tree.

Because the greater the number of trees in the forest, accuracy is higher and overfitting is
avoided.

The best model training among all the above is Logistic Regression.

Among all the algorithms logistic regression performs best on the validation data with an
accuracy score of 80.48%

4.3 Confusion Matrix

An overview of a machine learning model's performance on a set of test data is provided via a
confusion matrix. It is a way to show how many instances, depending on the model's
predictions, are accurate and inaccurate. It is frequently used to assess how well
categorization models seek to assign a categorical label to each instance of input.

The number of instances that the model generated on the test data is shown in the matrix.

12
True Positive (TP): When a positive outcome is accurately predicted by the model, the actual
result is also positive.

True Negative (TN): When a negative result is accurately predicted by the model, the real
result is also negative.

False Positive (FP): When a positive result is predicted by the model but the actual result is
negative. FP is also known as a Type I error.

False Negative (FN): When a positive result occurs instead of the expected negative one, the
model predicted the wrong thing. FN is also known as Type II error.

Confusion Matrix for y_test and y_pred

Confusion Matrix for training dataset

13
Metrics based on Confusion Matrix Data

1. Accuracy
2. Precision
3. Recall
4. F1-score
5. Support

Further, we predict Feature Engineering

We are able to generate new features that could have an impact on the target variable by
using the domain knowledge. The following three new features will be developed by us.

1. Total Income: We will aggregate the applicant's and co-applicant's income. If the total
income is high, the chances of loan approval might also be high.

2. EMI: The amount the borrower must pay each month to repay the loan is known as the
interest rate. The reason behind adding this variable is because borrowers with large EMIs
may find it challenging to repay their debt. By calculating the ratio of the loan amount in
relation to the loan period, we may determine the EMI.

3. Balance Income: This is the amount of money that remains after EMI is paid. The reason for
the variable's creation is that a high value indicates a high likelihood of loan repayment,
which raises the chance that the loan will be approved.

CHAPTER 5: RESULTS AND ANALYSIS

The loan applications are either accepted or denied by the system.

One important factor that influences a bank's financial statements is loan recovery. It is
exceedingly difficult to forecast whether a consumer will be able to repay a loan. For huge
data sets, machine learning (ML) approaches are highly helpful in predicting results. In our
study, the loan acceptance of consumers is predicted using three machine learning
algorithms: Random Forest (RF), Decision Tree (DT), and Logistic Regression (LR).

According to the experimental findings, the experimental results conclude that the accuracy
of Logistic Regression is better than compared to Random Forest machine algorithm and
decision tree machine learning approaches.

In this project, we learned how to create models to predict the target variable, i.e. if the
applicant will be able to repay the loan or not.

14
CHAPTER 6: CONCLUSION AND REFERENCES

For training, we have employed a variety of algorithms, including Random Forest, SVC,
Decision Tree, and Logistic Regression.

With an accuracy score of 80.48%, logistic regression outperforms all other algorithms on the
validation data.

My accuracy score following the test data's final submission was 80.18%.

Accuracy improved with the aid of feature engineering.

Amazingly, compared to all other ensemble models, Logistic Regression performed better.

CHAPTER 7: REFERENCE

[Link]

NING

[Link]

%20Classification/Loan%20Prediction%20Analysis%20-%[Link]

[Link]

Paper 5
No ratings yet
Paper 5
44 pages
Arpit Pal E2 17 Report Loan-Prediction-System
No ratings yet
Arpit Pal E2 17 Report Loan-Prediction-System
34 pages
Loan Approval Prediction Report
100% (1)
Loan Approval Prediction Report
66 pages
AravindR - Capstone - Project - 1 (L - T Car Loan Prediction)
No ratings yet
AravindR - Capstone - Project - 1 (L - T Car Loan Prediction)
68 pages
Loan Prediction Using Artificial Intelligence and Machine Learning
No ratings yet
Loan Prediction Using Artificial Intelligence and Machine Learning
24 pages
Machine Learning in Credit Risk
No ratings yet
Machine Learning in Credit Risk
268 pages
Uber Price Prediction Using ML Techniques
No ratings yet
Uber Price Prediction Using ML Techniques
42 pages
Lending Risk Analysis With Predictive Modelling
No ratings yet
Lending Risk Analysis With Predictive Modelling
34 pages
Loan Approval Prediction Using Supervised Learning Algorithm
No ratings yet
Loan Approval Prediction Using Supervised Learning Algorithm
11 pages
Machine Learning With Python
100% (3)
Machine Learning With Python
137 pages
Project Synopsis-2
No ratings yet
Project Synopsis-2
11 pages
Loan Approval Prediction Models
No ratings yet
Loan Approval Prediction Models
10 pages
Loan Prediction for CS Students
No ratings yet
Loan Prediction for CS Students
21 pages
Data Science Real World Applications
100% (1)
Data Science Real World Applications
19 pages
Project Report
No ratings yet
Project Report
27 pages
Ijcrt 195700
No ratings yet
Ijcrt 195700
7 pages
1822 B.E Cse Batchno 92
No ratings yet
1822 B.E Cse Batchno 92
69 pages
ML Lab Report
No ratings yet
ML Lab Report
23 pages
Edafinal 1
No ratings yet
Edafinal 1
32 pages
Internship Report
No ratings yet
Internship Report
20 pages
Anu Internshipreport
No ratings yet
Anu Internshipreport
28 pages
Project Lit Final1
No ratings yet
Project Lit Final1
15 pages
Material For Student CAIPC (V062021A) EN
No ratings yet
Material For Student CAIPC (V062021A) EN
100 pages
Machine Learning Techniques For Predicting Credit Approvals: Prawar Mundra 2018IMG-037
No ratings yet
Machine Learning Techniques For Predicting Credit Approvals: Prawar Mundra 2018IMG-037
16 pages
House Price Prediction Insights
No ratings yet
House Price Prediction Insights
27 pages
Sat - 92.Pdf - Bank Loan Approval Data Analyze and Prediction Using Data Science Technique (ML)
No ratings yet
Sat - 92.Pdf - Bank Loan Approval Data Analyze and Prediction Using Data Science Technique (ML)
11 pages
Final Project Documentation
No ratings yet
Final Project Documentation
53 pages
AIML-Curriculum by Pregrad
No ratings yet
AIML-Curriculum by Pregrad
33 pages
AIML Curriculum Powered by IBM - Pregrad-Merged
No ratings yet
AIML Curriculum Powered by IBM - Pregrad-Merged
66 pages
Pregrad AI & ML Curriculum Overview
No ratings yet
Pregrad AI & ML Curriculum Overview
31 pages
1971 Chandra Naga Bhargava Reddy
No ratings yet
1971 Chandra Naga Bhargava Reddy
20 pages
Phase 2 Loan Prediction
No ratings yet
Phase 2 Loan Prediction
26 pages
1822 B.E Ece Batchno 63
No ratings yet
1822 B.E Ece Batchno 63
106 pages
Real Estate Web Application Using Flask
0% (1)
Real Estate Web Application Using Flask
11 pages
Final 1
No ratings yet
Final 1
6 pages
Split 20250408 1447
No ratings yet
Split 20250408 1447
10 pages
Machine Learning for Loan Approval Prediction
No ratings yet
Machine Learning for Loan Approval Prediction
31 pages
Prediciton of Loan Apprval-Project Report
No ratings yet
Prediciton of Loan Apprval-Project Report
82 pages
Course Plan 21CSC307P - Machine Learning For Data Analytics
No ratings yet
Course Plan 21CSC307P - Machine Learning For Data Analytics
13 pages
Dimensionality Reduction Algorithms
No ratings yet
Dimensionality Reduction Algorithms
34 pages
39120035-S.gokulnath Report
No ratings yet
39120035-S.gokulnath Report
50 pages
Student Performance Analysis Report
No ratings yet
Student Performance Analysis Report
40 pages
Data Science For Business 2
No ratings yet
Data Science For Business 2
5 pages
Data Science Industrial Training Report
No ratings yet
Data Science Industrial Training Report
35 pages
Tic Tac Toe
No ratings yet
Tic Tac Toe
55 pages
AIML Curriculum
No ratings yet
AIML Curriculum
25 pages
Intern ReportFSDFSDF
No ratings yet
Intern ReportFSDFSDF
18 pages
Krishna Edx Machine Learning With Python
No ratings yet
Krishna Edx Machine Learning With Python
18 pages
Machine Learning Project Report
No ratings yet
Machine Learning Project Report
12 pages
House-Price-Prediction-Using-Regression-Techniques Retouch - Removed
No ratings yet
House-Price-Prediction-Using-Regression-Techniques Retouch - Removed
14 pages
Predicting Health Insurance Claim Frauds Using Machine Learning
No ratings yet
Predicting Health Insurance Claim Frauds Using Machine Learning
11 pages
Loan Prediction Using Artificial Intelligence and Machine Learning
No ratings yet
Loan Prediction Using Artificial Intelligence and Machine Learning
23 pages
Abstract & Contents
No ratings yet
Abstract & Contents
7 pages
Major Final Report Kartik
No ratings yet
Major Final Report Kartik
48 pages
Case Study 219302405
No ratings yet
Case Study 219302405
14 pages
Anova
No ratings yet
Anova
35 pages
Interaction Effects in Regression: Gender
No ratings yet
Interaction Effects in Regression: Gender
5 pages
Quantitative Skills For Animal Sciences-Day 1
No ratings yet
Quantitative Skills For Animal Sciences-Day 1
78 pages
1.1 Descriptive Statistics
100% (1)
1.1 Descriptive Statistics
56 pages
Statistical Tests Comparison Table
No ratings yet
Statistical Tests Comparison Table
1 page
Biostatistics for Students
No ratings yet
Biostatistics for Students
101 pages
Data Visualisation CSE613: Prof. Ramesh Ragala
No ratings yet
Data Visualisation CSE613: Prof. Ramesh Ragala
59 pages
Bwinter - Stats - Proofs Book About R
No ratings yet
Bwinter - Stats - Proofs Book About R
326 pages
Types of Variables
No ratings yet
Types of Variables
31 pages
COSYSMO User Manual for Systems Engineering
No ratings yet
COSYSMO User Manual for Systems Engineering
29 pages
Car Interior
No ratings yet
Car Interior
10 pages
Overview of Statistics Concepts and Types
No ratings yet
Overview of Statistics Concepts and Types
38 pages
CHAPTER I: Nature of Inquiry and Research
No ratings yet
CHAPTER I: Nature of Inquiry and Research
16 pages
Data Condensation Techniques in Statistics
100% (2)
Data Condensation Techniques in Statistics
22 pages
Understanding The Two-Way ANOVA
No ratings yet
Understanding The Two-Way ANOVA
14 pages
Unit II 10 Data Preprocessing Techniques
No ratings yet
Unit II 10 Data Preprocessing Techniques
13 pages
Process Model Templates 90
No ratings yet
Process Model Templates 90
822 pages
Introduction To Statistical Data Analysis For The Life Sciences 1st Edition Srensen PDF Download
100% (5)
Introduction To Statistical Data Analysis For The Life Sciences 1st Edition Srensen PDF Download
77 pages
Unit 02 - Statistical Survey
No ratings yet
Unit 02 - Statistical Survey
34 pages
CSC407 - Chapter 4
No ratings yet
CSC407 - Chapter 4
28 pages
Categorical Data Analysis (CDA) - 1
No ratings yet
Categorical Data Analysis (CDA) - 1
154 pages
Understanding Qualitative Data in Education
No ratings yet
Understanding Qualitative Data in Education
15 pages
Statistical - Methods - and - Calculation - Skills - by - Isab - 5th Edition
100% (1)
Statistical - Methods - and - Calculation - Skills - by - Isab - 5th Edition
365 pages
Data Analysis
No ratings yet
Data Analysis
4 pages
Wombat Activity Analysis
No ratings yet
Wombat Activity Analysis
5 pages
J Paediatrics Child Health - 2007 - Walsh - Over The Counter Medication Use For Childhood Fever A Cross Sectional Study of
No ratings yet
J Paediatrics Child Health - 2007 - Walsh - Over The Counter Medication Use For Childhood Fever A Cross Sectional Study of
6 pages
Practical Research2 Q1Q2
No ratings yet
Practical Research2 Q1Q2
101 pages
Non-Experimental Research Guide
No ratings yet
Non-Experimental Research Guide
43 pages
2.1 Data Analysis With Graphs PDF
No ratings yet
2.1 Data Analysis With Graphs PDF
5 pages
Level 7 Unit6 Collecting Data
100% (1)
Level 7 Unit6 Collecting Data
16 pages

Loan Prediction ML Project

Uploaded by

Loan Prediction ML Project

Uploaded by

ST JOSEPH’S UNIVERSITY

LOAN ELIGIBILITY PREDICTION

Under the supervision of

36, Lalbagh Road

2.1 Information Regarding Dataset 2

Chapter 3: Exploratory Data Analysis 6

3.1 Univariate Analysis Observation 7

4.1 Logistic Regression for Prediction 11

Chapter 5: Results and Analysis 14

Chapter 6: Conclusion and References 15

Need For Study

This challenge involves supervised categorization as usual. A classification problem in which

Loan ID Unique Loan ID

Gender Male/ Female

Married Applicant married (Y/N)

Dependents Number of dependents

Education Applicant Education (Graduate/ Under Graduate)

Self-employed Self-employed (Y/N)

Applicant Income Applicant income

Coapplicant Income Coapplicant income

Loan Amount Loan amount in thousands

Loan Amount Term Term of loan in months

Credit History Credit history meets guidelines

Property Area Urban/ Semi Urban/ Rural

Loan Status Loan approved (Y/N)

Summary of the Dataset

Here in the above dataset, we have three formats of data types:

Property Area, Loan Status.

2. int64: It represents the integer variables. Applicant Income is of this format.

variables are Categorical, ordinal, and numerical.

Credit History, Loan Status)

Education, Property Area)

Income, Loan Amount, Loan Amount Term)

Let’s list out feature-wise count of missing values.

Now, all the features arrive to be numerical values.

The main objective to use EDA process is:

Some standard examples on how to perform EDA on modelling dataset.

First, we check frequency count in the dataset.

1. More Loans are approved Vs Rejected.

2. Count of Male applicants is more than Female.

3. Count of Married applicant is more than non-married.

4. Count of graduate is more than non-Graduate.

5. Count of self-employed is less than that of non-self-employed.

6. Maximum properties are located in Semiurban areas.

7. Credit History is present for many applicants.

8. The count of applicants with several dependents=0 is maximum.

80% of applicants in the dataset are male.

Around 65% of the applicants in the dataset are married.

Around 15% of applicants in the dataset are self-employed.

Around 85% of applicants have repaid their doubts.

Independent variables (Ordinal):

• Most of the applicants don't have any dependents.

• Around 80% of the applicants are Graduate.

• Most of the applicants are from the Semiurban area.

Then Creation of new attributes – Total Income

Next, we apply log transformation to the attributes.

Applicant Income Log

4.1 Logistic Regression for Prediction

An algorithm for categorization is called logistic regression. Given a set of independent

SVM – SUPPORT VECTOR MACHINE

Decision Tree Classifier

4.3 Confusion Matrix

Confusion Matrix for y_test and y_pred

Confusion Matrix for training dataset

Further, we predict Feature Engineering

CHAPTER 5: RESULTS AND ANALYSIS

The loan applications are either accepted or denied by the system.

Accuracy improved with the aid of feature engineering.

You might also like