0% found this document useful (0 votes)
212 views

1) Introduction A) Defining Problem Statement:-: ST ST

The document discusses analyzing loan data from banks to predict loan defaults. It introduces the problem of increasing loan defaults impacting bank revenues. The objectives are to understand how consumer and loan attributes influence default tendencies and build a machine learning model to predict defaulters. This will help banks maximize revenue and profit by reducing defaults. The data contains over 200,000 loan applications with attributes on borrowers, loans, banking history, and loan status. Exploratory analysis will examine relationships between variables to understand default drivers.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
212 views

1) Introduction A) Defining Problem Statement:-: ST ST

The document discusses analyzing loan data from banks to predict loan defaults. It introduces the problem of increasing loan defaults impacting bank revenues. The objectives are to understand how consumer and loan attributes influence default tendencies and build a machine learning model to predict defaulters. This will help banks maximize revenue and profit by reducing defaults. The data contains over 200,000 loan applications with attributes on borrowers, loans, banking history, and loan status. Exploratory analysis will examine relationships between variables to understand default drivers.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

1) Introduction

a) Defining problem statement:-

Many banks believed lending to individuals is the risk-free given they are better placed with credit
scores and sometimes the loans are backed by collateral. But recently the banking system has witnessed
an increase in the loan default i.e. the borrower is not able to pay back the installment on time. These
loan defaults directly impact the revenues of a banking system.
Now a days, banks are scrutinizing each loan application to identify potential loan default cases so that
they can predict which client is going to default the loan repayment and at which step.

b) Need of the study/project

The major objective of this study is to understand how consumer attributes and loan attributes are
influencing the tendency of default. We will go step by step for building a machine learning algorithm for
the prediction of loan defaulters based on certain variables present in the dataset. Our main objective is
to predict the defaulters correctly, so that lending organization is in a position to take decision on
whether to lend to particular person or not. The bank should maximize the revenue & profit by
minimizing the defaults.

c) Understanding business/social opportunity:

The banking business is to lend to the people who can repay the loan on time. The bank should not fail
in sanctioning of a loan to a person who can repay the loan, which leads to loss of revenue if it fails to
sanction. And at the same time, the bank should not fail in rejection of loan application of person who
cannot repay the loan, which leads to losses to organization if it fails to reject the loan application.
Based on the prediction of probabilities of default, the bank should capture each and every opportunity
of maximization of revenue.

2) Data Report

a) Understanding how data was collected in terms of time, frequency and methodology

The data set is contains the data of loan details from 1 st Jun-2007 to 1st Dec-2015 and it contains
monthly data. The data set contains details for 103 months from Jun-2007 to Dec-2015. We have data
for each 12 months of all years except 7 months in 2007.The data contains both categorical and
numerical values. We need to analyze the impact of both Categorical & Numerical variables on
dependent variable which is loan status. The data is related details of loans sanctioned and its current
status like repayment status, charged off, fully paid, principal outstanding, etc.,
b) Visual inspection of data (rows, columns, descriptive details)

We have 41 Variable and 226786 observations in the data. We have missing values in last credit pulled
date, revolving line utilization rate, last payment date, months since last delinquency, description and
next payment date. We have both numeric and categorical variables in the data. We have 5.2% of
missing observation. We have 39% of discrete columns and 61% of continuous columns. We have total
25 Numeric variables , 11 Categorical variables and 5 date variables.

c) Understanding of attributes (variable info, renaming if required)

We have different attributes in the data. We have attributes for borrower like annual income, debt
income ratio, state, home ownership and employment length. We have loan attributes like loan amount,
interest rate and loan term. We have borrower banking attributes like number of open credit lines,
revolving line utilization rate and number of installment accounts. We have loan status as dependant
variable. We need to create binary variable to apply models as default is equal to 1 and fully paid is
equal to Zero. We have missing values in last credit pulled date, revolving line utilization rate, last
payment date, months since last delinquency, description and next payment date.

3) Exploratory data analysis

a) Univariate analysis (distribution and spread for every continuous attribute, distribution of data in
categories for categorical ones)

Distribution of categorical variables.


Insights from Distribution plot for categorical variables.
Table for Number of Defaulted and Fully paid

Default Fully Paid


19063 207723

> 19063/(19063+207723)*100

[1] 8.405722

SI
No Variables Insights
1 term Most of the loans are taken for 36 Months Term
2 grade Most of the loans falling under grade B & C
Bank are majorly providing loans to who is having experience of more than
3 emp_length 10 years
home_ownershi
4 p Most of the Borrowers is having home ownership as Mortgage and Rent
verification_stat
5 us Major Loan Application are falling under Verified and Source verified
6 pymnt_plan Payment plan is no for almost all loan applications
7 purpose Major parts of the loans are debt consolidation and credit card
8 addr_state Higher loan applications are from state California
9 application_type Almost all applications are from Individuals
10 loan_status 91.59% of the loans are fully paid and balance 8.41% are defaulted

Distribution for numeric variables:


Summary of the numerical variables

member_id loan_amnt funded_amnt funded_amnt_inv int_rate


Min. : 70699 Min. : 500 Min. : 500 Min. : 0 Min. : 5.32
1st Qu.: 1758068 1st Qu.: 7200 1st Qu.: 7200 1st Qu.: 7200 1st Qu.:10.25
Median : 8440297 Median :12000 Median :12000 Median :11975 Median :13.11
Mean :15517393 Mean :13543 Mean :13507 Mean :13427 Mean :13.49
3rd Qu.:23001889 3rd Qu.:18194 3rd Qu.:18000 3rd Qu.:18000 3rd Qu.:16.29
Max. :73507418 Max. :35000 Max. :35000 Max. :35000 Max. :28.99

installment annual_inc dti delinq_2yrs inq_last_6mths


Min. : 15.69 Min. : 3000 Min. : 0.00 Min. : 0.000 Min. :0.0000
1st Qu.: 239.55 1st Qu.: 45000 1st Qu.:10.62 1st Qu.: 0.000 1st Qu.:0.0000
Median : 364.96 Median : 64000 Median :16.03 Median : 0.000 Median :0.0000
Mean : 417.99 Mean : 73965 Mean :16.44 Mean : 0.259 Mean :0.8244
3rd Qu.: 547.43 3rd Qu.: 90000 3rd Qu.:21.86 3rd Qu.: 0.000 3rd Qu.:1.0000
Max. :1409.99 Max. :8900060 Max. :59.26 Max. :29.000 Max. :8.0000

mths_since_last_delinq open_acc revol_bal revol_util total_acc


Min. : 0.00 Min. : 0.00 Min. : 0 Min. : 0.00 Min. : 2.00
1st Qu.: 17.00 1st Qu.: 7.00 1st Qu.: 5812 1st Qu.: 35.40 1st Qu.: 17.00
Median : 32.00 Median :10.00 Median : 10868 Median : 55.00 Median : 24.00
Mean : 35.04 Mean :10.99 Mean : 15241 Mean : 53.67 Mean : 25.22
3rd Qu.: 51.00 3rd Qu.:14.00 3rd Qu.: 19065 3rd Qu.: 73.20 3rd Qu.: 32.00
Max. :151.00 Max. :76.00 Max. :1743266 Max. :892.30 Max. :150.00
NA's :124638

NA's :164
out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp
Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 7195 1st Qu.: 7110 1st Qu.: 6000
Median : 0.0 Median : 0.0 Median :12290 Median :12208 Median :10500
Mean : 982.7 Mean : 982.3 Mean :14455 Mean :14358 Mean :12503
3rd Qu.: 0.0 3rd Qu.: 0.0 3rd Qu.:19728 3rd Qu.:19629 3rd Qu.:17075
Max. :35000.0 Max. :35000.0 Max. :57778 Max. :57778 Max. :35000

total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_amnt


Min. : 0.0 Min. : 0.0000 Min. :0 Min. :0 Min. : 0.0
1st Qu.: 629.7 1st Qu.: 0.0000 1st Qu.:0 1st Qu.:0 1st Qu.: 732.7
Median : 1311.1 Median : 0.0000 Median :0 Median :0 Median : 4956.3
Mean : 1951.0 Mean : 0.5893 Mean :0 Mean :0 Mean : 7159.7
3rd Qu.: 2485.4 3rd Qu.: 0.0000 3rd Qu.:0 3rd Qu.:0 3rd Qu.:10931.0
Max. :22777.6 Max. :286.7476 Max. :0 Max. :0 Max. :36475.6

Insights from distribution of Numerical Variables


b) Bivariate analysis (relationship between different variables , correlations)

Relationship between variables

Insight from above (Grade Vs Loan Amount): Higher loans distributed to Grande B and Grade C.

Insights from Above (Loan Amount Vs Interest Rate): Higher loans has distributed at Interest rate of
8.9%, 10.99%, 12.12%, 7.9%, 13.11%, 15.61% and 16.29%. Major portion of the loans distributed with
interest rates in between 6.03% to 20.99%.

Insights from Above (Grade Vs Interest Rate): Lower Average Interest rate is charges to Grade-A
Customers and higher average interest is charged to Grade-G customers. But the major portion of the
loans has been given to Grade- B&C where the average interest rates are coming at 11.50% & 14.53%.

Insights from Above (Loan Amount Vs Purpose): Highest loans has been taken for debt consolidation
and credit card.

Insights from Above (Loan Amount Vs Home Ownership): Highest loans are having home ownership as
mortgage and Rent.
Insights from Above (Loan Status Vs Home Ownership): More defaults are coming from Mortgage and
Rent. Banks should take more care while sanctioning of loans to mortgage and rented home ownerships.

Insights from Above (Work Exp Vs Loan Status Vs Loan Amount): More loans has been give to
customers who is having experience of 10+ Years and more defaults also are coming from the same
customers. Bank should take more care while sanctioning loans to customers with 10+ year experience.

Box plot for numeric values by Loan Status

Insights from Boxplot of Numerical values by Loan Status: More defaults are coming from wherever we
have higher loans, higher credit line accounts, higher interest cost, higher debt income ratio, lower
annual income, higher installments and higher outstanding principles.

Correlation Plots

Correlation plot for continuous variables:


collection_recovery_fee
mths_since_last_delinq

total_rec_late_fee
funded_amnt_inv

last_pymnt_amnt
total_pymnt_inv
total_rec_prncp
inq_last_6mths

out_prncp_inv
funded_amnt

total_rec_int
delinq_2yrs

total_pymnt
annual_inc
installment
loan_amnt

recoveries
out_prncp
open_acc
revol_bal
revol_util
total_acc
int_rate

dti
loan_amnt ? ? ?? 1
funded_amnt ? ? ??
funded_amnt_inv ? ? ?? 0.8
int_rate ? ? ??
installment ? ? ?? 0.6
annual_inc ? ? ??
dti ? ? ?? 0.4
delinq_2yrs ? ? ??
inq_last_6mths ? ? ??
mths_since_last_delinq ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? 0.2
open_acc ? ? ??
revol_bal ? ? ?? 0
revol_util ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ?
total_acc ? ? ??
out_prncp ? ? ?? -0.2
out_prncp_inv ? ? ??
total_pymnt ? ? ?? -0.4
total_pymnt_inv ? ? ??
total_rec_prncp ? ? ??
total_rec_int ? ? ?? -0.6
total_rec_late_fee ? ? ??
recoveries ? ? ? ? ? ? ? ? ? ?? ? ?? ? ? ? ? ? ? ? ? ? -0.8
collection_recovery_fee ? ? ? ? ? ? ? ? ? ?? ? ?? ? ? ? ? ? ? ? ? ?
last_pymnt_amnt ? ? ?? -1

Insights: Loan Amount, funded amounts, funded amount invested, Total payment received, total
invested amount payment, total principal payment and installments are more correlated.
loan_status_Fully.Paid
loan_status_Default
application_type_JOINT
application_type_INDIVIDUAL
next_pymnt_d_NA
next_pymnt_d_2016.03.01
next_pymnt_d_2016.02.01
next_pymnt_d_2016.01.01
purpose_wedding
purpose_vacation
purpose_small_business
purpose_renewable_energy
purpose_other
purpose_moving
purpose_medical
purpose_major_purchase
purpose_house
purpose_home_improvement
purpose_educational
purpose_debt_consolidation
purpose_credit_card
purpose_car
pymnt_plan_y
pymnt_plan_n
verification_status_Verified
verification_status_Source.Verified
verification_status_Not.Verified
home_ownership_RENT
home_ownership_OWN
home_ownership_OTHER
Features

home_ownership_NONE
home_ownership_MORTGAGE
home_ownership_ANY
emp_length_n.a
emp_length_9.years
emp_length_8.years
emp_length_7.years
emp_length_6.years
emp_length_5.years
emp_length_4.years
emp_length_3.years
emp_length_2.years
emp_length_10..years
emp_length_1.year
emp_length_..1.year
grade_G
grade_F
grade_E
grade_D
grade_C
grade_B
grade_A
term_60.months
term_36.months
last_pymnt_amnt
collection_recovery_fee
recoveries
total_rec_late_fee
total_rec_int
total_rec_prncp
total_pymnt_inv
total_pymnt
out_prncp_inv
out_prncp
total_acc
revol_util
revol_bal
open_acc
mths_since_last_delinq
inq_last_6mths
delinq_2yrs dti
annual_inc
installment
int_rate
funded_amnt_inv
funded_amnt
loan_amnt
member_id
verification_status_Source.Verified
home_ownership_MORTGAGE

verification_status_Not.Verified

application_type_INDIVIDUAL
purpose_home_improvement
purpose_debt_consolidation

purpose_renewable_energy
verification_status_Verified

next_pymnt_d_2016.01.01
next_pymnt_d_2016.02.01
next_pymnt_d_2016.03.01
home_ownership_OTHER

purpose_major_purchase

purpose_small_business
home_ownership_NONE

home_ownership_RENT

application_type_JOINT
home_ownership_OWN
collection_recovery_fee
mths_since_last_delinq

home_ownership_ANY

loan_status_Fully.Paid
emp_length_10..years

purpose_educational
purpose_credit_card
emp_length_..1.year

emp_length_2.years
emp_length_3.years
emp_length_4.years
emp_length_5.years
emp_length_6.years
emp_length_7.years
emp_length_8.years
emp_length_9.years

loan_status_Default
emp_length_1.year

next_pymnt_d_NA
purpose_wedding
total_rec_late_fee

purpose_vacation
funded_amnt_inv

purpose_medical
last_pymnt_amnt

purpose_moving
emp_length_n.a
term_36.months
term_60.months
total_pymnt_inv

purpose_house
total_rec_prncp
inq_last_6mths

purpose_other
out_prncp_inv

pymnt_plan_n
pymnt_plan_y
funded_amnt

purpose_car
total_rec_int
delinq_2yrs

total_pymnt
member_id

annual_inc
installment
loan_amnt

recoveries
out_prncp
open_acc
revol_bal
total_acc
revol_util

grade_G
grade_C
grade_D
grade_A
grade_B

grade_E
grade_F
int_rate

dti

Features

Correlation Meter
-1.0 -0.5 0.0 0.5 1.0

Removing of Unwanted Variables Like: Member ID, Date Columns, Columns with highest missing
values like Desc & Months since last delinquency and variables with high zero values like
total_rec_late_fee, reciceris and with high unique values like payment plan.

Structure of the data after removal of unwanted data.


> str(loandata)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 226786 obs. of 29 variables:
$ loan_amnt : num 5000 2400 10000 5000 3000 6500 12000 3000 1000 10000 ...
$ funded_amnt : num 5000 2400 10000 5000 3000 6500 12000 3000 1000 10000 ...
$ funded_amnt_inv : num 4975 2400 10000 5000 3000 ...
$ term : chr "36 months" "36 months" "36 months" "36 months" ...
$ int_rate : num 10.6 16 13.5 7.9 18.6 ...
$ installment : num 162.9 84.3 339.3 156.5 109.4 ...
$ grade : chr "B" "C" "C" "A" ...
$ emp_length : chr "10+ years" "10+ years" "10+ years" "3 years" ...
$ home_ownership : chr "RENT" "RENT" "RENT" "RENT" ...
$ annual_inc : num 24000 12252 49200 36000 48000 ...
$ verification_status: chr "Verified" "Not Verified" "Source Verified" "Source
Verified" ...
$ purpose : chr "credit_card" "small_business" "other" "wedding" ...
$ addr_state : chr "AZ" "IL" "CA" "AZ" ...
$ dti : num 27.65 8.72 20 11.2 5.35 ...
$ delinq_2yrs : num 0 0 0 0 0 0 0 0 0 0 ...
$ inq_last_6mths : num 1 2 1 3 2 2 0 2 1 2 ...
$ open_acc : num 3 2 10 9 4 14 12 11 11 14 ...
$ revol_bal : num 13648 2956 5598 7963 8221 ...
$ revol_util : num 83.7 98.5 21 28.3 87.5 20.6 67.1 43.1 81.5 70.2 ...
$ total_acc : num 9 10 37 12 4 23 34 11 23 28 ...
$ out_prncp : num 0 0 0 0 0 0 0 0 0 0 ...
$ out_prncp_inv : num 0 0 0 0 0 0 0 0 0 0 ...
$ total_pymnt : num 5861 3004 12226 5631 3938 ...
$ total_pymnt_inv : num 5832 3004 12226 5631 3938 ...
$ total_rec_prncp : num 5000 2400 10000 5000 3000 6500 12000 3000 1000 10000 ...
$ total_rec_int : num 861 604 2209 631 938 ...
$ total_rec_late_fee : num 0 0 17 0 0 ...
$ last_pymnt_amnt : num 172 650 357 161 111 ...
$ loan_status : chr "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
...

loan_status 0%
last_pymnt_amnt 0%
total_rec_late_fee 0%
total_rec_int 0%
total_rec_prncp 0%
total_pymnt_inv 0%
total_pymnt 0%
out_prncp_inv 0%
out_prncp 0%
total_acc 0%
revol_bal 0%
open_acc 0%
inq_last_6mths 0%
Features

delinq_2yrs 0%
dti 0%
addr_state 0%
purpose 0%
verification_status 0%
annual_inc 0%
home_ownership 0%
emp_length 0%
grade 0%
installment 0%
int_rate 0%
term 0%
funded_amnt_inv 0%
funded_amnt 0%
loan_amnt 0%
revol_util 0.07%
0 50 100 150
Missing Rows

Band a Good

There are 164 Missing values in the new data set in revol_util variables.

Missing value plot after treatment of Missing Values.


loan_status 0%
last_pymnt_amnt 0%
total_rec_late_fee 0%
total_rec_int 0%
total_rec_prncp 0%
total_pymnt_inv 0%
total_pymnt 0%
out_prncp_inv 0%
out_prncp 0%
total_acc 0%
revol_util 0%
revol_bal 0%
open_acc 0%
Features

inq_last_6mths 0%
delinq_2yrs 0%
dti 0%
addr_state 0%
purpose 0%
verification_status 0%
annual_inc 0%
home_ownership 0%
emp_length 0%
grade 0%
installment 0%
int_rate 0%
term 0%
funded_amnt_inv 0%
funded_amnt 0%
loan_amnt 0%
-0.050 -0.025 0.000 0.025 0.050
Missing Rows

Band a Good

Transformation all categorical variables into factor variables.(Summary with Factor Variables)
> str(Newloandata)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 226786 obs. of 29 variables:
$ loan_amnt : num 5000 2400 10000 5000 3000 6500 12000 3000 1000 10000 ...
$ funded_amnt : num 5000 2400 10000 5000 3000 6500 12000 3000 1000 10000 ...
$ funded_amnt_inv : num 4975 2400 10000 5000 3000 ...
$ term : Factor w/ 2 levels "36 months","60 months": 1 1 1 1 1 2 1 1 1
1 ...
$ int_rate : num 10.6 16 13.5 7.9 18.6 ...
$ installment : num 162.9 84.3 339.3 156.5 109.4 ...
$ grade : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 3 2 2 4 3 ...
$ emp_length : Factor w/ 12 levels "< 1 year","1 year",..: 3 3 3 5 11 7 3 5 1
6 ...
$ home_ownership : Factor w/ 6 levels "ANY","MORTGAGE",..: 6 6 6 6 6 5 5 6 6
6 ...
$ annual_inc : num 24000 12252 49200 36000 48000 ...
$ verification_status: Factor w/ 3 levels "Not Verified",..: 3 1 2 2 2 1 2 2 1 1 ...
$ purpose : Factor w/ 14 levels "car","credit_card",..: 2 12 10 14 1 3 3 2
3 5 ...
$ addr_state : Factor w/ 51 levels "AK","AL","AR",..: 4 15 5 4 5 4 5 15 25
5 ...
$ dti : num 27.65 8.72 20 11.2 5.35 ...
$ delinq_2yrs : num 0 0 0 0 0 0 0 0 0 0 ...
$ inq_last_6mths : num 1 2 1 3 2 2 0 2 1 2 ...
$ open_acc : num 3 2 10 9 4 14 12 11 11 14 ...
$ revol_bal : num 13648 2956 5598 7963 8221 ...
$ revol_util : num 83.7 98.5 21 28.3 87.5 20.6 67.1 43.1 81.5 70.2 ...
$ total_acc : num 9 10 37 12 4 23 34 11 23 28 ...
$ out_prncp : num 0 0 0 0 0 0 0 0 0 0 ...
$ out_prncp_inv : num 0 0 0 0 0 0 0 0 0 0 ...
$ total_pymnt : num 5861 3004 12226 5631 3938 ...
$ total_pymnt_inv : num 5832 3004 12226 5631 3938 ...
$ total_rec_prncp : num 5000 2400 10000 5000 3000 6500 12000 3000 1000 10000 ...
$ total_rec_int : num 861 604 2209 631 938 ...
$ total_rec_late_fee : num 0 0 17 0 0 ...
$ last_pymnt_amnt : num 172 650 357 161 111 ...
$ loan_status : Factor w/ 2 levels "Default","Fully Paid": 2 2 2 2 2 2 2 2 2 2
...

Structure of the data after addition of New binary values for Loan Status and New
variable of Funded amount to annual income.
str(Newloandata)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 226786 obs. of 31 variables:
$ loan_amnt : num 5000 2400 10000 5000 3000 6500 12000 3000 1000 10000 ...
$ funded_amnt : num 5000 2400 10000 5000 3000 6500 12000 3000 1000 10000 ...
$ funded_amnt_inv : num 4975 2400 10000 5000 3000 ...
$ term : Factor w/ 2 levels "36 months","60 months": 1 1 1 1 1 2 1 1 1
1 ...
$ int_rate : num 10.6 16 13.5 7.9 18.6 ...
$ installment : num 162.9 84.3 339.3 156.5 109.4 ...
$ grade : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 3 2 2 4 3 ...
$ emp_length : Factor w/ 12 levels "< 1 year","1 year",..: 3 3 3 5 11 7 3 5 1
6 ...
$ home_ownership : Factor w/ 6 levels "ANY","MORTGAGE",..: 6 6 6 6 6 5 5 6 6
6 ...
$ annual_inc : num 24000 12252 49200 36000 48000 ...
$ verification_status: Factor w/ 3 levels "Not Verified",..: 3 1 2 2 2 1 2 2 1 1 ...
$ purpose : Factor w/ 14 levels "car","credit_card",..: 2 12 10 14 1 3 3 2
3 5 ...
$ addr_state : Factor w/ 51 levels "AK","AL","AR",..: 4 15 5 4 5 4 5 15 25
5 ...
$ dti : num 27.65 8.72 20 11.2 5.35 ...
$ delinq_2yrs : num 0 0 0 0 0 0 0 0 0 0 ...
$ inq_last_6mths : num 1 2 1 3 2 2 0 2 1 2 ...
$ open_acc : num 3 2 10 9 4 14 12 11 11 14 ...
$ revol_bal : num 13648 2956 5598 7963 8221 ...
$ revol_util : num 83.7 98.5 21 28.3 87.5 20.6 67.1 43.1 81.5 70.2 ...
$ total_acc : num 9 10 37 12 4 23 34 11 23 28 ...
$ out_prncp : num 0 0 0 0 0 0 0 0 0 0 ...
$ out_prncp_inv : num 0 0 0 0 0 0 0 0 0 0 ...
$ total_pymnt : num 5861 3004 12226 5631 3938 ...
$ total_pymnt_inv : num 5832 3004 12226 5631 3938 ...
$ total_rec_prncp : num 5000 2400 10000 5000 3000 6500 12000 3000 1000 10000 ...
$ total_rec_int : num 861 604 2209 631 938 ...
$ total_rec_late_fee : num 0 0 17 0 0 ...
$ last_pymnt_amnt : num 172 650 357 161 111 ...
$ loan_status : Factor w/ 2 levels "Default","Fully Paid": 2 2 2 2 2 2 2 2 2 2
...
$ loan_Default : chr "0" "0" "0" "0" ...
$ FundtoAnn : num 0.2083 0.1959 0.2033 0.1389 0.0625 ...

4) Insights from EDA

a) Is the data unbalanced ? If so, what can be done ?

The data is unbalanced but to develop the model, the percentage of unbalanced will depends on the
how much banks considers as average defaults rates on total loans distributed

b) Any insights using clustering (if applicable):-

Cluster is applicable to unsupervised data, but here we are dealing with supervised data, hence
clustering is not applicable.

c) Any other Insights: All insights presented with graphs.

You might also like