1) Introduction A) Defining Problem Statement:-: ST ST
1) Introduction A) Defining Problem Statement:-: ST ST
Many banks believed lending to individuals is the risk-free given they are better placed with credit
scores and sometimes the loans are backed by collateral. But recently the banking system has witnessed
an increase in the loan default i.e. the borrower is not able to pay back the installment on time. These
loan defaults directly impact the revenues of a banking system.
Now a days, banks are scrutinizing each loan application to identify potential loan default cases so that
they can predict which client is going to default the loan repayment and at which step.
The major objective of this study is to understand how consumer attributes and loan attributes are
influencing the tendency of default. We will go step by step for building a machine learning algorithm for
the prediction of loan defaulters based on certain variables present in the dataset. Our main objective is
to predict the defaulters correctly, so that lending organization is in a position to take decision on
whether to lend to particular person or not. The bank should maximize the revenue & profit by
minimizing the defaults.
The banking business is to lend to the people who can repay the loan on time. The bank should not fail
in sanctioning of a loan to a person who can repay the loan, which leads to loss of revenue if it fails to
sanction. And at the same time, the bank should not fail in rejection of loan application of person who
cannot repay the loan, which leads to losses to organization if it fails to reject the loan application.
Based on the prediction of probabilities of default, the bank should capture each and every opportunity
of maximization of revenue.
2) Data Report
a) Understanding how data was collected in terms of time, frequency and methodology
The data set is contains the data of loan details from 1 st Jun-2007 to 1st Dec-2015 and it contains
monthly data. The data set contains details for 103 months from Jun-2007 to Dec-2015. We have data
for each 12 months of all years except 7 months in 2007.The data contains both categorical and
numerical values. We need to analyze the impact of both Categorical & Numerical variables on
dependent variable which is loan status. The data is related details of loans sanctioned and its current
status like repayment status, charged off, fully paid, principal outstanding, etc.,
b) Visual inspection of data (rows, columns, descriptive details)
We have 41 Variable and 226786 observations in the data. We have missing values in last credit pulled
date, revolving line utilization rate, last payment date, months since last delinquency, description and
next payment date. We have both numeric and categorical variables in the data. We have 5.2% of
missing observation. We have 39% of discrete columns and 61% of continuous columns. We have total
25 Numeric variables , 11 Categorical variables and 5 date variables.
We have different attributes in the data. We have attributes for borrower like annual income, debt
income ratio, state, home ownership and employment length. We have loan attributes like loan amount,
interest rate and loan term. We have borrower banking attributes like number of open credit lines,
revolving line utilization rate and number of installment accounts. We have loan status as dependant
variable. We need to create binary variable to apply models as default is equal to 1 and fully paid is
equal to Zero. We have missing values in last credit pulled date, revolving line utilization rate, last
payment date, months since last delinquency, description and next payment date.
a) Univariate analysis (distribution and spread for every continuous attribute, distribution of data in
categories for categorical ones)
> 19063/(19063+207723)*100
[1] 8.405722
SI
No Variables Insights
1 term Most of the loans are taken for 36 Months Term
2 grade Most of the loans falling under grade B & C
Bank are majorly providing loans to who is having experience of more than
3 emp_length 10 years
home_ownershi
4 p Most of the Borrowers is having home ownership as Mortgage and Rent
verification_stat
5 us Major Loan Application are falling under Verified and Source verified
6 pymnt_plan Payment plan is no for almost all loan applications
7 purpose Major parts of the loans are debt consolidation and credit card
8 addr_state Higher loan applications are from state California
9 application_type Almost all applications are from Individuals
10 loan_status 91.59% of the loans are fully paid and balance 8.41% are defaulted
NA's :164
out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp
Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 7195 1st Qu.: 7110 1st Qu.: 6000
Median : 0.0 Median : 0.0 Median :12290 Median :12208 Median :10500
Mean : 982.7 Mean : 982.3 Mean :14455 Mean :14358 Mean :12503
3rd Qu.: 0.0 3rd Qu.: 0.0 3rd Qu.:19728 3rd Qu.:19629 3rd Qu.:17075
Max. :35000.0 Max. :35000.0 Max. :57778 Max. :57778 Max. :35000
Insight from above (Grade Vs Loan Amount): Higher loans distributed to Grande B and Grade C.
Insights from Above (Loan Amount Vs Interest Rate): Higher loans has distributed at Interest rate of
8.9%, 10.99%, 12.12%, 7.9%, 13.11%, 15.61% and 16.29%. Major portion of the loans distributed with
interest rates in between 6.03% to 20.99%.
Insights from Above (Grade Vs Interest Rate): Lower Average Interest rate is charges to Grade-A
Customers and higher average interest is charged to Grade-G customers. But the major portion of the
loans has been given to Grade- B&C where the average interest rates are coming at 11.50% & 14.53%.
Insights from Above (Loan Amount Vs Purpose): Highest loans has been taken for debt consolidation
and credit card.
Insights from Above (Loan Amount Vs Home Ownership): Highest loans are having home ownership as
mortgage and Rent.
Insights from Above (Loan Status Vs Home Ownership): More defaults are coming from Mortgage and
Rent. Banks should take more care while sanctioning of loans to mortgage and rented home ownerships.
Insights from Above (Work Exp Vs Loan Status Vs Loan Amount): More loans has been give to
customers who is having experience of 10+ Years and more defaults also are coming from the same
customers. Bank should take more care while sanctioning loans to customers with 10+ year experience.
Insights from Boxplot of Numerical values by Loan Status: More defaults are coming from wherever we
have higher loans, higher credit line accounts, higher interest cost, higher debt income ratio, lower
annual income, higher installments and higher outstanding principles.
Correlation Plots
total_rec_late_fee
funded_amnt_inv
last_pymnt_amnt
total_pymnt_inv
total_rec_prncp
inq_last_6mths
out_prncp_inv
funded_amnt
total_rec_int
delinq_2yrs
total_pymnt
annual_inc
installment
loan_amnt
recoveries
out_prncp
open_acc
revol_bal
revol_util
total_acc
int_rate
dti
loan_amnt ? ? ?? 1
funded_amnt ? ? ??
funded_amnt_inv ? ? ?? 0.8
int_rate ? ? ??
installment ? ? ?? 0.6
annual_inc ? ? ??
dti ? ? ?? 0.4
delinq_2yrs ? ? ??
inq_last_6mths ? ? ??
mths_since_last_delinq ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? 0.2
open_acc ? ? ??
revol_bal ? ? ?? 0
revol_util ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ?
total_acc ? ? ??
out_prncp ? ? ?? -0.2
out_prncp_inv ? ? ??
total_pymnt ? ? ?? -0.4
total_pymnt_inv ? ? ??
total_rec_prncp ? ? ??
total_rec_int ? ? ?? -0.6
total_rec_late_fee ? ? ??
recoveries ? ? ? ? ? ? ? ? ? ?? ? ?? ? ? ? ? ? ? ? ? ? -0.8
collection_recovery_fee ? ? ? ? ? ? ? ? ? ?? ? ?? ? ? ? ? ? ? ? ? ?
last_pymnt_amnt ? ? ?? -1
Insights: Loan Amount, funded amounts, funded amount invested, Total payment received, total
invested amount payment, total principal payment and installments are more correlated.
loan_status_Fully.Paid
loan_status_Default
application_type_JOINT
application_type_INDIVIDUAL
next_pymnt_d_NA
next_pymnt_d_2016.03.01
next_pymnt_d_2016.02.01
next_pymnt_d_2016.01.01
purpose_wedding
purpose_vacation
purpose_small_business
purpose_renewable_energy
purpose_other
purpose_moving
purpose_medical
purpose_major_purchase
purpose_house
purpose_home_improvement
purpose_educational
purpose_debt_consolidation
purpose_credit_card
purpose_car
pymnt_plan_y
pymnt_plan_n
verification_status_Verified
verification_status_Source.Verified
verification_status_Not.Verified
home_ownership_RENT
home_ownership_OWN
home_ownership_OTHER
Features
home_ownership_NONE
home_ownership_MORTGAGE
home_ownership_ANY
emp_length_n.a
emp_length_9.years
emp_length_8.years
emp_length_7.years
emp_length_6.years
emp_length_5.years
emp_length_4.years
emp_length_3.years
emp_length_2.years
emp_length_10..years
emp_length_1.year
emp_length_..1.year
grade_G
grade_F
grade_E
grade_D
grade_C
grade_B
grade_A
term_60.months
term_36.months
last_pymnt_amnt
collection_recovery_fee
recoveries
total_rec_late_fee
total_rec_int
total_rec_prncp
total_pymnt_inv
total_pymnt
out_prncp_inv
out_prncp
total_acc
revol_util
revol_bal
open_acc
mths_since_last_delinq
inq_last_6mths
delinq_2yrs dti
annual_inc
installment
int_rate
funded_amnt_inv
funded_amnt
loan_amnt
member_id
verification_status_Source.Verified
home_ownership_MORTGAGE
verification_status_Not.Verified
application_type_INDIVIDUAL
purpose_home_improvement
purpose_debt_consolidation
purpose_renewable_energy
verification_status_Verified
next_pymnt_d_2016.01.01
next_pymnt_d_2016.02.01
next_pymnt_d_2016.03.01
home_ownership_OTHER
purpose_major_purchase
purpose_small_business
home_ownership_NONE
home_ownership_RENT
application_type_JOINT
home_ownership_OWN
collection_recovery_fee
mths_since_last_delinq
home_ownership_ANY
loan_status_Fully.Paid
emp_length_10..years
purpose_educational
purpose_credit_card
emp_length_..1.year
emp_length_2.years
emp_length_3.years
emp_length_4.years
emp_length_5.years
emp_length_6.years
emp_length_7.years
emp_length_8.years
emp_length_9.years
loan_status_Default
emp_length_1.year
next_pymnt_d_NA
purpose_wedding
total_rec_late_fee
purpose_vacation
funded_amnt_inv
purpose_medical
last_pymnt_amnt
purpose_moving
emp_length_n.a
term_36.months
term_60.months
total_pymnt_inv
purpose_house
total_rec_prncp
inq_last_6mths
purpose_other
out_prncp_inv
pymnt_plan_n
pymnt_plan_y
funded_amnt
purpose_car
total_rec_int
delinq_2yrs
total_pymnt
member_id
annual_inc
installment
loan_amnt
recoveries
out_prncp
open_acc
revol_bal
total_acc
revol_util
grade_G
grade_C
grade_D
grade_A
grade_B
grade_E
grade_F
int_rate
dti
Features
Correlation Meter
-1.0 -0.5 0.0 0.5 1.0
Removing of Unwanted Variables Like: Member ID, Date Columns, Columns with highest missing
values like Desc & Months since last delinquency and variables with high zero values like
total_rec_late_fee, reciceris and with high unique values like payment plan.
loan_status 0%
last_pymnt_amnt 0%
total_rec_late_fee 0%
total_rec_int 0%
total_rec_prncp 0%
total_pymnt_inv 0%
total_pymnt 0%
out_prncp_inv 0%
out_prncp 0%
total_acc 0%
revol_bal 0%
open_acc 0%
inq_last_6mths 0%
Features
delinq_2yrs 0%
dti 0%
addr_state 0%
purpose 0%
verification_status 0%
annual_inc 0%
home_ownership 0%
emp_length 0%
grade 0%
installment 0%
int_rate 0%
term 0%
funded_amnt_inv 0%
funded_amnt 0%
loan_amnt 0%
revol_util 0.07%
0 50 100 150
Missing Rows
Band a Good
There are 164 Missing values in the new data set in revol_util variables.
inq_last_6mths 0%
delinq_2yrs 0%
dti 0%
addr_state 0%
purpose 0%
verification_status 0%
annual_inc 0%
home_ownership 0%
emp_length 0%
grade 0%
installment 0%
int_rate 0%
term 0%
funded_amnt_inv 0%
funded_amnt 0%
loan_amnt 0%
-0.050 -0.025 0.000 0.025 0.050
Missing Rows
Band a Good
Transformation all categorical variables into factor variables.(Summary with Factor Variables)
> str(Newloandata)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 226786 obs. of 29 variables:
$ loan_amnt : num 5000 2400 10000 5000 3000 6500 12000 3000 1000 10000 ...
$ funded_amnt : num 5000 2400 10000 5000 3000 6500 12000 3000 1000 10000 ...
$ funded_amnt_inv : num 4975 2400 10000 5000 3000 ...
$ term : Factor w/ 2 levels "36 months","60 months": 1 1 1 1 1 2 1 1 1
1 ...
$ int_rate : num 10.6 16 13.5 7.9 18.6 ...
$ installment : num 162.9 84.3 339.3 156.5 109.4 ...
$ grade : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 3 2 2 4 3 ...
$ emp_length : Factor w/ 12 levels "< 1 year","1 year",..: 3 3 3 5 11 7 3 5 1
6 ...
$ home_ownership : Factor w/ 6 levels "ANY","MORTGAGE",..: 6 6 6 6 6 5 5 6 6
6 ...
$ annual_inc : num 24000 12252 49200 36000 48000 ...
$ verification_status: Factor w/ 3 levels "Not Verified",..: 3 1 2 2 2 1 2 2 1 1 ...
$ purpose : Factor w/ 14 levels "car","credit_card",..: 2 12 10 14 1 3 3 2
3 5 ...
$ addr_state : Factor w/ 51 levels "AK","AL","AR",..: 4 15 5 4 5 4 5 15 25
5 ...
$ dti : num 27.65 8.72 20 11.2 5.35 ...
$ delinq_2yrs : num 0 0 0 0 0 0 0 0 0 0 ...
$ inq_last_6mths : num 1 2 1 3 2 2 0 2 1 2 ...
$ open_acc : num 3 2 10 9 4 14 12 11 11 14 ...
$ revol_bal : num 13648 2956 5598 7963 8221 ...
$ revol_util : num 83.7 98.5 21 28.3 87.5 20.6 67.1 43.1 81.5 70.2 ...
$ total_acc : num 9 10 37 12 4 23 34 11 23 28 ...
$ out_prncp : num 0 0 0 0 0 0 0 0 0 0 ...
$ out_prncp_inv : num 0 0 0 0 0 0 0 0 0 0 ...
$ total_pymnt : num 5861 3004 12226 5631 3938 ...
$ total_pymnt_inv : num 5832 3004 12226 5631 3938 ...
$ total_rec_prncp : num 5000 2400 10000 5000 3000 6500 12000 3000 1000 10000 ...
$ total_rec_int : num 861 604 2209 631 938 ...
$ total_rec_late_fee : num 0 0 17 0 0 ...
$ last_pymnt_amnt : num 172 650 357 161 111 ...
$ loan_status : Factor w/ 2 levels "Default","Fully Paid": 2 2 2 2 2 2 2 2 2 2
...
Structure of the data after addition of New binary values for Loan Status and New
variable of Funded amount to annual income.
str(Newloandata)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 226786 obs. of 31 variables:
$ loan_amnt : num 5000 2400 10000 5000 3000 6500 12000 3000 1000 10000 ...
$ funded_amnt : num 5000 2400 10000 5000 3000 6500 12000 3000 1000 10000 ...
$ funded_amnt_inv : num 4975 2400 10000 5000 3000 ...
$ term : Factor w/ 2 levels "36 months","60 months": 1 1 1 1 1 2 1 1 1
1 ...
$ int_rate : num 10.6 16 13.5 7.9 18.6 ...
$ installment : num 162.9 84.3 339.3 156.5 109.4 ...
$ grade : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 3 2 2 4 3 ...
$ emp_length : Factor w/ 12 levels "< 1 year","1 year",..: 3 3 3 5 11 7 3 5 1
6 ...
$ home_ownership : Factor w/ 6 levels "ANY","MORTGAGE",..: 6 6 6 6 6 5 5 6 6
6 ...
$ annual_inc : num 24000 12252 49200 36000 48000 ...
$ verification_status: Factor w/ 3 levels "Not Verified",..: 3 1 2 2 2 1 2 2 1 1 ...
$ purpose : Factor w/ 14 levels "car","credit_card",..: 2 12 10 14 1 3 3 2
3 5 ...
$ addr_state : Factor w/ 51 levels "AK","AL","AR",..: 4 15 5 4 5 4 5 15 25
5 ...
$ dti : num 27.65 8.72 20 11.2 5.35 ...
$ delinq_2yrs : num 0 0 0 0 0 0 0 0 0 0 ...
$ inq_last_6mths : num 1 2 1 3 2 2 0 2 1 2 ...
$ open_acc : num 3 2 10 9 4 14 12 11 11 14 ...
$ revol_bal : num 13648 2956 5598 7963 8221 ...
$ revol_util : num 83.7 98.5 21 28.3 87.5 20.6 67.1 43.1 81.5 70.2 ...
$ total_acc : num 9 10 37 12 4 23 34 11 23 28 ...
$ out_prncp : num 0 0 0 0 0 0 0 0 0 0 ...
$ out_prncp_inv : num 0 0 0 0 0 0 0 0 0 0 ...
$ total_pymnt : num 5861 3004 12226 5631 3938 ...
$ total_pymnt_inv : num 5832 3004 12226 5631 3938 ...
$ total_rec_prncp : num 5000 2400 10000 5000 3000 6500 12000 3000 1000 10000 ...
$ total_rec_int : num 861 604 2209 631 938 ...
$ total_rec_late_fee : num 0 0 17 0 0 ...
$ last_pymnt_amnt : num 172 650 357 161 111 ...
$ loan_status : Factor w/ 2 levels "Default","Fully Paid": 2 2 2 2 2 2 2 2 2 2
...
$ loan_Default : chr "0" "0" "0" "0" ...
$ FundtoAnn : num 0.2083 0.1959 0.2033 0.1389 0.0625 ...
The data is unbalanced but to develop the model, the percentage of unbalanced will depends on the
how much banks considers as average defaults rates on total loans distributed
Cluster is applicable to unsupervised data, but here we are dealing with supervised data, hence
clustering is not applicable.