INTERNSHIP PROJECT REPORT ON
LOAN APPROVAL
PREDICTION
SUBMITTED BY
PREETHIKA SEQUEIRA
AT
ZEPHYR TECHNOLOGIES AND SOLUTIONS PVT LTD
5th Floor, Oberle Towers, Balmatta road, Bendoor,
Mangalore, Karnataka-575002, India
1
ACKNOWLEDGEMENT
We would like to express our sincere gratitude to Zephyr
Technologies for their invaluable support and guidance throughout
our course. Their expertise and dedication have greatly contributed
to our learning experience, helping us acquire the skills necessary
to successfully undertake and complete our project. The resources
and insights provided by Zephyr Technologies have been
instrumental in shaping our understanding and approach, and we are
deeply thankful for their commitment to our development.
Thank you, Zephyr Technologies, for being an integral part of our
journey.
2
ABSTRACT
The growing importance of data-driven decision-making has led to significant
advancements in the field of machine learning, particularly in domains like finance and
banking. One of the critical tasks in this field is automating loan approval processes,
which traditionally involve manual scrutiny of numerous factors. This project focuses
on predicting loan approval status based on various attributes such as income,
education, employment status, and credit score using machine learning algorithms. The
dataset used contains demographic and financial information about applicants,
including features such as age, income, loan amount, employment history, and more.
Multiple machine learning models, including Random Forest, Decision Tree, Support
Vector Machine (SVM), and Logistic Regression, are employed to predict whether a
loan application will be approved or rejected. The data undergoes preprocessing steps
such as missing value imputation, encoding of categorical variables, and feature
engineering before model training. Each model was trained on a training dataset and
evaluated using a test set to assess its performance based on accuracy, precision, recall,
and F1-score. The models are then trained on the dataset, and their performance is
evaluated using accuracy scores and classification reports, with confusion matrices
visualizing the results.
This project aims to not only predict loan approval outcomes efficiently but also offer
a comparison of different machine learning models in terms of performance and
interpretability. By leveraging the power of machine learning, this system provides an
automated and scalable solution to the loan approval process, which can help financial
institutions make quicker and more accurate decisions.
3
INDEX
1. INTRODUCTION 5
2. BACKGROUND STUDY 6
3. PROBLEM STATEMENT 7
4. DATASET 8
5. METHODOLOGY 9-10
6. IMPLEMENTATION 11-17
7. RESULT AND DISCUSSION 18
8. CONCLUSION 19
9. APPENDEX 20-21
10. BIBLOGRAPHY 22
4
1. INTRODUCTION
Loan approval prediction is a crucial process in the financial sector that helps banks
and lending institutions make informed decisions regarding the eligibility of
individuals for loans. Traditionally, loan approval was based on manual processes
involving subjective assessments. However, with the advancement of data science
and machine learning techniques, this process has evolved to leverage predictive
models, allowing financial institutions to assess loan applications more accurately
and efficiently.
In this project, we aim to develop a machine learning model that can predict the
approval status of loan applications based on various input features. These features
include applicant information such as income, education, employment status, credit
score, and asset values, among others. The goal is to predict whether a loan
application will be approved or rejected based on these attributes.
The project involves data preprocessing, including handling missing values,
encoding categorical variables, and feature engineering to transform raw data into a
usable format. Multiple machine learning algorithms, such as Random Forest,
Decision Tree, Support Vector Machine (SVM), and Logistic Regression, are trained
on the dataset to build predictive models. The models are then evaluated based on
their accuracy and performance metrics, such as confusion matrices and
classification reports.
By using machine learning models, this project demonstrates the power of data-
driven decision-making in the loan approval process. It highlights how predictive
analytics can enhance efficiency, reduce human error, and make loan approval
processes more consistent and transparent.
The models trained and evaluated in this project will provide a robust foundation for
developing automated loan approval systems, which can potentially be integrated
into the banking systems to streamline operations, enhance customer experiences,
and reduce risks associated with loan disbursements.
5
2. BACKGROUND STUDY
Loan approval prediction is a critical process in the banking and financial industry,
where the primary goal is to assess whether an applicant is eligible for a loan based on
various attributes such as income, credit score, employment status, and other financial
indicators. Traditionally, loan approvals were based on manual evaluations by bank
officers, which could be slow, prone to human error, and susceptible to bias. However,
with the advent of machine learning (ML) and data science, financial institutions are
now increasingly leveraging predictive models to automate and enhance the decision-
making process.
Predictive modeling for loan approval involves analyzing historical data from past loan
applicants and extracting patterns that can be used to predict the approval status of new
applicants. A variety of machine learning algorithms can be employed for this task,
including Random Forests, Decision Trees, Support Vector Machines (SVM), and
Logistic Regression. These algorithms can learn from historical data and make
predictions on unseen data based on features such as income, credit score, loan amount,
and employment status.
In this project, the focus is on predicting loan approval outcomes by employing
machine learning techniques to classify applicants into two categories: approved and
rejected. The dataset used for this task contains several features that influence the loan
approval decision, such as income, education, self-employment status, credit score
(CIBIL score), loan amount, and other relevant financial indicators.
The process involves several stages, including data preprocessing, where missing
values are handled and categorical variables are encoded into numerical values for
model training. Exploratory Data Analysis (EDA) is performed to understand the
relationships between the features and the target variable, which is the loan approval
status. After data preprocessing, the dataset is split into training and testing subsets,
and various machine learning models are trained and evaluated based on their accuracy
and performance.
By using these techniques, the goal of this project is to improve the efficiency and
accuracy of loan approval predictions, reducing the potential for human error, speeding
up the loan approval process, and making it more objective. Furthermore, automating
this process can help financial institutions manage large volumes of loan applications
more effectively, ensuring a quicker and more reliable evaluation of applicants.
6
3. PROBLEM STATEMENT
The problem at hand is the prediction of loan approval status based on various
applicant characteristics and financial information. In the current scenario, lending
institutions face challenges in accurately assessing the risk associated with loan
applicants, which can lead to either the rejection of creditworthy applicants or the
approval of loans to individuals with high default risks. This project aims to develop
a machine learning model that can predict the approval status of a loan application
(approved or rejected) based on input features such as income, credit score,
employment status, loan amount, and other financial factors. By leveraging various
classification algorithms, including Random Forest, Decision Tree, Support Vector
Machine (SVM), and Logistic Regression, this project seeks to create an efficient
system that can automate the loan approval process, reduce human biases, and
improve decision-making accuracy. The goal is to build a model with high predictive
accuracy and generalization capability to effectively predict loan approval outcomes
in real-world scenarios.
7
4. DATASET
The Loan Approval Prediction dataset is used in this project from Kaggle.
About the Dataset:
The Loan Approval Prediction dataset contains various features that capture
information about an individual's financial and personal details, which are used to
predict whether a loan application will be approved or rejected. These features
represent aspects such as income, credit score, loan amount, and employment status.
The dataset is typically used to train machine learning models to classify whether a
loan application is likely to be approved based on these inputs. Below is an overview
of the types of data typically included:
• Loan ID: Unique identifier for each loan application.
• Number of Dependents: The number of dependents the applicant has, which
could influence their ability to repay the loan.
• Education: The educational qualification of the applicant (e.g., Graduate, Not
Graduate).
• Self Employed: Indicates whether the applicant is self-employed (Yes/No).
• Income (Annually): The annual income of the applicant, which helps determine
their ability to repay the loan.
• Loan Amount: The total amount of money requested by the applicant.
• Loan Term: The duration of the loan (typically in years).
• CIBIL Score: A credit score indicating the applicant's creditworthiness.
• Residential Assets Value: The value of assets owned by the applicant in their
residential property.
• Commercial Assets Value: The value of assets owned by the applicant in their
commercial property.
• Luxury Assets Value: The value of luxury assets owned by the applicant, which
could provide additional financial security.
• Bank Asset Value: The total value of assets held in the applicant's bank account.
• Loan Status (Target Variable): Indicates whether the loan was approved (1) or
rejected (0).
8
5. METHODOLOGY
WORKING MECHANISM:
a) DATA COLLECTION
Data collection is the foundation of building any machine learning model. For this
project, the dataset is collected from financial institutions or publicly available
repositories, containing loan application details.
Key attributes in the dataset include:
• Applicant demographics (e.g., age, gender, education level)
• Financial details (e.g., income, loan amount, loan term)
• Credit history (e.g., CIBIL score)
• Asset information (e.g., residential/commercial asset values)
The quality and diversity of the data significantly impact the accuracy and robustness
of the loan prediction model.
b) DATA PREPROCESSING
Raw data often contains inconsistencies and missing values, which must be addressed
to ensure reliable analysis and modeling. The preprocessing steps include:
• Handling Missing Values: Missing values in categorical variables (e.g.,
education, employment status) are replaced with the mode. Missing values in
numerical variables (e.g., income, loan amount) are filled with the median.
• Encoding Categorical Variables: Transforming non-numeric variables into
numeric representations using techniques like Label Encoding.
• Feature Selection and Engineering: Dropping irrelevant columns like unique
identifiers (e.g., loan ID). Creating new features such as debt-to-income ratio or
net assets value.
• Normalization: Scaling numerical features to a common range (e.g., 0 to 1) to
ensure fair weight distribution during modeling.
c) MODEL SELECTION AND REQUIRED ALGORITHMS
9
Several machine learning algorithms are utilized to predict loan approval status. The
selection is based on their ability to handle classification problems and interpretability.
• Random Forest: Random Forest is an ensemble learning method that builds
multiple decision trees and merges their outputs for more accurate and stable
predictions. It reduces overfitting and improves generalization.
• Decision Tree: Decision Trees are used to split data into subsets based on the
most significant features. Nodes represent decisions, and branches represent
outcomes, making the process interpretable and efficient.
• Support Vector Machine (SVM): SVM aims to find a hyperplane that best
separates classes in a high-dimensional space. Kernel functions are applied to
handle non-linear relationships.
• Logistic Regression: Logistic Regression predicts the probability of loan
approval by modeling the relationship between independent variables and a
binary outcome.
d) MODEL TRAINING AND EVALUATION
1. Dataset Splitting: The dataset is divided into training and testing sets (80%-
20% split) to evaluate model performance.
2. Training: Each model is trained using the training dataset to learn patterns
and relationships.
3. Evaluation Metrics:
o Accuracy, Precision, Recall, and F1-Score are calculated to assess the
effectiveness of each model.
o Confusion matrices are plotted for visual evaluation of predicted
versus actual results.
e) MODEL SELECTION
After evaluating all models, the one with the best trade-off between performance
metrics and interpretability is selected for deployment. Random Forest, being robust
and accurate, is a common choice.
f) DEPLOYMENT AND INTEGRATION
The final model is integrated into an application or system that takes user input (e.g.,
financial details, credit history) and predicts loan approval status in real time. The
model is periodically retrained with new data to ensure its relevance and accuracy.
10
6. IMPLEMENTATION
Implementation refers to the process of integrating various elements of a program to
execute a specific task effectively. In computer programming, implementation involves
coding, testing, and running the program to ensure it performs as intended.
IMPORT LIBRARIES
• Pandas for managing and analyzing structured data.
• NumPy for numerical computations and array handling.
• Matplotlib and Seaborn for creating visualizations to explore and interpret
the data.
• Scikit-learn for splitting datasets, preprocessing data, building machine
learning models (Random Forest, Decision Tree, SVM, Logistic Regression),
and evaluating their performance using metrics like accuracy, classification
reports, and confusion matrices.
UPLOAD THE DATASET
• pd.read_csv('loan_approval_dataset.csv'): Reads the CSV file named
'loan_approval_dataset.csv' into a Pandas DataFrame called dataset.
11
DATA EXPLORATION
Result:
12
DATA PREPROCESSING
Result:
13
ENCODING CATEGORICAL VARIABLES
In this step, we handle categorical data by converting it into numerical form using
LabelEncoder from the sklearn.preprocessing library.
FEATURE ENGINEERING
Feature engineering involves selecting and preparing the independent variables
(features) and the dependent variable (target) for training the machine learning model.
SPLITTING THE DATASET
The training set (X_train, y_train) is used to train the model. The testing set (X_test,
y_test) is used to evaluate the model's performance on unseen data. test_size=0.2:
Allocates 20% of the data for testing and 80% for training. random_state=42: Ensures
reproducibility by using a fixed random seed for the split.
TRAIN THE MODEL
In this step, multiple machine learning models are initialized, trained on the training
data, and evaluated on the testing data. The goal is to compare the performance of
different algorithms for loan approval prediction.
1. Models Used:
o Random Forest: A robust ensemble method combining multiple decision
trees.
14
o Decision Tree: A simple and interpretable algorithm.
o SVM (Support Vector Machine): A classifier that finds the optimal
hyperplane for data separation.
o Logistic Regression: A statistical method for binary classification.
2. Training and Evaluation: Each model is trained using the training set (X_train,
y_train). Predictions are made on the testing set (X_test). Performance metrics
include:
▪ Accuracy: Proportion of correctly predicted labels.
▪ Classification Report: Precision, recall, F1-score for each class.
3. Confusion Matrix Visualization: A confusion matrix is plotted for each model.
Heatmaps provide insights into the number of correct and incorrect predictions
for both classes (Approved and Rejected).
15
CONFUSION MATRIX
EVALUATE THE MODEL
16
PLOT
17
7. RESULT AND DISCUSSION
The loan approval prediction project evaluated four machine learning models:
Random Forest, Decision Tree, Support Vector Machine (SVM), and Logistic
Regression. The performance of these models was assessed using metrics such as
accuracy, precision, recall, and F1-score, as well as confusion matrices.
The Random Forest model emerged as the best-performing algorithm with an
accuracy of 98.1%. It demonstrated excellent precision, recall, and F1-scores for
both classes, indicating its robustness and effectiveness in accurately predicting both
approved and rejected loan applications. The balanced classification metrics suggest
that Random Forest effectively minimized false positives and false negatives.
The Decision Tree model also performed well, achieving an accuracy of 97.5%.
While slightly less accurate than Random Forest, it still delivered strong precision
and recall values, showcasing its ability to handle both classes effectively.
The SVM model, however, performed poorly with an accuracy of 62.8%. It
exhibited high precision for classifying approved loans (class 0) but completely
failed to identify rejected loans (class 1). This significant imbalance in recall and
F1-score indicates that SVM struggled with the dataset, likely due to its sensitivity
to non-linear relationships and feature scaling.
The Logistic Regression model achieved a moderate accuracy of 72.8%. While it
showed good recall for approved loans (94%), its performance for rejected loans was
subpar, with a recall of only 36%. This imbalance suggests that Logistic Regression
struggled to generalize well, likely due to the dataset's complexity and non-linear
patterns.
18
8. CONCLUSION
The study demonstrates the power of machine learning algorithms in automating
loan approval processes. The Random Forest model, with its robust performance,
emerges as the most reliable option for practical deployment. It effectively captures
complex relationships and interactions among features, making it suitable for real-
world financial applications. The Decision Tree model, while slightly less accurate,
offers simplicity and interpretability, making it a viable alternative for smaller-scale
implementations.
The findings also highlight the limitations of traditional models like Logistic
Regression and SVM in handling non-linear patterns and class imbalances within
the dataset. To further enhance prediction accuracy and model robustness,
hyperparameter tuning, ensemble techniques like Gradient Boosting, or integrating
more granular data points such as borrower credit history and economic indicators
could be explored.
The project underscores the significance of advanced machine learning algorithms
in creating efficient, scalable, and accurate solutions for loan approval systems,
paving the way for streamlined decision-making in financial institutions.
19
9. APPENDIX
Source code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
dataset = pd.read_csv('loan_approval_dataset.csv')
print(dataset.head()) # View the first few rows of the dataset
print(dataset.info()) # Information about dataset
print(dataset.describe()) # Summary statistics for numerical columns
print(dataset.isnull().sum())
encoder = LabelEncoder()
dataset[' education'] = encoder.fit_transform(dataset[' education'])
dataset[' self_employed'] = encoder.fit_transform(dataset[' self_employed'])
dataset[' loan_status'] = encoder.fit_transform(dataset[' loan_status'])
X = dataset.drop(['loan_id', ' loan_status'], axis=1) # Features
y = dataset[' loan_status'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
models = {
'Random Forest': RandomForestClassifier(),
'Decision Tree': DecisionTreeClassifier(),
'SVM': SVC(),
20
'Logistic Regression': LogisticRegression()
}
results = {}
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
results[name] = {'accuracy': accuracy, 'classification_report': class_report}
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='coolwarm', xticklabels=['Approved', 'Rejected'],
yticklabels=['Approved', 'Rejected'])
plt.title(f'{name} - Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
for name, result in results.items():
print(f"\n{name} Model:")
print(f"Accuracy: {result['accuracy']}")
print(f"Classification Report:\n{result['classification_report']}")
model_names = list(results.keys())
accuracies = [result['accuracy'] for result in results.values()]
plt.figure(figsize=(10, 5))
sns.barplot(x=model_names, y=accuracies)
plt.xlabel('Models')
plt.ylabel('Accuracy')
plt.title('Model Comparison: Loan Approval Prediction')
plt.show()
21
10. BIBLIOGRAPHY
Websites Referred:
https://www.kaggle.com/datasets/architsharma01/loan-approval-prediction-
dataset
https://www.tutorialspoint.com/loan-approval-prediction-using-machine-
learning
https://www.geeksforgeeks.org/loan-approval-prediction-using-machine-
learning/
22