Manali Andyal 26 05 2025 FRA Part A Guided Project Report PDF
Manali Andyal 26 05 2025 FRA Part A Guided Project Report PDF
DSBA
By:
Manali Andyal
1|Page
Contents
Problem Statement……………………………………………… Page 3
Data Overview
o Data Description………………………………………… Page 3
o Data Dictionary………………………………………….. Page 3
Exploratory Data Analysis
o Univariate Analysis…………………………………….. Page 7
o Bivariate Analysis………………………………………. Page 8
Data Pre-processing Page 10
Logistic Regression Model……………………………………………… Page 11
Random Forest Model…………………………………………………… Page 13
Model Comparison after Model Performance Page 14
improvement post Hyperparameter tuning……………………
Page 17
Final Model Selection…………………………………………………….
Features Importance……………………………………………………… Page 18
Conclusion on Default Risk…………………………………………….. Page 19
Recommendations/Mitigation Strategies………………………. Page 19
2|Page
Problem Statement:
Context
In the realm of modern finance, businesses encounter the perpetual challenge of managing debt
obligations effectively to maintain a favorable credit standing and foster sustainable growth.
Investors keenly scrutinize companies capable of navigating financial complexities while ensuring
stability and profitability. A pivotal instrument in this evaluation process is the balance sheet, which
provides a comprehensive overview of a company's assets, liabilities, and shareholder equity,
offering insights into its financial health and operational efficiency. In this context, leveraging
available financial data, particularly from preceding fiscal periods, becomes imperative for informed
decision-making and strategic planning
Objective
A renowned credit rating organization wants to develop a Financial Health Assessment Tool.
With the help of the tool, it endeavors to empower businesses and investors with a robust
mechanism for evaluating the financial well-being and creditworthiness of companies. By
harnessing machine learning techniques, the organization aims to analyze historical financial
statements and extract pertinent insights to facilitate informed decision-making via the tool.
Specifically, the organization foresees facilitating the following with the help of the tool:
Data Overview
1) Data Description
The data consists of financial metrics from the balance sheets of different companies
2) Data Dictionary
Observations Features
2058 58
3|Page
The dataset contains 2058 observations and 58 features in form of the below listed variables
4|Page
• _Total_debt_to_Total_net_worth: Total debt/Total net worth: Total Liability/Equity
Ratio
• _Long_term_fund_suitability_ratio_A: Long-term fund suitability ratio (A): (Long-
term Liability+Equity)/Fixed Assets
• _Net_profit_before_tax_to_Paid_in_capital: Net profit before tax/Paid-in capital:
Pretax Income/Capital
• _Total_Asset_Turnover: Total Asset Turnover. Net Sales/Average Total Assets
• _Accounts_Receivable_Turnover: Accounts Receivable Turnover. The accounts
receivable turnover ratio, or receivables turnover, is used in business accounting to
quantify how well companies are managing the credit that they extend to their
customers by evaluating how long it takes to collect the outstanding debt throughout
the accounting period.
• _Average_Collection_Days: Average Collection Days: Days Receivable Outstanding
• _Inventory_Turnover_Rate_times: Inventory Turnover Rate (times). The inventory
turnover ratio is the number of times a company has sold and replenished its
inventory over a specific amount of time. The formula can also be used to calculate
the number of days it will take to sell the inventory on hand.
• _Fixed_Assets_Turnover_Frequency: Fixed Assets Turnover Frequency. Fixed Asset
Turnover (FAT) is an efficiency ratio that indicates how well or efficiently a business
uses fixed assets to generate sales. This ratio divides net sales by net fixed assets,
calculated over an annual period.
• _Net_Worth_Turnover_Rate_times: Net Worth Turnover Rate (times): Equity
Turnover. Equity turnover is a ratio that measures the proportion of a company's
sales to its stockholders' equity. The intent of the measurement is to determine the
efficiency with which management is using equity to generate revenue.
• _Operating_profit_per_person: Operating profit per person: Operation Income Per
Employee
• _Allocation_rate_per_person Allocation rate per person: Fixed Assets Per Employee
• _Quick_Assets_to_Total_Assets: Quick Assets/Total Assets
• _Cash_to_Total_Assets: Cash/Total Assets
• _Quick_Assets_to_Current_Liability: Quick Assets/Current Liability
• _Cash_to_Current_Liability: Cash/Current Liability
• _Operating_Funds_to_Liability: Operating Funds to Liability
• _Inventory_to_Working_Capital: Inventory/Working Capital
• _Inventory_to_Current_Liability Inventory/Current Liability
• _Long_term_Liability_to_Current_Assets: Long-term Liability to Current Assets
• _Retained_Earnings_to_Total_Assets Retained Earnings to Total Assets
• _Total_income_to_Total_expense: Total income/Total expense
• _Total_expense_to_Assets: Total expense/Assets
• _Current_Asset_Turnover_Rate: Current Asset Turnover Rate: Current Assets to
Sales. The current assets turnover ratio indicates how many times the current assets
are turned over in the form of sales within a specific period of time. A higher asset
turnover ratio means a better percentage of sales.
• _Quick_Asset_Turnover_Rate : Quick Asset Turnover Rate: Quick Assets to Sales. The
asset turnover ratio measures the efficiency of a company's assets in generating
revenue or sales.
5|Page
• _Cash_Turnover_Rate : Cash Turnover Rate: Cash to Sales. The cash turnover ratio
is an efficiency ratio that reveals the number of times that cash is turned over in an
accounting period.
• _Fixed_Assets_to_Assets: Fixed Assets to Assets. Fixed assets are also known as non-
current assets—assets that can't be easily converted into cash.
• _Cash_Flow_to_Total_Assets: Cash Flow to Total Assets. This ratio indicates the cash
a company can generate in relation to its size.
• _Cash_Flow_to_Liability: Cash Flow to Liability. The amount of money available to
run business operations and complete transactions. This is calculated as current
assets (cash or near-cash assets, like notes receivable) minus current liabilities
(liabilities due during the upcoming accounting period)
• _CFO_to_Assets: CFO to Assets. Cash flow on total assets is an efficiency ratio that
rates cash flows to the company assets without being affected by income recognition
or income measurements.
• _Cash_Flow_to_Equity: Cash Flow to Equity. cash flow to equity is a measure of how
much cash is available to the equity shareholders of a company after all expenses,
reinvestment, and debt are paid.
• _Current_Liability_to_Current_Assets: Current Liability to Current Assets. Current
liabilities are a company's financial commitments that are due and payable within a
year, Current assets are projected to be consumed, sold, or converted into cash
within a year or within the operational cycle.
• _Liability_Assets_Flag Liability-Assets Flag: 1 if Total Liability exceeds Total Assets, 0
otherwise
• _Total_assets_to_GNP_price: Total assets to GNP price. Gross National Product (GNP)
is the total value of all finished goods and services produced by a country’s citizens in
a given financial year, irrespective of their location.
• _No_credit_Interval: No-credit Interval
• _Degree_of_Financial_Leverage_DFL: Degree of Financial Leverage (DFL). The degree
of financial leverage is a financial ratio that measures the sensitivity in fluctuations of
a company's overall profitability to the volatility of its operating income caused by
changes in its capital structure.
• _Interest_Coverage_Ratio_Interest_expense_to_EBIT: Interest Coverage Ratio
(Interest expense to EBIT). The interest coverage ratio is a debt and profitability ratio
used to determine how easily a company can pay interest on its outstanding debt.
The interest coverage ratio is calculated by dividing a company's earnings before
interest and taxes (EBIT) by its interest expense during a given period.
• _Net_Income_Flag: Net Income Flag: 1 if Net Income is Negative for the last two
years, 0 otherwise
• _Equity_to_Liability: Equity to Liability Ratio.
• Default : Whether the Company has Default (Bankrupted) or not? 1 - Defaulted, 0 -
Not Defaulted.
3) Data Contents
• There are 53 features of float data type, 4 features of integer data type and 1
feature if object data types
• There were no duplicate values
6|Page
Exploratory Data Analysis
1. Univariate Analysis
a. Count of default
Observations:
• The count of instances where default did not occur (category ‘0’) is much higher
than the count of instances where default did occur (category ‘1’).
• In summary, the chart indicates that non-default cases are more common than
default cases within this dataset
Observations:
• The first chart represents the operating expense rate, which indicates the
proportion of an organization’s total expenses allocated to general operational
costs (e.g., salaries, utilities, rent, etc.).
o Since the bar extends halfway, it suggests that the operating expense rate
is around 0.5 (or 50%).
• The second chart is labeled “Research and development expense rate.”
o It also shows a single blue bar, but this one extends only a very small
fraction of the way across the axis (closer to 0).
7|Page
o This bar represents the R&D expense rate, indicating the proportion of
expenses allocated specifically to research and development activities.
o The short length of the bar suggests that the R&D expense rate is
significantly lower than the operating expense rate.
Observations:
2. Bivariate Analysis
a. Quick Asset and Cash Turnover Rate vs Default
Observations:
8|Page
• The second graph is labeled “Net Value Growth Rate.”
o The exact values of these data points are not provided, but they are
positioned at different heights along the y-axis.
o This suggests variations in net value growth rates at different points on
the x-axis.
Observations:
c. Correlation Matrix
9|Page
Observations:
• Looking at the correlation plot, it seems that the metrics are weakly correlated with
each other
3. Data Pre-processing
• Dropped the columns Net_Income_Flag and Liability_Assets_Flag as they had very
few unique values
• Separated target variable i.e default column from the rest of the data
• Split the data into train and test
• There were missing values in the following features in the train data
10 | P a g e
o Cash_Flow_Per_Share = 126 missing values
o Total_debt_to_Total_net_worth = 18 missing values
o Cash_to_Total_Assets = 71 missing values
o Current_Liability_to_Current_Assets = 11 missing values
• There were missing values in the following features in the test data
o Cash_Flow_Per_Share = 41 missing values
o Total_debt_to_Total_net_worth = 3 missing values
o Cash_to_Total_Assets = 25 missing values
o Current_Liability_to_Current_Assets = 3 missing values
• Replaced the missing values using KNN Imputer
• Scaled the features to the same scale as below
Observations:
11 | P a g e
• High True Negatives: The model performs well in identifying negative instances,
with a high proportion (87.30%) of the total instances being true negatives.
• False Negatives and Positives: The model has a moderate number of false
negatives (5.77%) and a relatively low number of false positives (2.01%). This
suggests that while the model is good at avoiding false positives, it misses a fair
number of actual positives.
• Precision and Recall: The precision for positive predictions is fairly high at about
71.03%, indicating that when the model predicts a positive instance, it is correct
most of the time. However, the recall for positive instances is relatively low at
about 46.06%, indicating that the model misses a significant portion of the actual
positive instances.
• Overall Accuracy: The model's overall accuracy is quite high (92.22%), indicating
good performance on the training data. However, accuracy alone may not be the
best metric if the classes are imbalanced or if false negatives/positives have
different costs.
Observations:
12 | P a g e
• High True Negatives: The model performs well in identifying negative instances,
with a high proportion (86.99%) of the total instances being true negatives.
• False Negatives and Positives: The model has a moderate number of false
negatives (6.02%) and a relatively low number of false positives (2.33%). This
suggests that while the model is good at avoiding false positives, it misses a fair
number of actual positives.
• Precision and Recall: The precision for positive predictions is fairly high at about
66.67%, indicating that when the model predicts a positive instance, it is correct
most of the time. However, the recall for positive instances is relatively low at
about 43.64%, indicating that the model misses a significant portion of the actual
positive instances.
• Overall Accuracy: The model's overall accuracy is quite high (91.65%), indicating
good performance on the test data. However, accuracy alone may not be the best
metric if the classes are imbalanced or if false negatives/positives have different
costs.
Observations:
• Perfect Performance: The model shows perfect performance on the training data
with an accuracy, precision, recall, F1 score, and specificity all being 100%. This
indicates that the model correctly classified all instances in the training set.
• No Misclassifications: There are no false positives (FP) or false negatives (FN),
meaning the model made no errors in its predictions on the training data.
13 | P a g e
b. Random Forest Model - Test Performance
Observations:
• Accuracy (0.928155): This indicates that the model correctly predicted the class
labels for 92.82% of the instances in the test set. While this is high, accuracy alone
can be misleading, especially if the classes are imbalanced.
• Recall (0.490909): Also known as sensitivity or true positive rate, this metric
indicates that the model correctly identified 49.09% of the actual positive instances.
This relatively low recall suggests that the model is missing a significant number of
positive cases.
• Precision (0.75): This indicates that when the model predicts a positive class, it is
correct 75% of the time. High precision with low recall suggests that the model is
conservative in its positive predictions, preferring to avoid false positives at the
expense of missing true positives.
• F1 Score (0.593407): The F1 score is the harmonic mean of precision and recall,
providing a single metric that balances both. An F1 score of 0.593407 indicates that
the model has moderate performance when considering both false positives and
false negatives.
In summary, while the model shows high accuracy and precision, the low recall and
moderate F1 score indicate it struggles to identify all positive instances, suggesting a
potential issue with class imbalance and a need for strategies to improve its performance on
the minority class
15 | P a g e
o Therefore, the tuned random forest model is likely the best-performing model
on the training data and has the potential to generalize better to unseen data
compared to the other models.
b. Test performance
16 | P a g e
7. Final Model Selection
1. Logistic Regression Models:
• The original logistic regression model has high accuracy and precision on both train
and test data, but its recall and F1 scores are moderate, indicating it may miss many
positive instances.
• The tuned logistic regression model has lower accuracy but higher precision, with
reduced recall and F1 scores compared to the original.
• Final Recommendation
o Given the overall performance on both training and test data, the original
random forest model appears to be the best choice. Despite potential
overfitting indicated by perfect training metrics, it still generalizes well to the
test data with the highest accuracy (0.928155), balanced recall (0.490909),
precision (0.750000), and F1 score (0.593407) among all models. This
indicates that it captures the patterns in the data effectively and performs
reliably on new data.
17 | P a g e
8. Features Importance
Observations:
• Retaine Earnings to Total Assets, Net profit before tax, Per share net profit before tax
are the most important features
• Operating Expense Rate, Inventory Turnover Rates Times, Quick asset and current
asset turnover rate and Operating Profit Growth Rate are the features with lowest
impact in Random Forest Model
18 | P a g e
9. Conclusion on Default Risk
• Based on the lower equity-to-liability ratio and the presence of significant outliers,
the company appears to have a higher risk of defaulting in the next two quarters. The
constant net income flag, moderate asset growth, and operating and R&D expense
rates further contextualize this risk but are not as directly indicative of default risk as
the equity-to-liability ratio.
10.Recommendations/Mitigation Strategies
To mitigate the risk of default, the company should consider the following strategies:
19 | P a g e