100% found this document useful (1 vote)
2K views19 pages

Manali Andyal 26 05 2025 FRA Part A Guided Project Report PDF

Uploaded by

Monica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
2K views19 pages

Manali Andyal 26 05 2025 FRA Part A Guided Project Report PDF

Uploaded by

Monica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Finance and Risk Analytics

Guided Project Report – Part A

DSBA

By:
Manali Andyal

1|Page
Contents
Problem Statement……………………………………………… Page 3
Data Overview
o Data Description………………………………………… Page 3
o Data Dictionary………………………………………….. Page 3
Exploratory Data Analysis
o Univariate Analysis…………………………………….. Page 7
o Bivariate Analysis………………………………………. Page 8
Data Pre-processing Page 10
Logistic Regression Model……………………………………………… Page 11
Random Forest Model…………………………………………………… Page 13
Model Comparison after Model Performance Page 14
improvement post Hyperparameter tuning……………………
Page 17
Final Model Selection…………………………………………………….
Features Importance……………………………………………………… Page 18
Conclusion on Default Risk…………………………………………….. Page 19
Recommendations/Mitigation Strategies………………………. Page 19

2|Page
Problem Statement:
Context
In the realm of modern finance, businesses encounter the perpetual challenge of managing debt
obligations effectively to maintain a favorable credit standing and foster sustainable growth.
Investors keenly scrutinize companies capable of navigating financial complexities while ensuring
stability and profitability. A pivotal instrument in this evaluation process is the balance sheet, which
provides a comprehensive overview of a company's assets, liabilities, and shareholder equity,
offering insights into its financial health and operational efficiency. In this context, leveraging
available financial data, particularly from preceding fiscal periods, becomes imperative for informed
decision-making and strategic planning

Objective
A renowned credit rating organization wants to develop a Financial Health Assessment Tool.
With the help of the tool, it endeavors to empower businesses and investors with a robust
mechanism for evaluating the financial well-being and creditworthiness of companies. By
harnessing machine learning techniques, the organization aims to analyze historical financial
statements and extract pertinent insights to facilitate informed decision-making via the tool.
Specifically, the organization foresees facilitating the following with the help of the tool:

• Debt Management Analysis: Identify patterns and trends in debt management


practices to assess the ability of businesses to fulfill financial obligations promptly
and efficiently, and identify potential cases of default.
• Credit Risk Evaluation: Evaluate credit risk exposure by analyzing liquidity ratios,
debt-to-equity ratios, and other key financial indicators to ascertain the likelihood of
default and inform investment decisions.
As a part of the data science team in the organization, you have been provided with the
financial metrics of different companies. The task is to analyze the data provided and
develop a predictive model leveraging machine learning techniques to identify whether a
given company will default on its debt repayments in the next two quarters. The predictive
model will help the organization anticipate potential challenges with the financial
performance of the companies and enable proactive risk mitigation strategies.

Data Overview
1) Data Description
The data consists of financial metrics from the balance sheets of different companies

2) Data Dictionary
Observations Features
2058 58

3|Page
The dataset contains 2058 observations and 58 features in form of the below listed variables

• Co_Code: Company Code


• Co_Name: Company Name
• _Operating_Expense_Rate: Operating Expense Rate: Operating Expenses/Net Sales.
The operating expense ratio (OER) is the cost to operate a piece of property
compared to the income the property brings in.
• _Research_and_development_expense_rate: Research and development expense
rate: (Research and Development Expenses)/Net Sales. Research and development
(R&D) expenses are direct expenditures relating to a company's efforts to develop,
design, and enhance its products, services, technologies, or processes.
• _Cash_flow_rate: Cash flow rate: Cash Flow from Operating/Current Liabilities. Cash
flow is a measure of how much cash a business brought in or spent in total over a
period of time.
• _Interest_bearing_debt_interest_rate: Interest-bearing debt interest rate: Interest-
bearing Debt/Equity
• _Tax_rate_A: Effective Tax Rate. Effective tax rate represents the percentage of their
taxable income that individuals pay in taxes. For corporations, the effective corporate
tax rate is the rate they pay on their pre-tax profits.
• _Cash_Flow_Per_Share: Cash Flow Per Share. It is the after-tax earnings plus
depreciation on a per-share basis that functions as a measure of a firm's financial
strength
• _Per_Share_Net_profit_before_tax_Yuan_: Per Share Net profit before tax (Yuan ¥):
Pretax Income Per Share. Pretax income, also known as earnings before tax or pretax
earnings, is the net income earned by a business before taxes are
subtracted/accounted for.
• _Realized_Sales_Gross_Profit_Growth_Rate: Realized Sales Gross Profit Growth Rate.
• _Operating_Profit_Growth_Rate: Operating Profit Growth Rate: Operating Income
Growth. It is the rate of increase in operating income over the last year.
• _Continuous_Net_Profit_Growth_Rate: Continuous Net Profit Growth Rate: Net
Income-Excluding Disposal Gain or Loss Growth
• _Total_Asset_Growth_Rate: Total Asset Growth Rate: Total Asset Growth. It is the
rate at which how quickly the company has been growing its Assets
• _Net_Value_Growth_Rate: Net Value Growth Rate: Total Equity Growth
• _Total_Asset_Return_Growth_Rate_Ratio: Total Asset Return Growth Rate Ratio:
Return on Total Asset Growth
• _Cash_Reinvestment_perc: Cash Reinvestment %: Cash Reinvestment Ratio. It is the
valuation ratio that is used to measure the percentage of annual cash flow that the
company invests back into the business as a new investment.
• _Current_Ratio: Current Ratio. The current ratio describes the relationship between a
company's assets and liabilities
• _Quick_Ratio: Quick Ratio: Acid Test. Acid-test ratio (also known as quick ratio) is a
measure of a company's liquidity, which is its ability to pay its short-term obligations
using only its most liquid assets.
• _Interest_Expense_Ratio: Interest Expense Ratio: Interest Expenses/Total Revenue

4|Page
• _Total_debt_to_Total_net_worth: Total debt/Total net worth: Total Liability/Equity
Ratio
• _Long_term_fund_suitability_ratio_A: Long-term fund suitability ratio (A): (Long-
term Liability+Equity)/Fixed Assets
• _Net_profit_before_tax_to_Paid_in_capital: Net profit before tax/Paid-in capital:
Pretax Income/Capital
• _Total_Asset_Turnover: Total Asset Turnover. Net Sales/Average Total Assets
• _Accounts_Receivable_Turnover: Accounts Receivable Turnover. The accounts
receivable turnover ratio, or receivables turnover, is used in business accounting to
quantify how well companies are managing the credit that they extend to their
customers by evaluating how long it takes to collect the outstanding debt throughout
the accounting period.
• _Average_Collection_Days: Average Collection Days: Days Receivable Outstanding
• _Inventory_Turnover_Rate_times: Inventory Turnover Rate (times). The inventory
turnover ratio is the number of times a company has sold and replenished its
inventory over a specific amount of time. The formula can also be used to calculate
the number of days it will take to sell the inventory on hand.
• _Fixed_Assets_Turnover_Frequency: Fixed Assets Turnover Frequency. Fixed Asset
Turnover (FAT) is an efficiency ratio that indicates how well or efficiently a business
uses fixed assets to generate sales. This ratio divides net sales by net fixed assets,
calculated over an annual period.
• _Net_Worth_Turnover_Rate_times: Net Worth Turnover Rate (times): Equity
Turnover. Equity turnover is a ratio that measures the proportion of a company's
sales to its stockholders' equity. The intent of the measurement is to determine the
efficiency with which management is using equity to generate revenue.
• _Operating_profit_per_person: Operating profit per person: Operation Income Per
Employee
• _Allocation_rate_per_person Allocation rate per person: Fixed Assets Per Employee
• _Quick_Assets_to_Total_Assets: Quick Assets/Total Assets
• _Cash_to_Total_Assets: Cash/Total Assets
• _Quick_Assets_to_Current_Liability: Quick Assets/Current Liability
• _Cash_to_Current_Liability: Cash/Current Liability
• _Operating_Funds_to_Liability: Operating Funds to Liability
• _Inventory_to_Working_Capital: Inventory/Working Capital
• _Inventory_to_Current_Liability Inventory/Current Liability
• _Long_term_Liability_to_Current_Assets: Long-term Liability to Current Assets
• _Retained_Earnings_to_Total_Assets Retained Earnings to Total Assets
• _Total_income_to_Total_expense: Total income/Total expense
• _Total_expense_to_Assets: Total expense/Assets
• _Current_Asset_Turnover_Rate: Current Asset Turnover Rate: Current Assets to
Sales. The current assets turnover ratio indicates how many times the current assets
are turned over in the form of sales within a specific period of time. A higher asset
turnover ratio means a better percentage of sales.
• _Quick_Asset_Turnover_Rate : Quick Asset Turnover Rate: Quick Assets to Sales. The
asset turnover ratio measures the efficiency of a company's assets in generating
revenue or sales.

5|Page
• _Cash_Turnover_Rate : Cash Turnover Rate: Cash to Sales. The cash turnover ratio
is an efficiency ratio that reveals the number of times that cash is turned over in an
accounting period.
• _Fixed_Assets_to_Assets: Fixed Assets to Assets. Fixed assets are also known as non-
current assets—assets that can't be easily converted into cash.
• _Cash_Flow_to_Total_Assets: Cash Flow to Total Assets. This ratio indicates the cash
a company can generate in relation to its size.
• _Cash_Flow_to_Liability: Cash Flow to Liability. The amount of money available to
run business operations and complete transactions. This is calculated as current
assets (cash or near-cash assets, like notes receivable) minus current liabilities
(liabilities due during the upcoming accounting period)
• _CFO_to_Assets: CFO to Assets. Cash flow on total assets is an efficiency ratio that
rates cash flows to the company assets without being affected by income recognition
or income measurements.
• _Cash_Flow_to_Equity: Cash Flow to Equity. cash flow to equity is a measure of how
much cash is available to the equity shareholders of a company after all expenses,
reinvestment, and debt are paid.
• _Current_Liability_to_Current_Assets: Current Liability to Current Assets. Current
liabilities are a company's financial commitments that are due and payable within a
year, Current assets are projected to be consumed, sold, or converted into cash
within a year or within the operational cycle.
• _Liability_Assets_Flag Liability-Assets Flag: 1 if Total Liability exceeds Total Assets, 0
otherwise
• _Total_assets_to_GNP_price: Total assets to GNP price. Gross National Product (GNP)
is the total value of all finished goods and services produced by a country’s citizens in
a given financial year, irrespective of their location.
• _No_credit_Interval: No-credit Interval
• _Degree_of_Financial_Leverage_DFL: Degree of Financial Leverage (DFL). The degree
of financial leverage is a financial ratio that measures the sensitivity in fluctuations of
a company's overall profitability to the volatility of its operating income caused by
changes in its capital structure.
• _Interest_Coverage_Ratio_Interest_expense_to_EBIT: Interest Coverage Ratio
(Interest expense to EBIT). The interest coverage ratio is a debt and profitability ratio
used to determine how easily a company can pay interest on its outstanding debt.
The interest coverage ratio is calculated by dividing a company's earnings before
interest and taxes (EBIT) by its interest expense during a given period.
• _Net_Income_Flag: Net Income Flag: 1 if Net Income is Negative for the last two
years, 0 otherwise
• _Equity_to_Liability: Equity to Liability Ratio.
• Default : Whether the Company has Default (Bankrupted) or not? 1 - Defaulted, 0 -
Not Defaulted.

3) Data Contents
• There are 53 features of float data type, 4 features of integer data type and 1
feature if object data types
• There were no duplicate values

6|Page
Exploratory Data Analysis
1. Univariate Analysis
a. Count of default

Observations:

• The count of instances where default did not occur (category ‘0’) is much higher
than the count of instances where default did occur (category ‘1’).
• In summary, the chart indicates that non-default cases are more common than
default cases within this dataset

b. Operating expense rates and research and development (R&D) expense


rates

Observations:

• The first chart represents the operating expense rate, which indicates the
proportion of an organization’s total expenses allocated to general operational
costs (e.g., salaries, utilities, rent, etc.).
o Since the bar extends halfway, it suggests that the operating expense rate
is around 0.5 (or 50%).
• The second chart is labeled “Research and development expense rate.”
o It also shows a single blue bar, but this one extends only a very small
fraction of the way across the axis (closer to 0).

7|Page
o This bar represents the R&D expense rate, indicating the proportion of
expenses allocated specifically to research and development activities.
o The short length of the bar suggests that the R&D expense rate is
significantly lower than the operating expense rate.

c. Total Asset Growth Rate and Net Value Growth Rate

Observations:

• The first graph is labeled “Total Asset Growth Rate.”


o It displays a single blue vertical bar within a range of 0 to 0.6 on the x-axis.
o Unfortunately, the exact value of the total asset growth rate is not
specified due to the lack of a numerical scale on the y-axis.
o However, we can infer that the total asset growth rate corresponds to the
height of the blue bar.
• The second graph is labeled “Net Value Growth Rate.”
o The exact values of these data points are not provided, but they are
positioned at different heights along the y-axis.
o This suggests variations in net value growth rates at different points on
the x-axis.

2. Bivariate Analysis
a. Quick Asset and Cash Turnover Rate vs Default

Observations:

• The first graph is labeled “Total Asset Growth Rate.”


o We can infer that the total asset growth rate corresponds to the height of
the blue bar.

8|Page
• The second graph is labeled “Net Value Growth Rate.”
o The exact values of these data points are not provided, but they are
positioned at different heights along the y-axis.
o This suggests variations in net value growth rates at different points on
the x-axis.

b. Net Income Flag and Equity to Liability vs Default

Observations:

• The left box plot represents the “Net Income Flag.”


o The x-axis has two categories: Default = 0 and Default = 1.
o The y-axis ranges approximately from 0.95 to 1.05.
o Both Default groups (0 and 1) have a constant value of 1 for the net
income flag.
o There is no variability; all data points are at the same level.
o This suggests that the net income flag might be a constant value across
the dataset.
• The right box plot represents the “Equity to Liability.”
o Again, the x-axis has two categories: Default = 0 and Default = 1.
o The y-axis ranges approximately from 0 to 1.
o The median (middle line within each box) is closer to zero for Default = 1
(companies that defaulted).
o There are several outliers (dots above and below the whiskers) for Default
= 1.
o The spread of values is wider for Default = 0 (non-defaulting companies).
o Overall, the equity-to-liability ratio appears lower for companies that
defaulted.
• Net Income Flag seems constant (value of 1) across both default groups.
• Equity to Liability shows more variability, with lower ratios for defaulting
companies.

c. Correlation Matrix

9|Page
Observations:

• Looking at the correlation plot, it seems that the metrics are weakly correlated with
each other

3. Data Pre-processing
• Dropped the columns Net_Income_Flag and Liability_Assets_Flag as they had very
few unique values
• Separated target variable i.e default column from the rest of the data
• Split the data into train and test
• There were missing values in the following features in the train data

10 | P a g e
o Cash_Flow_Per_Share = 126 missing values
o Total_debt_to_Total_net_worth = 18 missing values
o Cash_to_Total_Assets = 71 missing values
o Current_Liability_to_Current_Assets = 11 missing values
• There were missing values in the following features in the test data
o Cash_Flow_Per_Share = 41 missing values
o Total_debt_to_Total_net_worth = 3 missing values
o Cash_to_Total_Assets = 25 missing values
o Current_Liability_to_Current_Assets = 3 missing values
• Replaced the missing values using KNN Imputer
• Scaled the features to the same scale as below

4. Logistic Regression Model


a. Logistic Regression Model - Training Performance

Observations:

11 | P a g e
• High True Negatives: The model performs well in identifying negative instances,
with a high proportion (87.30%) of the total instances being true negatives.

• False Negatives and Positives: The model has a moderate number of false
negatives (5.77%) and a relatively low number of false positives (2.01%). This
suggests that while the model is good at avoiding false positives, it misses a fair
number of actual positives.

• Precision and Recall: The precision for positive predictions is fairly high at about
71.03%, indicating that when the model predicts a positive instance, it is correct
most of the time. However, the recall for positive instances is relatively low at
about 46.06%, indicating that the model misses a significant portion of the actual
positive instances.

• Overall Accuracy: The model's overall accuracy is quite high (92.22%), indicating
good performance on the training data. However, accuracy alone may not be the
best metric if the classes are imbalanced or if false negatives/positives have
different costs.

b. Logistic Regression Model - Test Performance

Observations:

12 | P a g e
• High True Negatives: The model performs well in identifying negative instances,
with a high proportion (86.99%) of the total instances being true negatives.
• False Negatives and Positives: The model has a moderate number of false
negatives (6.02%) and a relatively low number of false positives (2.33%). This
suggests that while the model is good at avoiding false positives, it misses a fair
number of actual positives.
• Precision and Recall: The precision for positive predictions is fairly high at about
66.67%, indicating that when the model predicts a positive instance, it is correct
most of the time. However, the recall for positive instances is relatively low at
about 43.64%, indicating that the model misses a significant portion of the actual
positive instances.
• Overall Accuracy: The model's overall accuracy is quite high (91.65%), indicating
good performance on the test data. However, accuracy alone may not be the best
metric if the classes are imbalanced or if false negatives/positives have different
costs.

5. Random Forest Model


a. Random Forest Model - Training Performance

Observations:

• Perfect Performance: The model shows perfect performance on the training data
with an accuracy, precision, recall, F1 score, and specificity all being 100%. This
indicates that the model correctly classified all instances in the training set.
• No Misclassifications: There are no false positives (FP) or false negatives (FN),
meaning the model made no errors in its predictions on the training data.

13 | P a g e
b. Random Forest Model - Test Performance

Observations:

• Accuracy (0.928155): This indicates that the model correctly predicted the class
labels for 92.82% of the instances in the test set. While this is high, accuracy alone
can be misleading, especially if the classes are imbalanced.
• Recall (0.490909): Also known as sensitivity or true positive rate, this metric
indicates that the model correctly identified 49.09% of the actual positive instances.
This relatively low recall suggests that the model is missing a significant number of
positive cases.
• Precision (0.75): This indicates that when the model predicts a positive class, it is
correct 75% of the time. High precision with low recall suggests that the model is
conservative in its positive predictions, preferring to avoid false positives at the
expense of missing true positives.
• F1 Score (0.593407): The F1 score is the harmonic mean of precision and recall,
providing a single metric that balances both. An F1 score of 0.593407 indicates that
the model has moderate performance when considering both false positives and
false negatives.
In summary, while the model shows high accuracy and precision, the low recall and
moderate F1 score indicate it struggles to identify all positive instances, suggesting a
potential issue with class imbalance and a need for strategies to improve its performance on
the minority class

6. Model Comparison after Model Performance improvement post


Hyperparameter tuning
a. Training performance
14 | P a g e
• Random Forest Original:
o All metrics are perfect (1.0), which suggests that the model has perfectly fit
the training data. This often indicates overfitting, meaning the model might
not generalize well to unseen test data.
• Random Forest Tuned:
o Accuracy: 0.917045
o Recall: 0.569288
o Precision: 0.921212
o F1 Score: 0.703704
o These metrics are more realistic compared to the original random forest
model and suggest a good balance between precision and recall, with an
improved F1 score, indicating a better overall performance on the training
data.

• Logistic Regression Original:


o The model shows good accuracy (0.922229) and reasonable precision (0.710280),
but lower recall (0.460606) and a moderate F1 score (0.558824). This suggests
that the model performs well in identifying the majority class but misses a
significant number of positive cases.

• Logistic Regression Tuned:


o The tuning seems to have improved precision (0.927273) at the cost of
accuracy (0.810110) and recall (0.352535). The F1 score is slightly lower
(0.510851), indicating a trade-off where the model is more conservative in
making positive predictions.
• Overall, based on the training data performance:
o Random Forest Tuned model appears to be performing the best, striking a
good balance between accuracy, precision, recall, and F1 score.
o While the original random forest shows perfect metrics, it is likely overfitting,
so its generalizability is questionable.
o The logistic regression models, while decent, do not perform as well as the
tuned random forest model in terms of overall balanced metrics.

15 | P a g e
o Therefore, the tuned random forest model is likely the best-performing model
on the training data and has the potential to generalize better to unseen data
compared to the other models.

b. Test performance

• Logistic Regression Original:


o The model has good accuracy and a balanced precision and recall, but the F1
score suggests moderate performance.
• Logistic Regression Tuned:
o Tuning has increased precision but reduced accuracy, recall, and F1 score,
indicating the model is conservative in making positive predictions but may
miss many true positives.
• Random Forest Original:
o This model shows the highest accuracy and balanced metrics, with good
precision and recall leading to the highest F1 score among all models.
• Random Forest Tuned:
o Tuning has slightly reduced all metrics compared to the original random
forest but still shows a balanced performance
• Conclusion
o Random Forest Models: The original random forest model shows the best
performance on the test data, with the highest accuracy (0.928155), balanced
recall (0.490909), good precision (0.750000), and the highest F1 score
(0.593407). The tuned random forest model performs well but not as well as
the original version.
o Logistic Regression Models: The original logistic regression model has good
accuracy (0.916505) but lower recall (0.436364) and F1 score (0.527473). The
tuned logistic regression model has reduced accuracy, recall, and F1 score,
though it has higher precision.
o Overall, the original random forest model is performing the best on the test
data, as indicated by its higher accuracy, balanced recall and precision, and
the highest F1 score. This suggests that despite potential overfitting observed
in training data, the original random forest model generalizes well to the test
data compared to the other models.

16 | P a g e
7. Final Model Selection
1. Logistic Regression Models:
• The original logistic regression model has high accuracy and precision on both train
and test data, but its recall and F1 scores are moderate, indicating it may miss many
positive instances.
• The tuned logistic regression model has lower accuracy but higher precision, with
reduced recall and F1 scores compared to the original.

2. Random Forest Models:


• The original random forest model shows perfect performance on training data, which
suggests overfitting. However, it still performs well on the test data with high
accuracy, balanced precision, recall, and the highest F1 score among all models.
• The tuned random forest model shows more realistic training performance
compared to the original, with good test performance, though slightly lower than the
original in all metrics.

• Final Recommendation
o Given the overall performance on both training and test data, the original
random forest model appears to be the best choice. Despite potential
overfitting indicated by perfect training metrics, it still generalizes well to the
test data with the highest accuracy (0.928155), balanced recall (0.490909),
precision (0.750000), and F1 score (0.593407) among all models. This
indicates that it captures the patterns in the data effectively and performs
reliably on new data.

17 | P a g e
8. Features Importance

Observations:

• Retaine Earnings to Total Assets, Net profit before tax, Per share net profit before tax
are the most important features
• Operating Expense Rate, Inventory Turnover Rates Times, Quick asset and current
asset turnover rate and Operating Profit Growth Rate are the features with lowest
impact in Random Forest Model

18 | P a g e
9. Conclusion on Default Risk
• Based on the lower equity-to-liability ratio and the presence of significant outliers,
the company appears to have a higher risk of defaulting in the next two quarters. The
constant net income flag, moderate asset growth, and operating and R&D expense
rates further contextualize this risk but are not as directly indicative of default risk as
the equity-to-liability ratio.
10.Recommendations/Mitigation Strategies
To mitigate the risk of default, the company should consider the following strategies:

• Strengthen Equity Position:


o Increase equity through new equity financing, retained earnings, or
converting debt to equity. This will improve the equity-to-liability ratio and
reduce leverage.
• Debt Restructuring:
o Negotiate with creditors to restructure existing debt, potentially extending
maturities, reducing interest rates, or converting debt to equity to alleviate
short-term financial pressure.
• Cost Management:
o Implement cost control measures to reduce operating expenses. Evaluate and
prioritize essential expenses, and look for areas where costs can be cut
without significantly impacting operations.
• Enhance Revenue Generation:
o Focus on increasing sales and revenue through marketing efforts, expanding
product lines, or entering new markets. Diversifying revenue streams can
stabilize cash flows.
• Liquidity Management:
o Ensure sufficient liquidity to meet short-term obligations by improving cash
flow management, delaying non-essential expenditures, and optimizing
working capital.
• Investment in Innovation:
o While the R&D expense rate is low, a balanced investment in innovation can
lead to long-term growth. Explore cost-effective ways to foster innovation,
such as partnerships or grants.
• Risk Monitoring:
o Establish a risk monitoring system to continuously assess financial health and
anticipate potential issues. Regularly review key financial ratios and metrics to
detect early warning signs of financial distress
11.

19 | P a g e

You might also like