0% found this document useful (0 votes)
103 views

FRA Extended

The document outlines a project focused on developing a Bankruptcy Prediction Tool using machine learning to assess the bankruptcy risk of US publicly traded corporations. It details the process of exploratory data analysis, data preprocessing, model building, and performance evaluation, ultimately recommending the tuned Random Forest model for its balanced performance metrics. Key predictors of financial distress include market value, total long-term debt, and retained earnings, with actionable insights provided for stakeholders in risk management and strategic planning.

Uploaded by

aurorajashri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views

FRA Extended

The document outlines a project focused on developing a Bankruptcy Prediction Tool using machine learning to assess the bankruptcy risk of US publicly traded corporations. It details the process of exploratory data analysis, data preprocessing, model building, and performance evaluation, ultimately recommending the tuned Random Forest model for its balanced performance metrics. Key predictors of financial distress include market value, total long-term debt, and retained earnings, with actionable insights provided for stakeholders in risk management and strategic planning.

Uploaded by

aurorajashri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

FRA Project

(Extended)
DSBA

By:
E. AuroRajashri

0
List of Content

1.1 Define the problem and perform Exploratory Data


Analysis ........................................................................................................................................ 4
1.1.1 Problem definition
1.1.2 Check shape, Data types, statistical summary
1.1.3 Univariate analysis and Bivariate analysis. Key meaningful observations on individual
variables and the relationship between variables
1.2 Data Preprocessing ........................................................................................................ 9
1.2.1 Outlier Treatment
1.2.2 Missing Value Treatment
1.2.3 Data Split
1.2.4 Scaling
1.3 Model Building .............................................................................................................. 12
1.3.1 Metrics of Choice (Justify the evaluation metrics)

1.3.2 Model Building (KNN, Naive bayes, Bagging, Boosting)


1.4 Model Performance evaluation.......................................................................................... 13
1.4.1 Check the confusion matrix and classification metrics for all the models (for both train and
test dataset)

1.5 Model Performance improvement.................................................................................... 16


1.5.1 Dealing with multicollinearity using VIF
1.5.2 Identify optimal threshold for Logistic Regression using ROC curve
1.5.3 Model performance check across different metrics

1.6 Model Performance Comparison and Final Model


Selection…………………………………………………………………………………………………………………..
1.7 Actionable Insights & Recommendations………………………………………………………………….

1.7.1 Key takeaway

1
List of Figures

Fig 1: Dataset Head rows


Fig 2: Dataset Info
Fig 3: Dataset Statistical Summary
Fig 4: Bankruptcy Analysis
Fig 5: Boxplot of all numeric variables
Fig 6: Distplot of all numeric variables
Fig 7: Heat Map of numeric variables
Fig 8: Count of Outliers
Fig 9: Boxplot – Post outlier treatment
Fig 10: Missing Values
Fig 11: Post Scaling
Fig 12: Logistic Regression
Fig 13: Random Forest Classifier
Fig 14: LR Training set – Confusion Matrix
Fig 15: LR Training set – Report
Fig 16: LR Test set – Confusion Matrix
Fig 17: LR Test set – Report
Fig 18: RF Training set – Confusion Matrix
Fig 19: RF Training set – Report
Fig 20: RF Test set – Confusion Matrix
Fig 21: RF Test set – Report
Fig 22: multicollinearity using VIF
Fig 23: Optimal Threshold using ROC
Fig 24: LR Tuned – Training set
Fig 25: LR Tuned – Training set
Fig 26: LR Tuned – Test set
Fig 27: LR Tuned – Test set
Fig 28: RF Tuned – Training set
Fig 29: RF Tuned – Training set
Fig 30: RF Tuned – Test set
Fig 31: RF Tuned – Test set
Fig 32: Model Performance comparison
Fig 33: Feature Importance Logistic regression coefficients
Fig 34: Feature Importance

2
Context
Bankruptcy prediction is a crucial component of financial risk management that protects the interests of creditors,
investors, and other stakeholders. Predicting a company's impending bankruptcy can help with timely interventions and
smart decision-making, which can reduce losses and promote stability in the economy. Predictive modeling can benefit
from the abundance of financial data provided by US corporations listed on major exchanges such as the New York Stock
Exchange (NYSE) and NASDAQ, which are subject to regulatory scrutiny and strict financial reporting requirements. A
firm is considered bankrupt, according to the Securities Exchange Commission (SEC), if it files for bankruptcy under the
Bankruptcy Code's Chapter 11 (reorganization) or Chapter 7 (liquidation) provisions.

Objective
A well-known financial analytics company wants to create a Bankruptcy Prediction Tool to help regulators, investors,
and financial institutions assess the bankruptcy risk of US publicly traded corporations. The program will evaluate past
financial data using cutting-edge machine learning algorithms to find important signs and trends related to bankruptcy.
The following are this tool's main goals:
1. Bankruptcy Risk Assessment: Provide a probabilistic estimate of a company's likelihood of filing for bankruptcy
within a specified time frame (e.g., one year), allowing stakeholders to make informed decisions and take
preventive measures.
2. Early Warning System: Develop an early warning system that flags companies exhibiting financial distress
signals, enabling proactive risk management and strategic planning.
3. Financial Health Analysis: Analyze various financial metrics to offer a comprehensive assessment of a company's
financial health, highlighting areas of concern and potential vulnerabilities.

Data Dictionary
 Company_id: Unique identifier for each company
 Current_assets: Total current assets (in millions)
 Cost_of_goods_sold: Cost of goods sold (in millions)
 Depreciation_and_amortization: Depreciation and amortization expenses (in millions)
 EBITDA: Earnings Before Interest, Taxes, Depreciation, and Amortization (in millions)
 Inventory: Value of inventory (in millions)
 Net_income: Net income (profit or loss) (in millions)
 Total_receivables: Total receivables (in millions)
 Market_value: Market value of the company (in millions)
 Net_sales: Net sales or revenue (in millions)
 Total_assets: Total assets (in millions)
 Total_long_term_debt: Total long-term debt (in millions)
 EBIT: Earnings Before Interest and Taxes (in millions)
 Gross_profit: Gross profit (in millions)
 Total_current_liabilities: Total current liabilities (in millions)
 Retained_earnings: Retained earnings (in millions)
 Total_revenue: Total revenue (in millions)
 Total_liabilities: Total liabilities (in millions)
 Total_operating_expenses: Total operating expenses (in millions)
 Bankrupt: Bankruptcy status (1 = Bankrupt, 0 = Not Bankrupt)

3
1.1 Define the problem and perform Exploratory
Data Analysis
1.1.1 Problem Definition
 Imported necessary libraries like NumPy, Pandas,matplotlib,seaborn.
 Loaded the given dataset to dataframe election


Fig 1: Dataset Head rows

1.1.2 Check shape, Data types, statistical summary


 Dataset has shape of 1983 rows and 20 columns. And it has 19 integer
datatypes and 1 object datatypes.

Fig 2: Dataset Info

 Below is the dataset statistical Summary

Fig 3: Dataset Statistical Summary

4
 There are no duplicates in the dataset.

1.1.3 Univariate analysis and Bivariate analysis


 Univariate analysis

Fig 4: Bankruptcy Analysis

Fig 5: Boxplot of all numeric variables

5
Fig 6: Distplot of all numeric variables
1. The dataset is imbalanced, with significantly fewer companies marked as bankrupt
(approximately 20.88%). This imbalance highlights the need for using techniques like
oversampling (e.g., SMOTE) or adjusting class weights in models to handle imbalance
effectively.
2. Variables such as Current Assets, EBITDA, and Net Income show a wide range of values,
including negative numbers, indicating financial distress in some companies.
Boxplots for numerical features revealed significant outliers, particularly in financial metrics
like Net Income, Total Liabilities, and EBIT. These outliers may represent companies under
extreme financial distress, critical for bankruptcy prediction.
3. Current Assets: Shows a significant spike followed by a decline, indicating fluctuations
in liquidity.
Cost of Goods Sold (COGS): Displays a similar pattern, suggesting changes in production
costs.
EBITDA: Notable peaks indicate periods of strong operational performance.
Net Income: Reflects profitability trends over time, essential for assessing financial health.
Total Long-Term Debt: Provides insights into the company's leverage and financial
obligations.

6
 Bivariate analysis

Fig 7: Heat Map of numeric variables

7
Strong Positive Correlations: Variables like Net Sales and Total Revenue, as well as Gross
Profit and EBITDA, show strong positive correlations, indicating redundancy and potential
for multicollinearity in predictive models.
Weak or Negative Correlations with "Bankrupt": The "Bankrupt" variable has weak or
slightly negative correlations with most financial metrics (e.g., Net Income, EBITDA),
suggesting bankruptcy is influenced by more complex or nonlinear factors.
Cost of Goods Sold (COGS) vs. Net Income: A negative correlation between COGS and Net
Income highlights the expected relationship where higher costs reduce profitability.
Market Value and Net Sales: A strong correlation indicates that a company's sales
performance significantly impacts its market valuation, a key insight for financial analysis.
Multicollinearity Risk: Variables such as Total Revenue, Net Sales, and Gross Profit are
highly correlated, suggesting the need for dimensionality reduction or careful feature
selection in modelling.

1.2 Data Preprocessing


1.2.1 Outlier treatment
 Count of outliers and outliers post treatment shown below:


Fig 8: Count of Outliers


8
Fig 9: Boxplot – Post outlier treatment

1.2.2 Missing Value treatment


Here, there is no missing value as shown below

Fig 10: Missing Values

1.2.3 Data Split


 Data splitted into train and test data in .30 size
 From Sklearn model selection library imported train test split

1.2.4 Scaling
 Post standard scaler, below is the head of the dataset.

Fig 11: Post Scaling

9
1.3 Model Building
1.3.1 Metrics of choice
1) Logistic regression

Fig 12: Logistic Regression

2) Random Forest Classifier




Fig 13: Random Forest Classifier





10
1.4 Model Performance evaluation
Logistic Regression Model - Training Performance

Fig 14: LR Training set – Confusion Matrix

Fig 15: LR Training set – Report

Logistic Regression Model - Test Performance

Fig 16: LR Test set – Confusion Matrix

Fig 17: LR Test set – Report

11
Random Forest Model - Training Performance


Fig 18: RF Training set – Confusion Matrix


Fig 19: RF Training set – Report


Random Forest Model - Test Performance


Fig 20: RF Test set – Confusion Matrix



Fig 21: RF Test set – Report



12
1.5 Model Performance Improvement
1.5.1 Dealing with multicollinearity using VIF


Fig 22: multicollinearity using VIF



13
1.5.2 Identifying optimal threshold using ROC curve ¶


Fig 23: Optimal Threshold using ROC


1.5.3 Model performance check across different metrics
Logistic Regression Performance - Training Set¶


Fig 24: LR Tuned – Training set


Fig 25: LR Tuned – Training set

14
Logistic Regression Performance - Test Set¶


Fig 26: LR Tuned – Test set


Fig 27: LR Tuned – Test set


15
Random Forest Performance - Train Set


Fig 28: RF Tuned – Training set


Fig 29: RF Tuned – Training set


Random Forest Performance - Test Set


Fig 30: RF Tuned – Test set



Fig 31: RF Tuned – Test set

16

1.6 Model Performance Comparison and Final Model Selection


Fig 32: Model Performance comparison

Key Metrics:
1. Recall:
o High recall is critical because false negatives (missed bankruptcies) can have severe
consequences.
o The tuned logistic regression and tuned random forest models have significantly higher
recall values compared to others.
2. Precision:
o Precision indicates how often predicted bankruptcies are correct. A balance between
precision and recall is essential to avoid unnecessary alarms.
o Tuned logistic regression has a lower precision compared to tuned random forest.
3. F1 Score:
o This metric balances precision and recall and is often a good indicator for imbalanced
datasets.
o Tuned random forest has a better F1 score compared to tuned logistic regression,
especially on testing data.
Observations:
 Random Forest (untuned): Although it achieves perfect accuracy, recall, precision, and F1
on training data, its performance on testing data suggests severe overfitting.
 Tuned Random Forest: Offers a good trade-off between recall, precision, and F1 on both
training and testing datasets.
 Tuned Logistic Regression: Achieves high recall but suffers from relatively low precision
and F1 scores.
Recommendation:
 Tuned Random Forest is the better choice based on its relatively high recall, balanced
precision, and a stronger F1 score on the testing set. It effectively balances the risk of missed
bankruptcies and false alarms compared to other models.

17

Fig 33: Feature Importance Logistic regression coefficients

18

Fig 34: Feature Importance 

The chart highlights a few features with their respective importance values (logistic regression
coefficients):
1. Market_value: Most significant feature with the highest importance.
2. Total_long_term_debt: Second most important feature.
3. Total_receivables, Total_operating_expenses, Retained_earnings, Inventory, and
Net_income are ranked progressively lower.
The chart provides a broader overview of feature importance:
1. Market_value, Total_long_term_debt, and Retained_earnings have the highest
importance.
2. Other features like Net_income, Total_liabilities, Total_receivables, and more
contribute relatively less but are still considered.
3. The order of features suggests a wider, more holistic feature comparison, with normalized
relative importance values.

19
1.7 Actionable Insights and Recommendations
Business Insights
1. Key Predictors of Financial Distress:
o Market Value: Companies with declining market value are at higher bankruptcy risk. Market
value reflects investor confidence and financial health.
o Total Long-Term Debt: High levels of long-term debt indicate a burdened financial
structure, increasing default risk.
o Retained Earnings: Low or negative retained earnings highlight long-term
underperformance, raising concerns about the firm's ability to sustain operations.
2. Sector-Wide Observations:
o The dataset reveals significant outliers in metrics like Net Income and EBIT, suggesting
specific industries or companies may face extreme financial stress.
o Imbalance in bankrupt versus non-bankrupt companies highlights that bankruptcy is
relatively rare but impactful, requiring precise identification.
3. Strategic Indicators:
o Variables like Gross Profit and EBITDA correlate strongly with revenue, emphasizing
operational efficiency as a critical survival factor.
o Financial metrics such as Total Liabilities and Cost of Goods Sold (COGS) significantly
influence profitability and distress signals.
4. Early Warning Signals:
o Companies with low EBITDA, declining profitability, and increasing liabilities are likely to
move toward financial distress. These signals can trigger preventive measures.

Business Recommendations
1. For Financial Institutions and Investors:
 Risk Management: Use the model to monitor high-risk companies and adjust credit exposure or
investment strategies proactively.
 Portfolio Diversification: Reduce concentration in sectors or companies showing consistent distress
signals (e.g., high debt-to-equity ratios or falling market values).
 Early Interventions: Offer restructuring plans or debt renegotiation options for at-risk clients
flagged by the model.
2. For Regulators:
 Strengthen Monitoring Systems: Leverage the early warning system to identify firms requiring
closer regulatory scrutiny, reducing systemic risks in financial markets.
 Encourage Transparency: Promote accurate and timely financial disclosures to enhance predictive
accuracy and market stability.
3. For Companies:
 Debt Management: Reduce high levels of long-term debt through refinancing or equity funding to
improve financial stability.
 Operational Efficiency: Focus on improving EBITDA and reducing COGS to enhance profitability.
 Liquidity Management: Prioritize maintaining healthy liquidity ratios (e.g., current assets to
current liabilities) to address short-term obligations effectively.
4. Enhance Predictive Monitoring Tools:
 Develop dashboards using the prediction tool to provide real-time bankruptcy risk insights for

20
stakeholders.
 Offer financial health scorecards to benchmark companies against industry peers, encouraging self-
assessment and improvement.
5. Adopt Strategic Partnerships:
 Collaborate with consulting firms to help distressed companies restructure operations, improve
cash flow, and regain profitability.
 Build alliances with insurance providers to design bankruptcy protection products for high-risk
clients.
6. Crisis Preparedness:
 Establish contingency plans for high-risk scenarios, including workforce management, asset
divestment, and creditor negotiations.
 Develop a proactive communication strategy to reassure stakeholders during times of financial
distress.

21

You might also like