0% found this document useful (0 votes)
45 views

RANDOM FOREST (Binary Classification)

The document describes a machine learning workflow for binary classification of honey samples using spectral data. It includes: 1) Importing common Python libraries for data processing, modeling, and visualization. 2) Loading a CSV dataset, splitting it into features (spectral data) and a target (adulterated or not). 3) Training a random forest model on 80% of the data and evaluating its predictions on the remaining 20%. Key steps are data preprocessing, model training and tuning, and evaluating performance using various metrics to identify an accurate model. Feature importance plots provide insights into the most predictive spectral bands.

Uploaded by

Noor Ul Haq
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

RANDOM FOREST (Binary Classification)

The document describes a machine learning workflow for binary classification of honey samples using spectral data. It includes: 1) Importing common Python libraries for data processing, modeling, and visualization. 2) Loading a CSV dataset, splitting it into features (spectral data) and a target (adulterated or not). 3) Training a random forest model on 80% of the data and evaluating its predictions on the remaining 20%. Key steps are data preprocessing, model training and tuning, and evaluating performance using various metrics to identify an accurate model. Feature importance plots provide insights into the most predictive spectral bands.

Uploaded by

Noor Ul Haq
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

CODE:

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os

for dirname, _, filenames in os.walk('/kaggle/input'):

for filename in filenames:

print(os.path.join(dirname, filename))

EXPLANATION:
1. import numpy as np: This line imports the NumPy library as np. NumPy is a fundamental library
for numerical computations in Python, and it provides support for arrays and matrices, which are
commonly used in machine learning.

2. import pandas as pd: This line imports the Pandas library as pd. Pandas is another essential
library for data manipulation and analysis in Python, often used to work with structured data,
such as CSV files or data tables.

3. Comments (# data processing, CSV file I/O...): These lines are comments that provide
explanations for the purpose of the imported libraries.

4. import os: This line imports the os module, which provides a way to interact with the operating
system. It is used to perform file and directory operations.

5. for dirname, _, filenames in os.walk('/kaggle/input'):: This line initiates a loop using the os.walk
function to traverse the directory tree starting from the '/kaggle/input' directory. It retrieves
three values in each iteration:

 dirname: The current directory being explored.

 _: A list of subdirectories in the current directory (but not used in this loop).

 filenames: A list of filenames in the current directory.

6. for filename in filenames:: This line starts another loop to iterate over the list of filenames
obtained in the previous step.

7. print(os.path.join(dirname, filename)): In this line, os.path.join() is used to combine the current


dirname and filename into a full path, and then print() is used to display that full path to the
console. This effectively lists all the files in the '/kaggle/input' directory and its subdirectories.

CODE:
# Import necessary libraries

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

import matplotlib.pyplot as plt

# Load the dataset

data = pd.read_csv('/kaggle/input/honey-adulteration/adulteration.csv') # Replace 'your_dataset.csv'


with the actual file path

# Split the data into features (X) and target (y)

X = data.iloc[:, 4:-1] # Select spectral band columns as features

y = data['Class'] # Target variable (adulterated or not)

# Map 'Class' to binary labels (e.g., 'Clover' to 1 and others to 0)

y = (y == 'Clover').astype(int)

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features (optional but often recommended)

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

# Initialize and train a classification model (Random Forest in this example)

clf = RandomForestClassifier(random_state=42)

clf.fit(X_train, y_train)
# Make predictions on the test set

y_pred = clf.predict(X_test)

# Evaluate the model's performance

accuracy = accuracy_score(y_test, y_pred)

conf_matrix = confusion_matrix(y_test, y_pred)

classification_rep = classification_report(y_test, y_pred)

# Print the evaluation results

print(f'Accuracy: {accuracy:.2f}')

print(f'Confusion Matrix:\n{conf_matrix}')

print(f'Classification Report:\n{classification_rep}')

# Plot feature importances (if applicable to your model)

feature_importances = clf.feature_importances_

plt.figure(figsize=(10, 6))

plt.bar(range(len(feature_importances)), feature_importances)

plt.xlabel('Spectral Bands')

plt.ylabel('Feature Importance')

plt.title('Feature Importance for Binary Classification')

plt.show()

EXPLANATION:
This code snippet demonstrates a workflow for building and evaluating a binary classification model
using a dataset with spectral band features. Here's a step-by-step explanation:

1. Importing Libraries:

 Necessary libraries such as pandas, numpy, scikit-learn, and matplotlib are imported to
perform data manipulation, model building, and visualization tasks.

2. Loading the Dataset:


 The dataset is loaded from a CSV file ('adulteration.csv') using pandas and stored in a
DataFrame named data.

3. Splitting Features and Target:

 The features (X) are selected from the DataFrame, excluding the first four columns
(assuming they are not needed for modeling). These columns are assumed to represent
spectral band data.

 The target variable (y) is extracted from the 'Class' column, where binary labels are
created. For example, 'Clover' is mapped to 1 (indicating adulterated) and other classes
to 0 (indicating not adulterated).

4. Splitting the Dataset:

 The dataset is split into training and testing sets using the train_test_split function from
scikit-learn. This is a common practice for evaluating machine learning models. Here,
80% of the data is used for training, and 20% is used for testing.

5. Standardizing Features (Optional):

 The features are standardized using the StandardScaler from scikit-learn.


Standardization scales the features to have a mean of 0 and a standard deviation of 1,
which can help some machine learning algorithms perform better. This step is optional
but often recommended.

6. Initializing and Training a Classification Model (Random Forest):

 A binary classification model is initialized using the RandomForestClassifier from scikit-


learn. This model is used to learn the relationship between spectral band features and
the binary target variable (adulterated or not).

 The model is trained on the standardized training data using the fit method.

7. Making Predictions:

 The trained model is used to make predictions on the test set using the predict method.

8. Evaluating Model Performance:

 The code calculates the accuracy of the model's predictions using the accuracy_score
function from scikit-learn.

 The confusion matrix is computed using the confusion_matrix function, providing


information about true positives, true negatives, false positives, and false negatives.

 A classification report is generated using the classification_report function, which


includes precision, recall, F1-score, and support for both classes.

9. Printing Evaluation Results:


 The accuracy, confusion matrix, and classification report are printed to assess the
model's performance.

10. Plotting Feature Importances (if applicable):

 If the model supports feature importance analysis (as Random Forest does), the code
calculates and plots feature importances. This helps understand which spectral bands
contribute most to the classification decision.

Overall, this code provides a complete example of a binary classification workflow, including data
preprocessing, model training, evaluation, and feature importance analysis.

GRAPH EXPLANATION:
The graph in the code is used to visualize the feature importances when using a Random Forest classifier
for binary classification. This visualization helps you understand which spectral bands (features) are the
most important for making classification decisions. Here's an explanation of the graph:

1. Feature Importances: In a machine learning model like Random Forest, feature importances
represent how much each feature (spectral band, in this case) contributes to the model's
predictions. Higher feature importance indicates that the feature is more influential in making
classification decisions.

2. x-axis (Spectral Bands): The x-axis of the graph represents the spectral bands used as features.
Each band corresponds to a specific wavelength in the hyperspectral data, such as 399.40nm,
404.39nm, and so on. These bands are the input features for the model.

3. y-axis (Feature Importance): The y-axis represents the feature importance scores. It quantifies
the importance of each spectral band in the classification process. Higher values indicate more
important features.

4. Bars: Each bar in the graph corresponds to a specific spectral band. The height of the bar
represents the feature importance score for that band. The taller the bar, the more important
that particular band is in making classification decisions.

5. Interpretation: By looking at this graph, you can identify which spectral bands have the most
significant impact on whether honey is classified as 'Clover' (positive class) or not. Bands with
higher feature importance are more informative for distinguishing between the two classes.

6. Usage: You can use this information to potentially reduce the number of features (spectral
bands) in your model if some bands are less important. It can also provide insights into the
underlying characteristics of the data and help focus further analysis on specific wavelengths
that are highly relevant for classification.

The graph is a valuable tool for feature selection and understanding the key factors contributing to the
model's decisions. It can guide decisions on feature engineering, model improvement, and domain-
specific insights about the dataset.

You might also like